[
  {
    "path": ".gitignore",
    "content": ".DS_Store\n.idea/\n.csv/\ntests.py\nnotes.txt\nbuild/\n.pytest_cache\nupdate_season_data.py\n\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib/hockey_scraper\nlib64/\nparts/\nsdist/\nvar/\nwheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n.hypothesis/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\n.static_storage/\n.media/\nlocal_settings.py\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n\n# PyBuilder\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# pyenv\n.python-version\n\n# celery beat schedule file\ncelerybeat-schedule\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n*.csv\n"
  },
  {
    "path": ".vscode/settings.json",
    "content": "{\n}"
  },
  {
    "path": "CHANGELOG.rst",
    "content": "v1.2.6\n------\n\n  * Added test coverage for most modules using pytest\n  * Refactored large portion of 'html_pbp.py' and corrected minor parser fixes in regards to penalties\n  * Added the module 'save_pages.py' which allows one to saves scraped files\n  * Added keyword arguments 'rescrape' and 'docs_dir' to the three main scraping functions. Specifying a valid directory using 'docs_dir' will make us check if a file was already scraped and saved before getting it from the source. It will also provide a location for us to save it if we don't have it yet. 'rescrape' only applies when a valid directory is provided with 'docs_dir'. Setting 'rescrape' equal to True will have us scrape the file from the source even if it's saved and save this new one.\n\nv1.2.7\n------\n  * Added functionality to easier scrape live games\n  * Fixed user warnings\n\nv1.3\n----\n  * Added functionality to scrape NWHL data\n\nv1.31\n-----\n  * Added functionality to automatically create docs_dir\n  * Added folder to store csv files\n\nv1.33\n-----\n  * Fixed bug with nhl changing contents of eventTypeId\n  * Updated ESPN scraping after they changed the layout of the pages\n\nv1.34\n-----\n  * Reflected change in url for ESPN scoreboard\n  * Deprecated NWHL usage due as pbp parser isn't applicable due to UI changes (new source unknown)\n\nv1.35\n-----\n  * Added nhl.scrape_function.scrape_schedule function\n  * Now chunk calls to the nhl schedule api\n  * Fixed nhl shift json endpoint\n\nv1.36\n-----\n  * Refactored and cleaned up code across modules\n  * Added names to utils.shared.Names\n  * Changed errors/warning to print red in the console\n\nv1.37\n-----\n  * Now saves scraped pages in docs_dir as a GZIP\n  * Only print full error summary when the number of games scraped is >= 25\n  * Remove hardcoded exception for Sebastian Aho. Updated process to work without it.\n  * Always rescrape schedule pages\n\nv1.38\n------\n  * Convert tri-codes from new format to old in Html PBP. Mappings stored in utils/tri_code_conversion.json. \n  * Added verbose option to top-level scrape functions\n  * Replaced default parser for HTML PBP with \"html5lib\" over \"lxml\". lxml was having issues with older games.\n  * Reduced chunk size in nhl.json_schedule.chunk_schedule_calls to 30 from 50. Was having some issues during tests. \n\nv1.39\n------\n  * Changed API endpoints"
  },
  {
    "path": "LICENSE.txt",
    "content": "                    GNU GENERAL PUBLIC LICENSE\n                       Version 3, 29 June 2007\n\n Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>\n Everyone is permitted to copy and distribute verbatim copies\n of this license document, but changing it is not allowed.\n\n                            Preamble\n\n  The GNU General Public License is a free, copyleft license for\nsoftware and other kinds of works.\n\n  The licenses for most software and other practical works are designed\nto take away your freedom to share and change the works.  By contrast,\nthe GNU General Public License is intended to guarantee your freedom to\nshare and change all versions of a program--to make sure it remains free\nsoftware for all its users.  We, the Free Software Foundation, use the\nGNU General Public License for most of our software; it applies also to\nany other work released this way by its authors.  You can apply it to\nyour programs, too.\n\n  When we speak of free software, we are referring to freedom, not\nprice.  Our General Public Licenses are designed to make sure that you\nhave the freedom to distribute copies of free software (and charge for\nthem if you wish), that you receive source code or can get it if you\nwant it, that you can change the software or use pieces of it in new\nfree programs, and that you know you can do these things.\n\n  To protect your rights, we need to prevent others from denying you\nthese rights or asking you to surrender the rights.  Therefore, you have\ncertain responsibilities if you distribute copies of the software, or if\nyou modify it: responsibilities to respect the freedom of others.\n\n  For example, if you distribute copies of such a program, whether\ngratis or for a fee, you must pass on to the recipients the same\nfreedoms that you received.  You must make sure that they, too, receive\nor can get the source code.  And you must show them these terms so they\nknow their rights.\n\n  Developers that use the GNU GPL protect your rights with two steps:\n(1) assert copyright on the software, and (2) offer you this License\ngiving you legal permission to copy, distribute and/or modify it.\n\n  For the developers' and authors' protection, the GPL clearly explains\nthat there is no warranty for this free software.  For both users' and\nauthors' sake, the GPL requires that modified versions be marked as\nchanged, so that their problems will not be attributed erroneously to\nauthors of previous versions.\n\n  Some devices are designed to deny users access to install or run\nmodified versions of the software inside them, although the manufacturer\ncan do so.  This is fundamentally incompatible with the aim of\nprotecting users' freedom to change the software.  The systematic\npattern of such abuse occurs in the area of products for individuals to\nuse, which is precisely where it is most unacceptable.  Therefore, we\nhave designed this version of the GPL to prohibit the practice for those\nproducts.  If such problems arise substantially in other domains, we\nstand ready to extend this provision to those domains in future versions\nof the GPL, as needed to protect the freedom of users.\n\n  Finally, every program is threatened constantly by software patents.\nStates should not allow patents to restrict development and use of\nsoftware on general-purpose computers, but in those that do, we wish to\navoid the special danger that patents applied to a free program could\nmake it effectively proprietary.  To prevent this, the GPL assures that\npatents cannot be used to render the program non-free.\n\n  The precise terms and conditions for copying, distribution and\nmodification follow.\n\n                       TERMS AND CONDITIONS\n\n  0. Definitions.\n\n  \"This License\" refers to version 3 of the GNU General Public License.\n\n  \"Copyright\" also means copyright-like laws that apply to other kinds of\nworks, such as semiconductor masks.\n\n  \"The Program\" refers to any copyrightable work licensed under this\nLicense.  Each licensee is addressed as \"you\".  \"Licensees\" and\n\"recipients\" may be individuals or organizations.\n\n  To \"modify\" a work means to copy from or adapt all or part of the work\nin a fashion requiring copyright permission, other than the making of an\nexact copy.  The resulting work is called a \"modified version\" of the\nearlier work or a work \"based on\" the earlier work.\n\n  A \"covered work\" means either the unmodified Program or a work based\non the Program.\n\n  To \"propagate\" a work means to do anything with it that, without\npermission, would make you directly or secondarily liable for\ninfringement under applicable copyright law, except executing it on a\ncomputer or modifying a private copy.  Propagation includes copying,\ndistribution (with or without modification), making available to the\npublic, and in some countries other activities as well.\n\n  To \"convey\" a work means any kind of propagation that enables other\nparties to make or receive copies.  Mere interaction with a user through\na computer network, with no transfer of a copy, is not conveying.\n\n  An interactive user interface displays \"Appropriate Legal Notices\"\nto the extent that it includes a convenient and prominently visible\nfeature that (1) displays an appropriate copyright notice, and (2)\ntells the user that there is no warranty for the work (except to the\nextent that warranties are provided), that licensees may convey the\nwork under this License, and how to view a copy of this License.  If\nthe interface presents a list of user commands or options, such as a\nmenu, a prominent item in the list meets this criterion.\n\n  1. Source Code.\n\n  The \"source code\" for a work means the preferred form of the work\nfor making modifications to it.  \"Object code\" means any non-source\nform of a work.\n\n  A \"Standard Interface\" means an interface that either is an official\nstandard defined by a recognized standards body, or, in the case of\ninterfaces specified for a particular programming language, one that\nis widely used among developers working in that language.\n\n  The \"System Libraries\" of an executable work include anything, other\nthan the work as a whole, that (a) is included in the normal form of\npackaging a Major Component, but which is not part of that Major\nComponent, and (b) serves only to enable use of the work with that\nMajor Component, or to implement a Standard Interface for which an\nimplementation is available to the public in source code form.  A\n\"Major Component\", in this context, means a major essential component\n(kernel, window system, and so on) of the specific operating system\n(if any) on which the executable work runs, or a compiler used to\nproduce the work, or an object code interpreter used to run it.\n\n  The \"Corresponding Source\" for a work in object code form means all\nthe source code needed to generate, install, and (for an executable\nwork) run the object code and to modify the work, including scripts to\ncontrol those activities.  However, it does not include the work's\nSystem Libraries, or general-purpose tools or generally available free\nprograms which are used unmodified in performing those activities but\nwhich are not part of the work.  For example, Corresponding Source\nincludes interface definition files associated with source files for\nthe work, and the source code for shared libraries and dynamically\nlinked subprograms that the work is specifically designed to require,\nsuch as by intimate data communication or control flow between those\nsubprograms and other parts of the work.\n\n  The Corresponding Source need not include anything that users\ncan regenerate automatically from other parts of the Corresponding\nSource.\n\n  The Corresponding Source for a work in source code form is that\nsame work.\n\n  2. Basic Permissions.\n\n  All rights granted under this License are granted for the term of\ncopyright on the Program, and are irrevocable provided the stated\nconditions are met.  This License explicitly affirms your unlimited\npermission to run the unmodified Program.  The output from running a\ncovered work is covered by this License only if the output, given its\ncontent, constitutes a covered work.  This License acknowledges your\nrights of fair use or other equivalent, as provided by copyright law.\n\n  You may make, run and propagate covered works that you do not\nconvey, without conditions so long as your license otherwise remains\nin force.  You may convey covered works to others for the sole purpose\nof having them make modifications exclusively for you, or provide you\nwith facilities for running those works, provided that you comply with\nthe terms of this License in conveying all material for which you do\nnot control copyright.  Those thus making or running the covered works\nfor you must do so exclusively on your behalf, under your direction\nand control, on terms that prohibit them from making any copies of\nyour copyrighted material outside their relationship with you.\n\n  Conveying under any other circumstances is permitted solely under\nthe conditions stated below.  Sublicensing is not allowed; section 10\nmakes it unnecessary.\n\n  3. Protecting Users' Legal Rights From Anti-Circumvention Law.\n\n  No covered work shall be deemed part of an effective technological\nmeasure under any applicable law fulfilling obligations under article\n11 of the WIPO copyright treaty adopted on 20 December 1996, or\nsimilar laws prohibiting or restricting circumvention of such\nmeasures.\n\n  When you convey a covered work, you waive any legal power to forbid\ncircumvention of technological measures to the extent such circumvention\nis effected by exercising rights under this License with respect to\nthe covered work, and you disclaim any intention to limit operation or\nmodification of the work as a means of enforcing, against the work's\nusers, your or third parties' legal rights to forbid circumvention of\ntechnological measures.\n\n  4. Conveying Verbatim Copies.\n\n  You may convey verbatim copies of the Program's source code as you\nreceive it, in any medium, provided that you conspicuously and\nappropriately publish on each copy an appropriate copyright notice;\nkeep intact all notices stating that this License and any\nnon-permissive terms added in accord with section 7 apply to the code;\nkeep intact all notices of the absence of any warranty; and give all\nrecipients a copy of this License along with the Program.\n\n  You may charge any price or no price for each copy that you convey,\nand you may offer support or warranty protection for a fee.\n\n  5. Conveying Modified Source Versions.\n\n  You may convey a work based on the Program, or the modifications to\nproduce it from the Program, in the form of source code under the\nterms of section 4, provided that you also meet all of these conditions:\n\n    a) The work must carry prominent notices stating that you modified\n    it, and giving a relevant date.\n\n    b) The work must carry prominent notices stating that it is\n    released under this License and any conditions added under section\n    7.  This requirement modifies the requirement in section 4 to\n    \"keep intact all notices\".\n\n    c) You must license the entire work, as a whole, under this\n    License to anyone who comes into possession of a copy.  This\n    License will therefore apply, along with any applicable section 7\n    additional terms, to the whole of the work, and all its parts,\n    regardless of how they are packaged.  This License gives no\n    permission to license the work in any other way, but it does not\n    invalidate such permission if you have separately received it.\n\n    d) If the work has interactive user interfaces, each must display\n    Appropriate Legal Notices; however, if the Program has interactive\n    interfaces that do not display Appropriate Legal Notices, your\n    work need not make them do so.\n\n  A compilation of a covered work with other separate and independent\nworks, which are not by their nature extensions of the covered work,\nand which are not combined with it such as to form a larger program,\nin or on a volume of a storage or distribution medium, is called an\n\"aggregate\" if the compilation and its resulting copyright are not\nused to limit the access or legal rights of the compilation's users\nbeyond what the individual works permit.  Inclusion of a covered work\nin an aggregate does not cause this License to apply to the other\nparts of the aggregate.\n\n  6. Conveying Non-Source Forms.\n\n  You may convey a covered work in object code form under the terms\nof sections 4 and 5, provided that you also convey the\nmachine-readable Corresponding Source under the terms of this License,\nin one of these ways:\n\n    a) Convey the object code in, or embodied in, a physical product\n    (including a physical distribution medium), accompanied by the\n    Corresponding Source fixed on a durable physical medium\n    customarily used for software interchange.\n\n    b) Convey the object code in, or embodied in, a physical product\n    (including a physical distribution medium), accompanied by a\n    written offer, valid for at least three years and valid for as\n    long as you offer spare parts or customer support for that product\n    model, to give anyone who possesses the object code either (1) a\n    copy of the Corresponding Source for all the software in the\n    product that is covered by this License, on a durable physical\n    medium customarily used for software interchange, for a price no\n    more than your reasonable cost of physically performing this\n    conveying of source, or (2) access to copy the\n    Corresponding Source from a network server at no charge.\n\n    c) Convey individual copies of the object code with a copy of the\n    written offer to provide the Corresponding Source.  This\n    alternative is allowed only occasionally and noncommercially, and\n    only if you received the object code with such an offer, in accord\n    with subsection 6b.\n\n    d) Convey the object code by offering access from a designated\n    place (gratis or for a charge), and offer equivalent access to the\n    Corresponding Source in the same way through the same place at no\n    further charge.  You need not require recipients to copy the\n    Corresponding Source along with the object code.  If the place to\n    copy the object code is a network server, the Corresponding Source\n    may be on a different server (operated by you or a third party)\n    that supports equivalent copying facilities, provided you maintain\n    clear directions next to the object code saying where to find the\n    Corresponding Source.  Regardless of what server hosts the\n    Corresponding Source, you remain obligated to ensure that it is\n    available for as long as needed to satisfy these requirements.\n\n    e) Convey the object code using peer-to-peer transmission, provided\n    you inform other peers where the object code and Corresponding\n    Source of the work are being offered to the general public at no\n    charge under subsection 6d.\n\n  A separable portion of the object code, whose source code is excluded\nfrom the Corresponding Source as a System Library, need not be\nincluded in conveying the object code work.\n\n  A \"User Product\" is either (1) a \"consumer product\", which means any\ntangible personal property which is normally used for personal, family,\nor household purposes, or (2) anything designed or sold for incorporation\ninto a dwelling.  In determining whether a product is a consumer product,\ndoubtful cases shall be resolved in favor of coverage.  For a particular\nproduct received by a particular user, \"normally used\" refers to a\ntypical or common use of that class of product, regardless of the status\nof the particular user or of the way in which the particular user\nactually uses, or expects or is expected to use, the product.  A product\nis a consumer product regardless of whether the product has substantial\ncommercial, industrial or non-consumer uses, unless such uses represent\nthe only significant mode of use of the product.\n\n  \"Installation Information\" for a User Product means any methods,\nprocedures, authorization keys, or other information required to install\nand execute modified versions of a covered work in that User Product from\na modified version of its Corresponding Source.  The information must\nsuffice to ensure that the continued functioning of the modified object\ncode is in no case prevented or interfered with solely because\nmodification has been made.\n\n  If you convey an object code work under this section in, or with, or\nspecifically for use in, a User Product, and the conveying occurs as\npart of a transaction in which the right of possession and use of the\nUser Product is transferred to the recipient in perpetuity or for a\nfixed term (regardless of how the transaction is characterized), the\nCorresponding Source conveyed under this section must be accompanied\nby the Installation Information.  But this requirement does not apply\nif neither you nor any third party retains the ability to install\nmodified object code on the User Product (for example, the work has\nbeen installed in ROM).\n\n  The requirement to provide Installation Information does not include a\nrequirement to continue to provide support service, warranty, or updates\nfor a work that has been modified or installed by the recipient, or for\nthe User Product in which it has been modified or installed.  Access to a\nnetwork may be denied when the modification itself materially and\nadversely affects the operation of the network or violates the rules and\nprotocols for communication across the network.\n\n  Corresponding Source conveyed, and Installation Information provided,\nin accord with this section must be in a format that is publicly\ndocumented (and with an implementation available to the public in\nsource code form), and must require no special password or key for\nunpacking, reading or copying.\n\n  7. Additional Terms.\n\n  \"Additional permissions\" are terms that supplement the terms of this\nLicense by making exceptions from one or more of its conditions.\nAdditional permissions that are applicable to the entire Program shall\nbe treated as though they were included in this License, to the extent\nthat they are valid under applicable law.  If additional permissions\napply only to part of the Program, that part may be used separately\nunder those permissions, but the entire Program remains governed by\nthis License without regard to the additional permissions.\n\n  When you convey a copy of a covered work, you may at your option\nremove any additional permissions from that copy, or from any part of\nit.  (Additional permissions may be written to require their own\nremoval in certain cases when you modify the work.)  You may place\nadditional permissions on material, added by you to a covered work,\nfor which you have or can give appropriate copyright permission.\n\n  Notwithstanding any other provision of this License, for material you\nadd to a covered work, you may (if authorized by the copyright holders of\nthat material) supplement the terms of this License with terms:\n\n    a) Disclaiming warranty or limiting liability differently from the\n    terms of sections 15 and 16 of this License; or\n\n    b) Requiring preservation of specified reasonable legal notices or\n    author attributions in that material or in the Appropriate Legal\n    Notices displayed by works containing it; or\n\n    c) Prohibiting misrepresentation of the origin of that material, or\n    requiring that modified versions of such material be marked in\n    reasonable ways as different from the original version; or\n\n    d) Limiting the use for publicity purposes of names of licensors or\n    authors of the material; or\n\n    e) Declining to grant rights under trademark law for use of some\n    trade names, trademarks, or service marks; or\n\n    f) Requiring indemnification of licensors and authors of that\n    material by anyone who conveys the material (or modified versions of\n    it) with contractual assumptions of liability to the recipient, for\n    any liability that these contractual assumptions directly impose on\n    those licensors and authors.\n\n  All other non-permissive additional terms are considered \"further\nrestrictions\" within the meaning of section 10.  If the Program as you\nreceived it, or any part of it, contains a notice stating that it is\ngoverned by this License along with a term that is a further\nrestriction, you may remove that term.  If a license document contains\na further restriction but permits relicensing or conveying under this\nLicense, you may add to a covered work material governed by the terms\nof that license document, provided that the further restriction does\nnot survive such relicensing or conveying.\n\n  If you add terms to a covered work in accord with this section, you\nmust place, in the relevant source files, a statement of the\nadditional terms that apply to those files, or a notice indicating\nwhere to find the applicable terms.\n\n  Additional terms, permissive or non-permissive, may be stated in the\nform of a separately written license, or stated as exceptions;\nthe above requirements apply either way.\n\n  8. Termination.\n\n  You may not propagate or modify a covered work except as expressly\nprovided under this License.  Any attempt otherwise to propagate or\nmodify it is void, and will automatically terminate your rights under\nthis License (including any patent licenses granted under the third\nparagraph of section 11).\n\n  However, if you cease all violation of this License, then your\nlicense from a particular copyright holder is reinstated (a)\nprovisionally, unless and until the copyright holder explicitly and\nfinally terminates your license, and (b) permanently, if the copyright\nholder fails to notify you of the violation by some reasonable means\nprior to 60 days after the cessation.\n\n  Moreover, your license from a particular copyright holder is\nreinstated permanently if the copyright holder notifies you of the\nviolation by some reasonable means, this is the first time you have\nreceived notice of violation of this License (for any work) from that\ncopyright holder, and you cure the violation prior to 30 days after\nyour receipt of the notice.\n\n  Termination of your rights under this section does not terminate the\nlicenses of parties who have received copies or rights from you under\nthis License.  If your rights have been terminated and not permanently\nreinstated, you do not qualify to receive new licenses for the same\nmaterial under section 10.\n\n  9. Acceptance Not Required for Having Copies.\n\n  You are not required to accept this License in order to receive or\nrun a copy of the Program.  Ancillary propagation of a covered work\noccurring solely as a consequence of using peer-to-peer transmission\nto receive a copy likewise does not require acceptance.  However,\nnothing other than this License grants you permission to propagate or\nmodify any covered work.  These actions infringe copyright if you do\nnot accept this License.  Therefore, by modifying or propagating a\ncovered work, you indicate your acceptance of this License to do so.\n\n  10. Automatic Licensing of Downstream Recipients.\n\n  Each time you convey a covered work, the recipient automatically\nreceives a license from the original licensors, to run, modify and\npropagate that work, subject to this License.  You are not responsible\nfor enforcing compliance by third parties with this License.\n\n  An \"entity transaction\" is a transaction transferring control of an\norganization, or substantially all assets of one, or subdividing an\norganization, or merging organizations.  If propagation of a covered\nwork results from an entity transaction, each party to that\ntransaction who receives a copy of the work also receives whatever\nlicenses to the work the party's predecessor in interest had or could\ngive under the previous paragraph, plus a right to possession of the\nCorresponding Source of the work from the predecessor in interest, if\nthe predecessor has it or can get it with reasonable efforts.\n\n  You may not impose any further restrictions on the exercise of the\nrights granted or affirmed under this License.  For example, you may\nnot impose a license fee, royalty, or other charge for exercise of\nrights granted under this License, and you may not initiate litigation\n(including a cross-claim or counterclaim in a lawsuit) alleging that\nany patent claim is infringed by making, using, selling, offering for\nsale, or importing the Program or any portion of it.\n\n  11. Patents.\n\n  A \"contributor\" is a copyright holder who authorizes use under this\nLicense of the Program or a work on which the Program is based.  The\nwork thus licensed is called the contributor's \"contributor version\".\n\n  A contributor's \"essential patent claims\" are all patent claims\nowned or controlled by the contributor, whether already acquired or\nhereafter acquired, that would be infringed by some manner, permitted\nby this License, of making, using, or selling its contributor version,\nbut do not include claims that would be infringed only as a\nconsequence of further modification of the contributor version.  For\npurposes of this definition, \"control\" includes the right to grant\npatent sublicenses in a manner consistent with the requirements of\nthis License.\n\n  Each contributor grants you a non-exclusive, worldwide, royalty-free\npatent license under the contributor's essential patent claims, to\nmake, use, sell, offer for sale, import and otherwise run, modify and\npropagate the contents of its contributor version.\n\n  In the following three paragraphs, a \"patent license\" is any express\nagreement or commitment, however denominated, not to enforce a patent\n(such as an express permission to practice a patent or covenant not to\nsue for patent infringement).  To \"grant\" such a patent license to a\nparty means to make such an agreement or commitment not to enforce a\npatent against the party.\n\n  If you convey a covered work, knowingly relying on a patent license,\nand the Corresponding Source of the work is not available for anyone\nto copy, free of charge and under the terms of this License, through a\npublicly available network server or other readily accessible means,\nthen you must either (1) cause the Corresponding Source to be so\navailable, or (2) arrange to deprive yourself of the benefit of the\npatent license for this particular work, or (3) arrange, in a manner\nconsistent with the requirements of this License, to extend the patent\nlicense to downstream recipients.  \"Knowingly relying\" means you have\nactual knowledge that, but for the patent license, your conveying the\ncovered work in a country, or your recipient's use of the covered work\nin a country, would infringe one or more identifiable patents in that\ncountry that you have reason to believe are valid.\n\n  If, pursuant to or in connection with a single transaction or\narrangement, you convey, or propagate by procuring conveyance of, a\ncovered work, and grant a patent license to some of the parties\nreceiving the covered work authorizing them to use, propagate, modify\nor convey a specific copy of the covered work, then the patent license\nyou grant is automatically extended to all recipients of the covered\nwork and works based on it.\n\n  A patent license is \"discriminatory\" if it does not include within\nthe scope of its coverage, prohibits the exercise of, or is\nconditioned on the non-exercise of one or more of the rights that are\nspecifically granted under this License.  You may not convey a covered\nwork if you are a party to an arrangement with a third party that is\nin the business of distributing software, under which you make payment\nto the third party based on the extent of your activity of conveying\nthe work, and under which the third party grants, to any of the\nparties who would receive the covered work from you, a discriminatory\npatent license (a) in connection with copies of the covered work\nconveyed by you (or copies made from those copies), or (b) primarily\nfor and in connection with specific products or compilations that\ncontain the covered work, unless you entered into that arrangement,\nor that patent license was granted, prior to 28 March 2007.\n\n  Nothing in this License shall be construed as excluding or limiting\nany implied license or other defenses to infringement that may\notherwise be available to you under applicable patent law.\n\n  12. No Surrender of Others' Freedom.\n\n  If conditions are imposed on you (whether by court order, agreement or\notherwise) that contradict the conditions of this License, they do not\nexcuse you from the conditions of this License.  If you cannot convey a\ncovered work so as to satisfy simultaneously your obligations under this\nLicense and any other pertinent obligations, then as a consequence you may\nnot convey it at all.  For example, if you agree to terms that obligate you\nto collect a royalty for further conveying from those to whom you convey\nthe Program, the only way you could satisfy both those terms and this\nLicense would be to refrain entirely from conveying the Program.\n\n  13. Use with the GNU Affero General Public License.\n\n  Notwithstanding any other provision of this License, you have\npermission to link or combine any covered work with a work licensed\nunder version 3 of the GNU Affero General Public License into a single\ncombined work, and to convey the resulting work.  The terms of this\nLicense will continue to apply to the part which is the covered work,\nbut the special requirements of the GNU Affero General Public License,\nsection 13, concerning interaction through a network will apply to the\ncombination as such.\n\n  14. Revised Versions of this License.\n\n  The Free Software Foundation may publish revised and/or new versions of\nthe GNU General Public License from time to time.  Such new versions will\nbe similar in spirit to the present version, but may differ in detail to\naddress new problems or concerns.\n\n  Each version is given a distinguishing version number.  If the\nProgram specifies that a certain numbered version of the GNU General\nPublic License \"or any later version\" applies to it, you have the\noption of following the terms and conditions either of that numbered\nversion or of any later version published by the Free Software\nFoundation.  If the Program does not specify a version number of the\nGNU General Public License, you may choose any version ever published\nby the Free Software Foundation.\n\n  If the Program specifies that a proxy can decide which future\nversions of the GNU General Public License can be used, that proxy's\npublic statement of acceptance of a version permanently authorizes you\nto choose that version for the Program.\n\n  Later license versions may give you additional or different\npermissions.  However, no additional obligations are imposed on any\nauthor or copyright holder as a result of your choosing to follow a\nlater version.\n\n  15. Disclaimer of Warranty.\n\n  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY\nAPPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT\nHOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM \"AS IS\" WITHOUT WARRANTY\nOF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,\nTHE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR\nPURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM\nIS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF\nALL NECESSARY SERVICING, REPAIR OR CORRECTION.\n\n  16. Limitation of Liability.\n\n  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING\nWILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS\nTHE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY\nGENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE\nUSE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF\nDATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD\nPARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),\nEVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF\nSUCH DAMAGES.\n\n  17. Interpretation of Sections 15 and 16.\n\n  If the disclaimer of warranty and limitation of liability provided\nabove cannot be given local legal effect according to their terms,\nreviewing courts shall apply local law that most closely approximates\nan absolute waiver of all civil liability in connection with the\nProgram, unless a warranty or assumption of liability accompanies a\ncopy of the Program in return for a fee.\n\n                     END OF TERMS AND CONDITIONS\n\n            How to Apply These Terms to Your New Programs\n\n  If you develop a new program, and you want it to be of the greatest\npossible use to the public, the best way to achieve this is to make it\nfree software which everyone can redistribute and change under these terms.\n\n  To do so, attach the following notices to the program.  It is safest\nto attach them to the start of each source file to most effectively\nstate the exclusion of warranty; and each file should have at least\nthe \"copyright\" line and a pointer to where the full notice is found.\n\n    <one line to give the program's name and a brief idea of what it does.>\n    Copyright (C) <year>  <name of author>\n\n    This program is free software: you can redistribute it and/or modify\n    it under the terms of the GNU General Public License as published by\n    the Free Software Foundation, either version 3 of the License, or\n    (at your option) any later version.\n\n    This program is distributed in the hope that it will be useful,\n    but WITHOUT ANY WARRANTY; without even the implied warranty of\n    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n    GNU General Public License for more details.\n\n    You should have received a copy of the GNU General Public License\n    along with this program.  If not, see <https://www.gnu.org/licenses/>.\n\nAlso add information on how to contact you by electronic and paper mail.\n\n  If the program does terminal interaction, make it output a short\nnotice like this when it starts in an interactive mode:\n\n    <program>  Copyright (C) <year>  <name of author>\n    This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.\n    This is free software, and you are welcome to redistribute it\n    under certain conditions; type `show c' for details.\n\nThe hypothetical commands `show w' and `show c' should show the appropriate\nparts of the General Public License.  Of course, your program's commands\nmight be different; for a GUI interface, you would use an \"about box\".\n\n  You should also get your employer (if you work as a programmer) or school,\nif any, to sign a \"copyright disclaimer\" for the program, if necessary.\nFor more information on this, and how to apply and follow the GNU GPL, see\n<https://www.gnu.org/licenses/>.\n\n  The GNU General Public License does not permit incorporating your program\ninto proprietary programs.  If your program is a subroutine library, you\nmay consider it more useful to permit linking proprietary applications with\nthe library.  If this is what you want to do, use the GNU Lesser General\nPublic License instead of this License.  But first, please read\n<https://www.gnu.org/licenses/why-not-lgpl.html>.\n\n"
  },
  {
    "path": "MANIFEST.in",
    "content": "include README.rst\n\n"
  },
  {
    "path": "README.rst",
    "content": "\nThis repository is no longer maintained. Feel free to fork it.\n==================================================================================\n\n\n.. .. image:: https://badge.fury.io/py/hockey-scraper.svg\n..    :target: https://badge.fury.io/py/hockey-scraper\n.. .. image:: https://readthedocs.org/projects/hockey-scraper/badge/?version=latest\n..    :target: https://readthedocs.org/projects/hockey-scraper/?badge=latest\n..    :alt: Documentation Status\n\n\nHockey-Scraper\n==============\n\n.. inclusion-marker-for-sphinx\n\n\nPurpose\n-------\n\nScrape NHL data off the NHL API and website. This includes the Play by Play and Shift data for each game and the schedule information. \nIt currently supports all preseason, regular season, and playoff games from the 2007-2008 season onwards. \n\nPrerequisites\n-------------\n\nYou are going to need to have python installed for this. This should work for both python 2.7 and 3. I recommend having\nfrom at least version 3.6.0 but earlier versions should be fine.\n\nInstallation\n------------\n\nTo install all you need to do is open up your terminal and run:\n\n::\n\n    pip install hockey_scraper\n\n\nNHL Usage\n---------\n\nThe full documentation can be found `here <http://hockey-scraper.readthedocs.io/en/latest/>`_.\n\nStandard Scrape Functions\n~~~~~~~~~~~~~~~~~~~~~~~~~\n\nScrape data on a season by season level:\n\n::\n\n    import hockey_scraper\n\n    # Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file\n    hockey_scraper.scrape_seasons([2015, 2016], True)\n\n    # Scrapes the 2008 season without shifts and returns a dictionary containing the pbp Pandas DataFrame\n    scraped_data = hockey_scraper.scrape_seasons([2008], False, data_format='Pandas')\n\nScrape a list of games:\n\n::\n\n    import hockey_scraper\n\n    # Scrapes the first game of 2014, 2015, and 2016 seasons with shifts and stores the data in a Csv file\n    hockey_scraper.scrape_games([2014020001, 2015020001, 2016020001], True)\n\n    # Scrapes the first game of 2007, 2008, and 2009 seasons with shifts and returns a Dictionary with the Pandas DataFrames\n    scraped_data = hockey_scraper.scrape_games([2007020001, 2008020001, 2009020001], True, data_format='Pandas')\n\nScrape all games in a given date range:\n\n::\n\n    import hockey_scraper\n\n    # Scrapes all games between 2016-10-10 and 2016-10-20 without shifts and stores the data in a Csv file\n    hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False)\n\n    # Scrapes all games between 2015-1-1 and 2015-1-15 without shifts and returns a Dictionary with the pbp Pandas DataFrame\n    scraped_data = hockey_scraper.scrape_date_range('2015-1-1', '2015-1-15', False, data_format='Pandas')\n\n\nThe dictionary returned by setting the default argument \"data_format\" equal to \"Pandas\" is structured like:\n\n::\n\n    {\n      # Both of these are always included\n      'pbp': pbp_df,\n\n      # This is only included when the argument 'if_scrape_shifts' is set equal to True\n      'shifts': shifts_df\n    }\n\n\nSchedule\n~~~~~~~~\n\nThe schedule for any past or future games can be scraped as follows:\n\n::\n\n    import hockey_scraper\n\n    # As oppossed to the other calls the default format is 'Pandas' which returns a DataFrame\n    sched_df = hockey_scraper.scrape_schedule(\"2019-10-01\", \"2020-07-01\")\n\nThe columns returned are: `['game_id', 'date', 'venue', 'home_team', 'away_team', 'start_time', 'home_score', 'away_score', 'status']`\n\n\nPersistent Data\n~~~~~~~~~~~~~~~\n\nAll the raw game data files retrieved can also be saved to your disk. This allows for faster rescraping (we don't need to re-retrieve them) \nand the ability to parse the data yourself.\n\nThis is achieved by setting the keyword argument `docs_dir=True`. This will store the data in a directory called `~/hockey_scraper_data`. \nYou can provide your own directory where you want everything to be stored (it must exist beforehand). By default `docs_dir=False`.\n\nFor example, let's say we are scraping the JSON PBP data for game `2019020001 <http://statsapi.web.nhl.com/api/v1/game/2019020001/feed/live>`_. \nIf `docs_dir` isn't `False` it will first check if the data is already in the directory. If so, it will load in the data from that file and not make a GET \nrequest to the NHL API. However if it doesn't exist, it will make a GET request and then save the output to the directory. \nThis will ensure that next time you are requesting that data it can load it from a file.\n\nHere are some examples.\n\nThe default saving location is `~/hockey_scraper_data`.\n\n\n::\n\n    # Create or try to refer to a directory in the home directory\n    # Will create a directory called 'hockey_scraper_data' in the home directory (if it doesn't exist)\n    hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=True)\n\n\nUser defined directory\n\n::\n\n    USER_PATH = \"/....\"\n    hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH)\n\n\nYou can override the existing files by specifying `rescrape=True`. It will retrieve all the files from source and save the newer versions to `docs_dir`.\n\n::\n\n    hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH, rescrape=True)\n\n\n\nLive Scraping\n~~~~~~~~~~~~~\n\nHere is a simple example of a way to setup live scraping. I strongly suggest checking out\n`this section <https://hockey-scraper.readthedocs.io/en/latest/live_scrape.html>`_ of the docs if you plan on using this.\n::\n\n   import hockey_scraper as hs\n\n\n   def to_csv(game):\n       \"\"\"\n       Store each game DataFrame in a file\n\n       :param game: LiveGame object\n\n       :return: None\n       \"\"\"\n\n       # If the game:\n       # 1. Started - We recorded at least one event\n       # 2. Not in Intermission\n       # 3. Not Over\n       if game.is_ongoing():\n           # Print the description of the last event\n           print(game.game_id, \"->\", game.pbp_df.iloc[-1]['Description'])\n\n           # Store in CSV files\n           game.pbp_df.to_csv(f\"../hockey_scraper_data/{game.game_id}_pbp.csv\", sep=',')\n           game.shifts_df.to_csv(f\"../hockey_scraper_data/{game.game_id}_shifts.csv\", sep=',')\n\n   if __name__ == \"__main__\":\n       # B4 we start set the directory to store the files\n       # You don't have to do this but I recommend it\n       hs.live_scrape.set_docs_dir(\"../hockey_scraper_data\")\n\n       # Scrape the info for all the games on 2018-11-15\n       games = hs.ScrapeLiveGames(\"2018-11-15\", if_scrape_shifts=True, pause=20)\n\n       # While all the games aren't finished\n       while not games.finished():\n           # Update for all the games currently being played\n           games.update_live_games(sleep_next=True)\n\n           # Go through every LiveGame object and apply some function\n           # You can of course do whatever you want here.\n           for game in games.live_games:\n               to_csv(game)\n\n\n\nContact\n-------\n\nPlease contact me for any issues or suggestions. For any bugs or anything related to the code please open an issue.\nOtherwise you can email me at Harryshomer@gmail.com.\n\n\nCopyright\n---------\n::\n\n    Copyright (C) 2019-2022 Harry Shomer\n    This file is part of hockey_scraper\n\n    hockey_scraper is free software: you can redistribute it and/or modify\n    it under the terms of the GNU General Public License as published by\n    the Free Software Foundation, either version 3 of the License, or\n    (at your option) any later version.\n\n    This program is distributed in the hope that it will be useful,\n    but WITHOUT ANY WARRANTY; without even the implied warranty of\n    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n    GNU General Public License for more details.\n\n    You should have received a copy of the GNU General Public License\n    along with this program.  If not, see <https://www.gnu.org/licenses/>.\n"
  },
  {
    "path": "docs/Makefile",
    "content": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS    =\nSPHINXBUILD   = sphinx-build\nSPHINXPROJ    = hockey_scraper\nSOURCEDIR     = source\nBUILDDIR      = build\n\n# Put it first so that \"make\" without argument is like \"make help\".\nhelp:\n\t@$(SPHINXBUILD) -M help \"$(SOURCEDIR)\" \"$(BUILDDIR)\" $(SPHINXOPTS) $(O)\n\n.PHONY: help Makefile\n\n# Catch-all target: route all unknown targets to Sphinx using the new\n# \"make mode\" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).\n%: Makefile\n\t@$(SPHINXBUILD) -M $@ \"$(SOURCEDIR)\" \"$(BUILDDIR)\" $(SPHINXOPTS) $(O)"
  },
  {
    "path": "docs/make.bat",
    "content": "@ECHO OFF\r\n\r\npushd %~dp0\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sphinx-build\r\n)\r\nset SOURCEDIR=source\r\nset BUILDDIR=build\r\nset SPHINXPROJ=hockey_scraper\r\n\r\nif \"%1\" == \"\" goto help\r\n\r\n%SPHINXBUILD% >NUL 2>NUL\r\nif errorlevel 9009 (\r\n\techo.\r\n\techo.The 'sphinx-build' command was not found. Make sure you have Sphinx\r\n\techo.installed, then set the SPHINXBUILD environment variable to point\r\n\techo.to the full path of the 'sphinx-build' executable. Alternatively you\r\n\techo.may add the Sphinx directory to PATH.\r\n\techo.\r\n\techo.If you don't have Sphinx installed, grab it from\r\n\techo.http://sphinx-doc.org/\r\n\texit /b 1\r\n)\r\n\r\n%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%\r\ngoto end\r\n\r\n:help\r\n%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%\r\n\r\n:end\r\npopd\r\n"
  },
  {
    "path": "docs/source/cli.rst",
    "content": "Command Line Interface\n======================\n\nThere also exists a cli tool called `hockey-scraper` which can be used to pull data. Users may find this more convenient than using python directly for simple queries.\n\nThe usage for the tool can be found below: \n\n.. code-block:: console\n\n    usage: hockey-scraper [-h] [-t REPORTTYPE] [--shifts] [-d DATERANGE [DATERANGE ...]] [-s SEASONS [SEASONS ...]]\n                      [-g GAMES [GAMES ...]] [-f FILEDIR] [-r] [-p]\n\n    CLI tool for the hockey_scraper project\n\n    optional arguments:\n      -h, --help            show this help message and exit\n      -t REPORTTYPE, --reportType REPORTTYPE\n                            Type of report to scrape. Either game or schedule.\n      --shifts              Whether to include shifts.\n      -d DATERANGE [DATERANGE ...], --dateRange DATERANGE [DATERANGE ...]\n                            Date range to scrape between.\n      -s SEASONS [SEASONS ...], --seasons SEASONS [SEASONS ...]\n                            Seasons to scrape.\n      -g GAMES [GAMES ...], --games GAMES [GAMES ...]\n                            Game IDs to scrape.\n      -f FILEDIR, --fileDir FILEDIR\n                            Whether to store scraped files. If the flag is specified and no argument is passed, a directory is created\n                            in the root. If an argument is passed with the flag the files are stored there (assuming the directory\n                            exists).\n      -r, --rescrape        Whether to re-scrape pages already scraped and stored in --fileDir.\n      -p, --preseason       Whether to scrape preseason data.\n\n\nCLI\n~~~\n.. automodule:: hockey_scraper.cli\n   :members:"
  },
  {
    "path": "docs/source/conf.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n#\n# hockey_scraper documentation build configuration file, created by\n# sphinx-quickstart on Sun Dec  3 03:00:09 2017.\n#\n# This file is execfile()d with the current directory set to its\n# containing dir.\n#\n# Note that not all possible configuration values are present in this\n# autogenerated file.\n#\n# All configuration values have a default; values that are commented out\n# serve to show the default.\n\n# If extensions (or modules to document with autodoc) are in another directory,\n# add these directories to sys.path here. If the directory is relative to the\n# documentation root, use os.path.abspath to make it absolute, like shown here.\n#\nimport os\nimport sys\nsys.path.insert(0, os.path.abspath('../..'))\n\n# -- General configuration ------------------------------------------------\n\n# If your documentation needs a minimal Sphinx version, state it here.\n#\n# needs_sphinx = '1.0'\n\n# Add any Sphinx extension module names here, as strings. They can be\n# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom\n# ones.\nextensions = ['sphinx.ext.autodoc']\nautodoc_mock_imports = ['BeautifulSoup4', 'requests', 'lxml', 'html5lib', 'pandas', 'pytest', 'pytz', 'tqdm']\n\n# The suffix(es) of source filenames.\n# You can specify multiple suffix as a list of string:\n#\n# source_suffix = ['.rst', '.md']\nsource_suffix = '.rst'\n\n# The master toctree document.\nmaster_doc = 'index'\n\n# General information about the project.\nproject = 'hockey_scraper'\ncopyright = '2023, Harry Shomer'\nauthor = 'Harry Shomer'\n\n# The version info for the project you're documenting, acts as replacement for\n# |version| and |release|, also used in various other places throughout the\n# built documents.\n#\n# The short X.Y version.\nversion = '1.40'\n# The full version, including alpha/beta/rc tags.\nrelease = '1.40'\n\n# The language for content autogenerated by Sphinx. Refer to documentation\n# for a list of supported languages.\n#\n# This is also used if you do content translation via gettext catalogs.\n# Usually you set \"language\" from the command line for these cases.\nlanguage = None\n\n# List of patterns, relative to source directory, that match files and\n# directories to ignore when looking for source files.\n# This patterns also effect to html_static_path and html_extra_path\nexclude_patterns = []\n\n# The name of the Pygments (syntax highlighting) style to use.\npygments_style = 'sphinx'\n\n# If true, `todo` and `todoList` produce output, else they produce nothing.\ntodo_include_todos = False\n\n\n# -- Options for HTML output ----------------------------------------------\n\n# The theme to use for HTML and HTML Help pages.  See the documentation for\n# a list of builtin themes.\n#\nhtml_theme = 'nature'\n\n# Theme options are theme-specific and customize the look and feel of a theme\n# further.  For a list of options available for each theme, see the\n# documentation.\n#\n# html_theme_options = {}\n\n\n# -- Options for HTMLHelp output ------------------------------------------\n\n# Output file base name for HTML help builder.\nhtmlhelp_basename = 'hockey_scraperdoc'\n\n\n# -- Options for LaTeX output ---------------------------------------------\n\nlatex_elements = {\n    # The paper size ('letterpaper' or 'a4paper').\n    #\n    # 'papersize': 'letterpaper',\n\n    # The font size ('10pt', '11pt' or '12pt').\n    #\n    # 'pointsize': '10pt',\n\n    # Additional stuff for the LaTeX preamble.\n    #\n    # 'preamble': '',\n\n    # Latex figure (float) alignment\n    #\n    # 'figure_align': 'htbp',\n}\n\n# Grouping the document tree into LaTeX files. List of tuples\n# (source start file, target name, title,\n#  author, documentclass [howto, manual, or own class]).\nlatex_documents = [\n    (master_doc, 'hockey_scraper.tex', 'hockey\\\\_scraper Documentation',\n     'Harry Shomer', 'manual'),\n]\n\n\n# -- Options for manual page output ---------------------------------------\n\n# One entry per manual page. List of tuples\n# (source start file, name, description, authors, manual section).\nman_pages = [\n    (master_doc, 'hockey_scraper', 'hockey_scraper Documentation',\n     [author], 1)\n]\n\n\n# -- Options for Texinfo output -------------------------------------------\n\n# Grouping the document tree into Texinfo files. List of tuples\n# (source start file, target name, title, author,\n#  dir menu entry, description, category)\ntexinfo_documents = [\n    (master_doc, 'hockey_scraper', 'hockey_scraper Documentation',\n     author, 'hockey_scraper', 'One line description of project.',\n     'Miscellaneous'),\n]\n\n\n\n"
  },
  {
    "path": "docs/source/index.rst",
    "content": "Hockey-Scraper\n==============\n\nContents\n--------\n.. toctree::\n   :maxdepth: 1\n\n   nhl_scrape_functions\n   live_scrape\n   license_link\n\n\n.. include:: ../../README.rst\n   :start-after: inclusion-marker-for-sphinx\n\nIndices and tables\n------------------\n\n* :ref:`genindex`\n* :ref:`modindex`\n* :ref:`search`\n"
  },
  {
    "path": "docs/source/license_link.rst",
    "content": "License\n=======\n.. include:: ../../LICENSE.txt"
  },
  {
    "path": "docs/source/live_scrape.rst",
    "content": "Live Scraping\n=============\n\nStandard Usage\n--------------\n\nTo get all the info for every game on a specific day we create a ScrapeLiveGames object.\n::\n\n    import hockey_scraper as hs\n\n    todays_games = hs.ScrapeLiveGames(\"2018-11-15\", if_scrape_shifts=True, pause=15)\n\n\nOnce created this object will contain an attribute called 'live_games' that holds a list of LiveGame objects for that\nday. LiveGame objects hold all the pertinent game information for each game. This includes the most recent\npbp and shift data for that game. Here are all the attributes for the LiveGame class:\n::\n\n   class LiveGame:\n    \"\"\"\n    This is a class holds all the information for a given game\n\n    :param int game_id: The NHL game id (ex: 2018020001)\n    :param datetime start_time: The UTC time of when the game begins\n    :param str home_team: Tricode for the home team (ex: NYR)\n    :param str away_team: Tricode for the home team (ex: MTL)\n    :param int espn_id: The ESPN game id for their feed\n    :param str date: Date of the game (ex: 2018-10-30)\n    :param bool if_scrape_shifts: Whether or not you want to scrape shifts\n    :param str api_game_status: Current Status of the game - [\"Final\", \"Live\", \"Intermission]\n    :param str html_game_status: Current Status of the game - [\"Final\", \"Live\", \"Intermission\"]\n    :param int intermission_time_remaining: Time remaining in the intermission. 0 if not in intermission\n    :param dict players: Player info for both teams\n    :param dict head_coaches: Head coaches for both teams\n    :param DataFrame _pbp_df: Holds most recent pbp data\n    :param DataFrame _shifts_df: Holds most recent shift data\n    :param DataFrame _prev_pbp_df: Holds the previous pbp data (for just in case)\n    :param DataFrame _prev_shifts_df: Holds the previous shift data (for just in case)\n    \"\"\"\n\nHere's a simple example of scraping the games continuously for a single date. This will run until every game is finished:\n\n::\n\n   import hockey_scraper as hs\n\n\n   def to_csv(game):\n       \"\"\"\n       Store each game DataFrame in a file\n\n       :param game: LiveGame object\n\n       :return: None\n       \"\"\"\n\n       # If the game:\n       # 1. Started - We recorded at least one event\n       # 2. Not in Intermission\n       # 3. Not Over\n       if game.is_ongoing():\n           # Print the description of the last event\n           print(game.game_id, \"->\", game.pbp_df.iloc[-1]['Description'])\n\n           # Store in CSV files\n           game.pbp_df.to_csv(f\"../hockey_scraper_data/{game.game_id}_pbp.csv\", sep=',')\n           game.shifts_df.to_csv(f\"../hockey_scraper_data/{game.game_id}_shifts.csv\", sep=',')\n\n   if __name__ == \"__main__\":\n       # B4 we start set the directory to store the files\n       hs.live_scrape.set_docs_dir(\"../hockey_scraper_data\")\n\n       # Scrape the info for all the games on 2018-11-15\n       games = hs.ScrapeLiveGames(\"2018-11-15\", if_scrape_shifts=True, pause=20)\n\n       # While all the games aren't finished\n       while not games.finished():\n           # Update for all the games currently being played\n           games.update_live_games(sleep_next=True)\n\n           # Go through every LiveGame object and apply some function\n           # You can of course do whatever you want here.\n           for game in games.live_games:\n               to_csv(game)\n\n\nIn the above example, we set a directory to store the most recent version of every scraped file. We then grab the\ninitial game info for each game for that day. We decide we want to include shifts and to pause 15 seconds after updating\nall the games. We then enter a loop that will be terminated once every game is finished. Once in the loop we first\nscrape the new info for every game and then pause for the specified time (default is 15).\n\nOnce we process the new data we then, presumably, want to do something with it. Here, I decided to merely print the last\nevent in the game and store the newer data in files. We do this by iterating through each LiveGame object in the 'live_games'\nattribute and calling the function 'to_csv'. In 'to_csv', before doing anything we check if the game is 'ongoing'.\nThis checks whether the game is either over or in intermission. If it is there isn't a whole lot to update. If it's\nneither we print the last event and then store the data for both the pbp & shifts.\n\nAnother option we have is for the program to sleep until the first game starts. Unless you want to start this yourself\neveryday, you'll probably be scheduling it to start at some time every day. This means from when you start the program\nto when the first game starts may be a significant amount of time (fwiw, it will just loop and not scrape anything). But\nyou can set it to sleep until the first game is scheduled to start. This can be done by setting the keyword 'sleep_next'\nto True. This check to see if the only games left are scheduled games yet to start. If so it sleeps until the next\nearliest game starts.\n::\n\n   # Causes the program to sleep until the first game starts\n   games.update_live_games(sleep_next=True)\n\n\nYou can also specify which games you want to scrape for that day (maybe you only care about one game), by setting the\nkeyword 'game_ids' equal to a list of NHL Game ID's of the games you want when instantiating a ScrapeLiveGame object. You\ncan of course to choose to filter it however you want as the list of LiveGame objects is a attribute of the object. Either\nway I strongly suggest creating a ScrapeLiveGames object and then either extracting the game you want or filtering it\nrather than instantiating a LiveGame object (you will be on the hook for a lot of information)\n::\n\n   # Only want those those two games.\n   games = hs.ScrapeLiveGames(\"2018-11-15\", if_scrape_shifts=True, pause=15, game_ids=[2018020280, 2018020281])\n\n\nFurther Usage\n-------------\n\nIf you would like more control over what you are doing then you should be dealing directly with LiveGame objects. As\nmentioned previously, still use ScrapeLiveGames to get the game info but you can then just extract the list of games\nand do as you please.\n\nUsing the previous example here we are scraping each game individually:\n::\n\n   # Scrape the info for all the games on 2018-11-15\n   games = hs.ScrapeLiveGames(\"2018-11-15\", if_scrape_shifts=True, pause=15)\n\n   while not games.finished():\n       # Go through every LiveGame object\n       for game in games.live_games:\n           # Scrape each game individually\n           game.scrape()\n\n           # Apply some function to every game\n           to_csv(game)\n\n       # Pause after each scraping chunk\n       time.sleep(15)\n\nIf you don't trust when I choose to not scrape (when the game is over or in intermission), you can make the keyword\n'force' equal to True. This will re-scrape it as long as the game already started.\n::\n\n   game.scrape(force=True)\n\nThis will override everything and will attempt to scrape the game no matter what. This means you are have to be on top\nof when to stop scraping the game. You are also on the hook for any potential errors.\n\nYou may also want to handle things like the start time of games yourself. As mentioned using ScrapeLiveGames we can\nset 'sleep_next' equal to True to sleep until the next game starts if no game is going on. You can also use the keyword\n'start_time' for a LiveGame object that will give you a datetime object with the scheduled starting time for a given\ngame in UTC time. Lastly, you can also use the function 'time_until_game()' that will return how many seconds until the\ngame starts.\n::\n\n   >>> games = hs.ScrapeLiveGames(\"2018-11-09\", if_scrape_shifts=True, pause=15)\n   >>> games.live_games[0].start_time\n   datetime.datetime(2018, 11, 10, 0, 0)\n   >>> games.live_games[0].time_until_game()\n   64599\n\nYou can use this how you please. For example, you may want to create a separate thread for each game and have it sleep\nuntil the game starts. Or maybe you want to use it another way. Either way it's there.\n\nThere are also a few methods that return the give information about the current status of the game. The first two return\nwhether the game is in intermission or whether it's over.\n::\n\n    def is_game_over(self, prev=False):\n        \"\"\"\n        Check if the game is over for both the html and json pbp. If prev=True check for the previous event\n        \n        :param prev: Check the game status for the previous event\n        \n        :return: Boolean - True if over\n        \"\"\"\n        if not prev:\n            return self.html_game_status == self.api_game_status == \"Final\"\n        else:\n            return self.prev_html_game_status == self.prev_api_game_status == \"Final\"\n\n    def is_intermission(self, prev=False):\n        \"\"\"\n        Check if in intermission for both the html and json pbp. If prev=True check for the previous event\n        \n        :param prev: Check the game status for the previous event\n\n        :return: Boolean - True if yes\n        \"\"\"\n        if not prev:\n            return self.html_game_status == self.api_game_status == \"Intermission\"\n        else:\n            return self.prev_html_game_status == self.prev_api_game_status == \"Intermission\"\n\nTwo things probably stand out is the option to check the status for the previous event (why do we care what it was earlier?)\nand the fact that two statuses exist.\n\nFirst let's talk about the two status. There are currently two pages that always need to be scraped for for data for\nthe Play-By-Play. One is an html file and one is the json api. The issue is that the api updates faster than the html.\nSo the api may say the game is over but the html version is still missing a few events. For this reason we need to check\nthat both are aligned.\n\nThe 'prev' keyword for both comes into play when we consider the last method 'is_ongoing'. This checks whether the game\nis currently being played. Which means the game: Started, is not in intermission, and is not over. Here's the method:\n::\n\n       def is_ongoing(self):\n        \"\"\"\n        Check if the game is currently being played.\n\n        The logic here is that we run into an issue with intermission and the end of game. If the game is just changed\n        to Final or Intermission the end user will assume the game isn't ongoing and will not update with the most\n        recent events. They'll be delayed for intermission and won't place it at all for Final games. So we use the\n        previous event as a guide. If it's currently in intermission or Final - we check the previous status. If it's\n        the same the user already has the data. Otherwise we 'lie' and say the game is still ongoing.\n\n        :return: Boolean\n        \"\"\"\n        # The game is currently being played\n        if self.time_until_game() == 0 and not self.is_game_over() and not self.is_intermission() and self.pbp_df.shape[0] > 0:\n            return True\n        # Since it's not being played check if game is over and if it wasn't for the previous\n        elif self.is_game_over() and not self.is_game_over(prev=True):\n            return True\n        # Check if it's in intermission and the if it was for the previous event\n        elif self.is_intermission() and not self.is_intermission(prev=True):\n            return True\n        else:\n            return False\n\nI recommend looking at the function definition written above. Basically checking the previous event makes sure we got\nthe most recent event if the game is over or in intermission. So if the last status was intermission and this one is\ntoo we know we don't need to scrape. But if the last status wasn't the means we are missing some information (presumably\nsomething happened between the last event and the end of the period).\n\nLive Scrape\n~~~~~~~~~~~\n.. automodule:: hockey_scraper.nhl.live_scrape\n   :members:"
  },
  {
    "path": "docs/source/nhl_scrape_functions.rst",
    "content": "NHL Scraping Functions\n======================\n\nScraping\n--------\n\nThere are three ways to scrape games:\n\n\\1. *Scrape by Season*:\n\nScrape games on a season by season level (Note: A given season is referred to by the first of the two years it spans.\nSo you would refer to the 2016-2017 season as 2016).\n::\n\n   import hockey_scraper\n\n    # Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file (both are equivalent!!!)\n    hockey_scraper.scrape_seasons([2015, 2016], True)\n    hockey_scraper.scrape_seasons([2015, 2016], True, data_format='Csv')\n\n    # Scrapes the 2008 season without shifts and returns a dictionary with the DataFrame\n    scraped_data = hockey_scraper.scrape_seasons([2008], False, data_format='Pandas')\n\n    # Scrapes 2014 season without shifts including preseason games\n    hockey_scraper.scrape_seasons([2014], False, preseason=True)\n\n\\2. *Scrape by Game*:\n\nScrape a list of games provided. All game ID's can be found using `this link\n<https://statsapi.web.nhl.com/api/v1/schedule?startDate=2016-10-03&endDate=2017-06-20>`_\n(you need to play around with the dates in the url).\n::\n\n    import hockey_scraper\n\n    # Scrapes the first game of 2014, 2015, and 2016 seasons with shifts and stores the data in a Csv file\n    hockey_scraper.scrape_games([2014020001, 2015020001, 2016020001], True)\n\n    # Scrapes the first game of 2007, 2008, and 2009 seasons with shifts and returns a a dictionary with the DataFrames\n    scraped_data = hockey_scraper.scrape_games([2007020001, 2008020001, 2009020001], True, data_format='Pandas')\n\n\\3. *Scrape by Date Range*:\n\nScrape all games between a specified date range. All dates must be written in a \"yyyy-mm-dd\" format.\n::\n\n    import hockey_scraper\n\n    # Scrapes all games between date range without shifts and stores the data in a Csv file (both are equivalent!!!)\n    hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False)\n    hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False, preseason=False)\n\n    # Scrapes all games between 2015-1-1 and 2015-1-15 without shifts and returns a a dictionary with the DataFrame\n    scraped_data = hockey_scraper.scrape_date_range('2015-1-1', '2015-1-15', False, data_format='Pandas')\n\n    # Scrapes all games from 2014-09-15 to 2014-11-01 with shifts including preseason games\n    hockey_scraper.scrape_date_range('2014-09-15', '2014-11-01', True, preseason=True)\n\n\n\\4. *Scrape Schedule*\n\nScrape the schedule between any given date range for past and future games. All dates must be written in a \"yyyy-mm-dd\" format. The default data_format is equal to 'Pandas'. This returns a DataFrame and not a dictionary like others. The columns returned are: ['game_id', 'date', 'venue', 'home_team', 'away_team', 'start_time', 'home_score', 'away_score', 'status']\n\n::\n\n    import hockey_scraper\n\n    sched_df = hockey_scraper.scrape_schedule(\"2019-10-01\", \"2020-07-01\")\n\n\n**Persistent Data**\n\nThe option also exists to save the scraped files in another directory. This would speed up re-scraping any games since\nwe already have the docs needed for it. It would also be useful if you want to grab any extra information from them\nas some of them contain a lot more information. In order to do this you can use the 'docs_dir' keyword. One can specify\nthe boolean value True to either create or refer (to an already created) directory in the home directory called\nhockey_scraper data. Or you can specify the directory with the string of the path. If this is a valid directory,\nwhen scraping each page it would first check if it was already scraped (therefore saving us the time of scraping it).\nIf it hasn't been scraped yet, it will then grab it from the source and save it in the given directory.\n\nSometimes you may have already scraped and saved a file but you want to re-scrape it from the source and save it again\n(this may seem strange but the NHL frequently fixes mistakes so you may want to update what you have). This can be done\nby setting the keyword argument rescrape equal to True.\n\n::\n\n    import hockey_scraper\n\n    # Path to the given directory\n    # Can also be True if you want the scraper to take care of it\n    USER_PATH = \"/....\"\n\n    # Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file\n    # Also includes a path for an existing directory for the scraped files to be placed in or retrieved from.\n    hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH)\n\n    # Once could chose to re-scrape previously saved files by making the keyword argument rescrape=True\n    hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH, rescrape=True)\n\n\n**Additional Notes**:\n\n\\1. For all three functions you must specify if you want to also scrape shifts (TOI tables) with a boolean. The Play by\nPlay is automatically scraped.\n\n\\2. When scraping by date range or by season, preseason games aren't scraped unless otherwise specified. Also preseason\ngames are scraped at your own risk. There is no guarantee it will work or that the files are even there!!!\n\n\\3. For all three functions the scraped data is deposited into a Csv file unless it's specified to return the DataFrames\n\n\\4. The Dictionary with the DataFrames (and scraping errors) returned by setting data_format='Pandas' is structured like:\n::\n\n   {\n      # Both of these are always included\n      'pbp': pbp_df,\n\n      # This is only included when the argument 'if_scrape_shifts' is set equal to True\n      'shifts': shifts_df\n    }\n\n\\5. When including a directory, it must be a valid directory. It will not create it for you. You'll get an error message\nbut otherwise it will scrape as if no directory was provided.\n\n\n\nScrape Functions\n~~~~~~~~~~~~~~~~\n.. automodule:: hockey_scraper.nhl.scrape_functions\n   :members:\n\nGame Scraper\n~~~~~~~~~~~~\n.. automodule:: hockey_scraper.nhl.game_scraper\n   :members:\n\nHtml PBP\n~~~~~~~~\n.. automodule:: hockey_scraper.nhl.pbp.html_pbp\n   :members:\n\nJson PBP\n~~~~~~~~\n.. automodule:: hockey_scraper.nhl.pbp.json_pbp\n   :members:\n\nEspn PBP\n~~~~~~~~\n.. automodule:: hockey_scraper.nhl.pbp.espn_pbp\n   :members:\n\nJson Shifts\n~~~~~~~~~~~\n.. automodule:: hockey_scraper.nhl.shifts.json_shifts\n   :members:\n\nHtml Shifts\n~~~~~~~~~~~\n.. automodule:: hockey_scraper.nhl.shifts.html_shifts\n   :members:\n\nSchedule\n~~~~~~~~\n.. automodule:: hockey_scraper.nhl.json_schedule\n   :members:\n\nPlaying Roster\n~~~~~~~~~~~~~~\n.. automodule:: hockey_scraper.nhl.playing_roster\n   :members:\n\nSave Pages\n~~~~~~~~~~\n.. automodule:: hockey_scraper.utils.save_pages\n   :members:\n\nShared Functions\n~~~~~~~~~~~~~~~~\n.. automodule:: hockey_scraper.utils.shared\n   :members:\n"
  },
  {
    "path": "docs/source/nwhl_scrape_functions.rst",
    "content": "NWHL Scraping Functions\n=======================\n\nScraping\n--------\n\nThere are three ways to scrape games:\n\n\\1. *Scrape by Season*:\n\nScrape games on a season by season level (Note: A given season is referred to by the first of the two years it spans.\nSo you would refer to the 2016-2017 season as 2016).\n::\n\n   import hockey_scraper\n\n    # Scrapes the 2015 & 2016 season and stores the data in a Csv file (both are equivalent!!!)\n    hockey_scraper.nwhl.scrape_seasons([2015, 2016])\n    hockey_scraper.nwhl.scrape_seasons([2015, 2016], data_format='Csv')\n\n    # Scrapes the 2008 season and returns a Pandas DataFrame\n    scraped_data = hockey_scraper.nwhl.scrape_seasons([2017], data_format='Pandas')\n\n\n\\2. *Scrape by Game*:\n\nScrape a list of games provided.\n::\n\n    import hockey_scraper\n\n    # Scrapes games and store in a Csv file\n    hockey_scraper.nwhl.scrape_games([14694271, 14814946, 14689491], True)\n\n    # Scrapes games and return DataFrame with data\n    scraped_data = hockey_scraper.nwhl.scrape_games([14689624, 18507470, 20575219, 22207005], data_format='Pandas')\n\n\\3. *Scrape by Date Range*:\n\nScrape all games between a specified date range. All dates must be written in a \"yyyy-mm-dd\" format.\n::\n\n    import hockey_scraper\n\n    # Scrapes all games between 2016-10-10 and 2017-01-01 and returns a Pandas DataFrame containing the pbp\n    hockey_scraper.nwhl.scrape_date_range('2016-10-10', '2017-01-01', data_format='pandas')\n\n\nScrape Functions\n~~~~~~~~~~~~~~~~\n.. automodule:: hockey_scraper.nwhl.scrape_functions\n   :members:\n\nHtml Schedule\n~~~~~~~~~~~~~\n.. automodule:: hockey_scraper.nwhl.html_schedule\n   :members:\n\nJson PBP\n~~~~~~~~\n.. automodule:: hockey_scraper.nwhl.json_pbp\n   :members:"
  },
  {
    "path": "hockey_scraper/__init__.py",
    "content": "from .nhl.live_scrape import ScrapeLiveGames, LiveGame\nfrom .nhl.scrape_functions import scrape_games, scrape_date_range, scrape_seasons, scrape_schedule\nfrom .nhl import live_scrape\nfrom .utils import shared\nfrom . import utils\n\n#from .nwhl import scrape_schedule as nwhl_scrape_schedule"
  },
  {
    "path": "hockey_scraper/cli.py",
    "content": "\"\"\"\nInterface for running cli commands\n\"\"\"\nimport sys\nimport argparse\nfrom .utils.shared import print_error\nfrom .nhl.scrape_functions import scrape_games, scrape_date_range, scrape_seasons, scrape_schedule\n\n\ndef validate_args(user_args):\n    \"\"\"\n    Validate that the passed args are sufficient enough to call the corresponding scraping function.\n    Detailed checks are done later by the packcage after the specific function is called.\n\n    :param user_args: ArgumentParser object\n\n    :return: Boolean indicating if args are good\n    \"\"\"\n    if user_args.reportType.lower() not in ['game', 'schedule']:\n        print_error(\"Invalid parameter passed for -t/--reportType. Must be either `game` or `schedule`\")\n        return False\n\n    # One of 3 not empty\n    if not any([user_args.dateRange, user_args.seasons, user_args.games]):\n        print_error(\"Must supply one of the following args: -d/--dateRange, -g/--games, or -s/--seasons. You passed none.\")\n        return False  \n\n    # Date range - should only pass two\n    # Whether or not they are valid is assessed later after calling one of the functions\n    if user_args.dateRange and len(user_args.dateRange) != 2:\n        print_error(\"Only 2 parameters should be passed for -d/--dateRange. You passed {}.\".format(len(user_args.dateRange)))\n        return False\n\n    ### Everything else should be handled by just calling the functions\n    return True\n\n\ndef run_cmd(user_args):\n    \"\"\"\n    Run the appropriate command. Args already validated by this point.\n\n    :param user_args: ArgumentParser object\n\n    :return: None\n    \"\"\"\n    if user_args.reportType.lower() == 'schedule': \n        scrape_schedule(user_args.dateRange[0], user_args.dateRange[1], rescrape=user_args.rescrape, docs_dir=user_args.fileDir, data_format='csv')\n    else:\n        if user_args.dateRange:\n            scrape_date_range(user_args.dateRange[0], user_args.dateRange[1], user_args.shifts, \n                              docs_dir=user_args.fileDir, rescrape=user_args.rescrape, preseason=user_args.preseason)\n        elif user_args.seasons:\n            scrape_seasons(user_args.seasons, user_args.shifts, docs_dir=user_args.fileDir, rescrape=user_args.rescrape, preseason=user_args.preseason)\n        else:\n            scrape_games(user_args.games, user_args.shifts, docs_dir=user_args.fileDir, rescrape=user_args.rescrape)\n\n    \n\ndef main():\n    parser = argparse.ArgumentParser(description='CLI tool for the hockey_scraper project')\n\n    ### Default to scraping games without shifts\n    parser.add_argument('-t', \"--reportType\", help='Type of report to scrape. Either game or schedule.', default='game', type=str, required=False)  \n    parser.add_argument(\"--shifts\", help='Whether to include shifts.', action='store_true', default=False, required=False)\n\n    ### Must pass one of these\n    parser.add_argument('-d', \"--dateRange\", help='Date range to scrape between.', nargs='+', type=str, required=False, default=[])\n    parser.add_argument('-s', \"--seasons\", help='Seasons to scrape.', nargs='+', type=int, required=False, default=[])\n    parser.add_argument('-g', \"--games\", help='Game IDs to scrape.', nargs='+', type=str, required=False, default=[])\n\n    ### Additonal optional args\n    parser.add_argument('-f', \"--fileDir\", \n                        help='''Whether to store scraped files. If the flag is specified and no argument is passed, a directory is created in the root.\n                                If an argument is passed with the flag the files are stored there (assuming the directory exists).\n                             ''', \n                        default=None, type=str, required=False)\n\n    parser.add_argument(\"-r\", \"--rescrape\", help='Whether to re-scrape pages already scraped and stored in --fileDir.', \n                        action='store_true', default=False, required=False)\n\n    parser.add_argument(\"-p\", \"--preseason\", help='Whether to scrape preseason data.', action='store_true', default=False, required=False)\n\n    args = parser.parse_args()\n\n    if validate_args(args):\n        run_cmd(args)\n\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "hockey_scraper/nhl/__init__.py",
    "content": ""
  },
  {
    "path": "hockey_scraper/nhl/game_scraper.py",
    "content": "\"\"\"\nThis module contains code to scrape data for a single game\n\"\"\"\n\nimport pandas as pd\n\nimport hockey_scraper.nhl.pbp.espn_pbp as espn_pbp\nimport hockey_scraper.nhl.pbp.html_pbp as html_pbp\nimport hockey_scraper.nhl.pbp.json_pbp as json_pbp\nimport hockey_scraper.nhl.playing_roster as playing_roster\nimport hockey_scraper.nhl.shifts.html_shifts as html_shifts\nimport hockey_scraper.nhl.shifts.json_shifts as json_shifts\nimport hockey_scraper.utils.shared as shared\n\nbroken_shifts_games = []\nbroken_pbp_games = []\nplayers_missing_ids = []\nmissing_coords = []\n\n\npbp_columns = [\n    'Game_Id', 'Date', 'Period', 'Event', 'Description', 'Time_Elapsed', 'Seconds_Elapsed', 'Strength',\n    'Ev_Zone', 'Type', 'Ev_Team', 'Home_Zone', 'Away_Team', 'Home_Team', 'p1_name', 'p1_ID', 'p2_name', 'p2_ID',\n    'p3_name', 'p3_ID', 'awayPlayer1', 'awayPlayer1_id', 'awayPlayer2', 'awayPlayer2_id', 'awayPlayer3',\n    'awayPlayer3_id', 'awayPlayer4', 'awayPlayer4_id', 'awayPlayer5', 'awayPlayer5_id', 'awayPlayer6',\n    'awayPlayer6_id', 'homePlayer1', 'homePlayer1_id', 'homePlayer2', 'homePlayer2_id', 'homePlayer3',\n    'homePlayer3_id', 'homePlayer4', 'homePlayer4_id', 'homePlayer5', 'homePlayer5_id', 'homePlayer6',\n    'homePlayer6_id',  'Away_Players', 'Home_Players', 'Away_Score', 'Home_Score', 'Away_Goalie',\n    'Away_Goalie_Id', 'Home_Goalie', 'Home_Goalie_Id', 'xC', 'yC', 'Home_Coach', 'Away_Coach'\n]\n\n\ndef check_goalie(row):\n    \"\"\"\n    Checks for bad goalie names (you can tell by them having no player id)\n    \n    :param row: df row\n    \n    :return: None\n    \"\"\"\n    if row['Away_Goalie'] != '' and row['Away_Goalie_Id'] is None:\n        if [row['Away_Goalie'], row['Game_Id']] not in players_missing_ids:\n            players_missing_ids.extend([[row['Away_Goalie'], row['Game_Id']]])\n\n    if row['Home_Goalie'] != '' and row['Home_Goalie_Id'] is None:\n        if [row['Home_Goalie'], row['Game_Id']] not in players_missing_ids:\n            players_missing_ids.extend([[row['Home_Goalie'], row['Game_Id']]])\n\n\ndef get_players_json(game_json):\n    \"\"\"\n    Return dict of players for that game by team\n\n    :param players_json: players section of json\n\n    :return: {team -> players}\n    \"\"\"\n    players = {\"Home\": {}, \"Away\": {}}\n    homeid = game_json['homeTeam']['id']\n    awayid = game_json['awayTeam']['id']\n\n    for player in game_json['rosterSpots']:\n<<<<<<< HEAD\n        if player['teamId'] == homeid:\n            players[\"Home\"][str(player[\"firstName\"] + \" \" + player[\"lastName\"]).upper()] = {\n                \"id\": player['playerId'], \n                \"last_name\": player[\"lastName\"].upper()\n            }\n\n        if player['teamId'] == awayid:\n            players[\"Away\"][str(player[\"firstName\"] + \" \" + player[\"lastName\"]).upper()] = {\n                \"id\": player['playerId'],\n                \"last_name\": player[\"lastName\"].upper()\n=======\n        # print(player)\n        if player['teamId'] == homeid:\n            players[\"Home\"][str(player[\"firstName\"]['default'] + \" \" + player[\"lastName\"]['default']).upper()] = {\n                \"id\": player['playerId'], \n                \"last_name\": player[\"lastName\"]['default'].upper()\n            }\n\n        if player['teamId'] == awayid:\n            players[\"Away\"][str(player[\"firstName\"]['default'] + \" \" + player[\"lastName\"]['default']).upper()] = {\n                \"id\": player['playerId'],\n                \"last_name\": player[\"lastName\"]['default'].upper()\n>>>>>>> 1029299054fbe671c3ca9c5d413cdfd102416853\n            }\n    \n    return players\n\n\n# TODO: Assumes no two players on the same team can have the same name\n#       Could potentially differentiate by number or position\ndef combine_players_lists(json_players, roster_players, game_id):\n    \"\"\"\n    Combine the json list of players (which contains id's) with the list in the roster html\n\n    :param json_players: dict of all players with id's\n    :param roster_players: dict with home and and away keys for players\n    :param game_id: id of game\n\n    :return: dict containing home and away keys -> which contains list of info on each player\n    \"\"\"\n    players = {'Home': dict(), 'Away': dict()}\n\n    for venue in players:\n        for player in roster_players[venue]:\n            try:\n                name = shared.fix_name(player[2])\n                player_id = json_players[venue][name]['id']\n                players[venue][name] = {'id': player_id, 'number': player[0], 'last_name': json_players[venue][name]['last_name']}\n            except KeyError as e:\n                # If he was listed as a scratch and not a goalie (check_goalie deals with goalies)\n                # As a whole the scratch list shouldn't be trusted but if a player is missing an id # and is on the\n                # scratch list I'm willing to assume that he didn't play\n                if not player[3] and player[1] != 'G':\n                    player.extend([game_id])\n                    players_missing_ids.extend([[player[2], player[4]]])\n                    players[venue][name] = {'id': None, 'number': player[0], 'last_name': ''}\n\n    return players\n\n\n\ndef get_teams_and_players(game_json, roster, game_id):\n    \"\"\"\n    Get list of players and teams for game\n\n    :param game_json: json pbp for game\n    :param roster: players from roster html\n    :param game_id: id for game\n\n    :return: dict for both - players and teams\n    \"\"\"\n    try:\n        teams = json_pbp.get_teams(game_json)\n        player_ids = get_players_json(game_json)\n        players = combine_players_lists(player_ids, roster['players'], game_id)\n    except Exception as e:\n        shared.print_error('Problem with getting the teams or players')\n        return None, None\n\n    return players, teams\n\n\ndef combine_html_json_pbp(json_df, html_df, game_id, date):\n    \"\"\"\n    Join both data sources. First try merging on event id (which is the DataFrame index) if both DataFrames have the\n    same number of rows. If they don't have the same number of rows, merge on: Period', Event, Seconds_Elapsed, p1_ID. \n    \n    :param json_df: json pbp DataFrame\n    :param html_df: html pbp DataFrame\n    :param game_id: id of game\n    :param date: date of game\n    \n    :return: finished pbp\n    \"\"\"\n    # Don't need those columns to merge in\n    json_df = json_df.drop(['p1_name', 'p2_name', 'p2_ID', 'p3_name', 'p3_ID'], axis=1)\n\n    try:\n        # If they aren't equal it's usually due to the HTML containing a challenge event\n        if html_df.shape[0] == json_df.shape[0]:\n            json_df = json_df[['period', 'event', 'seconds_elapsed', 'xC', 'yC']]\n            game_df = pd.merge(html_df, json_df, left_index=True, right_index=True, how='left')\n        else:\n            # We always merge if they aren't equal but we check if it's due to a challenge so we can print out a better\n            # warning message for the user.\n            # NOTE: May be slightly incorrect. It's possible for there to be a challenge and another issue for one game.\n            if 'CHL' in list(html_df.Event):\n                shared.print_warning(\"The number of rows in the Html and Json pbp are different because the\"\n                                     \" Json pbp, for some reason, does not include challenges. Will instead merge on \"\n                                     \"Period, Event, Time, and p1_id.\")\n            else:\n                shared.print_warning(\"The number of rows in the Html and json pbp are different because \"\n                                     \"someone fucked up. Will instead merge on Period, Event, Time, and p1_id.\")\n\n            # Actual Merging\n            game_df = pd.merge(html_df, json_df, left_on=['Period', 'Event', 'Seconds_Elapsed', 'p1_ID'],\n                               right_on=['period', 'event', 'seconds_elapsed', 'p1_ID'], how='left')\n\n        # This is always done - because merge doesn't work well with shootouts\n        game_df = game_df.drop_duplicates(subset=['Period', 'Event', 'Description', 'Seconds_Elapsed'])\n    except Exception as e:\n        shared.print_error('Problem combining Html Json pbp for game {}'.format(game_id))\n        return\n\n    game_df['Game_Id'] = game_id[-5:]\n    game_df['Date'] = date\n\n    return pd.DataFrame(game_df, columns=pbp_columns)\n\n\ndef combine_espn_html_pbp(html_df, espn_df, game_id, date, away_team, home_team):\n    \"\"\"\n    Merge the coordinate from the espn feed into the html DataFrame\n    \n    Can't join here on event_id because the plays are often out of order and pre-2009 are often missing events. \n    \n    :param html_df: DataFrame with info from html pbp\n    :param espn_df: DataFrame with info from espn pbp\n    :param game_id: json game id\n    :param date: ex: 2016-10-24\n    :param away_team: away team\n    :param home_team: home team\n    \n    :return: merged DataFrame\n    \"\"\"\n    if espn_df is not None and not espn_df.empty:\n        try:\n            game_df = pd.merge(html_df, espn_df, left_on=['Period', 'Seconds_Elapsed', 'Event'],\n                               right_on=['period', 'time_elapsed', 'event'], how='left')\n\n            # Shit happens\n            game_df = game_df.drop_duplicates(subset=['Period', 'Event', 'Description', 'Seconds_Elapsed'])\n\n            df = game_df.drop(['period', 'time_elapsed', 'event'], axis=1)\n        except Exception as e:\n            shared.print_error('Error combining espn and html pbp for game {}'.format(game_id))\n            return None\n    else:\n        df = html_df\n\n    df['Game_Id'] = game_id[-5:]\n    df['Date'] = date\n    df['Away_Team'] = away_team\n    df['Home_Team'] = home_team\n\n    return pd.DataFrame(df, columns=pbp_columns)\n\n\ndef scrape_pbp_live(game_id, date, roster, game_json, players, teams, espn_id=None):\n    \"\"\"\n    Wrapper for scraping the live pbp\n\n    :param game_id: json game id\n    :param date: date of game\n    :param roster: list of players in pre game roster\n    :param game_json: json pbp for game\n    :param players: dict of players\n    :param teams: dict of teams\n    :param espn_id: Game Id for the espn game. Only provided when live scraping\n\n    :return: Tuple - pbp & status\n    \"\"\"\n    html_df, status = html_pbp.scrape_game_live(game_id, players, teams)\n    game_df = scrape_pbp(game_id, date, roster, game_json, players, teams, espn_id=espn_id, html_df=html_df)\n    return game_df, status\n\n\ndef scrape_pbp(game_id, date, roster, game_json, players, teams, espn_id=None, html_df=None):\n    \"\"\"\n    Scrape the Pbp info. The HTML is always scraped.\n\n    The Json is parse whe season >= 2010 and there are plays. Otherwise ESPN is gotten to supplement\n    the HTML with coordinate.\n\n    The espn_id and the html data can be fed as keyword argument to speed up execution. This is used by\n    the live game scraping class.\n\n    :param game_id: json game id\n    :param date: date of game\n    :param roster: list of players in pre game roster\n    :param game_json: json pbp for game\n    :param players: dict of players\n    :param teams: dict of teams\n    :param espn_id: Game Id for the espn game. Only provided when live scraping\n    :param html_df: Can provide DataFrame for html. Only done for live-scraping\n\n    :return: DataFrame with info or None if it fails\n    \"\"\"\n    # Coordinates are only available in json from 2010 onwards\n    if int(str(game_id)[:4]) >= 2010 and len(game_json['plays']) > 0:\n        json_df = json_pbp.parse_json(game_json, game_id)\n    else:\n        json_df = None\n\n    # For live sometimes the json lags the html so if given we don't bother\n    if not isinstance(html_df, pd.DataFrame):\n        html_df = html_pbp.scrape_game(game_id, players, teams)\n\n    if html_df is None or html_df.empty:\n        return None\n\n    # Check if the json is missing the plays...if it is scrape ESPN for the coordinates\n    if json_df is None or json_df.empty:\n        espn_df = espn_pbp.scrape_game(date, teams['Home'], teams['Away'], game_id=espn_id)\n        game_df = combine_espn_html_pbp(html_df, espn_df, str(game_id), date, teams['Away'], teams['Home'])\n\n        # Sometimes espn is corrupted so can't get coordinates\n        if espn_df is None or espn_df.empty:\n            missing_coords.extend([[game_id, date]])\n    else:\n        game_df = combine_html_json_pbp(json_df, html_df, str(game_id), date)\n\n    if game_df is not None:\n        game_df['Home_Coach'] = roster['head_coaches']['Home']\n        game_df['Away_Coach'] = roster['head_coaches']['Away']\n\n    return game_df\n\n\ndef scrape_shifts(game_id, players, date):\n    \"\"\"\n    Scrape the Shift charts (or TOI tables)\n    \n    :param game_id: json game id\n    :param players: dict of players with numbers and id's\n    :param date: date of game\n    \n    :return: DataFrame with info or None if it fails\n    \"\"\"\n    shifts_df = None\n\n    # Control for fact that shift json is only available from 2010 onwards\n    if shared.get_season(date) >= 2010:\n        shifts_df = json_shifts.scrape_game(game_id)\n\n    if shifts_df is None or shifts_df.empty:\n        shifts_df = html_shifts.scrape_game(game_id, players)\n\n        if shifts_df is None or shifts_df.empty:\n            shared.print_error(\"Unable to scrape shifts for game \" + game_id)\n            broken_shifts_games.extend([[game_id, date]])\n            return None\n\n    shifts_df['Date'] = date\n\n    return shifts_df\n\n\ndef scrape_game(game_id, date, if_scrape_shifts):\n    \"\"\"\n    This scrapes the info for the game.\n    The pbp is automatically scraped, and the whether or not to scrape the shifts is left up to the user.\n    \n    :param game_id: game to scrap\n    :param date: ex: 2016-10-24\n    :param if_scrape_shifts: Boolean indicating whether to also scrape shifts \n    \n    :return: DataFrame of pbp info\n             (optional) DataFrame with shift info otherwise just None\n    \"\"\"\n    print(' '.join(['Scraping Game ', game_id, date]))\n    shifts_df = None\n\n    roster = playing_roster.scrape_roster(game_id)\n    game_json = json_pbp.get_pbp(game_id)           # Contains both player info (id's) and plays\n    players, teams = get_teams_and_players(game_json, roster, game_id)\n\n    # Game fails without any of these\n    if not roster or not game_json or not teams or not players:\n        broken_pbp_games.extend([[game_id, date]])\n        if if_scrape_shifts:\n            broken_shifts_games.extend([[game_id, date]])\n\n        return None, None\n\n    pbp_df = scrape_pbp(game_id, date, roster, game_json, players, teams)\n\n    # Only scrape shifts if asked and pbp is good\n    if if_scrape_shifts and pbp_df is not None:\n        shifts_df = scrape_shifts(game_id, players, date)\n\n    if pbp_df is None:\n        broken_pbp_games.extend([[game_id, date]])\n\n    return pbp_df, shifts_df\n"
  },
  {
    "path": "hockey_scraper/nhl/json_schedule.py",
    "content": "\"\"\"\nThis module contains functions to scrape the json schedule for any games or date range\n\"\"\"\nimport json\nfrom pytz import timezone\nfrom datetime import datetime, timedelta\nimport hockey_scraper.utils.shared as shared\n\nfrom tqdm import tqdm\n\n\n# TODO: Currently rescraping page each time since the status of some games may have changed\n# (e.g. Scraped on 2020-01-20 and game on 2020-01-21 was not Final...when use old page again will still think not Final)\n# Need to find a more elegant way of doing this (Metadata???)\ndef get_schedule(date):\n    \"\"\"\n    Scrapes games in date range\n    Ex: https://api-web.nhle.com/v1/schedule/2011-06-20\n    \n    :param date: scrape from this date\n    \n    :return: raw json of schedule of date range\n    \"\"\"\n    page_info = {\n        \"url\": 'https://api-web.nhle.com/v1/schedule/{a}'.format(a=date),\n        \"name\": \"Schedule_\" + date,\n        \"type\": \"json_schedule\",\n        \"season\": shared.get_season(date),\n    }\n\n    return json.loads(shared.get_file(page_info, force=True))\n\n\ndef chunk_schedule_calls(from_date, to_date):\n    \"\"\"\n    Due to new API, we have to inividually GET games by week\n\n    We filter out games not in range for the final week\n    \n    :param date_from: scrape from this date\n    :param date_to: scrape until this date\n\n    :return: raw json of schedule of date range\n    \"\"\"\n    sched = []\n    days_per_call = 7\n\n    from_date = datetime.strptime(from_date, \"%Y-%m-%d\") \n    to_date = datetime.strptime(to_date, \"%Y-%m-%d\")\n    num_days = (to_date - from_date).days + 1  # +1 since difference is looking for total number of days\n\n    for offset in tqdm(range(0, num_days, days_per_call), \"Scraping Schedule\"):\n        date_chunk = datetime.strftime(from_date + timedelta(days=offset), \"%Y-%m-%d\")\n        chunk_sched = get_schedule(date_chunk)['gameWeek']\n        sched.append(chunk_sched)\n\n    \n    return sched\n\n\ndef scrape_schedule(date_from, date_to, preseason=False, not_over=False):\n    \"\"\"\n    Calls getSchedule and scrapes the raw schedule Json\n\n    We filter out games not in range. Due to how new schedule API works\n    \n    :param date_from: scrape from this date\n    :param date_to: scrape until this date\n    :param preseason: Boolean indicating whether include preseason games (default if False)\n    :param not_over: Boolean indicating whether we scrape games not finished. \n                     Means we relax the requirement of checking if the game is over. \n    \n    :return: list with all the game id's\n    \"\"\"\n    print(\"Scraping the schedule between {} and {}...please give it a moment\".format(date_from, date_to))\n\n    est = timezone(\"America/New_York\")\n\n    # We need to include the timezone and cover the entire day\n    fds = list(map(int, date_from.split(\"-\")))\n    fdate_est = datetime(fds[0], fds[1], fds[2], 0, 0, tzinfo=est)\n    tds = list(map(int, date_to.split(\"-\")))\n    tdate_est = datetime(tds[0], tds[1], tds[2], 23, 59, tzinfo=est)\n\n    schedule = []\n    schedule_json = chunk_schedule_calls(date_from, date_to)\n\n    for chunk in schedule_json:\n        for day in chunk:\n            for game in day['games']:\n                game_id = int(str(game['id'])[5:])\n                \n                # TODO: Confirm if OFF is correct\n                # Check game is over or scraping live\n                status_cond = game['gameState'] == 'OFF' or not_over\n                # No preseason or \"special\" games\n                valid_game_cond = (game_id >= 20000 or preseason) and game_id < 40000\n                # Within specified date ranges\n                game_date = datetime.strptime(game['startTimeUTC'], \"%Y-%m-%dT%H:%M:%S%z\")\n                date_cond = fdate_est <= game_date.astimezone(est) <= tdate_est\n\n                if status_cond and valid_game_cond and date_cond:\n                    schedule.append({\n                        \"game_id\": game['id'], \n                        \"date\": day['date'], \n                        \"start_time\": datetime.strptime(game['startTimeUTC'][:-1], \"%Y-%m-%dT%H:%M:%S\"),\n                        \"venue\": game['venue'].get('default'),\n                        \"home_team\": shared.convert_tricode(game['homeTeam']['abbrev']),\n                        \"away_team\": shared.convert_tricode(game['awayTeam']['abbrev']),\n                        \"home_score\": game['homeTeam'].get(\"score\"),\n                        \"away_score\": game['awayTeam'].get(\"score\"),\n                        \"status\": game[\"gameState\"]\n                    })\n\n    return schedule\n\n\ndef get_dates(games):\n    \"\"\"\n    Given a list game_ids it returns the dates for each game.\n\n    We sort all the games and retrieve the schedule from the beginning of the season from the earliest game\n    until the end of most recent season.\n    \n    :param games: list with game_id's ex: 2016020001\n    \n    :return: list with game_id and corresponding date for all games\n    \"\"\"\n    today = datetime.today()\n\n    # Determine oldest and newest game\n    games = list(map(str, games))\n    games.sort()\n\n    date_from = shared.season_start_bound(games[0][:4])\n    year_to = int(games[-1][:4])\n\n    # If the last game is part of the ongoing season then only request the schedule until Today\n    # We get strange errors if we don't do it like this\n    if year_to == shared.get_season(datetime.strftime(today, \"%Y-%m-%d\")):\n        date_to = '-'.join([str(today.year), str(today.month), str(today.day)])\n    else:\n        date_to = datetime.strftime(shared.season_end_bound(year_to+1), \"%Y-%m-%d\")  # Newest game in sample\n\n    # TODO: Assume true is live here -> Workaround\n    schedule = scrape_schedule(date_from, date_to, preseason=True, not_over=True)\n\n    # Only return games we want in range\n    games_list = []\n    for game in schedule:\n        if str(game['game_id']) in games:\n            games_list.extend([game])\n\n    return games_list"
  },
  {
    "path": "hockey_scraper/nhl/live_scrape.py",
    "content": "\"\"\"\nModule to scrape live game info\n\"\"\"\nimport datetime\nimport time\nimport warnings\nimport pandas as pd\nimport hockey_scraper.nhl.game_scraper as game_scraper\nimport hockey_scraper.nhl.json_schedule as json_schedule\nimport hockey_scraper.nhl.pbp.espn_pbp as espn_pbp\nimport hockey_scraper.nhl.pbp.json_pbp as json_pbp\nimport hockey_scraper.nhl.playing_roster as playing_roster\nimport hockey_scraper.utils.shared as shared\n\n\ndef set_docs_dir(user_dir):\n    \"\"\"\n    Set the docs directory\n    \n    :param user_dir: User specified directory for storing saves scraped files\n    \n    :return: None\n    \"\"\"\n    # We always want to rescrape since the files are being updated constantly\n    shared.if_rescrape(True)\n    shared.add_dir(user_dir)\n\n\ndef check_date_format(date):\n    \"\"\"\n    Verify the date format. If wrong raises a ValueError\n\n    :param date: User supplied date\n\n    :return: None\n    \"\"\"\n    try:\n        time.strptime(date, \"%Y-%m-%d\")\n    except ValueError:\n        raise ValueError(\"Error: Incorrect format given for dates. They must be given like 'yyyy-mm-dd' \"\n                         \"(ex: '2016-10-01').\")\n\n\n# TODO: Should I denote more member variables as private?\nclass LiveGame:\n    \"\"\" \n    This is a class holds all the information for a given game\n      \n    :param int game_id: The NHL game id (ex: 2018020001)\n    :param datetime start_time: The UTC time of when the game begins\n    :param str home_team: Tricode for the home team (ex: NYR)\n    :param str away_team: Tricode for the home team (ex: MTL)\n    :param int espn_id: The ESPN game id for their feed\n    :param str date: Date of the game (ex: 2018-10-30)\n    :param bool if_scrape_shifts: Whether or not you want to scrape shifts\n    :param str api_game_status: Current Status of the game - [\"Final\", \"Live\", \"Intermission]\n    :param str html_game_status: Current Status of the game - [\"Final\", \"Live\", \"Intermission\"]\n    :param int intermission_time_remaining: Time remaining in the intermission. 0 if not in intermission\n    :param dict players: Player info for both teams\n    :param dict head_coaches: Head coaches for both teams\n    :param DataFrame _pbp_df: Holds most recent pbp data\n    :param DataFrame _shifts_df: Holds most recent shift data\n    :param DataFrame _prev_pbp_df: Holds the previous pbp data (for just in case)\n    :param DataFrame _prev_shifts_df: Holds the previous shift data (for just in case)\n    \"\"\"\n\n    def __init__(self, game_id, start_time, home_team, away_team, status, espn_id, date, if_scrape_shifts):\n        \"\"\" Constructor \"\"\"\n        # Given upon creation\n        self.game_id = game_id\n        self.start_time = start_time\n        self.home_team = home_team\n        self.away_team = away_team\n        self.espn_id = espn_id\n        self.date = date\n        self.if_scrape_shifts = if_scrape_shifts\n\n        # Html pbp is behind the json (json updates faster)\n        self.api_game_status = status\n        self.html_game_status = \"Live\"\n        self.prev_api_game_status = status\n        self.prev_html_game_status = \"Live\"\n        self.intermission_time_remaining = 0\n\n        # We know nothing to start off\n        self.players = None\n        self.head_coaches = None\n\n        # Pbp and shift data - Will be filled in later\n        # Also hold previous pair for checking for changes\n        self._pbp_df = pd.DataFrame()\n        self._shifts_df = pd.DataFrame()\n        self._prev_pbp_df = pd.DataFrame()\n        self._prev_shifts_df = pd.DataFrame()\n\n        # Object creation message\n        print(\"The LiveGame object for game {game_id} has been created. \".format(game_id=game_id), end=\"\")\n        if self.time_until_game() <= 0:\n            print(\"The game has started.\")\n        else:\n            print(\"The game starts in {time} seconds.\".format(time=self.time_until_game()))\n\n\n    @property\n    def pbp_df(self):\n        if isinstance(self._pbp_df, pd.DataFrame):\n            return self._pbp_df\n        else:\n            return pd.DataFrame()\n\n    @property\n    def shifts_df(self):\n        if isinstance(self._shifts_df, pd.DataFrame):\n            return self._shifts_df\n        else:\n            return pd.DataFrame()\n\n    @property\n    def prev_pbp_df(self):\n        if isinstance(self._prev_pbp_df, pd.DataFrame):\n            return self._prev_pbp_df\n        else:\n            return pd.DataFrame()\n\n    @property\n    def prev_pbp_df(self):\n        if isinstance(self._prev_shifts_df, pd.DataFrame):\n            return self._prev_shifts_df\n        else:\n            return pd.DataFrame()\n\n\n    def scrape(self, force=False):\n        \"\"\"\n        Scrape the given game. Check if currently ongoing or started\n        \n        :param bool force: Whether or not to force it to scrape even if it's over\n        \n        :return: None\n        \"\"\"\n        # 1. force = False: If the game hasn't eclipsed the starting time or is over we don't scrape\n        # 2. force = True: We always scrape\n        if (self.time_until_game() == 0 and not self.is_game_over(prev=True)) or force:\n            self.scrape_live_game(force=force)\n\n\n    def scrape_live_game(self, force=False):\n        \"\"\"\n        Scrape the live info for a given game\n        \n        :param force: Whether to scrape no matter what (used for intermission here)\n        \n        :return: None\n        \"\"\"\n        game_json = json_pbp.get_pbp(str(self.game_id))\n\n        # When don't have json...can't do anything without it\n        if game_json is None:\n            return\n\n        # Shift Game Statuses b4 we do anything\n        self.prev_api_game_status = self.api_game_status\n        self.prev_html_game_status = self.html_game_status\n\n        # Swap old pbp & shift DataFrames\n        self._prev_pbp_df = self._pbp_df\n        self._prev_shifts_df = self._shifts_df\n\n        # If json is in intermission:\n        # Update self.api_game_status, get minutes remaining in intermission, and check if html is intermission too.\n        # If both feeds are in intermission we return, otherwise we wait for the html to catch up.\n        # \"Intermission\" is my own game status so otherwise just take whatever is in the api\n        if game_json['liveData']['linescore']['intermissionInfo']['inIntermission']:\n            self.api_game_status = \"Intermission\"\n            self.intermission_time_remaining = game_json['liveData']['linescore']['intermissionInfo'][\"intermissionTimeRemaining\"]\n\n            # If see the both says intermission and we do too, we can just safely return and not bother with scraping.\n            # This will be false if the HTML hasn't updated yet to intermission\n            # If force we scrape no matter what\n            if self.is_intermission() and not force:\n                return\n        else:\n            # Update API Status if NOT in intermission to whatever is there\n            self.api_game_status = game_json[\"gameData\"][\"status\"][\"abstractGameState\"]\n\n        # Leave if b4 game started\n        if game_json[\"gameData\"][\"status\"][\"abstractGameState\"] in [\"Preview\"]:\n            self.html_game_status = self.api_game_status = game_json[\"gameData\"][\"status\"][\"abstractGameState\"]\n            return\n\n        # We get this the 1st time it scrapes the info (or when it's first available)\n        # Don't bother with earlier as it may not be there or we may end up with an old version\n        if not self.players:\n            roster = playing_roster.scrape_roster(self.game_id)\n            if roster is not None:\n                self.players, _ = game_scraper.get_teams_and_players(game_json, roster, self.game_id)\n                self.head_coaches = roster['head_coaches']\n            else:\n                return  # If we try and still can't get it we leave - Termination Reason #2\n\n        # Don't bother with scraper warnings\n        with warnings.catch_warnings():\n            warnings.simplefilter(\"ignore\")\n            # pay attention to each argument\n            self._pbp_df, self.html_game_status = game_scraper.scrape_pbp_live(self.game_id, self.date,\n                                                                              {\"head_coaches\": self.head_coaches},\n                                                                              game_json, self.players,\n                                                                              {\"Home\": self.home_team, \"Away\": self.away_team},\n                                                                              espn_id=self.espn_id)\n            if self.if_scrape_shifts:\n                self._shifts_df = game_scraper.scrape_shifts(self.game_id, self.players, self.date)\n\n\n    def is_ongoing(self):\n        \"\"\"\n        Check if the game is currently being played. \n        \n        The logic here is that we run into an issue with intermission and the end of game. If the game is just changed \n        to Final or Intermission the end user will assume the game isn't ongoing and will not update with the most\n        recent events. They'll be delayed for intermission and won't place it at all for Final games. So we use the \n        previous event as a guide. If it's currently in intermission or Final - we check the previous status. If it's \n        the same the user already has the data. Otherwise we 'lie' and say the game is still ongoing.\n        \n        :return: Boolean\n        \"\"\"\n        # The game is currently being played\n        if self.time_until_game() == 0 and not self.is_game_over() and not self.is_intermission() and self.pbp_df.shape[0] > 0:\n            return True\n        # Since it's not being played check if game is over and if it wasn't for the previous\n        elif self.is_game_over() and not self.is_game_over(prev=True):\n            return True\n        # Check if it's in intermission and the if it was for the previous event\n        elif self.is_intermission() and not self.is_intermission(prev=True):\n            return True\n        else:\n            return False\n\n\n    def time_until_game(self):\n        \"\"\"\n        Return the seconds until the game starts\n\n        :return: seconds until game \n        \"\"\"\n        delta = self.start_time - datetime.datetime.utcnow()\n        if delta.days >= 0:\n            return delta.seconds\n        else:\n            return 0\n\n\n    def is_game_over(self, prev=False):\n        \"\"\"\n        Check if the game is over for both the html and json pbp. If prev=True check for the previous event\n        \n        :param prev: Check the game status for the previous event\n        \n        :return: Boolean - True if over\n        \"\"\"\n        if not prev:\n            return self.html_game_status == self.api_game_status == \"Final\"\n        else:\n            return self.prev_html_game_status == self.prev_api_game_status == \"Final\"\n\n\n    def is_intermission(self, prev=False):\n        \"\"\"\n        Check if in intermission for both the html and json pbp. If prev=True check for the previous event\n        \n        :param prev: Check the game status for the previous event\n\n        :return: Boolean - True if yes\n        \"\"\"\n        if not prev:\n            return self.html_game_status == self.api_game_status == \"Intermission\"\n        else:\n            return self.prev_html_game_status == self.prev_api_game_status == \"Intermission\"\n\n\n\nclass ScrapeLiveGames:\n    \"\"\"\n    Class than contains the info for all the games on a specific day\n    \n    :param str date: Date of games (ex: 2018-10-30)\n    :param bool preseason: If you want to scrape preseason games\n    :param bool if_scrape_shifts: Whether or not you want to scrape shifts\n    :param list live_games: List of LiveGame objects for \n    :param int pause: Amount to pause after each scraping call\n    \"\"\"\n\n    def __init__(self, date, preseason=False, if_scrape_shifts=False, pause=15, game_ids=list()):\n        \"\"\"\n        Initialize the ScrapeLiveGames object with games for the day\n        \n        :param date: Date \n        :param preseason: If scrape preseason\n        :param if_scrape_shifts: Whether to scrape the shifts\n        :param pause: time to pause\n        :param game_ids: If only want specific games\n        \"\"\"\n        # First check date\n        check_date_format(date)\n\n        self.user_game_ids = game_ids\n        self.date = date\n        self.preseason = preseason\n        self.if_scrape_shifts = if_scrape_shifts\n        self.live_games = self.get_games()          # Hold list of LiveGame objects for that day\n        self.pause = pause\n\n\n    def get_games(self):\n        \"\"\"\n        Get initial game info -> Called with object creation. Includes: players, espn_ids, standard game info\n        \n        :return: Dict - LiveGame objects for all games today\n        \"\"\"\n        game_objs = []\n\n        # Get the initial schedule & espn game ids just in case\n        games = json_schedule.scrape_schedule(self.date, self.date, not_over=True, preseason=self.preseason)\n        games = self.get_espn_ids(games)\n\n        # Only keep the games we want if the user specified games\n        if self.user_game_ids:\n            games = [game for game in games if game['game_id'] in self.user_game_ids]\n\n        # Get rosters for each game\n        for game in games:\n            game_objs.append(LiveGame(game['game_id'], game['start_time'], game['home_team'], game['away_team'],\n                                      game['status'], game['espn_id'], self.date, self.if_scrape_shifts))\n\n        return game_objs\n\n\n    def get_espn_ids(self, games):\n        \"\"\"\n        Get espn game ids for all games that day\n\n        :param list games: games today\n\n        :return: Games with corresponding espn game ids\n        \"\"\"\n        # Get all espn info\n        response = espn_pbp.get_espn_date(self.date)\n        game_ids = espn_pbp.get_game_ids(response)\n        espn_games = espn_pbp.get_teams(response)\n\n        # Match up\n        for i in range(len(games)):\n            for j in range(len(espn_games)):\n                if games[i]['home_team'] in espn_games[j] or games[i]['away_team'] in espn_games[j]:\n                    games[i]['espn_id'] = game_ids[j]\n\n        return games\n\n\n    def update_live_games(self, force=False, sleep_next=False):\n        \"\"\"\n        Scrape the pbp & shifts of ongoing games\n        \n        :param bool force: Whether or not to force it to scrape even if it's in intermission\n        :param bool sleep_next: Sleep until the next game starts\n        \n        :return: None\n        \"\"\"\n        # Check if we need to sleep\n        if sleep_next:\n            self.sleep_next_game()\n\n        for game in self.live_games:\n            game.scrape(force=force)\n\n        time.sleep(self.pause)\n\n\n    def sleep_next_game(self):\n        \"\"\"\n        Sleep until the next game starts. Otherwise just looping and doing nothing\n        \n        :return: None\n        \"\"\"\n        # Get rid of final games...we are looking at current or upcoming games\n        non_final_games = [game for game in self.live_games if not game.is_game_over()]\n\n        # Lets get all the games NOT ongoing but aren't over (so scheduled games)\n        scheduled_games = [game for game in non_final_games if game.time_until_game() > 0]\n\n        # If all the non-final games haven't started yet let's find the next game\n        # Get earliest in the bunch\n        if len(scheduled_games) == len(non_final_games):\n            min_game = min(scheduled_games, key=lambda x: x.time_until_game())\n\n            if min_game.time_until_game() > 0:\n                print(\"\\nSleeping for {} seconds until the next earliest game starts.\".format(min_game.time_until_game()))\n                time.sleep(min_game.time_until_game())\n\n\n    def finished(self):\n        \"\"\"\n        Check if done with all games\n        \n        :return: Boolean\n        \"\"\"\n        # Count finished games\n        finished_games = 0\n        for game in self.live_games:\n            if game.is_game_over():\n                finished_games += 1\n\n        # If the # of finished games == # of total games\n        return len(self.live_games) == finished_games\n\n"
  },
  {
    "path": "hockey_scraper/nhl/pbp/__init__.py",
    "content": ""
  },
  {
    "path": "hockey_scraper/nhl/pbp/espn_pbp.py",
    "content": "\"\"\"\nThis module contains code to scrape coordinates for games off of espn for any given game\n\"\"\"\n\nimport re\nimport xml.etree.ElementTree as etree\nimport pandas as pd\nfrom bs4 import BeautifulSoup\nimport hockey_scraper.utils.shared as shared\n\n\ndef event_type(play_description):\n    \"\"\"\n    Returns the event type (ex: a SHOT or a GOAL...etc) given the event description. \n    \n    :param play_description: description of play\n    \n    :return: event\n    \"\"\"\n    events = {'GOAL SCORED': 'GOAL', 'SHOT ON GOAL': 'SHOT', 'SHOT MISSED': 'MISS', 'SHOT BLOCKED': 'BLOCK',\n              'PENALTY': 'PENL', 'FACEOFF': 'FAC', 'HIT': 'HIT', 'TAKEAWAY': 'TAKE', 'GIVEAWAY': 'GIVE'}\n\n    event = [events[e] for e in events if e in play_description]\n    return event[0] if event else None\n\n\ndef get_game_ids(response):\n    \"\"\"\n    Get game_ids for date from doc\n    \n    :param response: doc\n    \n    :return: list of game_ids\n    \"\"\"\n    soup = BeautifulSoup(response, 'lxml')\n\n    sections = soup.findAll(\"section\", {\"class\": \"Scoreboard bg-clr-white flex flex-auto justify-between\"})\n    game_ids = [section['id'] for section in sections]\n\n    return game_ids\n\n\ndef get_teams(response):\n    \"\"\"\n    Extract Teams for date from doc\n\n    ul-> class = ScoreCell__Competitors\n\n    div -> class = ScoreCell__TeamName ScoreCell__TeamName--shortDisplayName truncate db\n    \n    :param response: doc\n    \n    :return: list of teams    \n    \"\"\"\n    teams = []\n    soup = BeautifulSoup(response, 'lxml')\n\n    uls = soup.findAll('div', {'class': \"ScoreCell__Team\"})\n\n    for ul in uls:\n        actual_tm = None\n        tm = ul.find('div', {'class': \"ScoreCell__TeamName ScoreCell__TeamName--shortDisplayName truncate db\"}).text\n        \n        # ESPN stores the name and not the city\n        for real_tm in list(shared.TEAMS.keys()):\n            if tm.upper() in real_tm:\n                actual_tm = shared.TEAMS[real_tm]\n\n        # If not found we'll let the user know...this may happens\n        if actual_tm is None:\n            shared.print_warning(\"The team {} in the espn pbp is unknown. We use the supplied team name\".format(tm))\n            actual_tm = tm\n\n        teams.append(actual_tm)\n        \n    # Make a list of both teams for each game\n    games = [teams[i:i + 2] for i in range(0, len(teams), 2)]\n\n    return games\n\n\ndef get_espn_date(date):\n    \"\"\"\n    Get the page that contains all the games for that day\n    \n    :param date: YYYY-MM-DD\n    \n    :return: response \n    \"\"\"\n    page_info = {\n        \"url\": 'http://www.espn.com/nhl/scoreboard/_/date/{}'.format(date.replace('-', '')),\n        \"name\": date,\n        \"type\": \"espn_scoreboard\",\n        \"season\": shared.get_season(date),\n    }\n    response = shared.get_file(page_info)\n\n    # If can't get or not there throw an exception\n    if not response:\n        raise Exception\n    else:\n        return response\n\n\ndef get_espn_game_id(date, home_team, away_team):\n    \"\"\"\n    Scrapes the day's schedule and gets the id for the given game\n    Ex: http://www.espn.com/nhl/scoreboard/_/date/20161024\n    \n    :param date: format-> YearMonthDay-> 20161024\n    :param home_team: home team\n    :param away_team: away team\n    \n    :return: 9 digit game id as a string\n    \"\"\"\n    response = get_espn_date(date)\n    game_ids = get_game_ids(response)\n    games = get_teams(response)\n\n    # Get the game id with the right team for it\n    for i in range(len(games)):\n        if home_team in games[i] or away_team in games[i]:\n            return game_ids[i]\n\n\ndef get_espn_game(date, home_team, away_team, game_id=None):\n    \"\"\"\n    Gets the ESPN pbp feed \n    Ex: http://www.espn.com/nhl/gamecast/data/masterFeed?lang=en&isAll=true&gameId=400885300\n    \n    :param date: date of the game\n    :param home_team: home team\n    :param away_team: away team\n    :param game_id: Game id of we already have it - for live scraping. None if not there\n    \n    :return: raw xml\n    \"\"\"\n    # Get if not provided - for live games\n    if not game_id:\n        game_id = get_espn_game_id(date, home_team.upper(), away_team.upper())\n\n    file_info = {\n        \"url\": 'http://www.espn.com/nhl/gamecast/data/masterFeed?lang=en&isAll=true&gameId={}'.format(game_id),\n        \"name\": game_id,\n        \"type\": \"espn_pbp\",\n        \"season\": shared.get_season(date),\n    }\n    response = shared.get_file(file_info)\n\n    print(file_info)\n\n\n    ## Needed?\n    if response is None:\n        raise Exception\n\n    return response\n\n\ndef parse_event(event):\n    \"\"\"\n    Parse each event. In the string each field is separated by a '~'. \n    Relevant for here: The first two are the x and y coordinates. And the 4th and 5th are the time elapsed and period.\n    \n    :param event: string with info\n    \n    :return: return dict with relevant info\n    \"\"\"\n    info = dict()\n    fields = event.split('~')\n\n    # Shootouts screw everything up so don't bother...coordinates don't matter there either way\n    if fields[4] == '5':\n        return None\n\n    info['xC'] = float(fields[0])\n    info['yC'] = float(fields[1])\n    info['time_elapsed'] = shared.convert_to_seconds(fields[3])\n    info['period'] = fields[4]\n    info['event'] = event_type(fields[8].upper())\n\n    return info\n\n\ndef parse_espn(espn_xml):\n    \"\"\"\n    Parse feed \n    \n    :param espn_xml: raw xml of feed\n    \n    :return: DataFrame with info\n    \"\"\"\n    columns = ['period', 'time_elapsed', 'event', 'xC', 'yC']\n\n    # Occasionally we get malformed XML because of the presence of \\x13 characters\n    # Let's just replace them with dashes\n    espn_xml = espn_xml.replace(u'\\x13', '-')\n\n    try:\n        tree = etree.fromstring(espn_xml)\n    except etree.ParseError as e:\n        shared.print_error(\"Espn pbp isn't valid xml, therefore coordinates can't be obtained for this game\")\n        return pd.DataFrame([], columns=columns)\n\n    events = tree[1]\n    plays = [parse_event(event.text) for event in events]\n    plays = [play for play in plays if play is not None]\n\n    df = pd.DataFrame(plays, columns=columns)\n    df.period = df.period.astype(int)  # Causes join issues with html later\n\n    return df\n\n\ndef scrape_game(date, home_team, away_team, game_id=None):\n    \"\"\"\n    Scrape the game\n    \n    :param date: ex: 2016-20-24\n    :param home_team: tricode\n    :param away_team: tricode\n    :param game_id: Only provided for live games.\n    \n    :return: DataFrame with info \n    \"\"\"\n    try:\n        shared.print_warning('Using espn for pbp')\n        espn_xml = get_espn_game(date, home_team, away_team, game_id)\n    except Exception as e:\n        shared.print_error(\"Espn pbp for game {a} {b} {c} is either not there or can't be obtained {d}\".format(a=date,\n                                                                                                                 b=home_team,\n                                                                                                                 c=away_team, d=e))\n        return pd.DataFrame()\n\n    try:\n        espn_df = parse_espn(espn_xml)\n    except Exception as e:\n        shared.print_error(\"Issue parsing Espn pbp for game {a} {b} {c} {d}\".format(a=date, b=home_team, c=away_team, d=e))\n        return pd.DataFrame()\n\n    if espn_df.shape[0] == 0:\n        shared.print_error(\"Espn is missing coordinates for game {a} {b} {c}\".format(a=date, b=home_team, c=away_team))\n    \n    return espn_df\n\n\n\n\n# if __name__ == \"__main__\":\n#     get_espn_game('2022-10-08', 'SJS', 'NSH')\n"
  },
  {
    "path": "hockey_scraper/nhl/pbp/html_pbp.py",
    "content": "\"\"\"\nThis module contains functions to scrape the Html Play by Play for any given game\n\"\"\"\n\nimport re\nimport pandas as pd\nfrom bs4 import BeautifulSoup, SoupStrainer\nimport hockey_scraper.utils.shared as shared\n\n\ndef cur_game_status(doc):\n    \"\"\"\n    Return the game status\n    \n    :param doc: Html text\n    \n    :return: String -> one of ['Final', 'Intermission', 'Progress']\n    \"\"\"\n    soup = BeautifulSoup(doc, \"lxml\")\n    tables = soup.find_all('table', {'id': \"GameInfo\"})\n    tds = tables[0].find_all('td')\n    status = tds[-1].text\n\n    # 'End' - in there means an Intermission\n    # 'Final' - Game is over\n    # Otherwise - It's either in progress or b4 the game started\n    if 'end' in status.lower():\n        return 'Intermission'\n    elif 'final' in status.lower():\n        return 'Final'\n    else:\n        return 'Live'\n\n\ndef get_pbp(game_id):\n    \"\"\"\n    Given a game_id it returns the raw html\n    Ex: http://www.nhl.com/scores/htmlreports/20162017/PL020475.HTM\n    \n    :param game_id: the game\n    \n    :return: raw html of game\n    \"\"\"\n    game_id = str(game_id)\n    url = 'http://www.nhl.com/scores/htmlreports/{}{}/PL{}.HTM'.format(game_id[:4], int(game_id[:4]) + 1, game_id[4:])\n\n    page_info = {\n        \"url\": url,\n        \"name\": game_id,\n        \"type\": \"html_pbp\",\n        \"season\": game_id[:4],\n    }\n\n    return shared.get_file(page_info)\n\n\n\ndef get_contents(game_html):\n    \"\"\"\n    Uses Beautiful soup to parses the html document.\n    Some parsers work for some pages but don't work for others....I'm not sure why so I just try them all here in order\n    \n    :param game_html: html doc\n    \n    :return: \"soupified\" html \n    \"\"\"\n    parsers = [\"html5lib\", \"lxml\", \"html.parser\"]\n    strainer = SoupStrainer('td', attrs={'class': re.compile(r'bborder')})\n\n    for parser in parsers:\n        # parse_only only works with lxml for some reason\n        if parser == \"lxml\":\n            soup = BeautifulSoup(game_html, parser, parse_only=strainer)\n        else:\n            soup = BeautifulSoup(game_html, parser)\n\n        tds = soup.find_all(\"td\", {\"class\": re.compile('.*bborder.*')})\n\n        if len(tds) > 0:\n            break\n\n    return tds\n\n\ndef strip_html_pbp(td):\n    \"\"\"\n    Strip html tags and such. (Note to Self: Don't touch this!!!) \n    \n    :param td: pbp\n    \n    :return: list of plays (which contain a list of info) stripped of html\n    \"\"\"\n    for y in range(len(td)):\n        # Get the 'br' tag for the time column...this get's us time remaining instead of elapsed and remaining combined\n        if y == 3:\n            td[y] = td[y].get_text()   # This gets us elapsed and remaining combined-< 3:0017:00\n            index = td[y].find(':')\n            td[y] = td[y][:index+3]\n        elif (y == 6 or y == 7) and td[0] != '#':\n            # 6 & 7-> These are the player 1 ice one's\n            # The second statement controls for when it's just a header\n            baz = td[y].find_all('td')\n            bar = [baz[z] for z in range(len(baz)) if z % 4 != 0]  # Because of previous step we get repeats...delete some\n\n            # The setup in the list is now: Name/Number->Position->Blank...and repeat\n            # Now strip all the html\n            players = []\n            for i in range(len(bar)):\n                if i % 3 == 0:\n                    try:\n                        name = return_name_html(bar[i].find('font')['title'])\n                        number = bar[i].get_text().strip('\\n')  # Get number and strip leading/trailing newlines\n                    except KeyError:\n                        name = ''\n                        number = ''\n                elif i % 3 == 1:\n                    if name != '':\n                        position = bar[i].get_text()\n                        players.append([name, number, position])\n\n            td[y] = players\n        else:\n            td[y] = td[y].get_text()\n\n    return td\n\n\ndef clean_html_pbp(html):\n    \"\"\"\n    Get rid of html and format the data\n    \n    :param html: the requested url\n    \n    :return: a list with all the info\n    \"\"\"\n    soup = get_contents(html)\n\n    # Create a list of lists (each length 8)...corresponds to 8 columns in html pbp\n    td = [soup[i:i + 8] for i in range(0, len(soup), 8)]\n\n    cleaned_html = [strip_html_pbp(x) for x in td]\n\n    return cleaned_html\n\n\ndef add_home_zone(event_dict, home_team):\n    \"\"\"\n    Determines the zone relative to the home team and add it to event.\n    \n    Keep in mind that the 'ev_zone' recorded is the zone relative to the event team. And for blocks the NHL counts\n    the ev_team as the blocking team (I like counting the shooting team for blocks). Therefore, when it's the home team\n    the zone only gets flipped when it's a block. For away teams it's the opposite.\n    \n    :param event_dict: dict of event info\n    :param home_team: home team\n    \n    :return: None\n    \"\"\"\n    ev_team = event_dict['Ev_Team']\n    ev_zone = event_dict['Ev_Zone']\n    event = event_dict['Event']\n\n    # Return if we got nothing in there\n    # Also just make the home zone nothing too\n    if ev_zone == '':\n        event_dict['Home_Zone'] = ''\n        return\n\n    # When it's either: The away team and not a block or the home team and a block\n    if (ev_team != home_team and event != 'BLOCK') or (ev_team == home_team and event == 'BLOCK'):\n        if ev_zone == 'Off':\n            event_dict['Home_Zone'] = 'Def'\n        elif ev_zone == 'Def':\n            event_dict['Home_Zone'] = 'Off'\n        else:\n            event_dict['Home_Zone'] = ev_zone\n    else:\n        event_dict['Home_Zone'] = ev_zone\n\n\ndef add_zone(event_dict, play_description):\n    \"\"\"\n    Determine which zone the play occurred in (unless one isn't listed) and add it to dict\n    \n    :param event_dict: dict of event info\n    :param play_description: the zone would be included here\n    \n    :return: Off, Def, Neu, or NA\n    \"\"\"\n    s = [x.strip() for x in play_description.split(',')]  # Split by comma's into a list\n    zone = [x for x in s if 'Zone' in x]  # Find if list contains which zone\n\n    if not zone:\n        event_dict['Ev_Zone'] = None\n    elif zone[0].find(\"Off\") != -1:\n        event_dict['Ev_Zone'] = 'Off'\n    elif zone[0].find(\"Neu\") != -1:\n        event_dict['Ev_Zone'] = 'Neu'\n    elif zone[0].find(\"Def\") != -1:\n        event_dict['Ev_Zone'] = 'Def'\n\n\ndef add_type(event_dict, event, players, home_team):\n    \"\"\"\n    Add \"type\" for event -> either a penalty or a shot type\n    \n    :param event_dict: dict of event info\n    :param event: list with parsed event info\n    :param players: dict of home and away players in game\n    :param home_team: home team for game\n    \n    :return: None\n    \"\"\"\n    if 'PENL' in event[4]:\n        event_dict['Type'] = get_penalty(event[5], players, home_team)\n    else:\n        event_dict['Type'] = shot_type(event[5]).upper()\n\n\ndef add_strength(event_dict, home_players, away_players):\n    \"\"\"\n    Get strength for event -> It's home then away\n    \n    :param event_dict: dict of event info\n    :param home_players: list of players for home team\n    :param away_players: list of players for away team\n    \n    :return: None\n    \"\"\"\n    try:\n        home_skaters = event_dict['Home_Players'] - 1 if event_dict['Home_Goalie'] != '' else len(home_players)\n        away_skaters = event_dict['Away_Players'] - 1 if event_dict['Away_Goalie'] != '' else len(away_players)\n    except KeyError:\n        # Getting a key error here means that home/away goalie isn't there...which means home/away players are empty\n        home_skaters = 0\n        away_skaters = 0\n\n    event_dict['Strength'] = 'x'.join([str(home_skaters), str(away_skaters)])\n\n\ndef add_event_team(event_dict, event):\n    \"\"\"\n    Add event team for event. \n\n    Always first thing in description \n    \n    :param event_dict: dict of event info\n    :param event: list with parsed event info\n    \n    :return: None\n    \"\"\"\n    if event_dict['Event'] in ['GOAL', 'SHOT', 'MISS', 'BLOCK', 'PENL', 'FAC', 'HIT', 'TAKE', 'GIVE']:\n        event_dict['Ev_Team'] = shared.convert_tricode(event[5].split()[0])\n    else:\n        event_dict['Ev_Team'] = ''\n\n\ndef add_period(event_dict, event):\n    \"\"\"\n    Add period for event \n    \n    :param event_dict: dict of event info\n    :param event: list with parsed event info\n    \n    :return: None\n    \"\"\"\n    try:\n        event_dict['Period'] = int(event[1])\n    except ValueError:\n        event_dict['Period'] = 0\n\n\ndef add_time(event_dict, event):\n    \"\"\"\n    Fill in time and seconds elapsed\n    \n    :param event_dict: dict of parsed event stuff\n    :param event: event info from pbp\n    \n    :return: None\n    \"\"\"\n    event_dict['Time_Elapsed'] = str(event[3])\n\n    if event[3] != '':\n        event_dict['Seconds_Elapsed'] = shared.convert_to_seconds(event[3])\n    else:\n        event_dict['Seconds_Elapsed'] = 0.0\n\n\ndef add_score(event_dict, event, current_score, home_team):\n    \"\"\"\n    Change if someone scored...also change current score\n    \n    :param event_dict: dict of parsed event stuff\n    :param event: event info from pbp\n    :param current_score: current score in game\n    :param home_team: home team for game\n    \n    :return: None\n    \"\"\"\n    event_dict['Home_Score'] = current_score['Home']\n    event_dict['Away_Score'] = current_score['Away']\n    event_dict['score_diff'] = current_score['Home'] - current_score['Away']\n\n    # If it's a goal change the score\n    if event[4] == 'GOAL':\n        if event_dict['Ev_Team'] == home_team:\n            current_score['Home'] += 1\n        else:\n            current_score['Away'] += 1\n\n\ndef get_penalty(play_description, players, home_team):\n    \"\"\"\n    Get the penalty info\n    \n    :param play_description: description of play field\n    :param players: all players with info\n    :param home_team: home team for game\n    \n    :return: penalty info\n    \"\"\"\n    # First check if it's a bench\n    if \"bench\" in play_description or \"TEAM\" in play_description:\n        beg_penalty_index = play_description.find(\"TEAM\") + 5\n        return play_description[beg_penalty_index: play_description.find(')') + 1]\n    else:\n        # If it's not a bench penl we look for the player who took the penalty\n        # Get Number, and name for player who took the penalty\n        num_regex = re.compile(r'#(\\d+)')\n        numbers = num_regex.findall(play_description)\n\n        # If they don't have any players listed, then the description if fucked up and we got nothing\n        if not numbers:\n            return ''\n        else:\n            player = get_player_name(numbers[0], players, play_description[:3], home_team)\n\n        # Check if the number and player match up\n        if player['last_name'] is not None and player['last_name'] in play_description:\n            # beg_penalty_index is right after the penalty taker's last name (+1 for whitespace)\n            # Then we take from after his last name to right after the parentheses\n            beg_penalty_index = play_description.find(player['last_name']) + len(player['last_name']) + 1\n            return play_description[beg_penalty_index: play_description.find(')')+1]\n        else:\n            # This uses my old method...it falls apart for players like \"Del Zotto\"\n            pen_regex = re.compile(r'.{3}\\s+#\\d+\\s+\\w+\\s+(.*)\\)')\n            penalty = pen_regex.findall(play_description)\n            return penalty[0] + ')' if penalty else ''\n\n\ndef get_player_name(number, players, team, home_team):\n    \"\"\"\n    This function is used for the description field in the html. Given a last name and a number it return the player's \n    full name and id. Done by searching in players for the team until we find him (then just break)\n    \n    :param number: player's number\n    :param players: all players with info\n    :param team: team of player listed in html\n    :param home_team: home team defined b4 hand (from json)\n    \n    :return: dict with full and and id\n    \"\"\"\n    player = None\n    team = shared.convert_tricode(team) # Needed to convert from new format to old\n    venue = \"Home\" if team == home_team else \"Away\"\n\n    for name in players[venue]:\n        if players[venue][name]['number'] == number:\n            player = {\n                'name': name, \n                'id': players[venue][name]['id'], \n                'last_name': players[venue][name]['last_name']\n            }\n            break\n\n    # Control for when the name can't be found\n    if not player:\n        player = {'name': None, 'id': None, 'last_name': None}\n\n    return player\n\n\ndef if_valid_event(event):\n    \"\"\"\n    Checks if it's a valid event ('#' is meaningless and I don't like those other one's) to parse\n    \n    Don't remember what 'GOFF' is but 'EGT' is for emergency goaltender. The reason I get rid of it is because it's not\n    in the json and there's another 'EGPID' that can be found in both (not sure why 'EGT' exists then).\n    \n    Events 'PGSTR', 'PGEND', and 'ANTHEM' have been included at the start of each game for the 2017 season...I have no\n    idea why. \n     \n    :param event: list of stuff in pbp\n    \n    :return: boolean \n    \"\"\"\n    return event[0] != '#' and event[4] not in ['GOFF', 'EGT', 'PGSTR', 'PGEND', 'ANTHEM']\n\n\ndef return_name_html(info):\n    \"\"\"\n    In the PBP html the name is in a format like: 'Center - MIKE RICHARDS'\n    Some also have a hyphen in their last name so can't just split by '-'\n    \n    :param info: position and name\n    \n    :return: name\n    \"\"\"\n    s = info.index('-')  # Find first hyphen\n    return info[s + 1:].strip(' ')  # The name should be after the first hyphen\n\n\ndef shot_type(play_description):\n    \"\"\"\n    Determine which zone the play occurred in (unless one isn't listed)\n    \n    :param play_description: the type would be in here\n    \n    :return: the type if it's there (otherwise just NA)\n    \"\"\"\n    types = ['wrist', 'snap', 'slap', 'deflected', 'tip-in', 'backhand', 'wrap-around']\n\n    play_description = [x.strip() for x in play_description.split(',')]  # Strip leading and trailing whitespace\n    play_description = [i.lower() for i in play_description]  # Convert to lowercase\n\n    for p in play_description:\n        if p in types:\n            if p == 'wrist' or p == 'slap' or p == 'snap':\n                return ' '.join([p, 'shot'])\n            else:\n                return p\n\n    return ''\n\n\ndef parse_fac(description, players, ev_team, home_team):\n    \"\"\"\n    Parse the description field for a face-off\n    MTL won Neu. Zone - MTL #11 GOMEZ vs TOR #37 BRENT\n    \n    :param description: Play Description \n    :param players: players in game\n    :param ev_team: Event Team\n    :param home_team: Home Team for game\n    \n    :return: Dict with info\n    \"\"\"\n    event_info = {}\n\n    regex = re.compile(r'(.{3})\\s+#(\\d+)')\n    desc = regex.findall(description)  # [[Team, num], [Team, num]]\n\n    if ev_team == desc[0][0]:\n        p1 = get_player_name(desc[0][1], players, desc[0][0], home_team)\n        p2 = get_player_name(desc[1][1], players, desc[1][0], home_team)\n    else:\n        p1 = get_player_name(desc[1][1], players, desc[1][0], home_team)\n        p2 = get_player_name(desc[0][1], players, desc[0][0], home_team)\n\n    event_info['p1_name'] = p1['name']\n    event_info['p1_ID'] = p1['id']\n    event_info['p2_name'] = p2['name']\n    event_info['p2_ID'] = p2['id']\n\n    return event_info\n\n\ndef parse_shot_miss_take_give(description, players, ev_team, home_team):\n    \"\"\"\n    Parse the description field for a: SHOT, MISS, TAKE, GIVE\n    \n    MTL ONGOAL - #81 ELLER, Wrist, Off. Zone, 11 ft.\n    ANA #23 BEAUCHEMIN, Slap, Wide of Net, Off. Zone, 42 ft.\n    TOR GIVEAWAY - #35 GIGUERE, Def. Zone\n    TOR TAKEAWAY - #9 ARMSTRONG, Off. Zone\n    \n    :param description: Play Description \n    :param players: players in game\n    :param ev_team: Event Team\n    :param home_team: Home Team for game\n    \n    :return: Dict with info\n    \"\"\"\n    event_info = {}\n\n    regex = re.compile(r'(\\d+)')\n    desc = regex.search(description).groups()  # num\n\n    p = get_player_name(desc[0], players, ev_team, home_team)\n    event_info['p1_name'] = p['name']\n    event_info['p1_ID'] = p['id']\n\n    return event_info\n\n\ndef parse_hit(description, players, home_team):\n    \"\"\"\n    Parse the description field for a HIT\n\n    MTL #20 O'BYRNE HIT TOR #18 BROWN, Def. Zone\n\n    :param description: Play Description \n    :param players: players in game\n    :param home_team: Home Team for game\n\n    :return: Dict with info\n    \"\"\"\n    event_info = {}\n\n    regex = re.compile(r'(.{3})\\s+#(\\d+)')\n    desc = regex.findall(description)  # [[Team, num], [Team, num]]\n\n    p1 = get_player_name(desc[0][1], players, desc[0][0], home_team)\n    event_info['p1_name'] = p1['name']\n    event_info['p1_ID'] = p1['id']\n\n    if len(desc) > 1:\n        p2 = get_player_name(desc[1][1], players, desc[1][0], home_team)\n        event_info['p2_name'] = p2['name']\n        event_info['p2_ID'] = p2['id']\n\n    return event_info\n\n\ndef parse_block(description, players, home_team):\n    \"\"\"\n    Parse the description field for a BLOCK\n    \n    MTL #76 SUBBAN BLOCKED BY TOR #2 SCHENN, Wrist, Def. Zone\n\n    :param description: Play Description \n    :param players: players in game\n    :param home_team: Home Team for game\n\n    :return: Dict with info\n    \"\"\"\n    event_info = {}\n\n    regex = re.compile(r'(.{3})\\s+#(\\d+)')\n    desc = regex.findall(description)  # [[Team, num], [Team, num]]\n\n    if len(desc) == 0:\n        event_info['p1_name'] = event_info['p2_name'] = event_info['p1_ID'] = event_info['p2_ID'] = None\n    else:\n        p1 = get_player_name(desc[len(desc) - 1][1], players, desc[len(desc) - 1][0], home_team)\n        event_info['p1_name'] = p1['name']\n        event_info['p1_ID'] = p1['id']\n\n        if len(desc) > 1:\n            p2 = get_player_name(desc[0][1], players, desc[0][0], home_team)\n            event_info['p2_name'] = p2['name']\n            event_info['p2_ID'] = p2['id']\n\n    return event_info\n\n\ndef parse_goal(description, players, ev_team, home_team):\n    \"\"\"\n    Parse the description field for a GOAL\n    \n    TOR #81 KESSEL(1), Wrist, Off. Zone, 14 ft. Assists: #42 BOZAK(1); #8 KOMISAREK(1)\n    \n    :param description: Play Description \n    :param players: players in game\n    :param ev_team: Event Team\n    :param home_team: Home Team for game\n\n    :return: Dict with info\n    \"\"\"\n    event_info = {}\n\n    regex = re.compile(r'#(\\d+)\\s+')\n    desc = regex.findall(description)  # [num, ?, ?] -> ranging from 1 to 3 indices\n\n    p1 = get_player_name(desc[0], players, ev_team, home_team)\n    event_info['p1_name'] = p1['name']\n    event_info['p1_ID'] = p1['id']\n\n    if len(desc) >= 2:\n        p2 = get_player_name(desc[1], players, ev_team, home_team)\n        event_info['p2_name'] = p2['name']\n        event_info['p2_ID'] = p2['id']\n\n        if len(desc) == 3:\n            p3 = get_player_name(desc[2], players, ev_team, home_team)\n            event_info['p3_name'] = p3['name']\n            event_info['p3_ID'] = p3['id']\n\n    return event_info\n\n\ndef parse_penalty(description, players, home_team):\n    \"\"\"\n    Parse the description field for a Penalty\n\n    MTL #81 ELLER Hooking(2 min), Def. Zone Drawn By: TOR #11 SJOSTROM\n\n    :param description: Play Description \n    :param players: players in game\n    :param home_team: Home Team for game\n\n    :return: Dict with info\n    \"\"\"\n    event_info = {}\n\n    # Check if it's a Bench/Team Penalties\n    if \"bench\" in description or \"TEAM\" in description:\n        event_info['p1_name'] = 'Team'\n    else:\n        # Standard Penalty\n        regex = re.compile(r'(.{3})\\s+#(\\d+)')\n        desc = regex.findall(description)  # [[team, num], ?[team, num]] -> Either one to three indices\n\n        if desc:\n            p1 = get_player_name(desc[0][1], players, desc[0][0], home_team)\n            event_info['p1_name'] = p1['name']\n            event_info['p1_ID'] = p1['id']\n\n            # When there are three the penalty was served by someone else\n            # The Person who served the penalty is placed as the 3rd event player\n            if len(desc) == 3:\n                p3 = get_player_name(desc[1][1], players, desc[0][0], home_team)\n                event_info['p3_name'] = p3['name']\n                event_info['p3_ID'] = p3['id']\n\n                p2 = get_player_name(desc[2][1], players, desc[2][0], home_team)\n                event_info['p2_name'] = p2['name']\n                event_info['p2_ID'] = p2['id']\n            elif len(desc) == 2:\n                p2 = get_player_name(desc[1][1], players, desc[1][0], home_team)\n                event_info['p2_name'] = p2['name']\n                event_info['p2_ID'] = p2['id']\n\n    return event_info\n\n\ndef add_event_players(event_dict, event, players, home_team):\n    \"\"\"\n    Add players involved in the event to event_dict\n    \n    :param event_dict: dict of parsed event stuff\n    :param event: fixed up html\n    :param players: dict of players and id's\n    :param home_team: home team\n    \n    :return: None\n    \"\"\"\n    event_info = {}\n    description = event[5].strip()\n    ev_team = shared.convert_tricode(description.split()[0])\n\n    if event[4] == 'FAC':\n        event_info = parse_fac(description, players, ev_team, home_team)\n    elif event[4] in ['SHOT', 'MISS', 'GIVE', 'TAKE']:\n        event_info = parse_shot_miss_take_give(description, players, ev_team, home_team)\n    elif event[4] == 'HIT':\n        event_info = parse_hit(description, players, home_team)\n    elif event[4] == 'BLOCK':\n        event_info = parse_block(description, players, home_team)\n    elif event[4] == 'GOAL':\n        event_info = parse_goal(description, players, ev_team, home_team)\n    elif event[4] == 'PENL':\n        event_info = parse_penalty(description, players, home_team)\n\n    # Transfer info over\n    for key in event_info:\n        event_dict[key] = event_info[key]\n\n\ndef populate_players(event_dict, players, away_players, home_players):\n    \"\"\"\n    Populate away and home player info (and num skaters on each side).\n\n    These include:\n        1. HomePlayer & AwayPlayers fields from 1-6 for name/id\n        2. Home & Away Goalie Fields for name/id\n    \n    :param event_dict: dict with event info\n    :param players: all players in game and info\n    :param away_players: players for away team\n    :param home_players: players for home team\n    \n    :return: None\n    \"\"\"\n    for venue in ['Home', 'Away']:\n        for j in range(6):\n            # Deal with the Home & Away Player Fields\n            try:\n                ven_player = home_players[j] if venue == \"Home\" else away_players[j]\n                name = shared.fix_name(ven_player[0])\n                event_dict['{}Player{}'.format(venue.lower(), j + 1)] = name\n                event_dict['{}Player{}_id'.format(venue.lower(), j + 1)] = players[venue][name]['id']\n            except KeyError:\n                event_dict['{}Player{}_id'.format(venue.lower(), j + 1)] = None\n            except IndexError:\n                event_dict['{}Player{}'.format(venue.lower(), j + 1)] = None\n                event_dict['{}Player{}_id'.format(venue.lower(), j + 1)] = None\n                continue\n\n            # If the player is a goalie we try filling that field\n            if ven_player[2] == \"G\":\n                try:\n                    event_dict['{}_Goalie'.format(venue)] = name\n                    event_dict['{}_Goalie_Id'.format(venue)] = players[venue][name]['id']\n                except KeyError:\n                    pass\n\n        # Control for when no goalies present\n        if '{}_Goalie'.format(venue) not in event_dict:\n            event_dict['{}_Goalie'.format(venue)] = None\n        if '{}_Goalie_Id'.format(venue) not in event_dict:\n            event_dict['{}_Goalie_Id'.format(venue)] = None\n\n\n    event_dict['Away_Players'] = len(away_players)\n    event_dict['Home_Players'] = len(home_players)\n\n\ndef parse_event(event, players, home_team, current_score):\n    \"\"\"\n    Receives an event and parses it\n    \n    :param event: event type\n    :param players: players in game\n    :param home_team: home team\n    :param current_score: current score for both teams\n    \n    :return: dict with info\n    \"\"\"\n    event_dict = dict()\n\n    away_players = event[6]\n    home_players = event[7]\n\n    event_dict['Description'] = event[5]\n    event_dict['Event'] = str(event[4])\n\n    add_period(event_dict, event)\n    add_time(event_dict, event)\n    add_event_team(event_dict, event)\n    add_score(event_dict, event, current_score, home_team)\n    populate_players(event_dict, players, away_players, home_players)\n    add_strength(event_dict, home_players, away_players)\n    add_type(event_dict, event, players, home_team)\n    add_zone(event_dict, event[5])\n    add_home_zone(event_dict, home_team)\n\n    # Sometimes it's empty...(they seem to sometimes/always have a whitespace char)\n    if len(event_dict['Description']) > 1:\n        add_event_players(event_dict, event, players, home_team)\n\n    return event_dict\n\n\ndef parse_html(html, players, teams):\n    \"\"\"\n    Parse html game pbp\n    \n    :param html: raw html\n    :param players: players in the game (from json pbp)\n    :param teams: dict with home and away teams\n    \n    :return: DataFrame with info\n    \"\"\"\n    columns = ['Period', 'Event', 'Description', 'Time_Elapsed', 'Seconds_Elapsed', 'Strength', 'Ev_Zone', 'Type',\n               'Ev_Team', 'Home_Zone', 'Away_Team', 'Home_Team', 'p1_name', 'p1_ID', 'p2_name', 'p2_ID', 'p3_name',\n               'p3_ID', 'awayPlayer1', 'awayPlayer1_id', 'awayPlayer2', 'awayPlayer2_id', 'awayPlayer3', 'awayPlayer3_id',\n               'awayPlayer4', 'awayPlayer4_id', 'awayPlayer5', 'awayPlayer5_id', 'awayPlayer6', 'awayPlayer6_id',\n               'homePlayer1', 'homePlayer1_id', 'homePlayer2', 'homePlayer2_id', 'homePlayer3', 'homePlayer3_id',\n               'homePlayer4', 'homePlayer4_id', 'homePlayer5', 'homePlayer5_id', 'homePlayer6', 'homePlayer6_id',\n               'Away_Goalie', 'Away_Goalie_Id', 'Home_Goalie', 'Home_Goalie_Id', 'Away_Players', 'Home_Players',\n               'Away_Score', 'Home_Score']\n\n    current_score = {'Home': 0, 'Away': 0}\n    events = [parse_event(event, players, teams['Home'], current_score) for event in html if if_valid_event(event)]\n\n    df = pd.DataFrame(list(events), columns=columns)\n\n    # This is seen sometimes...it's a duplicate row\n    df.drop(df[df.Time_Elapsed == '-16:0-'].index, inplace=True)\n\n    df['p1_ID'] = df['p1_ID'].astype(\"float64\")\n    df['Away_Team'] = teams['Away']\n    df['Home_Team'] = teams['Home']\n\n    return df\n\n\ndef scrape_pbp(game_html, game_id, players, teams):\n    \"\"\"\n    Scrape the data for the pbp\n\n    :param game_html: Html doc for the game\n    :param game_id: game to scrape\n    :param players: dict with player info\n    :param teams: dict with home and away teams\n\n    :return: DataFrame of game info or None if it fails\n    \"\"\"\n    if not game_html:\n        shared.print_error(\"Html pbp for game {} is either not there or can't be obtained\".format(game_id))\n        return None\n\n    cleaned_html = clean_html_pbp(game_html)\n\n    if len(cleaned_html) == 0:\n        shared.print_error(\"Html pbp contains no plays, this game can't be scraped\")\n        return None\n\n    try:\n        game_df = parse_html(cleaned_html, players, teams)\n    except Exception as e:\n        shared.print_error('Error parsing Html pbp for game {} {}'.format(game_id, e))\n        return None\n\n    # These sometimes end up as objects\n    game_df.Period = game_df.Period.astype(int)\n    game_df.Seconds_Elapsed = game_df.Seconds_Elapsed.astype(float)\n\n    return game_df\n\n\ndef scrape_game_live(game_id, players, teams):\n    \"\"\"\n    Scrape the data for the game when it's live\n    \n    :param game_id: game to scrape\n    :param players: dict with player info\n    :param teams: dict with home and away teams\n    \n    :return: Tuple - get_pbp(), cur_game_status()\n    \"\"\"\n    game_html = get_pbp(game_id)\n    return scrape_pbp(game_html, game_id, players, teams), cur_game_status(game_html)\n\n\ndef scrape_game(game_id, players, teams):\n    \"\"\" \n    Scrape the data for the game when not live\n    \n    :param game_id: game to scrape\n    :param players: dict with player info\n    :param teams: dict with home and away teams\n    \n    :return: DataFrame of game info or None if it fails\n    \"\"\"\n    game_html = get_pbp(game_id)\n    return scrape_pbp(game_html, game_id, players, teams)\n\n\n"
  },
  {
    "path": "hockey_scraper/nhl/pbp/json_pbp.py",
    "content": "\"\"\"\nThis module contains functions to scrape the Json Play by Play for any given game\n\"\"\"\n\nimport json\nimport pandas as pd\nfrom operator import itemgetter\nimport hockey_scraper.utils.shared as shared\n\n\ndef get_pbp(game_id):\n    \"\"\"\n    Given a game_id it returns the raw json\n    Ex: https://api-web.nhle.com/v1/gamecenter/2023010044/play-by-play\n    \n    :param game_id: string - the game\n    \n    :return: raw json of game or None if couldn't get game\n    \"\"\"\n    page_info = {\n        \"url\": 'https://api-web.nhle.com/v1/gamecenter/{}/play-by-play'.format(game_id),\n        \"name\": game_id,\n        \"type\": \"json_pbp\",\n        \"season\": game_id[:4],\n    }\n    response = shared.get_file(page_info)\n\n    if not response:\n        shared.print_error(\"Json pbp for game {} is either not there or can't be obtained\".format(game_id))\n        return {}\n    else:\n        return json.loads(response)\n\n\ndef get_teams(pbp_json):\n    \"\"\"\n    Get teams \n\n    :param pbp_json: raw play by play json\n\n    :return: dict with home and away\n    \"\"\"\n    return {\n<<<<<<< HEAD\n        'Home': pbp_json['homeTeam']['abbrev'],\n        'Away': pbp_json['awayTeam']['abbrev']\n=======\n        'Home': shared.convert_tricode(pbp_json['homeTeam']['abbrev']),\n        'Away': shared.convert_tricode(pbp_json['awayTeam']['abbrev'])\n>>>>>>> 1029299054fbe671c3ca9c5d413cdfd102416853\n    }\n\n\ndef change_event_name(event):\n    \"\"\"\n    Change event names from json style to html (ex: BLOCKED_SHOT to BLOCK). \n    \n    :param event: event type\n    \n    :return: fixed event type\n    \"\"\"\n    event_types = {\n        'PERIOD-START': 'PSTR',\n        'FACEOFF': 'FAC',\n        'BLOCKED-SHOT': 'BLOCK',\n        'GAME-END': 'GEND',\n        'GIVEAWAY': 'GIVE',\n        'GOAL': 'GOAL',\n        'HIT': 'HIT',\n        'MISSED SHOT': 'MISS',\n        'PERIOD-END': 'PEND',\n        'SHOT-ON-GOAL': 'SHOT',\n        'STOPPAGE': 'STOP',\n        'TAKEAWAY': 'TAKE',\n        'PENALTY': 'PENL',\n        'EARLY INT START': 'EISTR',\n        'EARLY INT END': 'EIEND',\n        'SHOOTOUT COMPLETE': 'SOC',\n        'CHALLENGE': 'CHL',\n        'EMERGENCY GOALTENDER': 'EGPID'\n    }\n\n    return event_types.get(event.upper(), event)\n\n\ndef parse_event(event):\n    \"\"\"\n    Parses a single event when the info is in a json format\n    \n    :param event: json of event \n    \n    :return: dictionary with the info\n    \"\"\"\n    play = dict()\n\n    play['event_id'] = event['eventId']\n<<<<<<< HEAD\n    play['period'] = event['period']\n=======\n    play['period'] = event['periodDescriptor']['number']\n>>>>>>> 1029299054fbe671c3ca9c5d413cdfd102416853\n    play['event'] = str(change_event_name(event['typeDescKey'].upper()))\n    play['seconds_elapsed'] = shared.convert_to_seconds(event['timeInPeriod'])\n    \n    play['p1_name'], play['p2_name'], play['p3_name'] = '', '', ''\n    if 'details' in event.keys():\n        details = event['details'].keys()\n        # If there's a players key that means an event occurred on the play.\n\n        if 'scoringPlayerId' in details:\n            play['p1_ID'] = event['details']['scoringPlayerId']\n\n        if 'shootingPlayerId' in details:\n            play['p1_ID'] = event['details']['shootingPlayerId']\n\n        if 'assist1PlayerId' in details:\n            play['p2_ID'] = event['details']['assist1PlayerId']\n            \n        if 'assist2PlayerId' in details:\n            play['p3_ID'] = event['details']['assist2PlayerId']\n\n        if 'blockingPlayerId' in details:\n            play['p2_ID'] = event['details']['blockingPlayerId']\n\n        if 'xCoord' in details:\n            play['xC'] = event['details']['xCoord']\n            play['yC'] = event['details']['yCoord']\n        \n\n    return play\n\n\ndef parse_json(game_json, game_id):\n    \"\"\"\n    Scrape the json for a game\n    \n    :param game_json: raw json\n    :param game_id: game id for game\n    \n    :return: Either a DataFrame with info for the game or None when fail\n    \"\"\"\n    columns = ['period', 'event', 'seconds_elapsed', 'p1_name', 'p1_ID', 'p2_name', 'p2_ID', 'p3_name', 'p3_ID', 'xC', 'yC']\n\n    # 'PERIOD READY' & 'PERIOD OFFICIAL'..etc aren't found in html...so get rid of them\n    events_to_ignore = ['PERIOD READY', 'PERIOD OFFICIAL', 'GAME READY', 'GAME OFFICIAL', 'GAME SCHEDULED']\n\n    try:\n        plays = game_json['plays']\n        events = [parse_event(play) for play in plays if play['typeDescKey'].upper() not in events_to_ignore]\n    except Exception as e:\n        shared.print_error('Error parsing Json pbp for game {} {}'.format(game_id, e))\n        return None\n\n    # Sort by event id.\n    # Sometimes it's not in order of the assigned id in the json. Like, 156...155 (not sure how this happens).\n    sorted_events = sorted(events, key=itemgetter('event_id'))\n\n    return pd.DataFrame(sorted_events, columns=columns)\n\n\ndef scrape_game(game_id):\n    \"\"\"\n    **Used for debugging** \n\n    HTML depends on json so can't follow this structure\n    \n    :param game_id: game to scrape\n    \n    :return: DataFrame of game info\n    \"\"\"\n    game_json = get_pbp(game_id)\n\n    if not game_json:\n        shared.print_error(\"Json pbp for game {} is not either not there or can't be obtained\".format(game_id))\n        return None\n\n    try:\n        game_df = parse_json(game_json, game_id)\n    except Exception as e:\n        shared.print_error('Error parsing Json pbp for game {} {}'.format(game_id, e))\n        return None\n\n    return game_df\n"
  },
  {
    "path": "hockey_scraper/nhl/playing_roster.py",
    "content": "\"\"\"\nThis module contains functions to scrape the Html game roster for any given game\n\"\"\"\n\nfrom bs4 import BeautifulSoup\nimport hockey_scraper.utils.shared as shared\n\n\ndef get_roster(game_id):\n    \"\"\"\n    Given a game_id it returns the raw html\n    Ex: http://www.nhl.com/scores/htmlreports/20162017/RO020475.HTM\n    \n    :param game_id: the game\n    \n    :return: raw html of game\n    \"\"\"\n    game_id = str(game_id)\n\n    page_info = {\n        \"url\": 'http://www.nhl.com/scores/htmlreports/{}{}/RO{}.HTM'.format(game_id[:4], int(game_id[:4]) + 1, game_id[4:]),\n        \"name\": game_id,\n        \"type\": \"html_roster\",\n        \"season\": game_id[:4],\n    }\n\n    return shared.get_file(page_info)\n\n\ndef get_content(roster):\n    \"\"\"\n    Uses Beautiful soup to parses the html document.\n    Some parsers work for some pages but don't work for others....I'm not sure why so I just try them all here in order\n    \n    :param roster: doc\n    \n    :return: players and coaches\n    \"\"\"\n    parsers = [\"lxml\", \"html.parser\", \"html5lib\"]\n\n    for parser in parsers:\n        soup = BeautifulSoup(roster, \"lxml\")\n        players = get_players(soup)\n        head_coaches = get_coaches(soup)\n\n        if len(players) > 0:\n            break\n\n    return players, head_coaches\n\n\ndef fix_name(player):\n    \"\"\"\n    Get rid of (A) or (C) when a player has it attached to their name\n    \n    :param player: list of player info -> [number, position, name]\n    \n    :return: fixed list\n    \"\"\"\n    if player[2].find('(A)') != -1:\n        player[2] = player[2][:player[2].find('(A)')].strip()\n    elif player[2].find('(C)') != -1:\n        player[2] = player[2][:player[2].find('(C)')].strip()\n\n    return player\n\n\ndef get_coaches(soup):\n    \"\"\"\n    scrape head coaches\n    \n    :param soup: html\n    \n    :return: dict of coaches for game\n    \"\"\"\n    coaches = soup.find_all('tr', {'id': \"HeadCoaches\"})\n\n    # If it picks up nothing just return the empty list\n    if not coaches:\n        return coaches\n\n    coaches = coaches[0].find_all('td')\n\n    return {\n        'Away': coaches[1].get_text(),\n        'Home': coaches[3].get_text()\n    }\n\n\ndef get_players(soup):\n    \"\"\"\n    scrape roster for players \n    \n    :param soup: html\n    \n    :return: dict for home and away players\n    \"\"\"\n    tables = soup.findAll('table', {'align': 'center', 'border': '0', 'cellpadding': '0', 'cellspacing': '0', 'width': '100%'})\n\n    # If it picks up nothing just return the empty list\n    if not tables:\n        return tables\n\n    \"\"\"\n    There are 5 tables which correspond to the above criteria.\n    tables[0] is nothing\n    tables[1] is away starters\n    tables[2] is home starters\n    tables[3] is away scratches\n    tables[4] is home scratches\n    \"\"\"\n\n    del tables[0]\n    player_info = [table.find_all('td') for table in tables]\n\n    player_info = [[x.get_text() for x in group] for group in player_info]\n\n    # Make list of list of 3 each. The three are: number, position, name (in that order)\n    player_info = [[group[i:i+3] for i in range(0, len(group), 3)] for group in player_info]\n\n    # Get rid of header column\n    player_info = [[player for player in group if player[0] != '#'] for group in player_info]\n\n    # Add whether the player was a scratch\n    # 2 and 3 hold the scratches\n    for i in range(len(player_info)):\n        for j in range(len(player_info[i])):\n            if i == 2 or i == 3:\n                player_info[i][j].append(True)\n            else:\n                player_info[i][j].append(False)\n\n    players = {'Away': player_info[0], 'Home': player_info[1]}\n\n    # Scratches aren't always included\n    if len(player_info) == 4:\n        players['Away'] += player_info[2]\n        players['Home'] += player_info[3]\n\n    # For those with (A) or (C) in name field get rid of it\n    # First condition is to control when we get whitespace as one of the indices\n    players['Away'] = [fix_name(i) if i[0] != u'\\xa0' else i for i in players['Away']]\n    players['Home'] = [fix_name(i) if i[0] != u'\\xa0' else i for i in players['Home']]\n\n    # Get rid when just whitespace\n    players['Away'] = [i for i in players['Away'] if i[0] != u'\\xa0']\n    players['Home'] = [i for i in players['Home'] if i[0] != u'\\xa0']\n\n    return players\n\n\ndef scrape_roster(game_id):\n    \"\"\"\n    For a given game scrapes the roster\n    \n    :param game_id: id for game\n    \n    :return: dict of players (home and away) an dict for both head coaches \n    \"\"\"\n    roster = get_roster(game_id)\n\n    if not roster:\n        shared.print_error(\"Roster for game {} is either not there or can't be obtained\".format(game_id))\n        return None\n\n    try:\n        players, head_coaches = get_content(roster)\n    except Exception as e:\n        shared.print_error('Error parsing Roster for game {} {}'.format(game_id, e))\n        return None\n\n    \n    return {'players': players, 'head_coaches': head_coaches}\n"
  },
  {
    "path": "hockey_scraper/nhl/scrape_functions.py",
    "content": "\"\"\"\nFunctions to scrape by season, games, and date range\n\"\"\"\n\nimport time\nimport pandas as pd\nfrom datetime import datetime\nimport hockey_scraper.nhl.game_scraper as game_scraper\nimport hockey_scraper.nhl.json_schedule as json_schedule\nimport hockey_scraper.utils.shared as shared\n\n\ndef print_errors(detailed=True):\n    \"\"\"\n    Print errors with scraping.\n\n    Detailed parameter controls if certain errors should be *re-printed* after scraping all games.\n    For example if the pbp for a game is broken it's always printed immediately after that game.\n    But a summary of broken games will be printed if over 25 games are scraped. The logic is that\n    it'll be easier when you've scraped a lot of games to see all the errors at the end than scrolling\n    though all the output and potentially missing it.\n\n    :param detailed: When False only print player IDs otherwise all\n    \n    :return: None\n    \"\"\"\n    print(\"\")\n\n    if game_scraper.broken_pbp_games and detailed:\n        print('Broken pbp:')\n        for x in game_scraper.broken_pbp_games:\n            print(\"  -\", x[0], x[1])\n        print(\"\")\n\n    if game_scraper.broken_shifts_games and detailed:\n        print('Broken shifts:')\n        for x in game_scraper.broken_shifts_games:\n            print(\"  -\", x[0], x[1])\n        print(\"\")\n\n    if game_scraper.missing_coords and detailed:\n        print('Games missing coordinates:')\n        for x in game_scraper.missing_coords:\n            print(\"  -\", x[0], x[1])\n        print(\"\")\n\n    if game_scraper.players_missing_ids:\n        print(\"Players missing IDs:\")\n        for x in game_scraper.players_missing_ids:\n            print(\"  -\", x[0], x[1])\n        print(\"\")\n\n    # Clear them all out for the next call\n    game_scraper.broken_shifts_games = []\n    game_scraper.broken_pbp_games = []\n    game_scraper.players_missing_ids = []\n    game_scraper.missing_coords = []\n\n\ndef scrape_list_of_games(games, if_scrape_shifts, verbose=False):\n    \"\"\"\n    Given a list of game_id's (and a date for each game) it scrapes them\n    \n    :param games: list of [game_id, date]\n    :param if_scrape_shifts: Boolean indicating whether to also scrape shifts\n    :params verbose: Verbosity when printing errors. Defaults to False    \n    \n    :return: DataFrame of pbp info, also shifts if specified\n    \"\"\"\n    pbp_dfs = []\n    shifts_dfs = []\n\n    for game in games:\n        pbp_df, shifts_df = game_scraper.scrape_game(str(game[\"game_id\"]), game[\"date\"], if_scrape_shifts)\n        if pbp_df is not None:\n            pbp_dfs.extend([pbp_df])\n        if shifts_df is not None:\n            shifts_dfs.extend([shifts_df])\n\n    # Check if any games...if not let's get out of here\n    if len(pbp_dfs) == 0:\n        return None, None\n    else:\n        pbp_df = pd.concat(pbp_dfs)\n        pbp_df = pbp_df.reset_index(drop=True)\n        pbp_df.apply(lambda row: game_scraper.check_goalie(row), axis=1)\n\n    if if_scrape_shifts:\n        shifts_df = pd.concat(shifts_dfs).reset_index(drop=True)\n    else:\n        shifts_df = None\n\n    # Only print full details when # games > 25 or verbose=True\n    error_verbosity = verbose or len(games) >= 25\n    print_errors(error_verbosity)\n\n    return pbp_df, shifts_df\n\n\ndef scrape_schedule(from_date, to_date, data_format='pandas', rescrape=False, docs_dir=False):\n    \"\"\"\n    Scrape the games schedule in a given range.\n    \n    :param from_date: date you want to scrape from\n    :param to_date: date you want to scrape to \n    :param data_format: format you want data in - csv or  pandas (pandas is default)\n    :param rescrape: If you want to rescrape pages already scraped. Only applies if you supply a docs dir. (def. = None)\n    :param docs_dir: Directory that either contains previously scraped docs or one that you want them to be deposited \n                     in after scraping. When True it'll refer to (or if needed create) such a repository in the home\n                     directory. When provided a string it'll try to use that. Here it must be a valid directory otheriwse\n                     it won't work (I won't make it for you). When False the files won't be saved.\n    \n    :return: DataFrame of None\n    \"\"\"\n    cols = [\"game_id\", \"date\", \"venue\", \"home_team\", \"away_team\", \"start_time\", \"home_score\", \"away_score\", \"status\"]\n\n    shared.check_data_format(data_format)\n    shared.check_valid_dates(from_date, to_date)\n\n    shared.add_dir(docs_dir)\n    shared.if_rescrape(rescrape)\n\n    # live = True allows us to scrape games that aren't final\n    sched = json_schedule.scrape_schedule(from_date, to_date, preseason=True, not_over=True)\n    sched_df = pd.DataFrame(sched, columns=cols)\n\n    if data_format.lower() == 'csv':\n        shared.to_csv(from_date + '--' + to_date, sched_df, \"nhl\", \"schedule\")\n    else:\n        return sched_df\n\n\ndef scrape_date_range(from_date, to_date, if_scrape_shifts, data_format='csv', preseason=False, rescrape=False, docs_dir=False, verbose=False):\n    \"\"\"\n    Scrape games in given date range\n    \n    :param from_date: date you want to scrape from\n    :param to_date: date you want to scrape to\n    :param if_scrape_shifts: Boolean indicating whether to also scrape shifts \n    :param data_format: format you want data in - csv or  pandas (csv is default)\n    :param preseason: Boolean indicating whether to include preseason games (default if False)\n                      This is may or may not work!!! I don't give a shit.\n    :param rescrape: If you want to rescrape pages already scraped. Only applies if you supply a docs dir. (def. = None)\n    :param docs_dir: Directory that either contains previously scraped docs or one that you want them to be deposited \n                     in after scraping. When True it'll refer to (or if needed create) such a repository in the home\n                     directory. When provided a string it'll try to use that. Here it must be a valid directory otheriwse\n                     it won't work (I won't make it for you). When False the files won't be saved.\n    :params verbose: Override default verbosity when printing errors\n\n    :return: Dictionary with DataFrames and errors or None\n    \"\"\"\n    shared.check_data_format(data_format)\n    shared.check_valid_dates(from_date, to_date)\n\n    shared.add_dir(docs_dir)\n    shared.if_rescrape(rescrape)\n\n    games = json_schedule.scrape_schedule(from_date, to_date, preseason)\n    pbp_df, shifts_df = scrape_list_of_games(games, if_scrape_shifts, verbose)\n\n    if data_format.lower() == 'csv':\n        shared.to_csv(from_date + '--' + to_date, pbp_df, \"nhl\", \"pbp\")\n        shared.to_csv(from_date + '--' + to_date, shifts_df, \"nhl\", \"shifts\")\n    else:\n        return {\"pbp\": pbp_df, \"shifts\": shifts_df} if if_scrape_shifts else {\"pbp\": pbp_df}\n\n\ndef scrape_seasons(seasons, if_scrape_shifts, data_format='csv', preseason=False, rescrape=False, docs_dir=False, verbose=False):\n    \"\"\"\n    Given list of seasons it scrapes all the seasons \n    \n    :param seasons: list of seasons\n    :param if_scrape_shifts: Boolean indicating whether to also scrape shifts \n    :param data_format: format you want data in - csv or pandas (csv is default)\n    :param preseason: Boolean indicating whether to include preseason games (default if False)\n                      This is may or may not work!!! I don't give a shit.\n    :param rescrape: If you want to rescrape pages already scraped. Only applies if you supply a docs dir.\n    :param docs_dir: Directory that either contains previously scraped docs or one that you want them to be deposited \n                     in after scraping. When True it'll refer to (or if needed create) such a repository in the home\n                     directory. When provided a string it'll try to use that. Here it must be a valid directory otheriwse\n                     it won't work (I won't make it for you). When False the files won't be saved.\n    :params verbose: Override default verbosity when printing errors\n\n    :return: Dictionary with DataFrames and errors or None\n    \"\"\"\n    shared.check_data_format(data_format)\n    shared.add_dir(docs_dir)\n    shared.if_rescrape(rescrape)\n\n    # Holds all seasons scraped (if not csv)\n    master_pbps, master_shifts = [], []\n\n    for season in seasons:\n        from_date = shared.season_start_bound(season)\n        to_date = datetime.strftime(shared.season_end_bound(str(int(season) + 1)), \"%Y-%m-%d\")\n\n        games = json_schedule.scrape_schedule(from_date, to_date, preseason)\n        pbp_df, shifts_df = scrape_list_of_games(games, if_scrape_shifts, verbose)\n\n        if data_format.lower() == 'csv':\n            shared.to_csv(str(season) + str(season + 1), pbp_df, \"nhl\", \"pbp\")\n            shared.to_csv(str(season) + str(season + 1), shifts_df, \"nhl\", \"shifts\")\n        elif pbp_df is not None:\n            master_pbps.append(pbp_df)\n            master_shifts.append(shifts_df)\n\n    if data_format.lower() == 'pandas' and master_pbps:\n        if if_scrape_shifts:\n            return {\"pbp\": pd.concat(master_pbps), \"shifts\": pd.concat(master_shifts)}\n        else:\n            return {\"pbp\": pd.concat(master_pbps)}\n\n\ndef scrape_games(games, if_scrape_shifts, data_format='csv', rescrape=False, docs_dir=False, verbose=False):\n    \"\"\"\n    Scrape a list of games\n    \n    :param games: list of game_ids\n    :param if_scrape_shifts: Boolean indicating whether to also scrape shifts \n    :param data_format: format you want data in - csv or pandas (csv is default)\n    :param rescrape: If you want to rescrape pages already scraped. Only applies if you supply a docs dir.\n    :param docs_dir: Directory that either contains previously scraped docs or one that you want them to be deposited \n                     in after scraping. When True it'll refer to (or if needed create) such a repository in the home\n                     directory. When provided a string it'll try to use that. Here it must be a valid directory otheriwse\n                     it won't work (I won't make it for you). When False the files won't be saved. \n    :params verbose: Override default verbosity when printing errors\n\n    :return: Dictionary with DataFrames and errors or None\n    \"\"\"\n    shared.check_data_format(data_format)\n    shared.add_dir(docs_dir)\n    shared.if_rescrape(rescrape)\n\n    # Create List of game_id's and dates\n    games_list = json_schedule.get_dates(games)\n\n    # Scrape pbp and shifts\n    pbp_df, shifts_df = scrape_list_of_games(games_list, if_scrape_shifts, verbose)\n\n    if data_format.lower() == 'csv':\n        shared.to_csv(str(int(time.time())), pbp_df, \"nhl\", \"pbp\")\n        shared.to_csv(str(int(time.time())), shifts_df, \"nhl\", \"shifts\")\n    else:\n        return {\"pbp\": pbp_df, \"shifts\": shifts_df} if if_scrape_shifts else {\"pbp\": pbp_df}\n"
  },
  {
    "path": "hockey_scraper/nhl/shifts/__init__.py",
    "content": ""
  },
  {
    "path": "hockey_scraper/nhl/shifts/html_shifts.py",
    "content": "\"\"\"\nThis module contains functions to scrape the Html Toi Tables (or shifts) for any given game\n\"\"\"\n\nimport re\nimport pandas as pd\nfrom bs4 import BeautifulSoup\nimport hockey_scraper.utils.shared as shared\n\n\ndef get_shifts(game_id):\n    \"\"\"\n    Given a game_id it returns a the shifts for both teams\n    Ex: http://www.nhl.com/scores/htmlreports/20162017/TV020971.HTM\n    \n    :param game_id: the game\n    \n    :return: Shifts or None\n    \"\"\"\n    game_id = str(game_id)\n    venue_pgs = tuple()\n\n    for venue in [\"home\", \"away\"]:\n        venue_tag = \"H\" if venue == \"home\" else \"V\"\n        venue_url = 'http://www.nhl.com/scores/htmlreports/{}{}/T{}{}.HTM'.format(game_id[:4], int(game_id[:4])+1, venue_tag, game_id[4:])\n  \n        page_info = {\n            \"url\": venue_url,\n            \"name\": game_id,\n            \"type\": \"html_shifts_{}\".format(venue),\n            \"season\": game_id[:4],\n        }\n\n        venue_pgs += (shared.get_file(page_info), )\n\n    return venue_pgs\n\n\ndef get_soup(shifts_html):\n    \"\"\"\n    Uses Beautiful soup to parses the html document.\n    Some parsers work for some pages but don't work for others....I'm not sure why so I just try them all here in order\n    \n    :param shifts_html: html doc\n    \n    :return: \"soupified\" html and player_shifts portion of html (it's a bunch of td tags)\n    \"\"\"\n    parsers = [\"lxml\", \"html.parser\", \"html5lib\"]\n\n    for parser in parsers:\n        soup = BeautifulSoup(shifts_html, parser)\n        td = soup.findAll(True, {'class': ['playerHeading + border', 'lborder + bborder']})\n\n        if len(td) > 0:\n            break\n\n    return td, get_teams(soup)\n\n\ndef get_teams(soup):\n    \"\"\"\n    Return the team for the TOI tables and the home team\n    \n    :param soup: souped up html\n    \n    :return: list with team and home team\n    \"\"\"\n    team = soup.find('td', class_='teamHeading + border')  # Team for shifts\n    team = team.get_text()\n\n    # Get Home Team\n    teams = soup.find_all('td', {'align': 'center', 'style': 'font-size: 10px;font-weight:bold'})\n    regex = re.compile(r'>(.*)<br/?>')\n    home_team = regex.findall(str(teams[7]))\n\n    return [team, home_team[0]]\n\n\ndef analyze_shifts(shift, name, team, home_team, player_ids):\n    \"\"\"\n    Analyze shifts for each player when using.\n    Prior to this each player (in a dictionary) has a list with each entry being a shift.\n\n    :param shift: info on shift\n    :param name: player name\n    :param team: given team\n    :param home_team: home team for given game\n    :param player_ids: dict with info on players\n    \n    :return: dict with info for shift\n    \"\"\"\n    shifts = dict()\n\n    shifts['Player'] = name.upper()\n    shifts['Period'] = '4' if shift[1] == 'OT' else shift[1]\n    shifts['Team'] = shared.get_team(team.strip(' '))\n    shifts['Start'] = shared.convert_to_seconds(shift[2].split('/')[0])\n    shifts['Duration'] = shared.convert_to_seconds(shift[4].split('/')[0])\n\n    # I've had problems with this one...if there are no digits the time is fucked up\n    if re.compile('\\d+').findall(shift[3].split('/')[0]):\n        shifts['End'] = shared.convert_to_seconds(shift[3].split('/')[0])\n    else:\n        shifts['End'] = shifts['Start'] + shifts['Duration']\n\n    try:\n        if home_team == team:\n            shifts['Player_Id'] = player_ids['Home'][name.upper()]['id']\n        else:\n            shifts['Player_Id'] = player_ids['Away'][name.upper()]['id']\n    except KeyError:\n        shifts['Player_Id'] = None\n\n    return shifts\n\n\ndef parse_html(html, player_ids, game_id):\n    \"\"\"\n    Parse the html\n    \n    Note: Don't fuck with this!!! I'm not exactly sure how or why but it works. \n    \n    :param html: cleaned up html\n    :param player_ids: dict of home and away players\n    :param game_id: id for game\n    \n    :return: DataFrame with info\n    \"\"\"\n    all_shifts = []\n    columns = ['Game_Id', 'Player', 'Player_Id', 'Period', 'Team', 'Start', 'End', 'Duration']\n\n    td, teams = get_soup(html)\n\n    team = teams[0]\n    home_team = teams[1]\n    players = dict()\n\n    # The list 'td' is laid out with player name followed by every component of each shift. Each shift contains:\n    # shift #, Period, begin, end, and duration. The shift event isn't included.\n    for t in td:\n        t = t.get_text()\n        if ',' in t:     # If it has a comma in it we know it's a player's name...so add player to dict\n            name = t\n            # Just format the name normally...it's coded as: 'num last_name, first_name'\n            name = name.split(',')\n            name = ' '.join([name[1].strip(' '), name[0][2:].strip(' ')])\n            name = shared.fix_name(name)\n            players[name] = dict()\n            players[name]['number'] = name[0][:2].strip()\n            players[name]['Shifts'] = []\n        else:\n            # Here we add all the shifts to whatever player we are up to\n            players[name]['Shifts'].extend([t])\n\n    for key in players.keys():\n        # Create a list of lists (each length 5)...corresponds to 5 columns in html shifts\n        players[key]['Shifts'] = [players[key]['Shifts'][i:i + 5] for i in range(0, len(players[key]['Shifts']), 5)]\n\n        # Parse each shift\n        shifts = [analyze_shifts(shift, key, team, home_team, player_ids) for shift in players[key]['Shifts']]\n        all_shifts.extend(shifts)\n\n    df = pd.DataFrame(all_shifts)\n    df['Game_Id'] = str(game_id)[5:]\n    \n    return df[columns]\n\n\ndef scrape_game(game_id, players):\n    \"\"\"\n    Scrape the game. \n    \n    :param game_id: id for game\n    :param players: list of players\n    \n    :return: DataFrame with info for the game\n    \"\"\"\n    columns = ['Game_Id', 'Period', 'Team', 'Player', 'Player_Id', 'Start', 'End', 'Duration']\n\n    home_html, away_html = get_shifts(game_id)\n\n    if home_html is None or away_html is None:\n        shared.print_error(\"Html shifts for game {} is either not there or can't be obtained\".format(game_id))\n        return pd.DataFrame()\n\n    try:\n        away_df = parse_html(away_html, players, game_id)\n        home_df = parse_html(home_html, players, game_id)\n    except Exception as e:\n        shared.print_error('Error parsing Html shifts for game {} {}'.format(game_id, e))\n        return pd.DataFrame()\n\n    # Combine the two\n    game_df = pd.concat([away_df, home_df], ignore_index=True)\n    game_df = pd.DataFrame(game_df, columns=columns)\n    game_df = game_df.sort_values(by=['Period', 'Start'], ascending=[True, True])\n\n    return game_df.reset_index(drop=True)\n"
  },
  {
    "path": "hockey_scraper/nhl/shifts/json_shifts.py",
    "content": "\"\"\"\nThis module contains functions to scrape the Json toi/shifts for any given game\n\"\"\"\n\nimport json\nimport pandas as pd\nimport hockey_scraper.utils.shared as shared\n\n\ndef get_shifts(game_id):\n    \"\"\"\n    Given a game_id it returns the raw json\n    Ex: https://api.nhle.com/stats/rest/en/shiftcharts?cayenneExp=gameId=2019020001\n    \n    :param game_id: the game\n    \n    :return: json or None\n    \"\"\"\n    page_info = {\n        \"url\": 'https://api.nhle.com/stats/rest/en/shiftcharts?cayenneExp=gameId={}'.format(game_id),\n        \"name\": str(game_id),\n        \"type\": \"json_shifts\",\n        \"season\": str(game_id)[:4],\n    }\n\n    response = shared.get_file(page_info)\n\n    # Return empty dict if can't get page\n    if not response:\n        return {}\n    else:\n        return json.loads(response)\n\n\ndef fix_team_tricode(tricode):\n    \"\"\"\n    Some of the tricodes are different than how I want them\n    \n    :param tricode: 3 letter team name - ex: NYR\n    \n    :return: fixed tricode\n    \"\"\"\n    fixed_tricodes = {\n        'TBL':  'T.B',\n        'LAK': 'L.A',\n        'NJD': 'N.J',\n        'SJS': 'S.J'\n    }\n\n    return fixed_tricodes.get(tricode.upper(), tricode)\n\n\ndef parse_shift(shift):\n    \"\"\"\n    Parse shift for json\n    \n    :param shift: json for shift\n    \n    :return: dict with shift info\n    \"\"\"\n    shift_dict = dict()\n\n    # At the end of the json they list when all the goal events happened. We don't want them...\n    # They are the only one's which have their eventDescription be not null\n    if shift['eventDescription'] is not None:\n        return {}\n\n    name = shared.fix_name(' '.join([shift['firstName'].strip(' '), shift['lastName'].strip(' ')]))\n    \n    shift_dict['Player'] = name\n    shift_dict['Player_Id'] = shift['playerId']\n    shift_dict['Period'] = shift['period']\n    shift_dict['Team'] = fix_team_tricode(shift['teamAbbrev'])\n    shift_dict['Start'] = shared.convert_to_seconds(shift['startTime'])\n    shift_dict['End'] = shared.convert_to_seconds(shift['endTime'])\n    shift_dict['Duration'] = shared.convert_to_seconds(shift['duration'])\n\n\n    return shift_dict\n\n\ndef parse_json(shift_json, game_id):\n    \"\"\"\n    Parse the json\n    \n    :param shift_json: raw json\n    :param game_id: if of game\n    \n    :return: DataFrame with info\n    \"\"\"\n    columns = ['Game_Id', 'Period', 'Team', 'Player', 'Player_Id', 'Start', 'End', 'Duration']\n\n    shifts = [parse_shift(shift) for shift in shift_json['data']]  # Go through the shifts\n    shifts = [shift for shift in shifts if shift != {}]            # Get rid of null shifts (which happen at end)\n\n    df = pd.DataFrame(shifts, columns=columns)\n    df['Game_Id'] = str(game_id)[5:]\n    df = df.sort_values(by=['Period', 'Start'], ascending=[True, True])  \n\n    return df.reset_index(drop=True)\n\n\ndef scrape_game(game_id):\n    \"\"\"\n    Scrape the game. \n    \n    :param game_id: game\n    \n    :return: DataFrame with info for the game\n    \"\"\"\n    shifts_json = get_shifts(game_id)\n\n    if not shifts_json:\n        #shared.print_error(\"Json shifts for game {} is either not there or can't be obtained\".format(game_id))\n        return pd.DataFrame()\n\n    try:\n        game_df = parse_json(shifts_json, game_id)\n    except Exception as e:\n        shared.print_error('Error parsing Json shifts for game {} {}'.format(game_id, e))\n        return pd.DataFrame()\n\n    return game_df\n\n"
  },
  {
    "path": "hockey_scraper/nwhl/__init__.py",
    "content": ""
  },
  {
    "path": "hockey_scraper/nwhl/game_pbp.py",
    "content": "\"\"\"\nScrape the PBP info for a given game\n\"\"\"\nimport json\nimport time\nimport datetime\nimport pandas as pd\nfrom bs4 import BeautifulSoup\nimport hockey_scraper.utils.shared as shared\nimport hockey_scraper.utils.save_pages as sp\n\nfrom selenium import webdriver\nfrom selenium.webdriver.chrome.options import Options\nfrom selenium.common.exceptions import WebDriverException\nfrom selenium.webdriver.common.by import By\nfrom selenium.webdriver.support import expected_conditions as EC\nfrom selenium.common.exceptions import TimeoutException, ElementNotVisibleException, WebDriverException\nfrom selenium.webdriver.support.ui import WebDriverWait\n\nfrom selenium.webdriver.common.action_chains import ActionChains\n\n\noptions = Options()\noptions.add_argument(\"--headless\")\n\n\n\ndef scrape_page(url):\n    \"\"\"\n\n    :param url: Game pbp url\n\n    :return n pages - each have period info\n    \"\"\"\n    driver = webdriver.Firefox()\n    wait = WebDriverWait(driver, 10)\n\n    driver.get(url)\n    time.sleep(8)\n\n    \"\"\"\n\n    for _ in range(5):\n        driver.execute_script(\"window.scrollTo(0,document.body.scrollHeight)\")\n        time.sleep(.2)\n    #['SO', 'OT', 'OT1', 'OT2', '3', '2', '1']:\n\n\n    for period in ['3', '2', '1']: \n        btn = '//a[@ng-click=\"ctrl.period = \\'{}\\'\"]'.format(period)\n\n        try:\n            wait.until(EC.element_to_be_clickable((By.XPATH, btn)))\n            #driver.find_element_by_xpath(btn).click()\n            btn_elem = driver.find_element_by_xpath(btn)\n            ActionChains(driver).move_to_element(btn_elem).click().perform()\n        except (TimeoutException, ElementNotVisibleException, WebDriverException) as e:\n            print(e)\n\n        ### This just print the last row in the list to see if we are correctly toggling between periods\n        soup = BeautifulSoup(driver.page_source, \"lxml\")\n        plays_table = soup.find(\"table\", {\"class\": \"play-by-play\"})\n        plays = plays_table.find_all(\"tr\")\n        print(plays[-1])\n\n    \"\"\"\n\n    pg = driver.page_source\n    driver.close()\n\n    return pg\n\n\n\n\ndef get_pbp(game_id):\n    \"\"\"\n    Get the response for a game (e.g. https://www.nwhl.zone/stats#/100/game/268087/play-by-play)\n    \n    :param game_id: Given Game id (e.g. 268087)\n    \n    :return: \n    \"\"\"\n    file_info = {\n        \"url\": 'https://www.nwhl.zone/stats#/100/game/{}/play-by-play'.format(game_id),\n        \"name\": str(game_id),\n        \"type\": \"nwhl_json_pbp\",\n        \"season\": \"nwhl\",\n        'dir': shared.docs_dir\n    }\n    \n    # Saved pages logic is here bec. of button logic in scrape_pbp\n    if shared.docs_dir and sp.check_file_exists(file_info) and not shared.re_scrape:\n        # TODO: Regex matching game_id\n        pgs = sp.get_page(file_info)\n    else:\n        pgs = scrape_page(file_info['url'])\n\n        # We have to save each individually\n        #for i, pg in enumerate(pgs):\n        i=1\n        file_info['name'] += \"_{}\".format(i)\n        sp.save_page(pgs, file_info)\n\n    return pgs\n\n\n\n\n\n\ndef parse_event(event, score, teams, date, game_id, players):\n    \"\"\"\n    Parses a single event when the info is in a json format\n\n    :param event: json of event \n    :param score: Current score of the game\n    :param teams: Teams dict (id -> name)\n    :param date: date of the game\n    :param game_id: game id for game\n    :param players: Dict of player ids to player names\n    \n    :return: dictionary with the info\n    \"\"\"\n    play = dict()\n\n    \n\n\ndef parse_json(game_json, game_id,):\n    \"\"\"\n    Scrape the json for a game\n    \n    plus, minus players\n\n    :param game_json: raw json\n    :param game_id: game id for game\n\n    :return: Either a DataFrame with info for the game \n    \"\"\"\n    cols = ['game_id', 'date', 'season', 'period', 'seconds_elapsed', 'event', 'ev_team', 'home_team', 'away_team',\n            'p1_name', 'p1_id', 'p2_name', 'p2_id', 'p3_name', 'p3_id',\n            \"homePlayer1\", \"homePlayer1_id\", \"homePlayer2\", \"homePlayer2_id\", \"homePlayer3\", \"homePlayer3_id\",\n            \"homePlayer4\", \"homePlayer4_id\", \"homePlayer5\", \"homePlayer5_id\", \"homePlayer6\", \"homePlayer6_id\",\n            \"awayPlayer1\", \"awayPlayer1_id\", \"awayPlayer2\", \"awayPlayer2_id\", \"awayPlayer3\", \"awayPlayer3_id\",\n            \"awayPlayer4\", \"awayPlayer4_id\", \"awayPlayer5\", \"awayPlayer5_id\", \"awayPlayer6\", \"awayPlayer6_id\",\n            'home_goalie', 'home_goalie_id', 'away_goalie', 'away_goalie_id', 'details', 'home_score', 'away_score',\n            'xC', 'yC', 'play_index']\n\n    # B4 anything - if there are no plays we leave\n    if len(game_json['plays']) == 0:\n        shared.print_error(\"The Json pbp for game {} contains no plays and therefore can't be parsed\".format(game_id))\n        return pd.DataFrame()\n\n    # Get all the players in the game\n    players = get_roster(game_json)\n\n    # Initialize & Update as we go along\n    score = {\"home\": 0, \"away\": 0}\n    teams = {\"home\": {\"id\": game_json['game']['home_team'], \"name\": game_json['team_instance'][0]['abbrev']},\n             \"away\": {\"id\": game_json['game']['away_team'], \"name\": game_json['team_instance'][1]['abbrev']}\n             }\n\n    # Get date from UTC timestamp\n    date = game_json['plays'][0]['created_at']\n    date = datetime.datetime.strptime(date[:date.rfind(\"-\")], \"%Y-%m-%dT%H:%M:%S\").strftime(\"%Y-%m-%d\")\n\n    try:\n        events = [parse_event(play, score, teams, date, game_id, players) for play in game_json['plays']]\n    except Exception as e:\n        shared.print_error('Error parsing Json pbp for game {} {}'.format(game_id, e))\n        return pd.DataFrame()\n\n    df = pd.DataFrame(events, columns=cols)\n\n    # Get rid of null events and order by play index\n    df = df[(~pd.isnull(df['event'])) & (df['event'] != \"\")]\n    df = df.sort_values(by=['play_index'])\n    df = df.drop(['play_index'], axis=1)\n\n    return df.reset_index(drop=True)\n\n\ndef scrape_pbp(game_id):\n    \"\"\"\n    Scrape the pbp data for a given game\n    \n    :param game_id: Given Game id (e.g. 18507472)\n    \n    :return: DataFrame with pbp info\n    \"\"\"\n    game_json = get_pbp(game_id)\n\n    if not game_json:\n        shared.print_error(\"Pbp for game {} is not either not there or can't be obtained\".format(game_id))\n        return pd.DataFrame()\n\n    try:\n        game_df = parse_json(game_json, game_id)\n    except Exception as e:\n        shared.print_error('Error parsing the Pbp for game {} {}'.format(game_id, e))\n        return pd.DataFrame()\n\n    return game_df\n\n"
  },
  {
    "path": "hockey_scraper/nwhl/scrape_functions.py",
    "content": "\"\"\"\nFunctions to scrape by season, games, and date range\n\"\"\"\nimport random\nimport pandas as pd\n\n#from . import html_schedule, json_pbp\nimport hockey_scraper.utils.shared as shared\n\n# All columns for the pbp\ncols = ['game_id', 'date', 'season', 'period', 'seconds_elapsed', 'event', 'ev_team', 'home_team', 'away_team',\n        'p1_name', 'p1_id', 'p2_name', 'p2_id', 'p3_name', 'p3_id',\n        \"homePlayer1\", \"homePlayer1_id\", \"homePlayer2\", \"homePlayer2_id\", \"homePlayer3\", \"homePlayer3_id\",\n        \"homePlayer4\", \"homePlayer4_id\", \"homePlayer5\", \"homePlayer5_id\", \"homePlayer6\", \"homePlayer6_id\",\n        \"awayPlayer1\", \"awayPlayer1_id\", \"awayPlayer2\", \"awayPlayer2_id\", \"awayPlayer3\", \"awayPlayer3_id\",\n        \"awayPlayer4\", \"awayPlayer4_id\", \"awayPlayer5\", \"awayPlayer5_id\", \"awayPlayer6\", \"awayPlayer6_id\",\n        'home_goalie', 'home_goalie_id', 'away_goalie', 'away_goalie_id', 'details', 'home_score', 'away_score',\n        'xC', 'yC']\n\n# Hold any games we didn't scrape for any reason\nbroken_games = []\n\n\ndef print_errors():\n    \"\"\"\n    Print any scraping errors.\n    \n    :return: None\n    \"\"\"\n    global broken_games\n    if broken_games:\n        print('\\nBroken pbp:')\n        for x in broken_games:\n            print(x)\n\n    broken_games = []\n\n\ndef scrape_list_of_games(games):\n    \"\"\"\n    Scrape an arbitrary list of games given the game id's\n    \n    :param games: List of game_id's to scrape\n    \n    :return: DataFrame of pbp info \n    \"\"\"\n    pbp_dfs = []\n    for game in games:\n        print(' '.join(['Scraping NWHL Game ', str(game)]))\n        pbp_df = json_pbp.scrape_pbp(game)\n        if not pbp_df.empty:\n            pbp_dfs.append(pbp_df)\n        else:\n            broken_games.append(game)\n\n    # If not empty...\n    if pbp_dfs:\n        return pd.concat(pbp_dfs, sort=True).reset_index(drop=True)[cols]\n    return None\n\n\ndef scrape_games(games, data_format='csv', rescrape=False, docs_dir=None):\n    \"\"\"\n    Scrape a list of games\n\n    :param games: list of game_ids\n    :param data_format: format you want data in - csv or pandas (csv is default)\n    :param rescrape: If you want to rescrape pages already scraped. Only applies if you supply a docs dir.\n    :param docs_dir: Directory that either contains previously scraped docs or one that you want them to be deposited \n                     in after scraping\n\n    :return: Dictionary with DataFrames or None\n    \"\"\"\n    # First check if the inputs are good\n    shared.check_data_format(data_format)\n\n    # Check on the docs_dir and re_scrape\n    shared.add_dir(docs_dir)\n    shared.if_rescrape(rescrape)\n\n    pbp_df = scrape_list_of_games(games)\n    print_errors()\n\n    if data_format.lower() == 'csv':\n        shared.to_csv(str(random.randint(1, 101)), pbp_df, None, \"nwhl\")\n    else:\n        return pbp_df\n\n\ndef scrape_date_range(from_date, to_date, data_format='csv', rescrape=False, docs_dir=None):\n    \"\"\"\n    Scrape games in given date range\n\n    :param from_date: date you want to scrape from\n    :param to_date: date you want to scrape to\n    :param data_format: format you want data in - csv or pandas (csv is default)\n    :param rescrape: If you want to rescrape pages already scraped. Only applies if you supply a docs dir. (def. = None)\n    :param docs_dir: Directory that either contains previously scraped docs or one that you want them to be deposited \n                     in after scraping. (default is None)\n\n    :return: Dictionary with DataFrames and errors or None\n    \"\"\"\n    # First check if the inputs are good\n    shared.check_data_format(data_format)\n    shared.check_valid_dates(from_date, to_date)\n\n    # Check on the docs_dir and re_scrape\n    shared.add_dir(docs_dir)\n    shared.if_rescrape(rescrape)\n\n    # Get dates and convert to just a list of game ids\n    games = html_schedule.scrape_dates(from_date, to_date)\n    game_ids = [game['game_id'] for game in games]\n\n    # Scrape all PBP\n    pbp_df = scrape_list_of_games(game_ids)\n\n    # Merge in subtype\n    pbp_df = pd.merge(pbp_df, pd.DataFrame(games, columns=['game_id', 'sub_type']), on=\"game_id\", how=\"left\")\n\n    print_errors()\n    if data_format.lower() == 'csv':\n        shared.to_csv(from_date + '--' + to_date, pbp_df, None, \"nwhl\")\n    else:\n        return pbp_df\n\n\ndef scrape_seasons(seasons, data_format='csv', rescrape=False, docs_dir=None):\n    \"\"\"\n    Given list of seasons it scrapes all the seasons \n\n    :param seasons: list of seasons\n    :param data_format: format you want data in - csv or pandas (csv is default)\n    :param rescrape: If you want to rescrape pages already scraped. Only applies if you supply a docs dir.\n    :param docs_dir: Directory that either contains previously scraped docs or one that you want them to be deposited \n                     in after scraping\n\n    :return: Dictionary with DataFrames and errors or None\n    \"\"\"\n    # First check if the inputs are good\n    shared.check_data_format(data_format)\n\n    # Check on the docs_dir and re_scrape\n    shared.add_dir(docs_dir)\n    shared.if_rescrape(rescrape)\n\n    # Holds all seasons scraped (if not csv)\n    master_pbps = []\n\n    for season in seasons:\n        games = html_schedule.scrape_seasons(season)\n        game_ids = [game['game_id'] for game in games]\n\n        # Scrape all PBP\n        pbp_df = scrape_list_of_games(game_ids)\n\n        # Merge in subtype\n        pbp_df = pd.merge(pbp_df, pd.DataFrame(games, columns=['game_id', 'sub_type']), on=\"game_id\", how=\"left\")\n\n        if data_format.lower() == 'csv':\n            shared.to_csv(str(season) + str(season + 1), pbp_df, None, \"nwhl\")\n        else:\n            master_pbps.append(pbp_df)\n\n    print_errors()\n    if data_format.lower() == 'pandas':\n        return pd.concat(master_pbps, sort=True)\n\n"
  },
  {
    "path": "hockey_scraper/nwhl/scrape_schedule.py",
    "content": "\"\"\"\nScrape the schedule info for nwhl games\n\"\"\"\nimport time\nfrom datetime import datetime\nimport re\nfrom bs4 import BeautifulSoup\nimport hockey_scraper.utils.shared as shared\nimport hockey_scraper.utils.save_pages as sp\n\nfrom selenium import webdriver\nfrom selenium.webdriver.chrome.options import Options\n\noptions = Options()\noptions.add_argument(\"--headless\")\n\n\n\ndef scrape_dynamic(url):\n    \"\"\"\n    Dynamically scrape a given url and scroll down.\n\n    :param url: Page to get\n\n    :return source page\n    \"\"\"\n    browser = webdriver.Chrome(chrome_options=options)\n    browser.get(url)\n    time.sleep(5)\n\n    # Scroll down to get all the games - Do it a few times to make sure\n    for _ in range(5):\n        browser.execute_script(\"window.scrollTo(0,document.body.scrollHeight)\")\n        time.sleep(.2)\n\n    pg = browser.page_source\n    browser.close()\n\n    return pg\n\n\ndef get_schedule(url, name):\n    \"\"\"\n    Given a url it returns the raw html\n\n    :param url: url for page\n    :param name: Name for saved file\n\n    :return: raw html of game\n    \"\"\"\n    file_info = {\n        \"url\": url,\n        \"name\": str(name) + \"_schedule\",\n        \"type\": \"html_schedule_nwhl\",\n        \"season\": \"nwhl\",\n        'dir': shared.docs_dir\n    }\n\n    # Done manually due to custom scraping logic\n    if shared.docs_dir and sp.check_file_exists(file_info) and not shared.re_scrape:\n        pg =  sp.get_page(file_info)\n    else:\n        pg = scrape_dynamic(file_info['url'])\n        sp.save_page(pg, file_info)\n\n    return pg\n\n\ndef get_season_codes():\n    \"\"\"\n    They use fucked up codes instead of actual years to represent seasons in the url.\n\n    e.g. For 2019 - https://www.nwhl.zone/stats#/100/schedule?all&season_id=1950\n\n    Instead of hardcoding it I just ping the base page and get the codes\n\n    :return dict - season -> season_code\n    \"\"\"\n    seed_url = 'https://www.nwhl.zone/stats#/100/schedule'\n\n    pg = get_schedule(seed_url, \"seed\")\n    soup = BeautifulSoup(pg, \"lxml\")\n\n    # This grabs the options in the season dropdown\n    filters = soup.find_all(\"div\", {\"class\": \"filters d\"})\n    selects = filters[0].find_all(\"select\")\n    options = selects[0].find_all(\"option\")\n\n    # Parses out the season and associated code for each\n    season_codes = {}\n    for o in options:\n        season_codes[o['label'][:4]] = o['value'][o['value'].find(\":\")+1:]\n\n    return season_codes\n\n\ndef parse_game(game, season):\n    \"\"\"\n    Given a soup object for a given game parse out the info.\n\n    Skip over all-star game\n\n    :param games: Soup object\n    :param season: nwhl season\n\n    :return dict of info\n    \"\"\"\n    parsed_game = {}\n\n    # Team info\n    teams = game.find_all(\"span\", {\"class\": \"team-inline\"})\n\n    if \"All-Star\" in teams[0].text:\n        return parsed_game\n\n    parsed_game['away_team'] = teams[0].find(\"span\").text\n    parsed_game['home_team'] = teams[1].find(\"span\").text\n\n    # Scores\n    # If we don't puckey anything up the game hasn't occured yet \n    scores = game.find(\"td\", {\"class\": \"center\"})\n    team_scores = scores.find_all(\"span\", {\"class\": re.compile(\"ng-binding.*\")})\n    parsed_game['away_score'] = None if not team_scores else team_scores[0].text\n    parsed_game['home_score'] = None if not team_scores else team_scores[1].text\n\n    # Date & Time\n    dt_time = game.find_all(\"td\", {\"class\": \"center ng-binding\"})\n    date = dt_time[0].text\n    parsed_game['game_time'] = dt_time[1].text\n\n    # Date is structured as 'Sat Mar 5' \n    # So we pull month and date and infer year from date and season (use july as cutoff)\n    num_month = time.strptime(date[4:7],'%b').tm_mon\n    year = season if num_month >= 7 else season + 1\n    parsed_game['date'] = f\"{year}-{num_month}-{date[8:]}\"\n\n    ### Game id - link is structured like '#/100/game/274708'\n    links = game.find(\"a\")\n    parsed_game['game_id'] = re.findall(\".*\\/(\\d+)$\", links['href'])[0]\n\n    return parsed_game\n\n\ndef get_season_games(season, season_code):\n    \"\"\"\n    For a given season get the schedule page and parse out the info for each game\n\n    :param season: Season we are scraping\n    :param season_code: season_id code for url param\n\n    :return list of dicts with game info\n    \"\"\"\n    parsed_games = []\n\n    url = \"https://www.nwhl.zone/stats#/100/schedule?season_id={}&all\".format(season_code)\n    pg = get_schedule(url, season)\n    soup = BeautifulSoup(pg, \"lxml\")\n\n    ## Each <tr> is a game\n    sched = soup.find_all(\"table\", {\"class\": \"schedule\"})[0]\n    games = sched.find_all(\"tr\", {\"class\": re.compile(\"^ng-scope\")})\n\n    for game in games:\n        g = parse_game(game, season)\n        if g:\n            parsed_games.append(g)\n\n    return parsed_games\n\n\ndef scrape_dates(from_date, to_date):\n    \"\"\"\n    Get all the games between two dates. We scrape the schedule for each season in the \n    srange and then pick out the correct ones by date.\n    \n    :param from_date: Date Scrape from\n    :param to_date: Date scrape to\n    \n    :return: List of all games\n    \"\"\"\n    games = []\n\n    season_codes = get_season_codes()\n    first_season = shared.get_season(from_date)\n    last_season = shared.get_season(to_date)\n\n    # Convert to datetime to easily compare to game dates\n    from_datetime = datetime.strptime(from_date, \"%Y-%m-%d\")\n    to_datetime = datetime.strptime(to_date, \"%Y-%m-%d\")\n\n    for season in range(first_season, last_season+1):\n        for game in get_season_games(season, season_codes[str(season)]):\n\n            game_date = datetime.strptime(game['date'], \"%Y-%m-%d\")\n            if from_datetime <= game_date <= to_datetime:\n                games.append(game)\n\n    return games\n\n\ndef scrape_season(season):\n    \"\"\"\n    Scrape the games for a given season\n\n    :param season: e.g. 2017\n\n    :return list of dict of game info\n    \"\"\"\n    season_codes = get_season_codes()\n    return get_season_games(season, season_codes[str(season)])\n\n\n\n"
  },
  {
    "path": "hockey_scraper/utils/__init__.py",
    "content": "from .merge_pbp_shifts import merge"
  },
  {
    "path": "hockey_scraper/utils/config.py",
    "content": "\"\"\"\nBasic configurations\n\"\"\"\n\n# Directory where to save pages\n# When True assumes ~/hockey_scraper_data\n# Otherwise can take str to `existing` directory\nDOCS_DIR = False\n\n# Boolean that tells us whether or not we should re-scrape a given page if it's already saved\nRESCRAPE = False\n\n# Whether to log verbose errors to log file\nLOG = False"
  },
  {
    "path": "hockey_scraper/utils/merge_pbp_shifts.py",
    "content": "import pandas as pd\n\n\ndef label_priority(row):\n    \"\"\"\n    Priority for sorting\n    \n    Courtesy of Matt Barlowe (pre-NHL days)\n    \n    :param row: given event\n    \n    :return: given priority for that event\n    \"\"\"\n    if row.Event in ['TAKE', 'GIVE', 'MISS', 'HIT', 'SHOT', 'BLOCK']:\n        return 1\n    elif row.Event == \"GOAL\":\n        return 2\n    elif row.Event == \"STOP\":\n        return 3\n    elif row.Event == \"PENL\":\n        return 4\n    elif row.Event == \"OFF\":\n        return 5\n    elif row.Event == 'ON':\n        return 6\n    elif row.Event == 'FAC':\n        return 7\n    elif row.Event == \"PEND\":\n        return 8\n    else:\n        return 0\n\n\ndef group_shifts_cols(shifts, type_group_cols):\n    \"\"\"\n    Group into columns for players by some column subset\n\n    :param shifts: DataFrame of shifts\n    :param type_group_cols: Some columns -> Either for On or Off\n\n    :return: Grouped DataFrame\n    \"\"\"\n    # Group both by player and player id get a new columns with a list of the group\n    # The \"Player\" and \"Player_Id\" column contain a list of the grouped up players/player_ids\n    grouped_df_player = shifts.groupby(by=type_group_cols, as_index=False)['Player'].apply(list).reset_index()\n    grouped_df_playerid = shifts.groupby(by=type_group_cols, as_index=False)['Player_Id'].apply(list).reset_index()\n\n    # Rename from nothing to something\n    grouped_df_player = grouped_df_player.rename(index=str, columns={0: 'player'})\n    grouped_df_playerid = grouped_df_playerid.rename(index=str, columns={0: 'player_Id'})\n\n    # Player and Player Id are done separately above bec. they wouldn't work together\n    # So just did both and slid over the relevant columns here\n    grouped_df_player['player_Id'] = grouped_df_playerid['player_Id']\n\n    # Rename either Start or End to Seconds Elapsed\n    grouped_df_player = grouped_df_player.rename(index=str, columns={type_group_cols[-1:][0]: \"Seconds_Elapsed\"})\n    grouped_df_player['Event'] = 'ON' if type_group_cols[-1:][0] == \"Start\" else \"OFF\"\n\n    return grouped_df_player\n\n\ndef group_shifts_type(shifts, player_cols, player_id_cols):\n    \"\"\"\n    Groups rows by players getting \"On\" and players getting \"Off\"\n\n    :param shifts: Shifts_df\n    :param player_cols: Columns for players (see previous function)\n    :param player_id_cols: Column for player ids' (see previous functions)\n\n    :return: Shifts DataFrame grouped by players on and off every second\n    \"\"\"\n    # To subset for On and Off shifts\n    group_cols_start = ['Game_Id', 'Period', 'Team', 'Home_Team', 'Away_Team', 'Date', 'Start']\n    group_cols_end = ['Game_Id', 'Period', 'Team', 'Home_Team', 'Away_Team', 'Date', 'End']\n\n    # Group by two type of column list above and then combine the two\n    # Now have rows for On and rows for Off\n    grouped_df_on = group_shifts_cols(shifts, group_cols_start)\n    grouped_df_off = group_shifts_cols(shifts, group_cols_end)\n    grouped_df = grouped_df_on.append(grouped_df_off)\n\n    # Convert the Column which contain a list to the appropriate columns for both player and player_id\n    players = pd.DataFrame(grouped_df.player.values.tolist(), index=grouped_df.index).rename(\n        columns=lambda x: 'Player{}'.format(x + 1))\n    player_ids = pd.DataFrame(grouped_df.player_Id.values.tolist(), index=grouped_df.index).rename(\n        columns=lambda x: 'Player{}_id'.format(x + 1))\n\n    # There are sometimes more than 6 players coming on at a time...it's not my problem (it's rare enough)\n    grouped_df[player_cols] = players[['Player1', 'Player2', 'Player3', 'Player4', 'Player5', 'Player6']]\n    grouped_df[player_id_cols] = player_ids[['Player1_id', 'Player2_id', 'Player3_id', 'Player4_id', 'Player5_id', 'Player6_id']]\n\n    # Not needed anymore since we converted to new columns\n    grouped_df = grouped_df.drop(['player', 'player_Id'], axis=1)\n\n    return grouped_df.reset_index(drop=True)\n\n\ndef group_shifts(games_df, shifts):\n    \"\"\"\n    As of now the shifts are 1 player per row. This groups by team by type (on/off) by second. So at the beginning of \n    the game we'll have one row with 6 players coming on for the home team and the same row for the away team.\n\n    :param games_df: DataFrame containing Game_Id, Home_Team, and Away_Team -> Shifts_df doesn't contains home/away\n    :param shifts: DataFrame of Shifts\n\n    :return: Grouped Shifts DataFrame\n    \"\"\"\n    # Up to 6 players on and off any time\n    player_cols = [''.join(['Player', str(num)]) for num in range(1, 7)]\n    player_id_cols = [''.join(['Player', str(num), '_id']) for num in range(1, 7)]\n\n    # Merge in Home/Away Teams\n    shifts = pd.merge(shifts, games_df, on=['Game_Id'])\n\n    # Groups into on and off shift rows\n    grouped_df = group_shifts_type(shifts, player_cols, player_id_cols)\n\n    # Separate home and away for the purpose of the player columns (read below for more info)\n    grouped_df_home = grouped_df[grouped_df.Team == grouped_df.Home_Team]\n    grouped_df_away = grouped_df[grouped_df.Team == grouped_df.Away_Team]\n\n    # Rename Players columns into both home and away\n    # As on now it's player1, player1_id...etc.\n    # To merge into the pbp we need to append home and away for the appropriate players\n    # So we separate them and rename them with a \"home\" for the home teams and \"away\" for away teams\n    grouped_df_home = grouped_df_home.rename(index=str, columns={col: 'home' + col for col in player_cols})\n    grouped_df_home = grouped_df_home.rename(index=str, columns={col: 'home' + col for col in player_id_cols})\n    grouped_df_away = grouped_df_away.rename(index=str, columns={col: 'away' + col for col in player_cols})\n    grouped_df_away = grouped_df_away.rename(index=str, columns={col: 'away' + col for col in player_id_cols})\n\n    # Group home/away shifts at same time on the same line\n    df = pd.merge(grouped_df_home, grouped_df_away, on=['Game_Id', 'Period', 'Date', 'Event', 'Seconds_Elapsed'], \n                  how=\"outer\", sort=True)\n\n    df = df.rename(index=str, columns={\"Home_Team_x\": \"Home_Team\", \"Away_Team_x\": \"Away_Team\"})\n\n    return df.reset_index(drop=True)\n\n\ndef merge(pbp_df, shifts_df):\n    \"\"\"\n    Merge the shifts_df into the pbp_df.\n\n    :param pbp_df: Play by Play DataFrame\n    :param shifts_df: Shift Tables DataFrame\n\n    :return: Play by Play DataFrame with shift info embedded\n    \"\"\"\n    # To get the final pbp columns in the \"correct\" order\n    pbp_columns = pbp_df.columns\n\n    shifts_df['Player_Id'] = shifts_df['Player_Id'].astype(int)\n\n    # Get unique game_id -> teams pair for placing in Shifts_df\n    pbp_unique = pbp_df.drop_duplicates(subset=['Game_Id', 'Home_Team', 'Away_Team'])[['Game_Id', 'Home_Team', 'Away_Team']]\n\n    # Group up shifts that start/end at the same time\n    new_shifts = group_shifts(pbp_unique, shifts_df)\n    new_shifts = new_shifts.where((pd.notnull(new_shifts)), None)\n\n    # Add in & order rows\n    new_pbp = pbp_df.append(new_shifts).reset_index(drop=True)\n    new_pbp['Priority'] = new_pbp.apply(label_priority, axis=1)\n    new_pbp = new_pbp.sort_values(by=['Game_Id', 'Period', 'Seconds_Elapsed', 'Priority'])\n\n    return new_pbp[pbp_columns]\n"
  },
  {
    "path": "hockey_scraper/utils/player_name_fixes.json",
    "content": "{\n    \"_description\": \"Fixes some of the mistakes made with player names (converts to 'correct' name)\",\n    \"_comment\": \"A majority of this is courtesy of Muneeb Alam (https://github.com/muneebalam/Hockey/blob/master/NHL/Core/GetPbP.py)\",\n\n    \"fixes\": {\n        \"n/a\": \"n/a\",\n        \"ALEXANDER OVECHKIN\": \"Alex Ovechkin\",\n        \"TOBY ENSTROM\": \"Tobias Enstrom\",\n        \"JAMIE MCGINN\": \"Jamie McGinn\",\n        \"CODY MCLEOD\": \"Cody McLeod\",\n        \"MARC-EDOUARD VLASIC\": \"Marc-Edouard Vlasic\",\n        \"RYAN MCDONAGH\": \"Ryan McDonagh\",\n        \"CHRIS TANEV\": \"Christopher Tanev\",\n        \"JARED MCCANN\": \"Jared McCann\",\n        \"P.K. SUBBAN\": \"PK Subban\",\n        \"DEVANTE SMITH-PELLY\": \"Devante Smith-Pelly\",\n        \"MIKE MCKENNA\": \"Mike McKenna\",\n        \"MICHAEL MCCARRON\": \"Michael McCarron\",\n        \"T.J. BRENNAN\": \"TJ Brennan\",\n        \"BRAYDEN MCNABB\": \"Brayden McNabb\",\n        \"PIERRE-ALEXANDRE PARENTEAU\": \"PA Parenteau\",\n        \"JAMES VAN RIEMSDYK\": \"James van Riemsdyk\",\n        \"OLIVER EKMAN-LARSSON\": \"Oliver Ekman-Larsson\",\n        \"TJ OSHIE\": \"TJ Oshie\",\n        \"J P DUMONT\": \"JP Dumont\",\n        \"J.T. MILLER\": \"JT Miller\",\n        \"R.J UMBERGER\": \"RJ Umberger\",\n        \"PA PARENTEAU\": \"PA Parenteau\",\n        \"PER-JOHAN AXELSSON\": \"P.J. Axelsson\",\n        \"MAXIME TALBOT\": \"Max Talbot\",\n        \"JOHN-MICHAEL LILES\": \"John-Michael Liles\",\n        \"DANIEL GIRARDI\": \"Dan Girardi\",\n        \"DANIEL CLEARY\": \"Dan Cleary\",\n        \"NIKLAS KRONVALL\": \"Niklas Kronwall\",\n        \"SIARHEI KASTSITSYN\": \"Sergei Kostitsyn\",\n        \"ANDREI KASTSITSYN\": \"Andrei Kostitsyn\",\n        \"ALEXEI KOVALEV\": \"Alex Kovalev\",\n        \"DAVID JOHNNY ODUYA\": \"Johnny Oduya\",\n        \"EDWARD PURCELL\": \"Teddy Purcell\",\n        \"NICKLAS GROSSMAN\": \"Nicklas Grossmann\",\n        \"PERNELL KARL SUBBAN\": \"PK Subban\",\n        \"VOJTEK VOLSKI\": \"Wojtek Wolski\",\n        \"VYACHESLAV VOYNOV\": \"Slava Voynov\",\n        \"FREDDY MODIN\": \"Fredrik Modin\",\n        \"VACLAV PROSPAL\": \"Vinny Prospal\",\n        \"KRISTOPHER LETANG\": \"Kris Letang\",\n        \"PIERRE ALEXANDRE PARENTEAU\": \"PA Parenteau\",\n        \"T.J. OSHIE\": \"TJ Oshie\",\n        \"JOHN HILLEN III\": \"Jack Hillen\",\n        \"BRANDON CROMBEEN\": \"B.J. Crombeen\",\n        \"JEAN-PIERRE DUMONT\": \"JP Dumont\",\n        \"RYAN NUGENT-HOPKINS\": \"Ryan Nugent-Hopkins\",\n        \"CONNOR MCDAVID\": \"Connor McDavid\",\n        \"TREVOR VAN RIEMSDYK\": \"Trevor van Riemsdyk\",\n        \"CALVIN DE HAAN\": \"Calvin de Haan\",\n        \"GREG MCKEGG\": \"Greg McKegg\",\n        \"NATHAN MACKINNON\": \"Nathan MacKinnon\",\n        \"KYLE MCLAREN\": \"Kyle McLaren\",\n        \"ADAM MCQUAID\": \"Adam McQuaid\",\n        \"DYLAN MCILRATH\": \"Dylan McIlrath\",\n        \"DANNY DEKEYSER\": \"Danny DeKeyser\",\n        \"JAKE MCCABE\": \"Jake McCabe\",\n        \"JAMIE MCBAIN\": \"Jamie McBain\",\n        \"PIERRE-MARC BOUCHARD\": \"Pierre-Marc Bouchard\",\n        \"JEAN-FRANCOIS JACQUES\": \"JF Jacques\",\n        \"OLE-KRISTIAN TOLLEFSEN\": \"Ole-Kristian Tollefsen\",\n        \"MARC-ANDRE BERGERON\": \"Marc-Andre Bergeron\",\n        \"MARC-ANTOINE POULIOT\": \"Marc-Antoine Pouliot\",\n        \"MARC-ANDRE GRAGNANI\": \"Marc-Andre Gragnani\",\n        \"JORDAN LAVALLEE-SMOTHERMAN\": \"Jordan Lavallee-Smotherman\",\n        \"PIERRE-LUC LETOURNEAU-LEBLOND\": \"Pierre Leblond\",\n        \"J-F JACQUES\": \"JF Jacques\",\n        \"JP DUMONT\": \"JP Dumont\",\n        \"MARC-ANDRE CLICHE\": \"Marc-Andre Cliche\",\n        \"J-P DUMONT\": \"JP Dumont\",\n        \"JOSHUA BAILEY\": \"Josh Bailey\",\n        \"OLIVIER MAGNAN-GRENIER\": \"Olivier Magnan-Grenier\",\n        \"FRÉDÉRIC ST-DENIS\": \"Frederic St-Denis\",\n        \"MARC-ANDRE BOURDON\": \"Marc-Andre Bourdon\",\n        \"PIERRE-CEDRIC LABRIE\": \"Pierre-Cedric Labrie\",\n        \"JONATHAN AUDY-MARCHESSAULT\": \"Jonathan Marchessault\",\n        \"JEAN-GABRIEL PAGEAU\": \"Jean-Gabriel Pageau\",\n        \"JEAN-PHILIPPE COTE\": \"Jean-Philippe Cote\",\n        \"PIERRE-EDOUARD BELLEMARE\": \"Pierre-Edouard Bellemare\",\n        \"COLIN (JOHN) WHITE\": \"Colin White\",\n        \"BATES (JON) BATTAGLIA\": \"Bates Battaglia\",\n        \"MATHEW DUBMA\": \"Matt Dumba\",\n        \"NIKOLAI ANTROPOV\": \"Nik Antropov\",\n        \"KRYS BARCH\": \"Krystofer Barch\",\n        \"CAMERON BARKER\": \"Cam Barker\",\n        \"NICKLAS BERGFORS\": \"Niclas Bergfors\",\n        \"ROBERT BLAKE\": \"Rob Blake\",\n        \"MICHAEL BLUNDEN\": \"Mike Blunden\",\n        \"CHRISTOPHER BOURQUE\": \"Chris Bourque\",\n        \"MICHÃ«L BOURNIVAL\": \"Michael Bournival\",\n        \"NICHOLAS BOYNTON\": \"Nick Boynton\",\n        \"TJ BRENNAN\": \"TJ Brennan\",\n        \"DANIEL BRIERE\": \"Danny Briere\",\n        \"TJ BRODIE\": \"TJ Brodie\",\n        \"J.T. BROWN\": \"JT Brown\",\n        \"ALEXANDRE BURROWS\": \"Alex Burrows\",\n        \"MICHAEL CAMMALLERI\": \"Mike Cammalleri\",\n        \"DANIEL CARCILLO\": \"Dan Carcillo\",\n        \"MATTHEW CARLE\": \"Matt Carle\",\n        \"DANNY CLEARY\": \"Dan Cleary\",\n        \"JOSEPH CORVO\": \"Joe Corvo\",\n        \"JOSEPH CRABB\": \"Joey Crabb\",\n        \"BJ CROMBEEN\": \"B.J. Crombeen\",\n        \"EVGENII DADONOV\": \"Evgeny Dadonov\",\n        \"CHRIS VANDE VELDE\": \"Chris VandeVelde\",\n        \"JACOB DE LA ROSE\": \"Jacob de la Rose\",\n        \"JOE DIPENTA\": \"Joe DiPenta\",\n        \"JON DISALVATORE\": \"Jon DiSalvatore\",\n        \"JACOB DOWELL\": \"Jake Dowell\",\n        \"NICHOLAS DRAZENOVIC\": \"Nick Drazenovic\",\n        \"ROBERT EARL\": \"Robbie Earl\",\n        \"ALEXANDER FROLOV\": \"Alex Frolov\",\n        \"T.J. GALIARDI\": \"TJ Galiardi\",\n        \"TJ GALIARDI\": \"TJ Galiardi\",\n        \"ANDREW GREENE\": \"Andy Greene\",\n        \"MICHAEL GRIER\": \"Mike Grier\",\n        \"NATHAN GUENIN\": \"Nate Guenin\",\n        \"MARTY HAVLAT\": \"Martin Havlat\",\n        \"JOSHUA HENNESSY\": \"Josh Hennessy\",\n        \"T.J. HENSICK\": \"TJ Hensick\",\n        \"TJ Hensick\": \"TJ Hensick\",\n        \"CHRISTOPHER HIGGINS\": \"Chris Higgins\",\n        \"ROBERT HOLIK\": \"Bobby Holik\",\n        \"MATTHEW IRWIN\": \"Matt Irwin\",\n        \"P. J. AXELSSON\": \"P.J. Axelsson\",\n        \"PER JOHAN AXELSSON\": \"P.J. Axelsson\",\n        \"JONATHON KALINSKI\": \"Jon Kalinski\",\n        \"ALEXANDER KHOKHLACHEV\": \"Alex Khokhlachev\",\n        \"DJ KING\": \"DJ King\",\n        \"DWAYNE KING\": \"DJ King\",\n        \"MICHAEL KNUBLE\": \"Mike Knuble\",\n        \"KRYSTOFER KOLANOS\": \"Krys Kolanos\",\n        \"MICHAEL KOMISAREK\": \"Mike Komisarek\",\n        \"STAFFAN KRONVALL\": \"Staffan Kronwall\",\n        \"NIKOLAY KULEMIN\": \"Nikolai Kulemin\",\n        \"CLARKE MACARTHUR\": \"Clarke MacArthur\",\n        \"LANE MACDERMID\": \"Lane MacDermid\",\n        \"ANDREW MACDONALD\": \"Andrew MacDonald\",\n        \"RAYMOND MACIAS\": \"Ray Macias\",\n        \"CRAIG MACDONALD\": \"Craig MacDonald\",\n        \"STEVE MACINTYRE\": \"Steve MacIntyre\",\n        \"MAKSIM MAYOROV\": \"Maxim Mayorov\",\n        \"AARON MACKENZIE\": \"Aaron MacKenzie\",\n        \"DEREK MACKENZIE\": \"Derek MacKenzie\",\n        \"RODNEY PELLEY\": \"Rod Pelley\",\n        \"BRETT MACLEAN\": \"Brett MacLean\",\n        \"ANDREW MACWILLIAM\": \"Andrew MacWilliam\",\n        \"BRYAN MCCABE\": \"Bryan McCabe\",\n        \"OLIVIER MAGNAN\": \"Olivier Magnan-Grenier\",\n        \"DEAN MCAMMOND\": \"Dean McAmmond\",\n        \"KENNDAL MCARDLE\": \"Kenndal McArdle\",\n        \"ANDY MCDONALD\": \"Andy McDonald\",\n        \"COLIN MCDONALD\": \"Colin McDonald\",\n        \"JOHN MCCARTHY\": \"John McCarthy\",\n        \"STEVE MCCARTHY\": \"Steve McCarthy\",\n        \"DARREN MCCARTY\": \"Darren McCarty\",\n        \"JAY MCCLEMENT\": \"Jay McClement\",\n        \"CODY MCCORMICK\": \"Cody McCormick\",\n        \"MAX MCCORMICK\": \"Max McCormick\",\n        \"BROCK MCGINN\": \"Brock McGinn\",\n        \"TYE MCGINN\": \"Tye McGinn\",\n        \"BRIAN MCGRATTAN\": \"Brian McGrattan\",\n        \"DAVID MCINTYRE\": \"David McIntyre\",\n        \"NATHAN MCIVER\": \"Nathan McIver\",\n        \"JAY MCKEE\": \"Jay McKee\",\n        \"CURTIS MCKENZIE\": \"Curtis McKenzie\",\n        \"FRAZER MCLAREN\": \"Frazer McLaren\",\n        \"BRETT MCLEAN\": \"Brett McLean\",\n        \"BRANDON MCMILLAN\": \"Brandon McMillan\",\n        \"CARSON MCMILLAN\": \"Carson McMillan\",\n        \"PHILIP MCRAE\": \"Philip McRae\",\n        \"FREDERICK MEYER IV\": \"Freddy Meyer\",\n        \"MICHAEL MODANO\": \"Mike Modano\",\n        \"CHRISTOPHER NEIL\": \"Chris Neil\",\n        \"MATTHEW NIETO\": \"Matt Nieto\",\n        \"JOHN ODUYA\": \"Johnny Oduya\",\n        \"PIERRE PARENTEAU\": \"PA Parenteau\",\n        \"MARC POULIOT\": \"Marc-Antoine Pouliot\",\n        \"MAXWELL REINHART\": \"Max Reinhart\",\n        \"MICHAEL RUPP\": \"Mike Rupp\",\n        \"ROBERT SCUDERI\": \"Rob Scuderi\",\n        \"TOMMY SESTITO\": \"Tom Sestito\",\n        \"MICHAEL SILLINGER\": \"Mike Sillinger\",\n        \"JONATHAN SIM\": \"Jon Sim\",\n        \"MARTIN ST LOUIS\": \"Martin St. Louis\",\n        \"MATTHEW STAJAN\": \"Matt Stajan\",\n        \"ZACHERY STORTINI\": \"Zack Stortini\",\n        \"PK SUBBAN\": \"PK Subban\",\n        \"WILLIAM THOMAS\": \"Bill Thomas\",\n        \"R.J. UMBERGER\": \"RJ Umberger\",\n        \"RJ UMBERGER\": \"RJ Umberger\",\n        \"MARK VAN GUILDER\": \"Mark van Guilder\",\n        \"BRYCE VAN BRABANT\": \"Bryce van Brabant\",\n        \"DAVID VAN DER GULIK\": \"David van der Gulik\",\n        \"MIKE VAN RYN\": \"Mike van Ryn\",\n        \"ANDREW WOZNIEWSKI\": \"Andy Wozniewski\",\n        \"JAMES WYMAN\": \"JT Wyman\",\n        \"JT WYMAN\": \"JT Wyman\",\n        \"NIKOLAY ZHERDEV\": \"Nikolai Zherdev\",\n        \"HARRISON ZOLNIERCZYK\": \"Harry Zolnierczyk\",\n        \"MARTIN ST PIERRE\": \"Martin St. Pierre\",\n        \"B.J CROMBEEN\": \"B.J. Crombeen\",\n        \"DENIS GAUTHIER JR.\": \"DENIS GAUTHIER\",\n        \"DENIS JR. GAUTHIER\": \"DENIS GAUTHIER\",\n        \"MARC-ANDRE FLEURY\": \"Marc-Andre Fleury\",\n        \"DAN LACOUTURE\": \"Dan LaCouture\",\n        \"RICK DIPIETRO\": \"Rick DiPietro\",\n        \"JOEY MACDONALD\": \"Joey MacDonald\",\n        \"TIMOTHY JR. THOMAS\": \"Tim Thomas\",\n        \"ILJA BRYZGALOV\": \"Ilya Bryzgalov\",\n        \"MATHEW DUMBA\": \"Matt Dumba\",\n        \"MICHAËL BOURNIVAL\": \"Michael Bournival\",\n        \"MATTHEW BENNING\": \"Matt Benning\",\n        \"ZACHARY SANFORD\": \"Zach Sanford\",\n        \"AJ GREER\": \"A.J. Greer\",\n        \"JT COMPHER\": \"J.T. Compher\",\n        \"NICOLAS PETAN\": \"Nic Petan\",\n        \"VINCENT HINOSTROZA\": \"Vinnie Hinostroza\",\n        \"PHILIP VARONE\": \"Phil Varone\",\n        \"JOSHUA MORRISSEY\": \"Josh Morrissey\",\n        \"Mathew Bodie\": \"Mat Bodie\",\n        \"MICHAEL FERLAND\": \"Micheal Ferland\",\n        \"MICHAEL SANTORELLI\": \"Mike Santorelli\",\n        \"CHRISTOPHER BREEN\": \"Chris Breen\",\n        \"BRYCE VAN BRABRANT\": \"Bryce Van Brabant\",\n        \"ALEXANDER KILLORN\": \"Alex Killorn\",\n        \"JOSEPH MORROW\": \"Joe Morrow\",\n        \"ALEX STEEN\": \"Alexander Steen\",\n        \"BRADLEY MILLS\": \"Brad Mills\",\n        \"MICHAEL SISLO\": \"Mike Sislo\",\n        \"MICHAEL VERNACE\": \"Mike Vernace\",\n        \"STEVEN REINPRECHT\": \"Steve Reinprecht\",\n        \"MATTHEW MURRAY\": \"Matt Murray\",\n        \"THOMAS MCCOLLUM\": \"TOM MCCOLLUM\",\n        \"MICHAEL MATHESON\": \"MIKE MATHESON\",\n        \"BOO NIEVES\": \"CRISTOVAL NIEVES\",\n        \"J.F. BERUBE\": \"JEAN-FRANCOIS BERUBE\",\n        \"TONY DEANGELO\": \"ANTHONY DEANGELO\",\n        \"JEFFREY HAMILTON\": \"JEFF HAMILTON\",\n        \"JAMES VANDERMEER\": \"JIM VANDERMEER\",\n        \"MICHAEL YORK\": \"MIKE YORK\",\n        \"EMMANUEL LEGACE\": \"MANNY LEGACE\",\n        \"JAMES DOWD\": \"JIM DOWD\",\n        \"ANDREW MILLER\": \"DREW MILLER\",\n        \"JOHN PEVERLEY\": \"RICH PEVERLEY\",\n        \"ILJA ZUBOV\": \"ILYA ZUBOV\",\n        \"CHRISTOPHER MINARD\": \"CHRIS MINARD\",\n        \"BENJAMIN ONDRUS\": \"BEN ONDRUS\",\n        \"ZACH FITZGERALD\": \"ZACK FITZGERALD\",\n        \"STEPHEN VALIQUETTE\": \"STEVE VALIQUETTE\",\n        \"OLAF KOLZIG\": \"OLIE KOLZIG\",\n        \"J-SEBASTIEN AUBIN\": \"JEAN-SEBASTIEN AUBIN\",\n        \"ALEXANDER AULD\": \"ALEX AULD\",\n        \"JAMES HOWARD\": \"JIMMY HOWARD\",\n        \"JEFF DROUIN-DESLAURIERS\": \"JEFF DESLAURIERS\",\n        \"SIMEON VARLAMOV\": \"SEMYON VARLAMOV\",\n        \"ALEXANDER PECHURSKI\": \"Alexander Pechurskiy\",\n        \"JEFFREY PENNER\": \"JEFF PENNER\",\n        \"EMMANUEL FERNANDEZ\": \"Manny FERNANDEZ\",\n        \"ALEXANDER PETROVIC\": \"ALEX PETROVIC\",\n        \"ZACHARY ASTON-REESE\": \"ZACH ASTON-REESE\",\n        \"J-F BERUBE\": \"JEAN-FRANCOIS BERUBE\",\n        \"DANNY O'REGAN\": \"DANIEL O'REGAN\",\n        \"PATRICK MAROON\": \"PAT MAROON\",\n        \"LEE  STEMPNIAK\": \"LEE STEMPNIAK\",\n        \"JAMES REIMER ,\": \"JAMES REIMER\",\n        \"CALVIN PETERSEN ,\": \"CALVIN PETERSEN\",\n        \"CAL PETERSEN\": \"CALVIN PETERSEN\",\n        \"ALEXANDER NYLANDER\": \"ALEX NYLANDER\",\n        \"CHRISTOPHER WAGNER\": \"CHRIS WAGNER\",\n        \"EGOR SHARANGOVICH\": \"Yegor Sharangovich\",\n        \"ALEXIS LAFRENIERE\": \"Alexis Lafrenière\",\n        \"ALEXIS LAFRENI?RE\": \"Alexis Lafrenière\",\n        \"CALLAN FOOTE\": \"Cal Foote\",\n        \"MATTIAS JANMARK-NYLEN\": \"Mattias Janmark\",\n        \"JOSHUA DUNNE\": \"Josh Dunne\",\n        \"TIM STUTZLE\": \"Tim Stützle\",\n        \"WILLIAM BORGEN\": \"Will Borgen\",\n        \"CHASE DE LEO\": \"Chase DeLeo\",\n        \"JANIS MOSER\": \"J.J. Moser\"\n    }\n}"
  },
  {
    "path": "hockey_scraper/utils/save_pages.py",
    "content": "\"\"\"\nSaves the scraped docs so you don't have to re-scrape them every time you want to parse the docs. \n\n\\**** Don't mess with this unless you know what you're doing \\****\n\"\"\"\nimport os\nimport gzip\n\n\ndef create_base_file_path(file_info):\n    \"\"\"\n    Creates the base file path for a given file\n    \n    :param file_info: Dictionary containing the info on the file. Includes the name, season, file type, and the dir\n                      we want to deposit any data in.\n    \n    :return: path \n    \"\"\"\n    # Shitty fix for when you already have it saved but don't have nwhl folders\n    if 'nwhl' in file_info['type']:\n        if not os.path.isdir(os.path.join(file_info['dir'], 'docs', str(file_info['season']), file_info['type'])):\n            os.mkdir(os.path.join(file_info['dir'], 'docs', str(file_info['season']), file_info['type']))\n\n    return os.path.join(file_info['dir'], 'docs', str(file_info['season']), file_info['type'], file_info['name'] + \".txt\")\n\n\ndef is_compressed(file_info):\n    \"\"\"\n    Check if stored file is compressed as we used to not save them as compressed.\n\n    :param file_info: Dictionary containing the info on the file. Includes the name, season, file type, and the dir\n                      we want to deposit any data in.\n    \n    return Boolean\n    \"\"\"\n    return os.path.isfile(create_base_file_path(file_info) + \".gz\")\n\n\ndef create_dir_structure(dir_name):\n    \"\"\"\n    Create the basic directory structure for docs_dir if not done yet.\n    Creates the docs and csvs subdir if it doesn't exist\n\n    :param dir_name: Name of dir to create\n\n    :return None\n    \"\"\"\n    if not os.path.isdir(os.path.join(dir_name, 'docs')):\n        os.mkdir(os.path.join(dir_name, 'docs'))\n\n    if not os.path.isdir(os.path.join(dir_name, 'csvs')): \n        os.mkdir(os.path.join(dir_name, 'csvs'))\n\n\n\ndef create_season_dirs(file_info):\n    \"\"\"\n    Creates the infrastructure to hold all the scraped docs for a season\n    \n    :param file_info: Dictionary containing the info on the file. Includes the name, season, file type, and the dir\n                      we want to deposit any data in.\n                        \n    :return: None\n    \"\"\"\n    sub_folders = [\"html_pbp\", \"json_pbp\", \"espn_pbp\", \"html_shifts_home\", \"html_shifts_away\", \n                   \"json_shifts\", \"html_roster\", \"json_schedule\", \"espn_scoreboard\"]\n\n    season_path = os.path.join(file_info['dir'], 'docs', str(file_info['season']))\n    os.mkdir(season_path)\n\n    for sub_f in sub_folders:\n        os.mkdir(os.path.join(season_path, sub_f))\n\n\ndef check_file_exists(file_info):\n    \"\"\"\n    Checks if the file exists. Also check if structure for holding scraped file exists to. If not, it creates it. \n\n    :param file_info: Dictionary containing the info on the file. Includes the name, season, file type, and the dir\n                      we want to deposit any data in.\n\n    :return: Boolean - True if it exists\n    \"\"\"\n    create_dir_structure(file_info['dir'])\n\n    # Check if the folder for the season for the given game was created yet...if not create it\n    if not os.path.isdir(os.path.join(file_info['dir'], 'docs', str(file_info['season']))):\n        create_season_dirs(file_info)\n\n    # May or may not be compressed due to file saved under older versions\n    non_compressed_file = os.path.isfile(create_base_file_path(file_info)) \n    compressed_file = is_compressed(file_info)\n\n    return compressed_file or non_compressed_file\n\n\ndef get_page(file_info):\n    \"\"\"\n    Get the file so we don't need to re-scrape.\n\n    Try both compressed and non-compressed for backwards compatability issues (formerly non-compressed)\n\n    :param file_info: Dictionary containing the info on the file. Includes the name, season, file type, and the dir\n                      we want to deposit any data in.\n\n    :return: Response or None\n    \"\"\"\n    base_file = create_base_file_path(file_info)\n\n    if is_compressed(file_info):\n        with gzip.open(base_file + \".gz\", 'rb') as my_file:\n            return my_file.read().decode(\"utf-8\").replace('\\n', '')\n    else:\n        with open(base_file, 'r') as my_file:\n            return my_file.read().replace('\\n', '')\n\n\ndef save_page(page, file_info):\n    \"\"\"\n    Save the page we just scraped.\n    \n    Note: It'll only get saved if the directory already exists!!!!!!. I'm not dealing with any fuck ups. That would \n    involve checking if it's even a valid path and creating it. Make sure you get it right. \n\n    :param page: File scraped\n    :param file_info: Dictionary containing the info on the file. Includes the name, season, file type, and the dir\n                      we want to deposit any data in.\n\n    :return: None\n    \"\"\"\n    if file_info['dir'] and page is not None and page != '':\n        with gzip.open(create_base_file_path(file_info) + \".gz\", 'wb') as file:\n            file.write(page.encode()) \n"
  },
  {
    "path": "hockey_scraper/utils/shared.py",
    "content": "\"\"\"\nThis file is a bunch of the shared functions or just general stuff used by the different scrapers in the package.\n\"\"\"\nimport os\nimport time\nimport json\nimport logging\nimport warnings\nimport requests\nfrom datetime import datetime, timedelta\nfrom requests.adapters import HTTPAdapter\nfrom requests.packages.urllib3.util.retry import Retry\nfrom . import save_pages as sp\nfrom . import config\nimport inspect\n\n# Directory where this file lives\nFILE_DIR = os.path.dirname(os.path.realpath(__file__))\n\n# Name and Team fixes used \nwith open(os.path.join(FILE_DIR, \"player_name_fixes.json\"), \"r\" ,encoding=\"utf-8\") as f:\n    Names = json.load(f)['fixes']\n\nwith open(os.path.join(FILE_DIR, \"team_tri_codes.json\"), \"r\" ,encoding=\"utf-8\") as f:\n    TEAMS = json.load(f)['teams']\n\nwith open(os.path.join(FILE_DIR, \"tri_code_conversion.json\"), \"r\" ,encoding=\"utf-8\") as f:\n    TRI_CODES = json.load(f)['tri_codes']\n\n\ndef fix_name(name):\n    \"\"\"\n    Check if a name falls under those that need fixing. If it does...fix it.\n\n    :param name: name in pbp\n\n    :return: Either the given parameter or the fixed name\n    \"\"\"\n    return Names.get(name.upper(), name.upper()).upper()\n\n\ndef get_team(team):\n    \"\"\"\n    Get the fucking team\n    \"\"\"\n    return TEAMS.get(team.upper(), team.upper()).upper()\n\n\ndef convert_tricode(tri):\n    \"\"\"\n    Convert the tri-code if found in 'tri_code_conversion.json'\n\n    :return Fixed tri-code or original\n    \"\"\"\n    return TRI_CODES.get(tri.upper(), tri.upper()).upper()\n    \n\ndef custom_formatwarning(msg, *args, **kwargs): \n    \"\"\"\n    Override format for standard wanings\n    \"\"\"\n    ansi_no_color = '\\033[0m'\n    return \"{msg}\\n{no_color}\".format(no_color=ansi_no_color, msg=msg)\n    \nwarnings.formatwarning = custom_formatwarning\n\n\ndef print_error(msg):\n    \"\"\"\n    Implement own custom error using warning module. Prints in red\n\n    Reason why i still use warning for errors is so i can set to ignore them if i want to (e.g. live_scrape line 200).\n\n    :param msg: Str to print\n\n    :return: None\n    \"\"\"\n    ansi_red_code = '\\033[0;31m'\n    warning_msg = \"{}Error: {}\".format(ansi_red_code, msg)\n\n    # if config.LOG:\n    #     caller_file = os.path.basename(inspect.stack()[1].filename)\n    #     get_logger(caller_file).error(msg + \" \" + verbose)\n\n    warnings.warn(warning_msg) \n\n\ndef print_warning(msg):\n    \"\"\"\n    Implement own custom warning using warning module. Prints in Orange.\n\n    :param msg: Str to print\n\n    :return: None\n    \"\"\"\n    ansi_yellow_code = '\\033[0;33m'\n    warning_msg = \"{}Warning: {}\".format(ansi_yellow_code, msg)\n\n    warnings.warn(warning_msg)\n\n\ndef get_logger(python_file):\n    \"\"\"\n    Create a basic logger to a log file\n\n    :param python_file: File that instantiates the logger instance\n    \n    :return: logger \n    \"\"\"\n    base_py_file = os.path.basename(python_file)\n\n    # If already exists we don't try to recreate it\n    if base_py_file in logging.Logger.manager.loggerDict.keys():\n        return logging.getLogger(base_py_file)\n\n    logger = logging.getLogger(base_py_file)\n    logger.setLevel(logging.INFO)  \n    \n    fh = logging.FileHandler(\"hockey_scraper_errors_{}.log\".format(datetime.now().strftime(\"%Y-%m-%dT%H:%M:%S\"))) \n    fh.setFormatter(logging.Formatter('%(asctime)s\\t%(name)s\\t%(levelname)s\\t%(message)s', datefmt='%Y-%m-%d %I:%M:%S'))\n    logger.addHandler(fh)\n\n    return logger\n\n\ndef log_error(err, py_file):\n    \"\"\"\n    Log error when Logging is specified\n\n    :param err: Error to log\n    :param python_file: File that instantiates the logger instance\n\n    :return: None\n    \"\"\"\n    if config.LOG:\n        get_logger(py_file).error(err)\n\n\ndef get_season(date):\n    \"\"\"\n    Get Season based on from_date\n\n    There is an exception for the 2019-2020 pandemic season. Accoding to the below url:\n        -  2019-2020 season ends in Oct. 2020\n        -  2020-2021 season begins in November 2020\n        -  https://nhl.nbcsports.com/2020/07/10/new-nhl-critical-dates-calendar-means-an-october-free-agent-frenzy/\n\n    :param date: date\n\n    :return: season -> ex: 2016 for 2016-2017 season\n    \"\"\"\n    year = date[:4]\n    date = datetime.strptime(date, \"%Y-%m-%d\")\n    initial_bound = datetime.strptime('-'.join([year, '01-01']), \"%Y-%m-%d\")\n\n    # End bound for year1-year2 season is later for pandemic year\n    if initial_bound <= date <= season_end_bound(year):\n        return int(year) - 1\n\n    return int(year)\n\n\ndef season_start_bound(year):\n    \"\"\"\n    Get start bound for a season.\n\n    Notes:\n     - There is a bug in the schedule API for 2016 that causes the pushback to 09-30\n     - Pandemic season started in January\n\n    :param year: str of year for given date\n\n    :return: str of first date in season\n    \"\"\"\n    if int(year) == 2016:\n        return \"2016-09-30\"\n        \n    if int(year) == 2020:\n        return '2021-01-01'\n\n    return \"{}-09-01\".format(str(year))\n\n\n\ndef season_end_bound(year):\n    \"\"\"\n    Determine the end bound of a given season. Changes depending on if it's the pandemic season or not\n\n    :param year: str of year for given date\n\n    :return: Datetime obj of last date in season\n    \"\"\"\n    normal_end_bound = datetime.strptime('-'.join([str(year), '08-31']), \"%Y-%m-%d\")\n    pandemic_end_bound = datetime.strptime('-'.join([str(year), '10-31']), \"%Y-%m-%d\")\n\n    if int(year) == 2020:\n        return pandemic_end_bound\n\n    return normal_end_bound\n\n\ndef convert_to_seconds(minutes):\n    \"\"\"\n    Return minutes elapsed in time format to seconds elapsed\n\n    :param minutes: time elapsed\n\n    :return: time elapsed in seconds\n    \"\"\"\n    if minutes == '-16:0-':\n        return '1200'      # Sometimes in the html at the end of the game the time is -16:0-\n\n    # If the time is junk not much i can do\n    try:\n        x = time.strptime(minutes.strip(' '), '%M:%S')\n    except ValueError:\n        return None\n\n    return timedelta(hours=x.tm_hour, minutes=x.tm_min, seconds=x.tm_sec).total_seconds()\n\n\ndef if_rescrape(user_rescrape):\n    \"\"\"\n    If you want to re_scrape. If someone is a dumbass and feeds it a non-boolean it terminates the program\n\n    Note: Only matters when you have a directory specified\n\n    :param user_rescrape: Boolean\n\n    :return: None\n    \"\"\"\n    if isinstance(user_rescrape, bool):\n        config.RESCRAPE = user_rescrape\n    else:\n        raise ValueError(\"Error: 'if_rescrape' must be a boolean. Not a {}\".format(type(user_rescrape)))\n\n\ndef add_dir(user_dir):\n    \"\"\"\n    Add directory to store scraped docs if valid. Or create in the home dir\n\n    NOTE: After this functions docs_dir is either None or a valid directory\n\n    :param user_dir: If bool=True create in the home dire or if user provided directory on their machine\n\n    :return: None\n    \"\"\"\n    # False so they don't want it\n    if not user_dir:\n        config.DOCS_DIR = False\n        return\n\n    # Something was given\n    # Either True or string to directory\n    # If boolean refer to the home directory\n    if isinstance(user_dir, bool):\n        config.DOCS_DIR = os.path.join(os.path.expanduser('~'), \"hockey_scraper_data\")\n        # Create if needed\n        if not os.path.isdir(config.DOCS_DIR):\n            print_warning(\"Creating the hockey_scraper_data directory in the home directory\")\n            os.mkdir(config.DOCS_DIR)\n    elif isinstance(user_dir, str) and os.path.isdir(user_dir):\n        config.DOCS_DIR = user_dir\n    elif not (isinstance(user_dir, str) and isinstance(user_dir, bool)):\n        config.DOCS_DIR = False\n        print_error(\"The docs_dir argument provided is invalid\")\n    else:\n        config.DOCS_DIR = False\n        print_error(\"The directory specified for the saving of scraped docs doesn't exist. Therefore:\"\n              \"\\n1. All specified games will be scraped from their appropriate sources (NHL or ESPN).\"\n              \"\\n2. All scraped files will NOT be saved at all. Please either create the directory you want them to be \"\n              \"deposited in or recheck the directory you typed in and start again.\\n\")\n\n\ndef scrape_page(url):\n    \"\"\"\n    Scrape a given url\n\n    :param url: url for page\n\n    :return: response object\n    \"\"\"\n    response = requests.Session()\n    retries = Retry(total=10, backoff_factor=.1)\n    response.mount('http://', HTTPAdapter(max_retries=retries))\n\n    try:\n        response = response.get(url, timeout=5)\n        response.raise_for_status()\n        page = response.text\n    except (requests.exceptions.HTTPError, requests.exceptions.ConnectionError):\n        page = None\n    except requests.exceptions.ReadTimeout:\n        # If it times out and it's the schedule print an error message...otherwise just make the page = None\n        if \"schedule\" in url:\n            raise Exception(\"Timeout Error: The NHL API took too long to respond to our request. \"\n                                \"Please Try Again (you may need to try a few times before it works). \")\n        else:\n            print_error(\"Timeout Error: The server took too long to respond to our request.\")\n            page = None\n\n    # Pause for 1 second - make it more if you want\n    time.sleep(1)\n\n    return page\n\n\n\ndef get_file(file_info, force=False):\n    \"\"\"\n    Get the specified file.\n\n    If a docs_dir is provided we check if it exists. If it does we see if it contains that page (and saves if it\n    doesn't). If the docs_dir doesn't exist we just scrape from the source and not save.\n\n    :param file_info: Dictionary containing the info for the file.\n                      Contains the url, name, type, and season\n    :param force: Force a rescrape. Default is False\n\n    :return: page\n    \"\"\"\n    file_info['dir'] = config.DOCS_DIR\n\n    # If everything checks out we'll retrieve it, otherwise we scrape it\n    if file_info['dir'] and sp.check_file_exists(file_info) and not config.RESCRAPE and not force:\n        page = sp.get_page(file_info)\n    else:\n        page = scrape_page(file_info['url'])\n        sp.save_page(page, file_info)\n\n    return page\n\n\ndef check_data_format(data_format):\n    \"\"\"\n    Checks if data_format specified (if it is at all) is either None, 'Csv', or 'pandas'.\n    It exits program with error message if input isn't good.\n\n    :param data_format: data_format provided \n\n    :return: Boolean - True if good\n    \"\"\"\n    if not data_format or data_format.lower() not in ['csv', 'pandas']:\n        raise ValueError('{} is an unspecified data format. The two options are Csv and Pandas '\n                            '(Csv is default)\\n'.format(data_format))\n\n\ndef check_valid_dates(from_date, to_date):\n    \"\"\"\n    Check if it's a valid date range\n\n    :param from_date: date should scrape from\n    :param to_date: date should scrape to\n\n    :return: None\n    \"\"\"\n    try:\n        if time.strptime(to_date, \"%Y-%m-%d\") < time.strptime(from_date, \"%Y-%m-%d\"):\n            raise ValueError(\"Error: The second date input is earlier than the first one\")\n    except ValueError:\n        raise ValueError(\"Error: Incorrect format given for dates. They must be given like 'yyyy-mm-dd' \"\n                            \"(ex: '2016-10-01').\")\n\n\ndef to_csv(base_file_name, df, league, file_type):\n    \"\"\"\n    Write DataFrame to csv file\n\n    :param base_file_name: name of file\n    :param df: DataFrame\n    :param league: nhl or nwhl\n    :param file_type: type of file despoiting\n\n    :return: None\n    \"\"\"\n    docs_dir = config.DOCS_DIR\n\n    # This was a late addition so we add support here\n    if isinstance(docs_dir, str) and not os.path.isdir(os.path.join(docs_dir, \"csvs\")):\n        os.mkdir(os.path.join(docs_dir, \"csvs\"))\n\n    if df is not None:\n        if isinstance(docs_dir, str):\n            file_name = os.path.join(docs_dir, \"csvs\", '{}_{}_{}.csv'.format(league, file_type, base_file_name))\n        else:\n            file_name = '{}_{}_{}.csv'.format(league, file_type, base_file_name)\n\n        print(\"---> {} {} data deposited in file - {}\".format(league, file_type, file_name))\n        df.to_csv(file_name, sep=',', encoding='utf-8')\n\n"
  },
  {
    "path": "hockey_scraper/utils/team_tri_codes.json",
    "content": "{\n    \"_descrition\": \"# All the corresponding tri-codes for team names\",\n\n    \"teams\": {\n        \"ANAHEIM DUCKS\": \"ANA\",\n        \"ARIZONA COYOTES\": \"ARI\",\n        \"ATLANTA THRASHERS\": \"ATL\",\n        \"BOSTON BRUINS\": \"BOS\",\n        \"BUFFALO SABRES\": \"BUF\",\n        \"CAROLINA HURRICANES\": \"CAR\",\n        \"COLUMBUS BLUE JACKETS\": \"CBJ\",\n        \"CALGARY FLAMES\": \"CGY\",\n        \"CHICAGO BLACKHAWKS\": \"CHI\",\n        \"COLORADO AVALANCHE\": \"COL\",\n        \"DALLAS STARS\": \"DAL\",\n        \"DETROIT RED WINGS\": \"DET\",\n        \"EDMONTON OILERS\": \"EDM\",\n        \"FLORIDA PANTHERS\": \"FLA\",\n        \"LOS ANGELES KINGS\": \"L.A\",\n        \"MINNESOTA WILD\": \"MIN\",\n        \"MONTREAL CANADIENS\": \"MTL\",\n        \"MONTRÉAL CANADIENS\": \"MTL\",\n        \"CANADIENS MONTREAL\": \"MTL\",\n        \"NEW JERSEY DEVILS\": \"N.J\",\n        \"NASHVILLE PREDATORS\": \"NSH\",\n        \"NEW YORK ISLANDERS\": \"NYI\",\n        \"NEW YORK RANGERS\": \"NYR\",\n        \"OTTAWA SENATORS\": \"OTT\",\n        \"PHILADELPHIA FLYERS\": \"PHI\",\n        \"PHOENIX COYOTES\": \"PHX\",\n        \"PITTSBURGH PENGUINS\": \"PIT\",\n        \"SAN JOSE SHARKS\": \"S.J\",\n        \"SEATTLE KRAKEN\": \"SEA\",\n        \"ST. LOUIS BLUES\": \"STL\",\n        \"TAMPA BAY LIGHTNING\": \"T.B\",\n        \"TORONTO MAPLE LEAFS\": \"TOR\",\n        \"VANCOUVER CANUCKS\": \"VAN\",\n        \"VEGAS GOLDEN KNIGHTS\": \"VGK\",\n        \"WINNIPEG JETS\": \"WPG\",\n        \"WASHINGTON CAPITALS\": \"WSH\",\n        \"BERN SC BERN\": \"BSB\",\n        \"KOLN HAIE\": \"KHI\"\n    }\n}"
  },
  {
    "path": "hockey_scraper/utils/tri_code_conversion.json",
    "content": "{\n    \"_description\": \"Conversion of new tri-code to old\",\n\n    \"tri_codes\": {  \n        \"NJD\": \"N.J\",\n        \"TBL\": \"T.B\",\n        \"LAK\": \"L.A\",\n        \"SJS\": \"S.J\"\n    }\n}"
  },
  {
    "path": "readthedocs.yml",
    "content": "# .readthedocs.yml\n\nbuild:\n  os: ubuntu-20.04  # <- add this line\n  tools:\n    python: \"3.9\"\n\npython:\n  version: 3.9\n  setup_py_install: true\n"
  },
  {
    "path": "requirements.txt",
    "content": "BeautifulSoup4>=4.5.3\nrequests>=2.14.2\nlxml>=3.7.2\nhtml5lib>=0.999999999\npandas>=0.23.4\nsphinx>=1.5.1\npytest>=3.0.5\npytz\ntqdm"
  },
  {
    "path": "setup.py",
    "content": "import os\nfrom setuptools import setup, find_packages\n\n\ndef read():\n    return open(os.path.join(os.path.dirname(__file__), 'README.rst')).read()\n\nsetup(\n    name='hockey_scraper',\n    version='1.40.2',\n    description=\"\"\"Python Package for scraping NHL Play-by-Play and Shift data.\"\"\",\n    long_description=read(),\n    classifiers=[\n        \"Development Status :: 5 - Production/Stable\",\n        'Programming Language :: Python :: 3',\n        \"Programming Language :: Python :: 2\",\n    ],\n    keywords='NHL',\n    url='https://github.com/HarryShomer/Hockey-Scraper',\n    author='Harry Shomer',\n    author_email='Harryshomer@gmail.com',\n    license='GNU General Public License v3 (GPLv3)',\n    packages=find_packages(),\n    install_requires=['BeautifulSoup4', 'requests', 'lxml', 'html5lib', 'pandas', 'pytest', 'pytz', 'tqdm'],\n    zip_safe=False,\n\n    package_data={\n        \"\": [\"*.json\"],\n    }\n\n    # entry_points={\n    #     'console_scripts': [\n    #         'hockey-scraper = hockey_scraper.cli:main',\n    #     ],\n    # }\n)\n\n"
  },
  {
    "path": "tests/__init__.py",
    "content": ""
  },
  {
    "path": "tests/test_espn_pbp.py",
    "content": "# \"\"\" Tests for 'espn_pbp.py' \"\"\"\n\n# import pandas as pd\n# import pytest\n\n# from hockey_scraper.nhl.pbp import espn_pbp\n\n\n# @pytest.fixture\n# def game_date():\n#     return '2015-10-24'\n\n\n# @pytest.fixture\n# def teams():\n#     return {'Home': 'PHI', 'Away': 'NYR'}\n\n\n# @pytest.fixture\n# def game_response(game_date, teams):\n#     \"\"\" Page for that game\"\"\"\n#     return espn_pbp.get_espn_game(game_date, teams['Home'], teams['Away'])\n\n\n# @pytest.fixture\n# def date_response(game_date):\n#     \"\"\" Page that details all the games that day\"\"\"\n#     return espn_pbp.get_espn_date(game_date)\n\n\n# def test_get_teams(date_response):\n#     \"\"\" Check to make sure we get a list of both teams for every game that day\"\"\"\n\n#     # Games for that date\n#     date_games = [['ANA', 'MIN'], ['N.J', 'BUF'], ['TOR', 'MTL'], ['PHX', 'OTT'], ['NYR', 'PHI'], ['NYI', 'STL'],\n#                   ['PIT', 'NSH'], ['FLA', 'DAL'], ['T.B', 'CHI'], ['CBJ', 'COL'], ['DET', 'VAN'], ['CAR', 'S.J']]\n\n#     assert espn_pbp.get_teams(date_response) == date_games\n\n\n# def test_get_game_ids(date_response):\n#     \"\"\" Check to see that all the espn game id's for that day are correct\"\"\"\n#     game_ids = ['400814970', '400814971', '400814972', '400814973', '400814974', '400814975', '400814976',\n#                 '400814977', '400814978', '400814979', '400814980', '400814981']\n\n#     assert espn_pbp.get_game_ids(date_response) == game_ids\n\n\n# def test_get_espn_game_id(game_date, teams):\n#     \"\"\" \n#     Test to see for a given game (identified by the game_id and both teams) it return the correct game_id number\n#     for the Espn api.\n#     \"\"\"\n#     assert espn_pbp.get_espn_game_id(game_date, teams['Home'], teams['Away']) == '400814974'\n\n\n# def test_get_espn_game(game_response):\n#     \"\"\" Test to see if we get anything back whn requesting a game. Should be a non-empty string\"\"\"\n#     assert type(game_response) == str\n#     assert len(game_response) > 0\n\n\n# def test_parse_event():\n#     \"\"\" Checks to see that it correctly parses a game event\"\"\"\n#     event = \"-35~12~505~4:48~2~3506~5767~5833~Power Play Goal Scored by Derick Brassard(Slapshot 56 ft)assisted by \" \\\n#             \"Kevin Hayes and Chris Kreider~0~2~2~0~702~13~801~901~2~3~4\"\n#     parsed_event = {\"xC\": -35, \"yC\": 12, \"time_elapsed\": 288, 'period': '2', 'event': \"GOAL\"}\n\n#     assert espn_pbp.parse_event(event) == parsed_event\n\n\n# def test_parse_espn(game_response):\n#     \"\"\" Checks if the espn game response is parsed correctly. Specifically: \n#         1. Is a DataFrame?\n#         2. Contains the correct amount of events?\n#         3. Contains the correct columns?\n#     \"\"\"\n#     scraped_game = espn_pbp.parse_espn(game_response)\n#     espn_columns = ['period', 'time_elapsed', 'event', 'xC', 'yC']\n\n#     assert isinstance(scraped_game, pd.DataFrame)\n#     assert scraped_game.shape[0] == 379\n#     assert list(scraped_game.columns) == espn_columns\n"
  },
  {
    "path": "tests/test_game_scraper.py",
    "content": "\"\"\" Tests for 'game_scraper.py' \"\"\"\n\nimport pandas as pd\nimport pytest\n\nfrom hockey_scraper.nhl import game_scraper, playing_roster\nfrom hockey_scraper.nhl.pbp import json_pbp\n\n\n@pytest.fixture\ndef players():\n    return {'Home': \n            {'NOAH HANIFIN': {'id': 8478396, 'number': '5', 'last_name': 'HANIFIN'}, \n             'KLAS DAHLBECK': {'id': 8476403, 'number': '6', 'last_name': 'DAHLBECK'}, \n             'DEREK RYAN': {'id': 8478585, 'number': '7', 'last_name': 'RYAN'}, \n             'JORDAN STAAL': {'id': 8473533, 'number': '11', 'last_name': 'STAAL'}, \n             'JUSTIN WILLIAMS': {'id': 8468508, 'number': '14', 'last_name': 'WILLIAMS'}, \n             'SEBASTIAN AHO': {'id': 8478427, 'number': '20', 'last_name': 'AHO'}, \n             'LEE STEMPNIAK': {'id': 8470740, 'number': '21', 'last_name': 'STEMPNIAK'}, \n             'BRETT PESCE': {'id': 8477488, 'number': '22', 'last_name': 'PESCE'}, \n             'BROCK MCGINN': {'id': 8476934, 'number': '23', 'last_name': 'MCGINN'}, \n             'JUSTIN FAULK': {'id': 8475753, 'number': '27', 'last_name': 'FAULK'}, \n             'ELIAS LINDHOLM': {'id': 8477496, 'number': '28', 'last_name': 'LINDHOLM'}, \n             'PHILLIP DI GIUSEPPE': {'id': 8476858, 'number': '34', 'last_name': 'DI GIUSEPPE'}, \n             'JOAKIM NORDSTROM': {'id': 8475807, 'number': '42', 'last_name': 'NORDSTROM'}, \n             'VICTOR RASK': {'id': 8476437, 'number': '49', 'last_name': 'RASK'}, \n             'JEFF SKINNER': {'id': 8475784, 'number': '53', 'last_name': 'SKINNER'}, \n             'TREVOR VAN RIEMSDYK': {'id': 8477845, 'number': '57', 'last_name': 'VAN RIEMSDYK'}, \n             'JACCOB SLAVIN': {'id': 8476958, 'number': '74', 'last_name': 'SLAVIN'}, \n             'TEUVO TERAVAINEN': {'id': 8476882, 'number': '86', 'last_name': 'TERAVAINEN'}, \n             'CAM WARD': {'id': 8470320, 'number': '30', 'last_name': 'WARD'}, \n             'SCOTT DARLING': {'id': 8474152, 'number': '33', 'last_name': 'DARLING'}\n             }, \n            'Away': \n            {'NICK LEDDY': {'id': 8475181, 'number': '2', 'last_name': 'LEDDY'}, \n             'RYAN PULOCK': {'id': 8477506, 'number': '6', 'last_name': 'PULOCK'}, \n             'JORDAN EBERLE': {'id': 8474586, 'number': '7', 'last_name': 'EBERLE'}, \n             'JOSH BAILEY': {'id': 8474573, 'number': '12', 'last_name': 'BAILEY'}, \n             'MATHEW BARZAL': {'id': 8478445, 'number': '13', 'last_name': 'BARZAL'}, \n             'THOMAS HICKEY': {'id': 8474066, 'number': '14', 'last_name': 'HICKEY'}, \n             'CAL CLUTTERBUCK': {'id': 8473504, 'number': '15', 'last_name': 'CLUTTERBUCK'}, \n             'ANDREW LADD': {'id': 8471217, 'number': '16', 'last_name': 'LADD'}, \n             'ANDERS LEE': {'id': 8475314, 'number': '27', 'last_name': 'LEE'}, \n             'SEBASTIAN AHO': {'id': 8480222, 'number': '28', 'last_name': 'AHO'}, \n             'BROCK NELSON': {'id': 8475754, 'number': '29', 'last_name': 'NELSON'}, \n             'ADAM PELECH': {'id': 8476917, 'number': '50', 'last_name': 'PELECH'}, \n             'ROSS JOHNSTON': {'id': 8477527, 'number': '52', 'last_name': 'JOHNSTON'}, \n             'CASEY CIZIKAS': {'id': 8475231, 'number': '53', 'last_name': 'CIZIKAS'}, \n             'JOHNNY BOYCHUK': {'id': 8470187, 'number': '55', 'last_name': 'BOYCHUK'}, \n             'TANNER FRITZ': {'id': 8479206, 'number': '56', 'last_name': 'FRITZ'}, \n             'ANTHONY BEAUVILLIER': {'id': 8478463, 'number': '72', 'last_name': 'BEAUVILLIER'}, \n             'JOHN TAVARES': {'id': 8475166, 'number': '91', 'last_name': 'TAVARES'}, \n             'THOMAS GREISS': {'id': 8471306, 'number': '1', 'last_name': 'GREISS'}, \n             'JAROSLAV HALAK': {'id': 8470860, 'number': '41', 'last_name': 'HALAK'}\n            }\n        }\n\n\n@pytest.fixture\ndef pbp_columns():\n    return ['Game_Id', 'Date', 'Period', 'Event', 'Description', 'Time_Elapsed', 'Seconds_Elapsed', 'Strength',\n            'Ev_Zone', 'Type', 'Ev_Team', 'Home_Zone', 'Away_Team', 'Home_Team', 'p1_name', 'p1_ID', 'p2_name', 'p2_ID',\n            'p3_name', 'p3_ID', 'awayPlayer1', 'awayPlayer1_id', 'awayPlayer2', 'awayPlayer2_id', 'awayPlayer3',\n            'awayPlayer3_id', 'awayPlayer4', 'awayPlayer4_id', 'awayPlayer5', 'awayPlayer5_id', 'awayPlayer6',\n            'awayPlayer6_id', 'homePlayer1', 'homePlayer1_id', 'homePlayer2', 'homePlayer2_id', 'homePlayer3',\n            'homePlayer3_id', 'homePlayer4', 'homePlayer4_id', 'homePlayer5', 'homePlayer5_id', 'homePlayer6',\n            'homePlayer6_id', 'Away_Players', 'Home_Players', 'Away_Score', 'Home_Score', 'Away_Goalie',\n            'Away_Goalie_Id', 'Home_Goalie', 'Home_Goalie_Id', 'xC', 'yC', 'Home_Coach', 'Away_Coach'\n            ]\n\n\n@pytest.fixture\ndef shifts_columns():\n    return ['Game_Id', 'Period', 'Team', 'Player', 'Player_Id', 'Start', 'End', 'Duration', 'Date']\n\n\ndef test_scrape_game(pbp_columns, shifts_columns):\n    \"\"\" Tests if scrape pbp and shifts for game correctly with and without shifts.\n        Check:\n            1. Returns either a DataFrame or None (for shifts when False)\n            2. The number of rows is correct\n            3. The columns are correct\n     \"\"\"\n\n    # 1. Try first without shifts\n    pbp, shifts = game_scraper.scrape_game(\"2016020475\", \"2016-12-18\", False)\n    assert isinstance(pbp, pd.DataFrame)\n    assert shifts is None\n    assert pbp.shape[0] == 326\n    assert list(pbp.columns) == pbp_columns\n\n    # 2. Try with shifts\n    pbp, shifts = game_scraper.scrape_game(\"2023010008\", \"2023-09-24\", True)\n    assert isinstance(pbp, pd.DataFrame)\n    assert isinstance(shifts, pd.DataFrame)\n    #assert pbp.shape[0] == 248\n    #assert shifts.shape[0] == 726\n    assert list(pbp.columns) == pbp_columns\n    assert list(shifts.columns) == shifts_columns\n\n\ndef test_combine_players_lists(players):\n    \"\"\" Check that it combines the list of players from the json pbp and the html roster correctly \"\"\"\n    game_id = \"2017020891\"\n    json_players = game_scraper.get_players_json(json_pbp.get_pbp(game_id))\n    roster = playing_roster.scrape_roster(game_id)['players']\n\n    assert players == game_scraper.combine_players_lists(json_players, roster, game_id)\n"
  },
  {
    "path": "tests/test_html_pbp.py",
    "content": "\"\"\" Tests for 'html_pbp.py' \"\"\"\n\nimport pandas as pd\nimport pytest\n\nfrom hockey_scraper.nhl.pbp import html_pbp\n\n\n# TODO: Fill out the rest of the test here and in the file (the important ones there)\n\n\n@pytest.fixture\ndef game_id():\n    return \"2017020516\"\n\n\n@pytest.fixture\ndef cleaned_html(game_id):\n    return html_pbp.clean_html_pbp(html_pbp.get_pbp(game_id))\n\n\n@pytest.fixture\ndef pbp_cols():\n    return ['Period', 'Event', 'Description', 'Time_Elapsed', 'Seconds_Elapsed', 'Strength', 'Ev_Zone', 'Type',\n            'Ev_Team', 'Home_Zone', 'Away_Team', 'Home_Team', 'p1_name', 'p1_ID', 'p2_name', 'p2_ID', 'p3_name',\n            'p3_ID', 'awayPlayer1', 'awayPlayer1_id', 'awayPlayer2', 'awayPlayer2_id', 'awayPlayer3', 'awayPlayer3_id',\n            'awayPlayer4', 'awayPlayer4_id', 'awayPlayer5', 'awayPlayer5_id', 'awayPlayer6', 'awayPlayer6_id',\n            'homePlayer1', 'homePlayer1_id', 'homePlayer2', 'homePlayer2_id', 'homePlayer3', 'homePlayer3_id',\n            'homePlayer4', 'homePlayer4_id', 'homePlayer5', 'homePlayer5_id', 'homePlayer6', 'homePlayer6_id',\n            'Away_Goalie', 'Away_Goalie_Id', 'Home_Goalie', 'Home_Goalie_Id', 'Away_Players', 'Home_Players',\n            'Away_Score', 'Home_Score'\n            ]\n\n\n@pytest.fixture\ndef event():\n    return ['112', '1', 'EV', '15:59', 'PENL', 'TOR #25 VAN RIEMSDYK\\xa0Slashing(2 min), Off. Zone Drawn By: CAR #49 RASK',\n            [['VICTOR RASK', '49', 'C'], ['JEFF SKINNER', '53', 'C'], ['TEUVO TERAVAINEN', '86', 'L'],\n             ['NOAH HANIFIN', '5', 'D'], ['BRETT PESCE', '22', 'D'], ['SCOTT DARLING', '33', 'G']],\n            [['MITCHELL MARNER', '16', 'C'], ['TYLER BOZAK', '42', 'C'], ['JAMES VAN RIEMSDYK', '25', 'L'],\n             ['ROMAN POLAK', '46', 'D'], ['JAKE GARDINER', '51', 'D'], ['FREDERIK ANDERSEN', '31', 'G']]\n            ]\n\n@pytest.fixture\ndef players():\n    return {'Home':\n                {'RON HAINSEY': {'id': 8468493, 'number': '2', 'last_name': 'HAINSEY'},\n                 'CONNOR CARRICK': {'id': 8476941, 'number': '8', 'last_name': 'CARRICK'},\n                 'ZACH HYMAN': {'id': 8475786, 'number': '11', 'last_name': 'HYMAN'},\n                 'PATRICK MARLEAU': {'id': 8466139, 'number': '12', 'last_name': 'MARLEAU'},\n                 'MATT MARTIN': {'id': 8474709, 'number': '15', 'last_name': 'MARTIN'},\n                 'MITCHELL MARNER': {'id': 8478483, 'number': '16', 'last_name': 'MARNER'},\n                 'DOMINIC MOORE': {'id': 8468575, 'number': '20', 'last_name': 'MOORE'},\n                 'KASPERI KAPANEN': {'id': 8477953, 'number': '24', 'last_name': 'KAPANEN'},\n                 'JAMES VAN RIEMSDYK': {'id': 8474037, 'number': '25', 'last_name': 'VAN RIEMSDYK'},\n                 'CONNOR BROWN': {'id': 8477015, 'number': '28', 'last_name': 'BROWN'},\n                 'WILLIAM NYLANDER': {'id': 8477939, 'number': '29', 'last_name': 'NYLANDER'},\n                 'TYLER BOZAK': {'id': 8475098, 'number': '42', 'last_name': 'BOZAK'},\n                 'NAZEM KADRI': {'id': 8475172, 'number': '43', 'last_name': 'KADRI'},\n                 'MORGAN RIELLY': {'id': 8476853, 'number': '44', 'last_name': 'RIELLY'},\n                 'ROMAN POLAK': {'id': 8471392, 'number': '46', 'last_name': 'POLAK'},\n                 'LEO KOMAROV': {'id': 8473463, 'number': '47', 'last_name': 'KOMAROV'},\n                 'JAKE GARDINER': {'id': 8474581, 'number': '51', 'last_name': 'GARDINER'},\n                 'ANDREAS BORGMAN': {'id': 8480158, 'number': '55', 'last_name': 'BORGMAN'},\n                 'FREDERIK ANDERSEN': {'id': 8475883, 'number': '31', 'last_name': 'ANDERSEN'},\n                 'JOSH LEIVO': {'id': 8476410, 'number': '32', 'last_name': 'LEIVO'},\n                 'AUSTON MATTHEWS': {'id': 8479318, 'number': '34', 'last_name': 'MATTHEWS'},\n                 'MARTIN MARINCIN': {'id': 8475716, 'number': '52', 'last_name': 'MARINCIN'}\n                 },\n            'Away':\n                {'HAYDN FLEURY': {'id': 8477938, 'number': '4', 'last_name': 'FLEURY'},\n                 'NOAH HANIFIN': {'id': 8478396, 'number': '5', 'last_name': 'HANIFIN'},\n                 'DEREK RYAN': {'id': 8478585, 'number': '7', 'last_name': 'RYAN'},\n                 'JORDAN STAAL': {'id': 8473533, 'number': '11', 'last_name': 'STAAL'},\n                 'JUSTIN WILLIAMS': {'id': 8468508, 'number': '14', 'last_name': 'WILLIAMS'},\n                 'MARCUS KRUGER': {'id': 8475323, 'number': '16', 'last_name': 'KRUGER'},\n                 'JOSH JOORIS': {'id': 8477591, 'number': '19', 'last_name': 'JOORIS'},\n                 'SEBASTIAN AHO': {'id': 8478427, 'number': '20', 'last_name': 'AHO'},\n                 'BRETT PESCE': {'id': 8477488, 'number': '22', 'last_name': 'PESCE'},\n                 'BROCK MCGINN': {'id': 8476934, 'number': '23', 'last_name': 'MCGINN'},\n                 'JUSTIN FAULK': {'id': 8475753, 'number': '27', 'last_name': 'FAULK'},\n                 'ELIAS LINDHOLM': {'id': 8477496, 'number': '28', 'last_name': 'LINDHOLM'},\n                 'JOAKIM NORDSTROM': {'id': 8475807, 'number': '42', 'last_name': 'NORDSTROM'},\n                 'VICTOR RASK': {'id': 8476437, 'number': '49', 'last_name': 'RASK'},\n                 'JEFF SKINNER': {'id': 8475784, 'number': '53', 'last_name': 'SKINNER'},\n                 'TREVOR VAN RIEMSDYK': {'id': 8477845, 'number': '57', 'last_name': 'VAN RIEMSDYK'},\n                 'JACCOB SLAVIN': {'id': 8476958, 'number': '74', 'last_name': 'SLAVIN'},\n                 'TEUVO TERAVAINEN': {'id': 8476882, 'number': '86', 'last_name': 'TERAVAINEN'},\n                 'SCOTT DARLING': {'id': 8474152, 'number': '33', 'last_name': 'DARLING'},\n                 'KLAS DAHLBECK': {'id': 8476403, 'number': '6', 'last_name': 'DAHLBECK'},\n                 'PHILLIP DI GIUSEPPE': {'id': 8476858, 'number': '34', 'last_name': 'DI GIUSEPPE'}\n                 }\n            }\n\n\n@pytest.fixture\ndef teams():\n    return {'Home': 'TOR', 'Away': 'CAR'}\n\n\n@pytest.fixture\ndef current_score():\n    return {'Home': 4, 'Away': 1}\n\n\ndef test_parse_event(event, players, teams, current_score):\n    \"\"\" Checks that it parses an event correctly \"\"\"\n    parsed_event = {\n        'Description': 'TOR #25 VAN RIEMSDYK\\xa0Slashing(2 min), Off. Zone Drawn By: CAR #49 RASK',\n        'Event': 'PENL', 'Period': 1, 'Time_Elapsed': '15:59', 'Seconds_Elapsed': 959.0, 'Ev_Team': 'TOR',\n        'Home_Score': 4, 'Away_Score': 1, 'score_diff': 3, 'homePlayer1': 'MITCHELL MARNER', 'homePlayer1_id': 8478483,\n        'homePlayer2': 'TYLER BOZAK', 'homePlayer2_id': 8475098, 'homePlayer3': 'JAMES VAN RIEMSDYK',\n        'homePlayer3_id': 8474037, 'homePlayer4': 'ROMAN POLAK', 'homePlayer4_id': 8471392,\n        'homePlayer5': 'JAKE GARDINER', 'homePlayer5_id': 8474581, 'homePlayer6': 'FREDERIK ANDERSEN',\n        'homePlayer6_id': 8475883, 'awayPlayer1': 'VICTOR RASK', 'awayPlayer1_id': 8476437,\n        'awayPlayer2': 'JEFF SKINNER', 'awayPlayer2_id': 8475784, 'awayPlayer3': 'TEUVO TERAVAINEN',\n        'awayPlayer3_id': 8476882, 'awayPlayer4': 'NOAH HANIFIN', 'awayPlayer4_id': 8478396,\n        'awayPlayer5': 'BRETT PESCE', 'awayPlayer5_id': 8477488, 'awayPlayer6': 'SCOTT DARLING',\n        'awayPlayer6_id': 8474152, 'Away_Goalie': 'SCOTT DARLING', 'Away_Goalie_Id': 8474152,\n        'Home_Goalie': 'FREDERIK ANDERSEN', 'Home_Goalie_Id': 8475883, 'Away_Players': 6, 'Home_Players': 6,\n        'Strength': '5x5', 'Type': 'Slashing(2 min)', 'Ev_Zone': 'Off', 'Home_Zone': 'Off',\n        'p1_name': 'JAMES VAN RIEMSDYK', 'p1_ID': 8474037, 'p2_name': 'VICTOR RASK', 'p2_ID': 8476437\n    }\n\n    assert parsed_event == html_pbp.parse_event(event, players, teams['Home'], current_score)\n\n\ndef test_parse_html(pbp_cols, players, teams, cleaned_html):\n    \"\"\" Check that it parsed the entirety of the html correctly\"\"\"\n    game_df = html_pbp.parse_html(cleaned_html, players, teams)\n\n    assert isinstance(game_df, pd.DataFrame)\n    assert game_df.shape[0] == 331\n    assert list(game_df.columns) == pbp_cols\n\n\ndef test_get_pbp():\n    \"\"\" Test getting the html pbp\"\"\"\n    pass\n\n\ndef test_get_soup():\n    \"\"\" Test 'soupifying' the html doc \"\"\"\n    pass\n\n\ndef test_strip_html_pbp():\n    \"\"\" String the html tags and such and make a list of lists for all plays\"\"\"\n    pass\n\n\ndef test_clean_html_pbp():\n    \"\"\" Get rid of html and format the data \"\"\"\n    pass\n"
  },
  {
    "path": "tests/test_html_shifts.py",
    "content": "\"\"\" Tests for 'html_shifts.py' \"\"\"\n\nimport bs4\nimport pandas as pd\nimport pytest\n\nfrom hockey_scraper.nhl.shifts import html_shifts\n\n\n@pytest.fixture\ndef shift_cols():\n    return ['Game_Id', 'Period', 'Team', 'Player', 'Player_Id', 'Start', 'End', 'Duration']\n\n\n@pytest.fixture\ndef game_id():\n    return '2009020001'\n\n@pytest.fixture\ndef player_ids():\n    return {\n        'Home': {\n                  'DENNIS WIDEMAN': {'id': 8469770, 'number': '6', 'last_name': 'WIDEMAN'},\n                  'CHUCK KOBASEW': {'id': 8469467, 'number': '12', 'last_name': 'KOBASEW'},\n                  'MARCO STURM': {'id': 8464979, 'number': '16', 'last_name': 'STURM'},\n                  'MILAN LUCIC': {'id': 8473473, 'number': '17', 'last_name': 'LUCIC'},\n                  'ANDREW FERENCE': {'id': 8466333, 'number': '21', 'last_name': 'FERENCE'},\n                  'SHAWN THORNTON': {'id': 8465978, 'number': '22', 'last_name': 'THORNTON'},\n                  'BLAKE WHEELER': {'id': 8471218, 'number': '26', 'last_name': 'WHEELER'},\n                  'STEVE BEGIN': {'id': 8464994, 'number': '27', 'last_name': 'BEGIN'},\n                  'MARK RECCHI': {'id': 8450725, 'number': '28', 'last_name': 'RECCHI'},\n                  'ZDENO CHARA': {'id': 8465009, 'number': '33', 'last_name': 'CHARA'},\n                  'PATRICE BERGERON': {'id': 8470638, 'number': '37', 'last_name': 'BERGERON'},\n                  'MARK STUART': {'id': 8470614, 'number': '45', 'last_name': 'STUART'},\n                  'DAVID KREJCI': {'id': 8471276, 'number': '46', 'last_name': 'KREJCI'},\n                  'MATT HUNWICK': {'id': 8471436, 'number': '48', 'last_name': 'HUNWICK'},\n                  'DEREK MORRIS': {'id': 8464966, 'number': '53', 'last_name': 'MORRIS'},\n                  'BYRON BITZ': {'id': 8470700, 'number': '61', 'last_name': 'BITZ'},\n                  'MICHAEL RYDER': {'id': 8467545, 'number': '73', 'last_name': 'RYDER'},\n                  'MARC SAVARD': {'id': 8462118, 'number': '91', 'last_name': 'SAVARD'},\n                  'TIM THOMAS': {'id': 8460703, 'number': '30', 'last_name': 'THOMAS'},\n                  'JOHNNY BOYCHUK': {'id': 8470187, 'number': '55', 'last_name': 'BOYCHUK'}\n                  },\n        'Away': {\n                  'BRIAN POTHIER': {'id': 8468427, 'number': '2', 'last_name': 'POTHIER'},\n                  'TOM POTI': {'id': 8465012, 'number': '3', 'last_name': 'POTI'},\n                  'JOHN ERSKINE': {'id': 8467365, 'number': '4', 'last_name': 'ERSKINE'},\n                  'ALEX OVECHKIN': {'id': 8471214, 'number': '8', 'last_name': 'OVECHKIN'},\n                  'BRENDAN MORRISON': {'id': 8459461, 'number': '9', 'last_name': 'MORRISON'},\n                  'MATT BRADLEY': {'id': 8465059, 'number': '10', 'last_name': 'BRADLEY'},\n                  'BOYD GORDON': {'id': 8470159, 'number': '15', 'last_name': 'GORDON'},\n                  'CHRIS CLARK': {'id': 8460567, 'number': '17', 'last_name': 'CLARK'},\n                  'NICKLAS BACKSTROM': {'id': 8473563, 'number': '19', 'last_name': 'BACKSTROM'},\n                  'BROOKS LAICH': {'id': 8469639, 'number': '21', 'last_name': 'LAICH'},\n                  'MIKE KNUBLE': {'id': 8458590, 'number': '22', 'last_name': 'KNUBLE'},\n                  'MILAN JURCINA': {'id': 8469684, 'number': '23', 'last_name': 'JURCINA'},\n                  'SHAONE MORRISONN': {'id': 8469472, 'number': '26', 'last_name': 'MORRISONN'},\n                  'ALEXANDER SEMIN': {'id': 8470120, 'number': '28', 'last_name': 'SEMIN'},\n                  'BOYD KANE': {'id': 8465028, 'number': '34', 'last_name': 'KANE'},\n                  'DAVID STECKEL': {'id': 8469483, 'number': '39', 'last_name': 'STECKEL'},\n                  'MIKE GREEN': {'id': 8471242, 'number': '52', 'last_name': 'GREEN'},\n                  'QUINTIN LAING': {'id': 8466232, 'number': '53', 'last_name': 'LAING'},\n                  'JOSE THEODORE': {'id': 8460535, 'number': '60', 'last_name': 'THEODORE'},\n                  'JEFF SCHULTZ': {'id': 8471240, 'number': '55', 'last_name': 'SCHULTZ'},\n                  'TYLER SLOAN': {'id': 8468846, 'number': '89', 'last_name': 'SLOAN'},\n                  'MICHAEL NYLANDER': {'id': 8458573, 'number': '92', 'last_name': 'NYLANDER'}\n        }\n    }\n\n\n@pytest.fixture\ndef shifts_html():\n    home, away = html_shifts.get_shifts(\"2009020001\")\n    return {'home': home, 'away': away}\n\n\n@pytest.fixture\ndef shifts_dfs(shifts_html, player_ids, game_id):\n    home_df = html_shifts.parse_html(shifts_html['home'], player_ids, game_id)\n    away_df = html_shifts.parse_html(shifts_html['away'], player_ids, game_id)\n\n    return {'home': home_df, 'away': away_df}\n\n\ndef test_get_shifts(shifts_html):\n    \"\"\" Test getting both shifts pages \"\"\"\n    assert type(shifts_html['home']) == str\n    assert len(shifts_html['home']) > 0\n\n    assert type(shifts_html['away']) == str\n    assert len(shifts_html['away']) > 0\n\n\ndef test_get_soup(shifts_html):\n    \"\"\" Test get soup -> Returns td tags for pbp and= list of both teams\"\"\"\n\n    # Home\n    td, teams = html_shifts.get_soup(shifts_html['home'])\n    assert type(td) == bs4.element.ResultSet\n    assert len(td) > 100   # If it's greater than 100 it's fine\n    assert type(teams) == list\n    assert len(teams) == 2\n\n    # Away\n    td, teams = html_shifts.get_soup(shifts_html['away'])\n    assert type(td) == bs4.element.ResultSet\n    assert len(td) > 100  # If it's greater than 100 it's fine\n    assert type(teams) == list\n    assert len(teams) == 2\n\n\ndef test_analyze_shifts(player_ids):\n    \"\"\" Test analyzing a single shift. See if it parses it correctly\"\"\"\n    # Get html (one td) and see if it spits out the correct info\n    shift = ['30', '3', '17:05 / 2:55', '18:03 / 1:57', '00:58']\n    name = \"DENNIS WIDEMAN\"\n    team = \"BOSTON BRUINS\"\n    home_team = \"BOSTON BRUINS\"\n    parsed_shift = {'Player': 'DENNIS WIDEMAN', 'Period': '3', 'Team': 'BOS', 'Start': 1025.0, 'Duration': 58.0,\n                    'End': 1083.0, 'Player_Id': 8469770}\n\n    assert parsed_shift == html_shifts.analyze_shifts(shift, name, team, home_team, player_ids)\n\n\ndef test_parse_html(shifts_dfs, shift_cols):\n    \"\"\" Should return a DataFrame of all the shifts for a given team \n        Here we make sure it works for both the home and away teams\n    \"\"\"\n    # Check they are both DataFrames\n    assert isinstance(shifts_dfs['home'], pd.DataFrame)\n    assert isinstance(shifts_dfs['away'], pd.DataFrame)\n\n    # Check correct amount of rows for each\n    assert shifts_dfs['home'].shape[0] == 360\n    assert shifts_dfs['away'].shape[0] == 375\n\n    # Correct columns for each\n    assert list(shifts_dfs['home'].columns).sort() == shift_cols.sort()\n    assert list(shifts_dfs['away'].columns).sort() == shift_cols.sort()\n\n\ndef test_scrape_game(game_id, player_ids, shifts_dfs, shift_cols):\n    \"\"\" Should return DataFrame with both home and away shifts \"\"\"\n    # Return Df\n    # Correct amount of line  -- Confirm equals to home + away\n    # Correct columns\n    game_df = html_shifts.scrape_game(game_id, player_ids)\n\n    # Check if it's the right datatype\n    assert isinstance(game_df, pd.DataFrame)\n\n    # Same number of rows as correct value and addition of home and aways dfs\n    assert game_df.shape[0] == (shifts_dfs['home'].shape[0] + shifts_dfs['away'].shape[0]) == 735\n\n    # Correct columns\n    assert list(shifts_dfs['home'].columns).sort() == shift_cols.sort()\n"
  },
  {
    "path": "tests/test_json_pbp.py",
    "content": "\"\"\"Tests for 'json_pbp.py'\"\"\"\n\nimport pandas as pd\n\nfrom hockey_scraper.nhl.pbp import json_pbp\n\n\ndef test_get_pbp():\n    \"\"\"Tests to see we get something when scraping. We want it to return a dictionary\"\"\"\n    assert isinstance(json_pbp.get_pbp(\"2016020001\"), dict)\n    assert isinstance(json_pbp.get_pbp(\"2008020768\"), dict)\n\n\ndef test_get_teams():\n    \"\"\"Tests how extracting home and away teams from json\"\"\"\n    assert json_pbp.get_teams(json_pbp.get_pbp(\"2014020001\")) == {\"Home\": 'TOR', \"Away\": 'MTL'}\n\n\ndef test_parse_json():\n    \"\"\" Tests how the pbp for one game is stored\n        1. We want it to return a pandas DataFrame\n        2. Checks to see if the proper game scraped is the right amount of events\n        3. Checks if the right columns are included\n    \"\"\"\n    scraped_game = json_pbp.scrape_game(\"2016020001\")\n    pbp_columns = ['period', 'event', 'seconds_elapsed', 'p1_name', 'p1_ID', 'p2_name', 'p2_ID', 'p3_name', 'p3_ID',\n                   'xC', 'yC']\n\n    assert isinstance(scraped_game, pd.DataFrame)\n    assert scraped_game.shape[0] == 349\n    assert list(scraped_game.columns) == pbp_columns\n\n\ndef test_parse_event():\n    \"\"\" Test to see that it parses an event correctly\"\"\"\n\n    event = {\n            \"eventId\": 201,\n            \"period\": 1,\n            \"periodDescriptor\": {\n                \"number\": 1,\n                \"periodType\": \"REG\"\n            },\n            \"timeInPeriod\": \"00:32\",\n            \"timeRemaining\": \"19:28\",\n            \"situationCode\": \"1551\",\n            \"homeTeamDefendingSide\": \"right\",\n            \"typeCode\": 505,\n            \"typeDescKey\": \"goal\",\n            \"sortOrder\": 20,\n            \"details\": {\n                \"xCoord\": -84,\n                \"yCoord\": 6,\n                \"zoneCode\": \"O\",\n                \"shotType\": \"wrist\",\n                \"scoringPlayerId\": 8478864,\n                \"assist1PlayerId\": 8482122,\n                \"assist2PlayerId\": 8475692,\n                \"eventOwnerTeamId\": 30,\n                \"goalieInNetId\": 8481020,\n                \"awayScore\": 0,\n                \"homeScore\": 1\n            }\n        }\n\n    parsed_event = {\n        'event_id': 201, 'period': 1, 'event': 'GOAL', 'seconds_elapsed': 32.0, 'p1_name': '',\n        'p1_ID': 8478864, 'p2_name': '', 'p2_ID': 8482122, 'p3_name': '', 'p3_ID': 8475692,\n        'xC': -84.0, 'yC': 6.0\n    }\n\n    assert json_pbp.parse_event(event) == parsed_event\n"
  },
  {
    "path": "tests/test_json_schedule.py",
    "content": "\"\"\"Tests for 'json_schedule.py'\"\"\"\nimport datetime\n\nfrom hockey_scraper.nhl import json_schedule\n\n\ndef test_get_schedule():\n    \"\"\"Tests to see we get something when scraping. We want it to return a dictionary\"\"\"\n    assert isinstance(json_schedule.get_schedule(\"2018-03-28\"), dict)\n\n\ndef test_scrape_schedule():\n    \"\"\"Test to see if successfully get the correct number of games between two dates\"\"\"\n    assert len(json_schedule.scrape_schedule(\"2017-08-01\", \"2017-09-01\")) == 0\n    assert len(json_schedule.scrape_schedule(\"2017-09-01\", \"2017-11-15\")) == 277\n\n\ndef test_get_dates():\n    \"\"\"Test to see that it returns the correct dates for given game id's\"\"\"\n    assert json_schedule.get_dates([2015010002])[0] == {'game_id': 2015010002, 'date': '2015-09-20', \n                                                        'start_time': datetime.datetime(2015, 9, 20, 20, 30), \n                                                        'venue': 'Bridgestone Arena', 'home_team': 'NSH', \n                                                        'away_team': 'FLA', 'home_score': 5, 'away_score': 2, \n                                                        'status': 'FINAL'}\n    assert json_schedule.get_dates([2017020275])[0] == {'game_id': 2017020275, 'date': '2017-11-15', \n                                                        'start_time': datetime.datetime(2017, 11, 16, 0, 30), \n                                                        'venue': 'Little Caesars Arena', 'home_team': 'DET', \n                                                        'away_team': 'CGY', 'home_score': 8, 'away_score': 2, \n                                                        'status': 'OFF'}\n    assert json_schedule.get_dates([2014030416])[0] == {'game_id': 2014030416, 'date': '2015-06-15', \n                                                        'start_time': datetime.datetime(2015, 6, 16, 0, 0), \n                                                        'venue': 'United Center', 'home_team': 'CHI', \n                                                        'away_team': 'T.B', 'home_score': 2, 'away_score': 0, \n                                                        'status': 'OFF'}\n\n\n# def test_chunk_schedule_calls():\n#     \"\"\"\n#     Test that we appropriately chunk calls in a range. Do so by by checking # of days in each chunk.\n\n#     chunk_size = 30\n\n#     Note: Won't always go to total days in interval as some days dont' have games\n#     \"\"\"\n#     # 1 day\n#     x = json_schedule.chunk_schedule_calls('2019-10-10', '2019-10-10')\n#     assert [len(chunk) for chunk in x] == [7]\n\n#     # > 50\n#     x = json_schedule.chunk_schedule_calls('2018-10-10', '2019-04-10')\n#     assert [len(chunk) for chunk in x] == [7, 7, 7, 7, 7, 7, 1]\n\n#     # 1 < x < 50\n#     x = json_schedule.chunk_schedule_calls('2018-10-10', '2018-11-15')\n#     assert [len(chunk) for chunk in x] == [7, 7, 7, 7, 7, 1]\n\n\n\n\n\n"
  },
  {
    "path": "tests/test_json_shifts.py",
    "content": "\"\"\"Tests for 'json_shifts.py'\"\"\"\n\nimport pandas as pd\n\nfrom hockey_scraper.nhl.shifts import json_shifts\n\n\ndef test_get_shifts():\n    \"\"\"Tests to see we get something when scraping. We want it to return a dictionary\"\"\"\n    assert isinstance(json_shifts.get_shifts(\"2016020001\"), dict)\n    assert isinstance(json_shifts.get_shifts(\"2008020768\"), dict)\n\n\ndef test_scrape_shifts():\n    \"\"\" Tests scraping the json shifts. \n        1. We either want a pandas df or None.\n        2. Checks to see if the proper game scraped is the right amount of shifts\n        3. Checks if right columns are included\n    \"\"\"\n    scraped_shifts = json_shifts.scrape_game(\"2016020001\")\n\n    assert isinstance(scraped_shifts, pd.DataFrame)\n    assert scraped_shifts.shape[0] == 850\n\n    shift_columns = ['Game_Id', 'Period', 'Team', 'Player', 'Player_Id', 'Start', 'End', 'Duration']\n    assert list(scraped_shifts.columns) == shift_columns\n\n\ndef test_parse_shift():\n    \"\"\" Test to see that it parses a shift correctly\"\"\"\n    shift = {\n        \"detailCode\": 0, \"duration\": \"00:46\", \"endTime\": \"04:12\", \"eventDescription\": None, \"eventDetails\": None,\n        \"eventNumber\": 327, \"firstName\": \"Leo\", \"gameId\": 2016020001, \"hexValue\": \"#00205B\", \"lastName\": \"Komarov\",\n        \"period\": 2, \"playerId\": 8473463, \"shiftNumber\": 12, \"startTime\": \"03:26\", \"teamAbbrev\": \"TOR\", \"teamId\": 10,\n        \"teamName\": \"Toronto Maple Leafs\", \"typeCode\": 517\n    }\n\n    parsed_shift = {'Player': 'LEO KOMAROV', 'Player_Id': 8473463, 'Period': 2, 'Team': 'TOR', 'Start': 206.0,\n                    'End': 252.0, 'Duration': 46.0}\n\n    assert json_shifts.parse_shift(shift) == parsed_shift\n"
  },
  {
    "path": "tests/test_nwhl.py",
    "content": "import pandas as pd\nimport pytest\n\n\n\n"
  },
  {
    "path": "tests/test_playing_roster.py",
    "content": "\"\"\"Test for 'playing_roster.py'\"\"\"\n\nimport pytest\n\nfrom hockey_scraper.nhl import playing_roster\n\n\n@pytest.fixture\ndef scraped_roster():\n    return playing_roster.scrape_roster(\"2016020475\")\n\n\ndef test_fix_name():\n    \"\"\" Tests to see that it takes the assistant captain and regular captain ('(A)' and '(C)') out of player's names.\"\"\"\n    assert playing_roster.fix_name(['27', 'D', 'RYAN MCDONAGH  (C)', False]) == ['27', 'D', 'RYAN MCDONAGH', False]\n    assert playing_roster.fix_name(['5', 'D', 'DAN GIRARDI  (A)', False]) == ['5', 'D', 'DAN GIRARDI', False]\n    assert playing_roster.fix_name(['13', 'R', 'KEVIN HAYES', False]) == ['13', 'R', 'KEVIN HAYES', False]\n\n\ndef test_get_players(scraped_roster):\n    \"\"\" Tests if it get the correct players for both teams in the correct format. \"\"\"\n    home_roster = [\n        ['5', 'D', 'DAN GIRARDI', False],\n        ['8', 'D', 'KEVIN KLEIN', False],\n        ['10', 'C', 'J.T. MILLER', False],\n        ['12', 'L', 'MATT PUEMPEL', False],\n        ['13', 'R', 'KEVIN HAYES', False],\n        ['18', 'D', 'MARC STAAL', False],\n        ['19', 'R', 'JESPER FAST', False],\n        ['20', 'C', 'CHRIS KREIDER', False],\n        ['21', 'C', 'DEREK STEPAN', False],\n        ['22', 'D', 'NICK HOLDEN', False],\n        ['24', 'C', 'OSCAR LINDBERG', False],\n        ['26', 'L', 'JIMMY VESEY', False],\n        ['27', 'D', 'RYAN MCDONAGH', False],\n        ['36', 'C', 'MATS ZUCCARELLO', False],\n        ['40', 'R', 'MICHAEL GRABNER', False],\n        ['46', 'L', 'MAREK HRIVIK', False],\n        ['61', 'L', 'RICK NASH', False],\n        ['76', 'D', 'BRADY SKJEI', False],\n        ['30', 'G', 'HENRIK LUNDQVIST', False],\n        ['32', 'G', 'ANTTI RAANTA', False],\n        ['4', 'D', 'ADAM CLENDENING', True],\n        ['73', 'C', 'BRANDON PIRRI', True]\n    ]\n    away_roster = [\n        ['2', 'D', 'JOHN MOORE', False],\n        ['6', 'D', 'ANDY GREENE', False],\n        ['7', 'D', 'JON MERRILL', False],\n        ['8', 'R', 'BEAU BENNETT', False],\n        ['9', 'L', 'TAYLOR HALL', False],\n        ['11', 'R', 'PA PARENTEAU', False],\n        ['12', 'D', 'BEN LOVEJOY', False],\n        ['13', 'L', 'MICHAEL CAMMALLERI', False],\n        ['14', 'C', 'ADAM HENRIQUE', False],\n        ['19', 'C', 'TRAVIS ZAJAC', False],\n        ['21', 'C', 'KYLE PALMIERI', False],\n        ['22', 'D', 'KYLE QUINCEY', False],\n        ['28', 'D', 'DAMON SEVERSON', False],\n        ['36', 'R', 'NICK LAPPIN', False],\n        ['37', 'C', 'PAVEL ZACHA', False],\n        ['38', 'L', 'VERNON FIDDLER', False],\n        ['44', 'L', 'MILES WOOD', False],\n        ['51', 'C', 'SERGEY KALININ', False],\n        ['1', 'G', 'KEITH KINKAID', False],\n        ['35', 'G', 'CORY SCHNEIDER', False],\n        ['16', 'C', 'JACOB JOSEFSON', True],\n        ['20', 'L', 'LUKE GAZDIC', True],\n        ['25', 'R', 'DEVANTE SMITH-PELLY', True]\n    ]\n\n    assert 'Home' in scraped_roster['players']\n    assert 'Away' in scraped_roster['players']\n\n    assert scraped_roster['players']['Home'] == home_roster\n    assert scraped_roster['players']['Away'] == away_roster\n\n\ndef test_get_coaches(scraped_roster):\n    \"\"\" Tests if it get the correct coaches for both teams in the correct format. \"\"\"\n    assert 'Home' in scraped_roster['head_coaches']\n    assert 'Away' in scraped_roster['head_coaches']\n\n    assert scraped_roster['head_coaches']['Home'] == 'ALAIN VIGNEAULT'\n    assert scraped_roster['head_coaches']['Away'] == 'JOHN HYNES'\n\n\ndef test_scrape_roster(scraped_roster):\n    \"\"\" Test scraping all the roster info \"\"\"\n    assert 'players' in scraped_roster\n    assert 'head_coaches' in scraped_roster\n"
  },
  {
    "path": "tests/test_scrape_functions.py",
    "content": "\"\"\" Tests for 'scrape_functions.py' \"\"\"\n\nimport pandas as pd\n\nfrom hockey_scraper.nhl import scrape_functions\n\n\ndef test_scrape_list_of_games():\n    \"\"\" Tests that it correctly scraped a given list of [game_id, date]\"\"\"\n    games = [\n        {'game_id': '2017020001', 'date': '2017-10-04', 'status': 'Final'},\n        {'game_id': '2017020746', 'date': '2018-01-25', 'status': 'Final'},\n        {'game_id': '2017020450', 'date': '2017-12-09', 'status': 'Final'},\n        {'game_id': '2017030311', 'date': '2018-05-11', 'status': 'Final'}\n    ]\n\n    # First try without shifts\n    pbp, shifts = scrape_functions.scrape_list_of_games(games[:2], False)\n    assert isinstance(pbp, pd.DataFrame)\n    assert shifts is None\n    assert pbp.shape[0] == 614\n\n    # Second we try with shifts\n    pbp, shifts = scrape_functions.scrape_list_of_games(games[2:], True)\n    assert isinstance(pbp, pd.DataFrame)\n    assert isinstance(shifts, pd.DataFrame)\n    assert pbp.shape[0] == 602\n    assert shifts.shape[0] == 1531\n\n    # Third we feed it a bullshit game_id and see what happens\n    pbp, shifts = scrape_functions.scrape_list_of_games([{\"game_id\": \"2016022000\", \"date\": \"\", \"status\": \"\"}], True)\n    assert pbp is None\n    assert shifts is None\n"
  },
  {
    "path": "tests/test_shared.py",
    "content": "\"\"\" Tests for 'shared.py' \"\"\"\n\nimport os\nimport shutil\nimport pytest\n\nfrom hockey_scraper.utils import shared, config\n\n\n@pytest.fixture\ndef file_info():\n    return {\n        \"url\": 'https://api-web.nhle.com/v1/schedule/{}'.format(\"2017-10-05\"),\n        \"name\": str(2017020001),\n        \"type\": \"json_pbp\",\n        \"season\": 2017,\n    }\n\n\ndef test_check_data_format():\n    \"\"\" Test if it recognized the correct formats allowed\"\"\"\n    # These both are fine\n    shared.check_data_format(\"Csv\")\n    shared.check_data_format(\"pandaS\")\n\n    # Should raise an exception\n    with pytest.raises(ValueError):\n        shared.check_data_format(\"txt\")\n\n\ndef test_check_valid_dates():\n    \"\"\" Test if given valid date range\"\"\"\n    shared.check_valid_dates(\"2017-10-01\", \"2018-01-05\")\n\n    with pytest.raises(ValueError):\n        shared.check_valid_dates(\"2017-12-01\", \"2017-11-30\")\n\n\ndef test_convert_to_seconds():\n    \"\"\" Tests if it correctly converts minutes remaining to seconds elapsed\"\"\"\n    assert shared.convert_to_seconds(\"8:33\") == 513\n    assert shared.convert_to_seconds(\"-16:0-\") == \"1200\"\n\n\ndef test_get_season():\n    \"\"\" Tests that this function returns the correct season for a given date\"\"\"\n    assert shared.get_season(\"2017-10-01\") == 2017\n    assert shared.get_season(\"2016-06-01\") == 2015\n    assert shared.get_season(\"2020-08-29\") == 2019\n    assert shared.get_season(\"2020-10-03\") == 2019\n    assert shared.get_season(\"2020-11-15\") == 2020\n\n\ndef test_scrape_page(file_info):\n    \"\"\" Test scraping from the source is good\"\"\"\n    file = shared.scrape_page(file_info['url'])\n\n    assert type(file) == str\n    assert len(file) > 0\n\n\ndef test_get_file(file_info):\n    \"\"\" Test getting the file...it's either scraped or loaded from a file \"\"\"\n    original_path = os.getcwd()\n\n    # When there is either no directory specified or it doesn't exist\n    file = shared.get_file(file_info)\n    assert type(file) == str\n    assert len(file) > 0\n    assert original_path == os.getcwd()\n\n    # When the directory exists\n    # Here I just use the directory of this file to make things easy\n    shared.add_dir(os.path.dirname(os.path.realpath(__file__)))\n    file = shared.get_file(file_info)\n    assert type(file) == str\n    assert len(file) > 0\n    assert original_path == os.getcwd()\n\n    # Some cleanup....remove stuff created from the file directory and move back\n    os.chdir(os.path.dirname(os.path.realpath(__file__)))\n    shutil.rmtree(\"docs\")\n    shutil.rmtree(\"csvs\")\n    os.chdir(original_path)\n\n\ndef test_add_dir():\n    \"\"\" Test if this function correctly tells if a directory exists on the machine\"\"\"\n\n    # Check when it does exist (will always be good for this file)\n    user_dir = os.path.dirname(os.path.realpath(__file__))\n    shared.add_dir(user_dir)\n    assert config.DOCS_DIR is not None\n\n    # Checks when it doesn't exist\n    user_dir = os.path.join(os.path.dirname(os.path.realpath(__file__)), \"hopefully_this_path_doesnt_exist\")\n    shared.add_dir(user_dir)\n    assert config.DOCS_DIR is False"
  }
]