Showing preview only (289K chars total). Download the full file or copy to clipboard to get everything.
Repository: HarryShomer/Hockey-Scraper
Branch: master
Commit: 983ffc4640fa
Files: 57
Total size: 272.4 KB
Directory structure:
gitextract_48coyrer/
├── .gitignore
├── .vscode/
│ └── settings.json
├── CHANGELOG.rst
├── LICENSE.txt
├── MANIFEST.in
├── README.rst
├── docs/
│ ├── Makefile
│ ├── make.bat
│ └── source/
│ ├── cli.rst
│ ├── conf.py
│ ├── index.rst
│ ├── license_link.rst
│ ├── live_scrape.rst
│ ├── nhl_scrape_functions.rst
│ └── nwhl_scrape_functions.rst
├── hockey_scraper/
│ ├── __init__.py
│ ├── cli.py
│ ├── nhl/
│ │ ├── __init__.py
│ │ ├── game_scraper.py
│ │ ├── json_schedule.py
│ │ ├── live_scrape.py
│ │ ├── pbp/
│ │ │ ├── __init__.py
│ │ │ ├── espn_pbp.py
│ │ │ ├── html_pbp.py
│ │ │ └── json_pbp.py
│ │ ├── playing_roster.py
│ │ ├── scrape_functions.py
│ │ └── shifts/
│ │ ├── __init__.py
│ │ ├── html_shifts.py
│ │ └── json_shifts.py
│ ├── nwhl/
│ │ ├── __init__.py
│ │ ├── game_pbp.py
│ │ ├── scrape_functions.py
│ │ └── scrape_schedule.py
│ └── utils/
│ ├── __init__.py
│ ├── config.py
│ ├── merge_pbp_shifts.py
│ ├── player_name_fixes.json
│ ├── save_pages.py
│ ├── shared.py
│ ├── team_tri_codes.json
│ └── tri_code_conversion.json
├── readthedocs.yml
├── requirements.txt
├── setup.py
└── tests/
├── __init__.py
├── test_espn_pbp.py
├── test_game_scraper.py
├── test_html_pbp.py
├── test_html_shifts.py
├── test_json_pbp.py
├── test_json_schedule.py
├── test_json_shifts.py
├── test_nwhl.py
├── test_playing_roster.py
├── test_scrape_functions.py
└── test_shared.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
.DS_Store
.idea/
.csv/
tests.py
notes.txt
build/
.pytest_cache
update_season_data.py
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib/hockey_scraper
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
.static_storage/
.media/
local_settings.py
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
*.csv
================================================
FILE: .vscode/settings.json
================================================
{
}
================================================
FILE: CHANGELOG.rst
================================================
v1.2.6
------
* Added test coverage for most modules using pytest
* Refactored large portion of 'html_pbp.py' and corrected minor parser fixes in regards to penalties
* Added the module 'save_pages.py' which allows one to saves scraped files
* Added keyword arguments 'rescrape' and 'docs_dir' to the three main scraping functions. Specifying a valid directory using 'docs_dir' will make us check if a file was already scraped and saved before getting it from the source. It will also provide a location for us to save it if we don't have it yet. 'rescrape' only applies when a valid directory is provided with 'docs_dir'. Setting 'rescrape' equal to True will have us scrape the file from the source even if it's saved and save this new one.
v1.2.7
------
* Added functionality to easier scrape live games
* Fixed user warnings
v1.3
----
* Added functionality to scrape NWHL data
v1.31
-----
* Added functionality to automatically create docs_dir
* Added folder to store csv files
v1.33
-----
* Fixed bug with nhl changing contents of eventTypeId
* Updated ESPN scraping after they changed the layout of the pages
v1.34
-----
* Reflected change in url for ESPN scoreboard
* Deprecated NWHL usage due as pbp parser isn't applicable due to UI changes (new source unknown)
v1.35
-----
* Added nhl.scrape_function.scrape_schedule function
* Now chunk calls to the nhl schedule api
* Fixed nhl shift json endpoint
v1.36
-----
* Refactored and cleaned up code across modules
* Added names to utils.shared.Names
* Changed errors/warning to print red in the console
v1.37
-----
* Now saves scraped pages in docs_dir as a GZIP
* Only print full error summary when the number of games scraped is >= 25
* Remove hardcoded exception for Sebastian Aho. Updated process to work without it.
* Always rescrape schedule pages
v1.38
------
* Convert tri-codes from new format to old in Html PBP. Mappings stored in utils/tri_code_conversion.json.
* Added verbose option to top-level scrape functions
* Replaced default parser for HTML PBP with "html5lib" over "lxml". lxml was having issues with older games.
* Reduced chunk size in nhl.json_schedule.chunk_schedule_calls to 30 from 50. Was having some issues during tests.
v1.39
------
* Changed API endpoints
================================================
FILE: LICENSE.txt
================================================
GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The GNU General Public License is a free, copyleft license for
software and other kinds of works.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users. We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors. You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.
To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights. Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received. You must make sure that they, too, receive
or can get the source code. And you must show them these terms so they
know their rights.
Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.
For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software. For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.
Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so. This is fundamentally incompatible with the aim of
protecting users' freedom to change the software. The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable. Therefore, we
have designed this version of the GPL to prohibit the practice for those
products. If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.
Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary. To prevent this, the GPL assures that
patents cannot be used to render the program non-free.
The precise terms and conditions for copying, distribution and
modification follow.
TERMS AND CONDITIONS
0. Definitions.
"This License" refers to version 3 of the GNU General Public License.
"Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.
"The Program" refers to any copyrightable work licensed under this
License. Each licensee is addressed as "you". "Licensees" and
"recipients" may be individuals or organizations.
To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy. The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.
A "covered work" means either the unmodified Program or a work based
on the Program.
To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy. Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.
To "convey" a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License. If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.
1. Source Code.
The "source code" for a work means the preferred form of the work
for making modifications to it. "Object code" means any non-source
form of a work.
A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.
The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form. A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.
The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities. However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work. For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.
The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.
The Corresponding Source for a work in source code form is that
same work.
2. Basic Permissions.
All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force. You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright. Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under
the conditions stated below. Sublicensing is not allowed; section 10
makes it unnecessary.
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.
When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.
4. Conveying Verbatim Copies.
You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.
5. Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified
it, and giving a relevant date.
b) The work must carry prominent notices stating that it is
released under this License and any conditions added under section
7. This requirement modifies the requirement in section 4 to
"keep intact all notices".
c) You must license the entire work, as a whole, under this
License to anyone who comes into possession of a copy. This
License will therefore apply, along with any applicable section 7
additional terms, to the whole of the work, and all its parts,
regardless of how they are packaged. This License gives no
permission to license the work in any other way, but it does not
invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display
Appropriate Legal Notices; however, if the Program has interactive
interfaces that do not display Appropriate Legal Notices, your
work need not make them do so.
A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit. Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.
6. Conveying Non-Source Forms.
You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:
a) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by the
Corresponding Source fixed on a durable physical medium
customarily used for software interchange.
b) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by a
written offer, valid for at least three years and valid for as
long as you offer spare parts or customer support for that product
model, to give anyone who possesses the object code either (1) a
copy of the Corresponding Source for all the software in the
product that is covered by this License, on a durable physical
medium customarily used for software interchange, for a price no
more than your reasonable cost of physically performing this
conveying of source, or (2) access to copy the
Corresponding Source from a network server at no charge.
c) Convey individual copies of the object code with a copy of the
written offer to provide the Corresponding Source. This
alternative is allowed only occasionally and noncommercially, and
only if you received the object code with such an offer, in accord
with subsection 6b.
d) Convey the object code by offering access from a designated
place (gratis or for a charge), and offer equivalent access to the
Corresponding Source in the same way through the same place at no
further charge. You need not require recipients to copy the
Corresponding Source along with the object code. If the place to
copy the object code is a network server, the Corresponding Source
may be on a different server (operated by you or a third party)
that supports equivalent copying facilities, provided you maintain
clear directions next to the object code saying where to find the
Corresponding Source. Regardless of what server hosts the
Corresponding Source, you remain obligated to ensure that it is
available for as long as needed to satisfy these requirements.
e) Convey the object code using peer-to-peer transmission, provided
you inform other peers where the object code and Corresponding
Source of the work are being offered to the general public at no
charge under subsection 6d.
A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.
A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling. In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage. For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product. A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.
"Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source. The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.
If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information. But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).
The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed. Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.
7. Additional Terms.
"Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law. If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it. (Additional permissions may be written to require their own
removal in certain cases when you modify the work.) You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the
terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or
author attributions in that material or in the Appropriate Legal
Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or
requiring that modified versions of such material be marked in
reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or
authors of the material; or
e) Declining to grant rights under trademark law for use of some
trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that
material by anyone who conveys the material (or modified versions of
it) with contractual assumptions of liability to the recipient, for
any liability that these contractual assumptions directly impose on
those licensors and authors.
All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10. If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term. If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.
8. Termination.
You may not propagate or modify a covered work except as expressly
provided under this License. Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).
However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.
Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.
9. Acceptance Not Required for Having Copies.
You are not required to accept this License in order to receive or
run a copy of the Program. Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance. However,
nothing other than this License grants you permission to propagate or
modify any covered work. These actions infringe copyright if you do
not accept this License. Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.
10. Automatic Licensing of Downstream Recipients.
Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License. You are not responsible
for enforcing compliance by third parties with this License.
An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations. If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.
11. Patents.
A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based. The
work thus licensed is called the contributor's "contributor version".
A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version. For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.
In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement). To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.
If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients. "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.
A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License. You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.
12. No Surrender of Others' Freedom.
If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all. For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.
13. Use with the GNU Affero General Public License.
Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU Affero General Public License into a single
combined work, and to convey the resulting work. The terms of this
License will continue to apply to the part which is the covered work,
but the special requirements of the GNU Affero General Public License,
section 13, concerning interaction through a network will apply to the
combination as such.
14. Revised Versions of this License.
The Free Software Foundation may publish revised and/or new versions of
the GNU General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the
Program specifies that a certain numbered version of the GNU General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation. If the Program does not specify a version number of the
GNU General Public License, you may choose any version ever published
by the Free Software Foundation.
If the Program specifies that a proxy can decide which future
versions of the GNU General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.
Later license versions may give you additional or different
permissions. However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.
15. Disclaimer of Warranty.
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
16. Limitation of Liability.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
17. Interpretation of Sections 15 and 16.
If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
Also add information on how to contact you by electronic and paper mail.
If the program does terminal interaction, make it output a short
notice like this when it starts in an interactive mode:
<program> Copyright (C) <year> <name of author>
This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, your program's commands
might be different; for a GUI interface, you would use an "about box".
You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU GPL, see
<https://www.gnu.org/licenses/>.
The GNU General Public License does not permit incorporating your program
into proprietary programs. If your program is a subroutine library, you
may consider it more useful to permit linking proprietary applications with
the library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License. But first, please read
<https://www.gnu.org/licenses/why-not-lgpl.html>.
================================================
FILE: MANIFEST.in
================================================
include README.rst
================================================
FILE: README.rst
================================================
This repository is no longer maintained. Feel free to fork it.
==================================================================================
.. .. image:: https://badge.fury.io/py/hockey-scraper.svg
.. :target: https://badge.fury.io/py/hockey-scraper
.. .. image:: https://readthedocs.org/projects/hockey-scraper/badge/?version=latest
.. :target: https://readthedocs.org/projects/hockey-scraper/?badge=latest
.. :alt: Documentation Status
Hockey-Scraper
==============
.. inclusion-marker-for-sphinx
Purpose
-------
Scrape NHL data off the NHL API and website. This includes the Play by Play and Shift data for each game and the schedule information.
It currently supports all preseason, regular season, and playoff games from the 2007-2008 season onwards.
Prerequisites
-------------
You are going to need to have python installed for this. This should work for both python 2.7 and 3. I recommend having
from at least version 3.6.0 but earlier versions should be fine.
Installation
------------
To install all you need to do is open up your terminal and run:
::
pip install hockey_scraper
NHL Usage
---------
The full documentation can be found `here <http://hockey-scraper.readthedocs.io/en/latest/>`_.
Standard Scrape Functions
~~~~~~~~~~~~~~~~~~~~~~~~~
Scrape data on a season by season level:
::
import hockey_scraper
# Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file
hockey_scraper.scrape_seasons([2015, 2016], True)
# Scrapes the 2008 season without shifts and returns a dictionary containing the pbp Pandas DataFrame
scraped_data = hockey_scraper.scrape_seasons([2008], False, data_format='Pandas')
Scrape a list of games:
::
import hockey_scraper
# Scrapes the first game of 2014, 2015, and 2016 seasons with shifts and stores the data in a Csv file
hockey_scraper.scrape_games([2014020001, 2015020001, 2016020001], True)
# Scrapes the first game of 2007, 2008, and 2009 seasons with shifts and returns a Dictionary with the Pandas DataFrames
scraped_data = hockey_scraper.scrape_games([2007020001, 2008020001, 2009020001], True, data_format='Pandas')
Scrape all games in a given date range:
::
import hockey_scraper
# Scrapes all games between 2016-10-10 and 2016-10-20 without shifts and stores the data in a Csv file
hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False)
# Scrapes all games between 2015-1-1 and 2015-1-15 without shifts and returns a Dictionary with the pbp Pandas DataFrame
scraped_data = hockey_scraper.scrape_date_range('2015-1-1', '2015-1-15', False, data_format='Pandas')
The dictionary returned by setting the default argument "data_format" equal to "Pandas" is structured like:
::
{
# Both of these are always included
'pbp': pbp_df,
# This is only included when the argument 'if_scrape_shifts' is set equal to True
'shifts': shifts_df
}
Schedule
~~~~~~~~
The schedule for any past or future games can be scraped as follows:
::
import hockey_scraper
# As oppossed to the other calls the default format is 'Pandas' which returns a DataFrame
sched_df = hockey_scraper.scrape_schedule("2019-10-01", "2020-07-01")
The columns returned are: `['game_id', 'date', 'venue', 'home_team', 'away_team', 'start_time', 'home_score', 'away_score', 'status']`
Persistent Data
~~~~~~~~~~~~~~~
All the raw game data files retrieved can also be saved to your disk. This allows for faster rescraping (we don't need to re-retrieve them)
and the ability to parse the data yourself.
This is achieved by setting the keyword argument `docs_dir=True`. This will store the data in a directory called `~/hockey_scraper_data`.
You can provide your own directory where you want everything to be stored (it must exist beforehand). By default `docs_dir=False`.
For example, let's say we are scraping the JSON PBP data for game `2019020001 <http://statsapi.web.nhl.com/api/v1/game/2019020001/feed/live>`_.
If `docs_dir` isn't `False` it will first check if the data is already in the directory. If so, it will load in the data from that file and not make a GET
request to the NHL API. However if it doesn't exist, it will make a GET request and then save the output to the directory.
This will ensure that next time you are requesting that data it can load it from a file.
Here are some examples.
The default saving location is `~/hockey_scraper_data`.
::
# Create or try to refer to a directory in the home directory
# Will create a directory called 'hockey_scraper_data' in the home directory (if it doesn't exist)
hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=True)
User defined directory
::
USER_PATH = "/...."
hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH)
You can override the existing files by specifying `rescrape=True`. It will retrieve all the files from source and save the newer versions to `docs_dir`.
::
hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH, rescrape=True)
Live Scraping
~~~~~~~~~~~~~
Here is a simple example of a way to setup live scraping. I strongly suggest checking out
`this section <https://hockey-scraper.readthedocs.io/en/latest/live_scrape.html>`_ of the docs if you plan on using this.
::
import hockey_scraper as hs
def to_csv(game):
"""
Store each game DataFrame in a file
:param game: LiveGame object
:return: None
"""
# If the game:
# 1. Started - We recorded at least one event
# 2. Not in Intermission
# 3. Not Over
if game.is_ongoing():
# Print the description of the last event
print(game.game_id, "->", game.pbp_df.iloc[-1]['Description'])
# Store in CSV files
game.pbp_df.to_csv(f"../hockey_scraper_data/{game.game_id}_pbp.csv", sep=',')
game.shifts_df.to_csv(f"../hockey_scraper_data/{game.game_id}_shifts.csv", sep=',')
if __name__ == "__main__":
# B4 we start set the directory to store the files
# You don't have to do this but I recommend it
hs.live_scrape.set_docs_dir("../hockey_scraper_data")
# Scrape the info for all the games on 2018-11-15
games = hs.ScrapeLiveGames("2018-11-15", if_scrape_shifts=True, pause=20)
# While all the games aren't finished
while not games.finished():
# Update for all the games currently being played
games.update_live_games(sleep_next=True)
# Go through every LiveGame object and apply some function
# You can of course do whatever you want here.
for game in games.live_games:
to_csv(game)
Contact
-------
Please contact me for any issues or suggestions. For any bugs or anything related to the code please open an issue.
Otherwise you can email me at Harryshomer@gmail.com.
Copyright
---------
::
Copyright (C) 2019-2022 Harry Shomer
This file is part of hockey_scraper
hockey_scraper is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
================================================
FILE: docs/Makefile
================================================
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
SPHINXPROJ = hockey_scraper
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
================================================
FILE: docs/make.bat
================================================
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build
set SPHINXPROJ=hockey_scraper
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
:end
popd
================================================
FILE: docs/source/cli.rst
================================================
Command Line Interface
======================
There also exists a cli tool called `hockey-scraper` which can be used to pull data. Users may find this more convenient than using python directly for simple queries.
The usage for the tool can be found below:
.. code-block:: console
usage: hockey-scraper [-h] [-t REPORTTYPE] [--shifts] [-d DATERANGE [DATERANGE ...]] [-s SEASONS [SEASONS ...]]
[-g GAMES [GAMES ...]] [-f FILEDIR] [-r] [-p]
CLI tool for the hockey_scraper project
optional arguments:
-h, --help show this help message and exit
-t REPORTTYPE, --reportType REPORTTYPE
Type of report to scrape. Either game or schedule.
--shifts Whether to include shifts.
-d DATERANGE [DATERANGE ...], --dateRange DATERANGE [DATERANGE ...]
Date range to scrape between.
-s SEASONS [SEASONS ...], --seasons SEASONS [SEASONS ...]
Seasons to scrape.
-g GAMES [GAMES ...], --games GAMES [GAMES ...]
Game IDs to scrape.
-f FILEDIR, --fileDir FILEDIR
Whether to store scraped files. If the flag is specified and no argument is passed, a directory is created
in the root. If an argument is passed with the flag the files are stored there (assuming the directory
exists).
-r, --rescrape Whether to re-scrape pages already scraped and stored in --fileDir.
-p, --preseason Whether to scrape preseason data.
CLI
~~~
.. automodule:: hockey_scraper.cli
:members:
================================================
FILE: docs/source/conf.py
================================================
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# hockey_scraper documentation build configuration file, created by
# sphinx-quickstart on Sun Dec 3 03:00:09 2017.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
sys.path.insert(0, os.path.abspath('../..'))
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['sphinx.ext.autodoc']
autodoc_mock_imports = ['BeautifulSoup4', 'requests', 'lxml', 'html5lib', 'pandas', 'pytest', 'pytz', 'tqdm']
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
# source_suffix = ['.rst', '.md']
source_suffix = '.rst'
# The master toctree document.
master_doc = 'index'
# General information about the project.
project = 'hockey_scraper'
copyright = '2023, Harry Shomer'
author = 'Harry Shomer'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '1.40'
# The full version, including alpha/beta/rc tags.
release = '1.40'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This patterns also effect to html_static_path and html_extra_path
exclude_patterns = []
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False
# -- Options for HTML output ----------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'nature'
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#
# html_theme_options = {}
# -- Options for HTMLHelp output ------------------------------------------
# Output file base name for HTML help builder.
htmlhelp_basename = 'hockey_scraperdoc'
# -- Options for LaTeX output ---------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#
# 'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#
# 'preamble': '',
# Latex figure (float) alignment
#
# 'figure_align': 'htbp',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'hockey_scraper.tex', 'hockey\\_scraper Documentation',
'Harry Shomer', 'manual'),
]
# -- Options for manual page output ---------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
(master_doc, 'hockey_scraper', 'hockey_scraper Documentation',
[author], 1)
]
# -- Options for Texinfo output -------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(master_doc, 'hockey_scraper', 'hockey_scraper Documentation',
author, 'hockey_scraper', 'One line description of project.',
'Miscellaneous'),
]
================================================
FILE: docs/source/index.rst
================================================
Hockey-Scraper
==============
Contents
--------
.. toctree::
:maxdepth: 1
nhl_scrape_functions
live_scrape
license_link
.. include:: ../../README.rst
:start-after: inclusion-marker-for-sphinx
Indices and tables
------------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
================================================
FILE: docs/source/license_link.rst
================================================
License
=======
.. include:: ../../LICENSE.txt
================================================
FILE: docs/source/live_scrape.rst
================================================
Live Scraping
=============
Standard Usage
--------------
To get all the info for every game on a specific day we create a ScrapeLiveGames object.
::
import hockey_scraper as hs
todays_games = hs.ScrapeLiveGames("2018-11-15", if_scrape_shifts=True, pause=15)
Once created this object will contain an attribute called 'live_games' that holds a list of LiveGame objects for that
day. LiveGame objects hold all the pertinent game information for each game. This includes the most recent
pbp and shift data for that game. Here are all the attributes for the LiveGame class:
::
class LiveGame:
"""
This is a class holds all the information for a given game
:param int game_id: The NHL game id (ex: 2018020001)
:param datetime start_time: The UTC time of when the game begins
:param str home_team: Tricode for the home team (ex: NYR)
:param str away_team: Tricode for the home team (ex: MTL)
:param int espn_id: The ESPN game id for their feed
:param str date: Date of the game (ex: 2018-10-30)
:param bool if_scrape_shifts: Whether or not you want to scrape shifts
:param str api_game_status: Current Status of the game - ["Final", "Live", "Intermission]
:param str html_game_status: Current Status of the game - ["Final", "Live", "Intermission"]
:param int intermission_time_remaining: Time remaining in the intermission. 0 if not in intermission
:param dict players: Player info for both teams
:param dict head_coaches: Head coaches for both teams
:param DataFrame _pbp_df: Holds most recent pbp data
:param DataFrame _shifts_df: Holds most recent shift data
:param DataFrame _prev_pbp_df: Holds the previous pbp data (for just in case)
:param DataFrame _prev_shifts_df: Holds the previous shift data (for just in case)
"""
Here's a simple example of scraping the games continuously for a single date. This will run until every game is finished:
::
import hockey_scraper as hs
def to_csv(game):
"""
Store each game DataFrame in a file
:param game: LiveGame object
:return: None
"""
# If the game:
# 1. Started - We recorded at least one event
# 2. Not in Intermission
# 3. Not Over
if game.is_ongoing():
# Print the description of the last event
print(game.game_id, "->", game.pbp_df.iloc[-1]['Description'])
# Store in CSV files
game.pbp_df.to_csv(f"../hockey_scraper_data/{game.game_id}_pbp.csv", sep=',')
game.shifts_df.to_csv(f"../hockey_scraper_data/{game.game_id}_shifts.csv", sep=',')
if __name__ == "__main__":
# B4 we start set the directory to store the files
hs.live_scrape.set_docs_dir("../hockey_scraper_data")
# Scrape the info for all the games on 2018-11-15
games = hs.ScrapeLiveGames("2018-11-15", if_scrape_shifts=True, pause=20)
# While all the games aren't finished
while not games.finished():
# Update for all the games currently being played
games.update_live_games(sleep_next=True)
# Go through every LiveGame object and apply some function
# You can of course do whatever you want here.
for game in games.live_games:
to_csv(game)
In the above example, we set a directory to store the most recent version of every scraped file. We then grab the
initial game info for each game for that day. We decide we want to include shifts and to pause 15 seconds after updating
all the games. We then enter a loop that will be terminated once every game is finished. Once in the loop we first
scrape the new info for every game and then pause for the specified time (default is 15).
Once we process the new data we then, presumably, want to do something with it. Here, I decided to merely print the last
event in the game and store the newer data in files. We do this by iterating through each LiveGame object in the 'live_games'
attribute and calling the function 'to_csv'. In 'to_csv', before doing anything we check if the game is 'ongoing'.
This checks whether the game is either over or in intermission. If it is there isn't a whole lot to update. If it's
neither we print the last event and then store the data for both the pbp & shifts.
Another option we have is for the program to sleep until the first game starts. Unless you want to start this yourself
everyday, you'll probably be scheduling it to start at some time every day. This means from when you start the program
to when the first game starts may be a significant amount of time (fwiw, it will just loop and not scrape anything). But
you can set it to sleep until the first game is scheduled to start. This can be done by setting the keyword 'sleep_next'
to True. This check to see if the only games left are scheduled games yet to start. If so it sleeps until the next
earliest game starts.
::
# Causes the program to sleep until the first game starts
games.update_live_games(sleep_next=True)
You can also specify which games you want to scrape for that day (maybe you only care about one game), by setting the
keyword 'game_ids' equal to a list of NHL Game ID's of the games you want when instantiating a ScrapeLiveGame object. You
can of course to choose to filter it however you want as the list of LiveGame objects is a attribute of the object. Either
way I strongly suggest creating a ScrapeLiveGames object and then either extracting the game you want or filtering it
rather than instantiating a LiveGame object (you will be on the hook for a lot of information)
::
# Only want those those two games.
games = hs.ScrapeLiveGames("2018-11-15", if_scrape_shifts=True, pause=15, game_ids=[2018020280, 2018020281])
Further Usage
-------------
If you would like more control over what you are doing then you should be dealing directly with LiveGame objects. As
mentioned previously, still use ScrapeLiveGames to get the game info but you can then just extract the list of games
and do as you please.
Using the previous example here we are scraping each game individually:
::
# Scrape the info for all the games on 2018-11-15
games = hs.ScrapeLiveGames("2018-11-15", if_scrape_shifts=True, pause=15)
while not games.finished():
# Go through every LiveGame object
for game in games.live_games:
# Scrape each game individually
game.scrape()
# Apply some function to every game
to_csv(game)
# Pause after each scraping chunk
time.sleep(15)
If you don't trust when I choose to not scrape (when the game is over or in intermission), you can make the keyword
'force' equal to True. This will re-scrape it as long as the game already started.
::
game.scrape(force=True)
This will override everything and will attempt to scrape the game no matter what. This means you are have to be on top
of when to stop scraping the game. You are also on the hook for any potential errors.
You may also want to handle things like the start time of games yourself. As mentioned using ScrapeLiveGames we can
set 'sleep_next' equal to True to sleep until the next game starts if no game is going on. You can also use the keyword
'start_time' for a LiveGame object that will give you a datetime object with the scheduled starting time for a given
game in UTC time. Lastly, you can also use the function 'time_until_game()' that will return how many seconds until the
game starts.
::
>>> games = hs.ScrapeLiveGames("2018-11-09", if_scrape_shifts=True, pause=15)
>>> games.live_games[0].start_time
datetime.datetime(2018, 11, 10, 0, 0)
>>> games.live_games[0].time_until_game()
64599
You can use this how you please. For example, you may want to create a separate thread for each game and have it sleep
until the game starts. Or maybe you want to use it another way. Either way it's there.
There are also a few methods that return the give information about the current status of the game. The first two return
whether the game is in intermission or whether it's over.
::
def is_game_over(self, prev=False):
"""
Check if the game is over for both the html and json pbp. If prev=True check for the previous event
:param prev: Check the game status for the previous event
:return: Boolean - True if over
"""
if not prev:
return self.html_game_status == self.api_game_status == "Final"
else:
return self.prev_html_game_status == self.prev_api_game_status == "Final"
def is_intermission(self, prev=False):
"""
Check if in intermission for both the html and json pbp. If prev=True check for the previous event
:param prev: Check the game status for the previous event
:return: Boolean - True if yes
"""
if not prev:
return self.html_game_status == self.api_game_status == "Intermission"
else:
return self.prev_html_game_status == self.prev_api_game_status == "Intermission"
Two things probably stand out is the option to check the status for the previous event (why do we care what it was earlier?)
and the fact that two statuses exist.
First let's talk about the two status. There are currently two pages that always need to be scraped for for data for
the Play-By-Play. One is an html file and one is the json api. The issue is that the api updates faster than the html.
So the api may say the game is over but the html version is still missing a few events. For this reason we need to check
that both are aligned.
The 'prev' keyword for both comes into play when we consider the last method 'is_ongoing'. This checks whether the game
is currently being played. Which means the game: Started, is not in intermission, and is not over. Here's the method:
::
def is_ongoing(self):
"""
Check if the game is currently being played.
The logic here is that we run into an issue with intermission and the end of game. If the game is just changed
to Final or Intermission the end user will assume the game isn't ongoing and will not update with the most
recent events. They'll be delayed for intermission and won't place it at all for Final games. So we use the
previous event as a guide. If it's currently in intermission or Final - we check the previous status. If it's
the same the user already has the data. Otherwise we 'lie' and say the game is still ongoing.
:return: Boolean
"""
# The game is currently being played
if self.time_until_game() == 0 and not self.is_game_over() and not self.is_intermission() and self.pbp_df.shape[0] > 0:
return True
# Since it's not being played check if game is over and if it wasn't for the previous
elif self.is_game_over() and not self.is_game_over(prev=True):
return True
# Check if it's in intermission and the if it was for the previous event
elif self.is_intermission() and not self.is_intermission(prev=True):
return True
else:
return False
I recommend looking at the function definition written above. Basically checking the previous event makes sure we got
the most recent event if the game is over or in intermission. So if the last status was intermission and this one is
too we know we don't need to scrape. But if the last status wasn't the means we are missing some information (presumably
something happened between the last event and the end of the period).
Live Scrape
~~~~~~~~~~~
.. automodule:: hockey_scraper.nhl.live_scrape
:members:
================================================
FILE: docs/source/nhl_scrape_functions.rst
================================================
NHL Scraping Functions
======================
Scraping
--------
There are three ways to scrape games:
\1. *Scrape by Season*:
Scrape games on a season by season level (Note: A given season is referred to by the first of the two years it spans.
So you would refer to the 2016-2017 season as 2016).
::
import hockey_scraper
# Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file (both are equivalent!!!)
hockey_scraper.scrape_seasons([2015, 2016], True)
hockey_scraper.scrape_seasons([2015, 2016], True, data_format='Csv')
# Scrapes the 2008 season without shifts and returns a dictionary with the DataFrame
scraped_data = hockey_scraper.scrape_seasons([2008], False, data_format='Pandas')
# Scrapes 2014 season without shifts including preseason games
hockey_scraper.scrape_seasons([2014], False, preseason=True)
\2. *Scrape by Game*:
Scrape a list of games provided. All game ID's can be found using `this link
<https://statsapi.web.nhl.com/api/v1/schedule?startDate=2016-10-03&endDate=2017-06-20>`_
(you need to play around with the dates in the url).
::
import hockey_scraper
# Scrapes the first game of 2014, 2015, and 2016 seasons with shifts and stores the data in a Csv file
hockey_scraper.scrape_games([2014020001, 2015020001, 2016020001], True)
# Scrapes the first game of 2007, 2008, and 2009 seasons with shifts and returns a a dictionary with the DataFrames
scraped_data = hockey_scraper.scrape_games([2007020001, 2008020001, 2009020001], True, data_format='Pandas')
\3. *Scrape by Date Range*:
Scrape all games between a specified date range. All dates must be written in a "yyyy-mm-dd" format.
::
import hockey_scraper
# Scrapes all games between date range without shifts and stores the data in a Csv file (both are equivalent!!!)
hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False)
hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False, preseason=False)
# Scrapes all games between 2015-1-1 and 2015-1-15 without shifts and returns a a dictionary with the DataFrame
scraped_data = hockey_scraper.scrape_date_range('2015-1-1', '2015-1-15', False, data_format='Pandas')
# Scrapes all games from 2014-09-15 to 2014-11-01 with shifts including preseason games
hockey_scraper.scrape_date_range('2014-09-15', '2014-11-01', True, preseason=True)
\4. *Scrape Schedule*
Scrape the schedule between any given date range for past and future games. All dates must be written in a "yyyy-mm-dd" format. The default data_format is equal to 'Pandas'. This returns a DataFrame and not a dictionary like others. The columns returned are: ['game_id', 'date', 'venue', 'home_team', 'away_team', 'start_time', 'home_score', 'away_score', 'status']
::
import hockey_scraper
sched_df = hockey_scraper.scrape_schedule("2019-10-01", "2020-07-01")
**Persistent Data**
The option also exists to save the scraped files in another directory. This would speed up re-scraping any games since
we already have the docs needed for it. It would also be useful if you want to grab any extra information from them
as some of them contain a lot more information. In order to do this you can use the 'docs_dir' keyword. One can specify
the boolean value True to either create or refer (to an already created) directory in the home directory called
hockey_scraper data. Or you can specify the directory with the string of the path. If this is a valid directory,
when scraping each page it would first check if it was already scraped (therefore saving us the time of scraping it).
If it hasn't been scraped yet, it will then grab it from the source and save it in the given directory.
Sometimes you may have already scraped and saved a file but you want to re-scrape it from the source and save it again
(this may seem strange but the NHL frequently fixes mistakes so you may want to update what you have). This can be done
by setting the keyword argument rescrape equal to True.
::
import hockey_scraper
# Path to the given directory
# Can also be True if you want the scraper to take care of it
USER_PATH = "/...."
# Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file
# Also includes a path for an existing directory for the scraped files to be placed in or retrieved from.
hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH)
# Once could chose to re-scrape previously saved files by making the keyword argument rescrape=True
hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH, rescrape=True)
**Additional Notes**:
\1. For all three functions you must specify if you want to also scrape shifts (TOI tables) with a boolean. The Play by
Play is automatically scraped.
\2. When scraping by date range or by season, preseason games aren't scraped unless otherwise specified. Also preseason
games are scraped at your own risk. There is no guarantee it will work or that the files are even there!!!
\3. For all three functions the scraped data is deposited into a Csv file unless it's specified to return the DataFrames
\4. The Dictionary with the DataFrames (and scraping errors) returned by setting data_format='Pandas' is structured like:
::
{
# Both of these are always included
'pbp': pbp_df,
# This is only included when the argument 'if_scrape_shifts' is set equal to True
'shifts': shifts_df
}
\5. When including a directory, it must be a valid directory. It will not create it for you. You'll get an error message
but otherwise it will scrape as if no directory was provided.
Scrape Functions
~~~~~~~~~~~~~~~~
.. automodule:: hockey_scraper.nhl.scrape_functions
:members:
Game Scraper
~~~~~~~~~~~~
.. automodule:: hockey_scraper.nhl.game_scraper
:members:
Html PBP
~~~~~~~~
.. automodule:: hockey_scraper.nhl.pbp.html_pbp
:members:
Json PBP
~~~~~~~~
.. automodule:: hockey_scraper.nhl.pbp.json_pbp
:members:
Espn PBP
~~~~~~~~
.. automodule:: hockey_scraper.nhl.pbp.espn_pbp
:members:
Json Shifts
~~~~~~~~~~~
.. automodule:: hockey_scraper.nhl.shifts.json_shifts
:members:
Html Shifts
~~~~~~~~~~~
.. automodule:: hockey_scraper.nhl.shifts.html_shifts
:members:
Schedule
~~~~~~~~
.. automodule:: hockey_scraper.nhl.json_schedule
:members:
Playing Roster
~~~~~~~~~~~~~~
.. automodule:: hockey_scraper.nhl.playing_roster
:members:
Save Pages
~~~~~~~~~~
.. automodule:: hockey_scraper.utils.save_pages
:members:
Shared Functions
~~~~~~~~~~~~~~~~
.. automodule:: hockey_scraper.utils.shared
:members:
================================================
FILE: docs/source/nwhl_scrape_functions.rst
================================================
NWHL Scraping Functions
=======================
Scraping
--------
There are three ways to scrape games:
\1. *Scrape by Season*:
Scrape games on a season by season level (Note: A given season is referred to by the first of the two years it spans.
So you would refer to the 2016-2017 season as 2016).
::
import hockey_scraper
# Scrapes the 2015 & 2016 season and stores the data in a Csv file (both are equivalent!!!)
hockey_scraper.nwhl.scrape_seasons([2015, 2016])
hockey_scraper.nwhl.scrape_seasons([2015, 2016], data_format='Csv')
# Scrapes the 2008 season and returns a Pandas DataFrame
scraped_data = hockey_scraper.nwhl.scrape_seasons([2017], data_format='Pandas')
\2. *Scrape by Game*:
Scrape a list of games provided.
::
import hockey_scraper
# Scrapes games and store in a Csv file
hockey_scraper.nwhl.scrape_games([14694271, 14814946, 14689491], True)
# Scrapes games and return DataFrame with data
scraped_data = hockey_scraper.nwhl.scrape_games([14689624, 18507470, 20575219, 22207005], data_format='Pandas')
\3. *Scrape by Date Range*:
Scrape all games between a specified date range. All dates must be written in a "yyyy-mm-dd" format.
::
import hockey_scraper
# Scrapes all games between 2016-10-10 and 2017-01-01 and returns a Pandas DataFrame containing the pbp
hockey_scraper.nwhl.scrape_date_range('2016-10-10', '2017-01-01', data_format='pandas')
Scrape Functions
~~~~~~~~~~~~~~~~
.. automodule:: hockey_scraper.nwhl.scrape_functions
:members:
Html Schedule
~~~~~~~~~~~~~
.. automodule:: hockey_scraper.nwhl.html_schedule
:members:
Json PBP
~~~~~~~~
.. automodule:: hockey_scraper.nwhl.json_pbp
:members:
================================================
FILE: hockey_scraper/__init__.py
================================================
from .nhl.live_scrape import ScrapeLiveGames, LiveGame
from .nhl.scrape_functions import scrape_games, scrape_date_range, scrape_seasons, scrape_schedule
from .nhl import live_scrape
from .utils import shared
from . import utils
#from .nwhl import scrape_schedule as nwhl_scrape_schedule
================================================
FILE: hockey_scraper/cli.py
================================================
"""
Interface for running cli commands
"""
import sys
import argparse
from .utils.shared import print_error
from .nhl.scrape_functions import scrape_games, scrape_date_range, scrape_seasons, scrape_schedule
def validate_args(user_args):
"""
Validate that the passed args are sufficient enough to call the corresponding scraping function.
Detailed checks are done later by the packcage after the specific function is called.
:param user_args: ArgumentParser object
:return: Boolean indicating if args are good
"""
if user_args.reportType.lower() not in ['game', 'schedule']:
print_error("Invalid parameter passed for -t/--reportType. Must be either `game` or `schedule`")
return False
# One of 3 not empty
if not any([user_args.dateRange, user_args.seasons, user_args.games]):
print_error("Must supply one of the following args: -d/--dateRange, -g/--games, or -s/--seasons. You passed none.")
return False
# Date range - should only pass two
# Whether or not they are valid is assessed later after calling one of the functions
if user_args.dateRange and len(user_args.dateRange) != 2:
print_error("Only 2 parameters should be passed for -d/--dateRange. You passed {}.".format(len(user_args.dateRange)))
return False
### Everything else should be handled by just calling the functions
return True
def run_cmd(user_args):
"""
Run the appropriate command. Args already validated by this point.
:param user_args: ArgumentParser object
:return: None
"""
if user_args.reportType.lower() == 'schedule':
scrape_schedule(user_args.dateRange[0], user_args.dateRange[1], rescrape=user_args.rescrape, docs_dir=user_args.fileDir, data_format='csv')
else:
if user_args.dateRange:
scrape_date_range(user_args.dateRange[0], user_args.dateRange[1], user_args.shifts,
docs_dir=user_args.fileDir, rescrape=user_args.rescrape, preseason=user_args.preseason)
elif user_args.seasons:
scrape_seasons(user_args.seasons, user_args.shifts, docs_dir=user_args.fileDir, rescrape=user_args.rescrape, preseason=user_args.preseason)
else:
scrape_games(user_args.games, user_args.shifts, docs_dir=user_args.fileDir, rescrape=user_args.rescrape)
def main():
parser = argparse.ArgumentParser(description='CLI tool for the hockey_scraper project')
### Default to scraping games without shifts
parser.add_argument('-t', "--reportType", help='Type of report to scrape. Either game or schedule.', default='game', type=str, required=False)
parser.add_argument("--shifts", help='Whether to include shifts.', action='store_true', default=False, required=False)
### Must pass one of these
parser.add_argument('-d', "--dateRange", help='Date range to scrape between.', nargs='+', type=str, required=False, default=[])
parser.add_argument('-s', "--seasons", help='Seasons to scrape.', nargs='+', type=int, required=False, default=[])
parser.add_argument('-g', "--games", help='Game IDs to scrape.', nargs='+', type=str, required=False, default=[])
### Additonal optional args
parser.add_argument('-f', "--fileDir",
help='''Whether to store scraped files. If the flag is specified and no argument is passed, a directory is created in the root.
If an argument is passed with the flag the files are stored there (assuming the directory exists).
''',
default=None, type=str, required=False)
parser.add_argument("-r", "--rescrape", help='Whether to re-scrape pages already scraped and stored in --fileDir.',
action='store_true', default=False, required=False)
parser.add_argument("-p", "--preseason", help='Whether to scrape preseason data.', action='store_true', default=False, required=False)
args = parser.parse_args()
if validate_args(args):
run_cmd(args)
if __name__ == "__main__":
main()
================================================
FILE: hockey_scraper/nhl/__init__.py
================================================
================================================
FILE: hockey_scraper/nhl/game_scraper.py
================================================
"""
This module contains code to scrape data for a single game
"""
import pandas as pd
import hockey_scraper.nhl.pbp.espn_pbp as espn_pbp
import hockey_scraper.nhl.pbp.html_pbp as html_pbp
import hockey_scraper.nhl.pbp.json_pbp as json_pbp
import hockey_scraper.nhl.playing_roster as playing_roster
import hockey_scraper.nhl.shifts.html_shifts as html_shifts
import hockey_scraper.nhl.shifts.json_shifts as json_shifts
import hockey_scraper.utils.shared as shared
broken_shifts_games = []
broken_pbp_games = []
players_missing_ids = []
missing_coords = []
pbp_columns = [
'Game_Id', 'Date', 'Period', 'Event', 'Description', 'Time_Elapsed', 'Seconds_Elapsed', 'Strength',
'Ev_Zone', 'Type', 'Ev_Team', 'Home_Zone', 'Away_Team', 'Home_Team', 'p1_name', 'p1_ID', 'p2_name', 'p2_ID',
'p3_name', 'p3_ID', 'awayPlayer1', 'awayPlayer1_id', 'awayPlayer2', 'awayPlayer2_id', 'awayPlayer3',
'awayPlayer3_id', 'awayPlayer4', 'awayPlayer4_id', 'awayPlayer5', 'awayPlayer5_id', 'awayPlayer6',
'awayPlayer6_id', 'homePlayer1', 'homePlayer1_id', 'homePlayer2', 'homePlayer2_id', 'homePlayer3',
'homePlayer3_id', 'homePlayer4', 'homePlayer4_id', 'homePlayer5', 'homePlayer5_id', 'homePlayer6',
'homePlayer6_id', 'Away_Players', 'Home_Players', 'Away_Score', 'Home_Score', 'Away_Goalie',
'Away_Goalie_Id', 'Home_Goalie', 'Home_Goalie_Id', 'xC', 'yC', 'Home_Coach', 'Away_Coach'
]
def check_goalie(row):
"""
Checks for bad goalie names (you can tell by them having no player id)
:param row: df row
:return: None
"""
if row['Away_Goalie'] != '' and row['Away_Goalie_Id'] is None:
if [row['Away_Goalie'], row['Game_Id']] not in players_missing_ids:
players_missing_ids.extend([[row['Away_Goalie'], row['Game_Id']]])
if row['Home_Goalie'] != '' and row['Home_Goalie_Id'] is None:
if [row['Home_Goalie'], row['Game_Id']] not in players_missing_ids:
players_missing_ids.extend([[row['Home_Goalie'], row['Game_Id']]])
def get_players_json(game_json):
"""
Return dict of players for that game by team
:param players_json: players section of json
:return: {team -> players}
"""
players = {"Home": {}, "Away": {}}
homeid = game_json['homeTeam']['id']
awayid = game_json['awayTeam']['id']
for player in game_json['rosterSpots']:
<<<<<<< HEAD
if player['teamId'] == homeid:
players["Home"][str(player["firstName"] + " " + player["lastName"]).upper()] = {
"id": player['playerId'],
"last_name": player["lastName"].upper()
}
if player['teamId'] == awayid:
players["Away"][str(player["firstName"] + " " + player["lastName"]).upper()] = {
"id": player['playerId'],
"last_name": player["lastName"].upper()
=======
# print(player)
if player['teamId'] == homeid:
players["Home"][str(player["firstName"]['default'] + " " + player["lastName"]['default']).upper()] = {
"id": player['playerId'],
"last_name": player["lastName"]['default'].upper()
}
if player['teamId'] == awayid:
players["Away"][str(player["firstName"]['default'] + " " + player["lastName"]['default']).upper()] = {
"id": player['playerId'],
"last_name": player["lastName"]['default'].upper()
>>>>>>> 1029299054fbe671c3ca9c5d413cdfd102416853
}
return players
# TODO: Assumes no two players on the same team can have the same name
# Could potentially differentiate by number or position
def combine_players_lists(json_players, roster_players, game_id):
"""
Combine the json list of players (which contains id's) with the list in the roster html
:param json_players: dict of all players with id's
:param roster_players: dict with home and and away keys for players
:param game_id: id of game
:return: dict containing home and away keys -> which contains list of info on each player
"""
players = {'Home': dict(), 'Away': dict()}
for venue in players:
for player in roster_players[venue]:
try:
name = shared.fix_name(player[2])
player_id = json_players[venue][name]['id']
players[venue][name] = {'id': player_id, 'number': player[0], 'last_name': json_players[venue][name]['last_name']}
except KeyError as e:
# If he was listed as a scratch and not a goalie (check_goalie deals with goalies)
# As a whole the scratch list shouldn't be trusted but if a player is missing an id # and is on the
# scratch list I'm willing to assume that he didn't play
if not player[3] and player[1] != 'G':
player.extend([game_id])
players_missing_ids.extend([[player[2], player[4]]])
players[venue][name] = {'id': None, 'number': player[0], 'last_name': ''}
return players
def get_teams_and_players(game_json, roster, game_id):
"""
Get list of players and teams for game
:param game_json: json pbp for game
:param roster: players from roster html
:param game_id: id for game
:return: dict for both - players and teams
"""
try:
teams = json_pbp.get_teams(game_json)
player_ids = get_players_json(game_json)
players = combine_players_lists(player_ids, roster['players'], game_id)
except Exception as e:
shared.print_error('Problem with getting the teams or players')
return None, None
return players, teams
def combine_html_json_pbp(json_df, html_df, game_id, date):
"""
Join both data sources. First try merging on event id (which is the DataFrame index) if both DataFrames have the
same number of rows. If they don't have the same number of rows, merge on: Period', Event, Seconds_Elapsed, p1_ID.
:param json_df: json pbp DataFrame
:param html_df: html pbp DataFrame
:param game_id: id of game
:param date: date of game
:return: finished pbp
"""
# Don't need those columns to merge in
json_df = json_df.drop(['p1_name', 'p2_name', 'p2_ID', 'p3_name', 'p3_ID'], axis=1)
try:
# If they aren't equal it's usually due to the HTML containing a challenge event
if html_df.shape[0] == json_df.shape[0]:
json_df = json_df[['period', 'event', 'seconds_elapsed', 'xC', 'yC']]
game_df = pd.merge(html_df, json_df, left_index=True, right_index=True, how='left')
else:
# We always merge if they aren't equal but we check if it's due to a challenge so we can print out a better
# warning message for the user.
# NOTE: May be slightly incorrect. It's possible for there to be a challenge and another issue for one game.
if 'CHL' in list(html_df.Event):
shared.print_warning("The number of rows in the Html and Json pbp are different because the"
" Json pbp, for some reason, does not include challenges. Will instead merge on "
"Period, Event, Time, and p1_id.")
else:
shared.print_warning("The number of rows in the Html and json pbp are different because "
"someone fucked up. Will instead merge on Period, Event, Time, and p1_id.")
# Actual Merging
game_df = pd.merge(html_df, json_df, left_on=['Period', 'Event', 'Seconds_Elapsed', 'p1_ID'],
right_on=['period', 'event', 'seconds_elapsed', 'p1_ID'], how='left')
# This is always done - because merge doesn't work well with shootouts
game_df = game_df.drop_duplicates(subset=['Period', 'Event', 'Description', 'Seconds_Elapsed'])
except Exception as e:
shared.print_error('Problem combining Html Json pbp for game {}'.format(game_id))
return
game_df['Game_Id'] = game_id[-5:]
game_df['Date'] = date
return pd.DataFrame(game_df, columns=pbp_columns)
def combine_espn_html_pbp(html_df, espn_df, game_id, date, away_team, home_team):
"""
Merge the coordinate from the espn feed into the html DataFrame
Can't join here on event_id because the plays are often out of order and pre-2009 are often missing events.
:param html_df: DataFrame with info from html pbp
:param espn_df: DataFrame with info from espn pbp
:param game_id: json game id
:param date: ex: 2016-10-24
:param away_team: away team
:param home_team: home team
:return: merged DataFrame
"""
if espn_df is not None and not espn_df.empty:
try:
game_df = pd.merge(html_df, espn_df, left_on=['Period', 'Seconds_Elapsed', 'Event'],
right_on=['period', 'time_elapsed', 'event'], how='left')
# Shit happens
game_df = game_df.drop_duplicates(subset=['Period', 'Event', 'Description', 'Seconds_Elapsed'])
df = game_df.drop(['period', 'time_elapsed', 'event'], axis=1)
except Exception as e:
shared.print_error('Error combining espn and html pbp for game {}'.format(game_id))
return None
else:
df = html_df
df['Game_Id'] = game_id[-5:]
df['Date'] = date
df['Away_Team'] = away_team
df['Home_Team'] = home_team
return pd.DataFrame(df, columns=pbp_columns)
def scrape_pbp_live(game_id, date, roster, game_json, players, teams, espn_id=None):
"""
Wrapper for scraping the live pbp
:param game_id: json game id
:param date: date of game
:param roster: list of players in pre game roster
:param game_json: json pbp for game
:param players: dict of players
:param teams: dict of teams
:param espn_id: Game Id for the espn game. Only provided when live scraping
:return: Tuple - pbp & status
"""
html_df, status = html_pbp.scrape_game_live(game_id, players, teams)
game_df = scrape_pbp(game_id, date, roster, game_json, players, teams, espn_id=espn_id, html_df=html_df)
return game_df, status
def scrape_pbp(game_id, date, roster, game_json, players, teams, espn_id=None, html_df=None):
"""
Scrape the Pbp info. The HTML is always scraped.
The Json is parse whe season >= 2010 and there are plays. Otherwise ESPN is gotten to supplement
the HTML with coordinate.
The espn_id and the html data can be fed as keyword argument to speed up execution. This is used by
the live game scraping class.
:param game_id: json game id
:param date: date of game
:param roster: list of players in pre game roster
:param game_json: json pbp for game
:param players: dict of players
:param teams: dict of teams
:param espn_id: Game Id for the espn game. Only provided when live scraping
:param html_df: Can provide DataFrame for html. Only done for live-scraping
:return: DataFrame with info or None if it fails
"""
# Coordinates are only available in json from 2010 onwards
if int(str(game_id)[:4]) >= 2010 and len(game_json['plays']) > 0:
json_df = json_pbp.parse_json(game_json, game_id)
else:
json_df = None
# For live sometimes the json lags the html so if given we don't bother
if not isinstance(html_df, pd.DataFrame):
html_df = html_pbp.scrape_game(game_id, players, teams)
if html_df is None or html_df.empty:
return None
# Check if the json is missing the plays...if it is scrape ESPN for the coordinates
if json_df is None or json_df.empty:
espn_df = espn_pbp.scrape_game(date, teams['Home'], teams['Away'], game_id=espn_id)
game_df = combine_espn_html_pbp(html_df, espn_df, str(game_id), date, teams['Away'], teams['Home'])
# Sometimes espn is corrupted so can't get coordinates
if espn_df is None or espn_df.empty:
missing_coords.extend([[game_id, date]])
else:
game_df = combine_html_json_pbp(json_df, html_df, str(game_id), date)
if game_df is not None:
game_df['Home_Coach'] = roster['head_coaches']['Home']
game_df['Away_Coach'] = roster['head_coaches']['Away']
return game_df
def scrape_shifts(game_id, players, date):
"""
Scrape the Shift charts (or TOI tables)
:param game_id: json game id
:param players: dict of players with numbers and id's
:param date: date of game
:return: DataFrame with info or None if it fails
"""
shifts_df = None
# Control for fact that shift json is only available from 2010 onwards
if shared.get_season(date) >= 2010:
shifts_df = json_shifts.scrape_game(game_id)
if shifts_df is None or shifts_df.empty:
shifts_df = html_shifts.scrape_game(game_id, players)
if shifts_df is None or shifts_df.empty:
shared.print_error("Unable to scrape shifts for game " + game_id)
broken_shifts_games.extend([[game_id, date]])
return None
shifts_df['Date'] = date
return shifts_df
def scrape_game(game_id, date, if_scrape_shifts):
"""
This scrapes the info for the game.
The pbp is automatically scraped, and the whether or not to scrape the shifts is left up to the user.
:param game_id: game to scrap
:param date: ex: 2016-10-24
:param if_scrape_shifts: Boolean indicating whether to also scrape shifts
:return: DataFrame of pbp info
(optional) DataFrame with shift info otherwise just None
"""
print(' '.join(['Scraping Game ', game_id, date]))
shifts_df = None
roster = playing_roster.scrape_roster(game_id)
game_json = json_pbp.get_pbp(game_id) # Contains both player info (id's) and plays
players, teams = get_teams_and_players(game_json, roster, game_id)
# Game fails without any of these
if not roster or not game_json or not teams or not players:
broken_pbp_games.extend([[game_id, date]])
if if_scrape_shifts:
broken_shifts_games.extend([[game_id, date]])
return None, None
pbp_df = scrape_pbp(game_id, date, roster, game_json, players, teams)
# Only scrape shifts if asked and pbp is good
if if_scrape_shifts and pbp_df is not None:
shifts_df = scrape_shifts(game_id, players, date)
if pbp_df is None:
broken_pbp_games.extend([[game_id, date]])
return pbp_df, shifts_df
================================================
FILE: hockey_scraper/nhl/json_schedule.py
================================================
"""
This module contains functions to scrape the json schedule for any games or date range
"""
import json
from pytz import timezone
from datetime import datetime, timedelta
import hockey_scraper.utils.shared as shared
from tqdm import tqdm
# TODO: Currently rescraping page each time since the status of some games may have changed
# (e.g. Scraped on 2020-01-20 and game on 2020-01-21 was not Final...when use old page again will still think not Final)
# Need to find a more elegant way of doing this (Metadata???)
def get_schedule(date):
"""
Scrapes games in date range
Ex: https://api-web.nhle.com/v1/schedule/2011-06-20
:param date: scrape from this date
:return: raw json of schedule of date range
"""
page_info = {
"url": 'https://api-web.nhle.com/v1/schedule/{a}'.format(a=date),
"name": "Schedule_" + date,
"type": "json_schedule",
"season": shared.get_season(date),
}
return json.loads(shared.get_file(page_info, force=True))
def chunk_schedule_calls(from_date, to_date):
"""
Due to new API, we have to inividually GET games by week
We filter out games not in range for the final week
:param date_from: scrape from this date
:param date_to: scrape until this date
:return: raw json of schedule of date range
"""
sched = []
days_per_call = 7
from_date = datetime.strptime(from_date, "%Y-%m-%d")
to_date = datetime.strptime(to_date, "%Y-%m-%d")
num_days = (to_date - from_date).days + 1 # +1 since difference is looking for total number of days
for offset in tqdm(range(0, num_days, days_per_call), "Scraping Schedule"):
date_chunk = datetime.strftime(from_date + timedelta(days=offset), "%Y-%m-%d")
chunk_sched = get_schedule(date_chunk)['gameWeek']
sched.append(chunk_sched)
return sched
def scrape_schedule(date_from, date_to, preseason=False, not_over=False):
"""
Calls getSchedule and scrapes the raw schedule Json
We filter out games not in range. Due to how new schedule API works
:param date_from: scrape from this date
:param date_to: scrape until this date
:param preseason: Boolean indicating whether include preseason games (default if False)
:param not_over: Boolean indicating whether we scrape games not finished.
Means we relax the requirement of checking if the game is over.
:return: list with all the game id's
"""
print("Scraping the schedule between {} and {}...please give it a moment".format(date_from, date_to))
est = timezone("America/New_York")
# We need to include the timezone and cover the entire day
fds = list(map(int, date_from.split("-")))
fdate_est = datetime(fds[0], fds[1], fds[2], 0, 0, tzinfo=est)
tds = list(map(int, date_to.split("-")))
tdate_est = datetime(tds[0], tds[1], tds[2], 23, 59, tzinfo=est)
schedule = []
schedule_json = chunk_schedule_calls(date_from, date_to)
for chunk in schedule_json:
for day in chunk:
for game in day['games']:
game_id = int(str(game['id'])[5:])
# TODO: Confirm if OFF is correct
# Check game is over or scraping live
status_cond = game['gameState'] == 'OFF' or not_over
# No preseason or "special" games
valid_game_cond = (game_id >= 20000 or preseason) and game_id < 40000
# Within specified date ranges
game_date = datetime.strptime(game['startTimeUTC'], "%Y-%m-%dT%H:%M:%S%z")
date_cond = fdate_est <= game_date.astimezone(est) <= tdate_est
if status_cond and valid_game_cond and date_cond:
schedule.append({
"game_id": game['id'],
"date": day['date'],
"start_time": datetime.strptime(game['startTimeUTC'][:-1], "%Y-%m-%dT%H:%M:%S"),
"venue": game['venue'].get('default'),
"home_team": shared.convert_tricode(game['homeTeam']['abbrev']),
"away_team": shared.convert_tricode(game['awayTeam']['abbrev']),
"home_score": game['homeTeam'].get("score"),
"away_score": game['awayTeam'].get("score"),
"status": game["gameState"]
})
return schedule
def get_dates(games):
"""
Given a list game_ids it returns the dates for each game.
We sort all the games and retrieve the schedule from the beginning of the season from the earliest game
until the end of most recent season.
:param games: list with game_id's ex: 2016020001
:return: list with game_id and corresponding date for all games
"""
today = datetime.today()
# Determine oldest and newest game
games = list(map(str, games))
games.sort()
date_from = shared.season_start_bound(games[0][:4])
year_to = int(games[-1][:4])
# If the last game is part of the ongoing season then only request the schedule until Today
# We get strange errors if we don't do it like this
if year_to == shared.get_season(datetime.strftime(today, "%Y-%m-%d")):
date_to = '-'.join([str(today.year), str(today.month), str(today.day)])
else:
date_to = datetime.strftime(shared.season_end_bound(year_to+1), "%Y-%m-%d") # Newest game in sample
# TODO: Assume true is live here -> Workaround
schedule = scrape_schedule(date_from, date_to, preseason=True, not_over=True)
# Only return games we want in range
games_list = []
for game in schedule:
if str(game['game_id']) in games:
games_list.extend([game])
return games_list
================================================
FILE: hockey_scraper/nhl/live_scrape.py
================================================
"""
Module to scrape live game info
"""
import datetime
import time
import warnings
import pandas as pd
import hockey_scraper.nhl.game_scraper as game_scraper
import hockey_scraper.nhl.json_schedule as json_schedule
import hockey_scraper.nhl.pbp.espn_pbp as espn_pbp
import hockey_scraper.nhl.pbp.json_pbp as json_pbp
import hockey_scraper.nhl.playing_roster as playing_roster
import hockey_scraper.utils.shared as shared
def set_docs_dir(user_dir):
"""
Set the docs directory
:param user_dir: User specified directory for storing saves scraped files
:return: None
"""
# We always want to rescrape since the files are being updated constantly
shared.if_rescrape(True)
shared.add_dir(user_dir)
def check_date_format(date):
"""
Verify the date format. If wrong raises a ValueError
:param date: User supplied date
:return: None
"""
try:
time.strptime(date, "%Y-%m-%d")
except ValueError:
raise ValueError("Error: Incorrect format given for dates. They must be given like 'yyyy-mm-dd' "
"(ex: '2016-10-01').")
# TODO: Should I denote more member variables as private?
class LiveGame:
"""
This is a class holds all the information for a given game
:param int game_id: The NHL game id (ex: 2018020001)
:param datetime start_time: The UTC time of when the game begins
:param str home_team: Tricode for the home team (ex: NYR)
:param str away_team: Tricode for the home team (ex: MTL)
:param int espn_id: The ESPN game id for their feed
:param str date: Date of the game (ex: 2018-10-30)
:param bool if_scrape_shifts: Whether or not you want to scrape shifts
:param str api_game_status: Current Status of the game - ["Final", "Live", "Intermission]
:param str html_game_status: Current Status of the game - ["Final", "Live", "Intermission"]
:param int intermission_time_remaining: Time remaining in the intermission. 0 if not in intermission
:param dict players: Player info for both teams
:param dict head_coaches: Head coaches for both teams
:param DataFrame _pbp_df: Holds most recent pbp data
:param DataFrame _shifts_df: Holds most recent shift data
:param DataFrame _prev_pbp_df: Holds the previous pbp data (for just in case)
:param DataFrame _prev_shifts_df: Holds the previous shift data (for just in case)
"""
def __init__(self, game_id, start_time, home_team, away_team, status, espn_id, date, if_scrape_shifts):
""" Constructor """
# Given upon creation
self.game_id = game_id
self.start_time = start_time
self.home_team = home_team
self.away_team = away_team
self.espn_id = espn_id
self.date = date
self.if_scrape_shifts = if_scrape_shifts
# Html pbp is behind the json (json updates faster)
self.api_game_status = status
self.html_game_status = "Live"
self.prev_api_game_status = status
self.prev_html_game_status = "Live"
self.intermission_time_remaining = 0
# We know nothing to start off
self.players = None
self.head_coaches = None
# Pbp and shift data - Will be filled in later
# Also hold previous pair for checking for changes
self._pbp_df = pd.DataFrame()
self._shifts_df = pd.DataFrame()
self._prev_pbp_df = pd.DataFrame()
self._prev_shifts_df = pd.DataFrame()
# Object creation message
print("The LiveGame object for game {game_id} has been created. ".format(game_id=game_id), end="")
if self.time_until_game() <= 0:
print("The game has started.")
else:
print("The game starts in {time} seconds.".format(time=self.time_until_game()))
@property
def pbp_df(self):
if isinstance(self._pbp_df, pd.DataFrame):
return self._pbp_df
else:
return pd.DataFrame()
@property
def shifts_df(self):
if isinstance(self._shifts_df, pd.DataFrame):
return self._shifts_df
else:
return pd.DataFrame()
@property
def prev_pbp_df(self):
if isinstance(self._prev_pbp_df, pd.DataFrame):
return self._prev_pbp_df
else:
return pd.DataFrame()
@property
def prev_pbp_df(self):
if isinstance(self._prev_shifts_df, pd.DataFrame):
return self._prev_shifts_df
else:
return pd.DataFrame()
def scrape(self, force=False):
"""
Scrape the given game. Check if currently ongoing or started
:param bool force: Whether or not to force it to scrape even if it's over
:return: None
"""
# 1. force = False: If the game hasn't eclipsed the starting time or is over we don't scrape
# 2. force = True: We always scrape
if (self.time_until_game() == 0 and not self.is_game_over(prev=True)) or force:
self.scrape_live_game(force=force)
def scrape_live_game(self, force=False):
"""
Scrape the live info for a given game
:param force: Whether to scrape no matter what (used for intermission here)
:return: None
"""
game_json = json_pbp.get_pbp(str(self.game_id))
# When don't have json...can't do anything without it
if game_json is None:
return
# Shift Game Statuses b4 we do anything
self.prev_api_game_status = self.api_game_status
self.prev_html_game_status = self.html_game_status
# Swap old pbp & shift DataFrames
self._prev_pbp_df = self._pbp_df
self._prev_shifts_df = self._shifts_df
# If json is in intermission:
# Update self.api_game_status, get minutes remaining in intermission, and check if html is intermission too.
# If both feeds are in intermission we return, otherwise we wait for the html to catch up.
# "Intermission" is my own game status so otherwise just take whatever is in the api
if game_json['liveData']['linescore']['intermissionInfo']['inIntermission']:
self.api_game_status = "Intermission"
self.intermission_time_remaining = game_json['liveData']['linescore']['intermissionInfo']["intermissionTimeRemaining"]
# If see the both says intermission and we do too, we can just safely return and not bother with scraping.
# This will be false if the HTML hasn't updated yet to intermission
# If force we scrape no matter what
if self.is_intermission() and not force:
return
else:
# Update API Status if NOT in intermission to whatever is there
self.api_game_status = game_json["gameData"]["status"]["abstractGameState"]
# Leave if b4 game started
if game_json["gameData"]["status"]["abstractGameState"] in ["Preview"]:
self.html_game_status = self.api_game_status = game_json["gameData"]["status"]["abstractGameState"]
return
# We get this the 1st time it scrapes the info (or when it's first available)
# Don't bother with earlier as it may not be there or we may end up with an old version
if not self.players:
roster = playing_roster.scrape_roster(self.game_id)
if roster is not None:
self.players, _ = game_scraper.get_teams_and_players(game_json, roster, self.game_id)
self.head_coaches = roster['head_coaches']
else:
return # If we try and still can't get it we leave - Termination Reason #2
# Don't bother with scraper warnings
with warnings.catch_warnings():
warnings.simplefilter("ignore")
# pay attention to each argument
self._pbp_df, self.html_game_status = game_scraper.scrape_pbp_live(self.game_id, self.date,
{"head_coaches": self.head_coaches},
game_json, self.players,
{"Home": self.home_team, "Away": self.away_team},
espn_id=self.espn_id)
if self.if_scrape_shifts:
self._shifts_df = game_scraper.scrape_shifts(self.game_id, self.players, self.date)
def is_ongoing(self):
"""
Check if the game is currently being played.
The logic here is that we run into an issue with intermission and the end of game. If the game is just changed
to Final or Intermission the end user will assume the game isn't ongoing and will not update with the most
recent events. They'll be delayed for intermission and won't place it at all for Final games. So we use the
previous event as a guide. If it's currently in intermission or Final - we check the previous status. If it's
the same the user already has the data. Otherwise we 'lie' and say the game is still ongoing.
:return: Boolean
"""
# The game is currently being played
if self.time_until_game() == 0 and not self.is_game_over() and not self.is_intermission() and self.pbp_df.shape[0] > 0:
return True
# Since it's not being played check if game is over and if it wasn't for the previous
elif self.is_game_over() and not self.is_game_over(prev=True):
return True
# Check if it's in intermission and the if it was for the previous event
elif self.is_intermission() and not self.is_intermission(prev=True):
return True
else:
return False
def time_until_game(self):
"""
Return the seconds until the game starts
:return: seconds until game
"""
delta = self.start_time - datetime.datetime.utcnow()
if delta.days >= 0:
return delta.seconds
else:
return 0
def is_game_over(self, prev=False):
"""
Check if the game is over for both the html and json pbp. If prev=True check for the previous event
:param prev: Check the game status for the previous event
:return: Boolean - True if over
"""
if not prev:
return self.html_game_status == self.api_game_status == "Final"
else:
return self.prev_html_game_status == self.prev_api_game_status == "Final"
def is_intermission(self, prev=False):
"""
Check if in intermission for both the html and json pbp. If prev=True check for the previous event
:param prev: Check the game status for the previous event
:return: Boolean - True if yes
"""
if not prev:
return self.html_game_status == self.api_game_status == "Intermission"
else:
return self.prev_html_game_status == self.prev_api_game_status == "Intermission"
class ScrapeLiveGames:
"""
Class than contains the info for all the games on a specific day
:param str date: Date of games (ex: 2018-10-30)
:param bool preseason: If you want to scrape preseason games
:param bool if_scrape_shifts: Whether or not you want to scrape shifts
:param list live_games: List of LiveGame objects for
:param int pause: Amount to pause after each scraping call
"""
def __init__(self, date, preseason=False, if_scrape_shifts=False, pause=15, game_ids=list()):
"""
Initialize the ScrapeLiveGames object with games for the day
:param date: Date
:param preseason: If scrape preseason
:param if_scrape_shifts: Whether to scrape the shifts
:param pause: time to pause
:param game_ids: If only want specific games
"""
# First check date
check_date_format(date)
self.user_game_ids = game_ids
self.date = date
self.preseason = preseason
self.if_scrape_shifts = if_scrape_shifts
self.live_games = self.get_games() # Hold list of LiveGame objects for that day
self.pause = pause
def get_games(self):
"""
Get initial game info -> Called with object creation. Includes: players, espn_ids, standard game info
:return: Dict - LiveGame objects for all games today
"""
game_objs = []
# Get the initial schedule & espn game ids just in case
games = json_schedule.scrape_schedule(self.date, self.date, not_over=True, preseason=self.preseason)
games = self.get_espn_ids(games)
# Only keep the games we want if the user specified games
if self.user_game_ids:
games = [game for game in games if game['game_id'] in self.user_game_ids]
# Get rosters for each game
for game in games:
game_objs.append(LiveGame(game['game_id'], game['start_time'], game['home_team'], game['away_team'],
game['status'], game['espn_id'], self.date, self.if_scrape_shifts))
return game_objs
def get_espn_ids(self, games):
"""
Get espn game ids for all games that day
:param list games: games today
:return: Games with corresponding espn game ids
"""
# Get all espn info
response = espn_pbp.get_espn_date(self.date)
game_ids = espn_pbp.get_game_ids(response)
espn_games = espn_pbp.get_teams(response)
# Match up
for i in range(len(games)):
for j in range(len(espn_games)):
if games[i]['home_team'] in espn_games[j] or games[i]['away_team'] in espn_games[j]:
games[i]['espn_id'] = game_ids[j]
return games
def update_live_games(self, force=False, sleep_next=False):
"""
Scrape the pbp & shifts of ongoing games
:param bool force: Whether or not to force it to scrape even if it's in intermission
:param bool sleep_next: Sleep until the next game starts
:return: None
"""
# Check if we need to sleep
if sleep_next:
self.sleep_next_game()
for game in self.live_games:
game.scrape(force=force)
time.sleep(self.pause)
def sleep_next_game(self):
"""
Sleep until the next game starts. Otherwise just looping and doing nothing
:return: None
"""
# Get rid of final games...we are looking at current or upcoming games
non_final_games = [game for game in self.live_games if not game.is_game_over()]
# Lets get all the games NOT ongoing but aren't over (so scheduled games)
scheduled_games = [game for game in non_final_games if game.time_until_game() > 0]
# If all the non-final games haven't started yet let's find the next game
# Get earliest in the bunch
if len(scheduled_games) == len(non_final_games):
min_game = min(scheduled_games, key=lambda x: x.time_until_game())
if min_game.time_until_game() > 0:
print("\nSleeping for {} seconds until the next earliest game starts.".format(min_game.time_until_game()))
time.sleep(min_game.time_until_game())
def finished(self):
"""
Check if done with all games
:return: Boolean
"""
# Count finished games
finished_games = 0
for game in self.live_games:
if game.is_game_over():
finished_games += 1
# If the # of finished games == # of total games
return len(self.live_games) == finished_games
================================================
FILE: hockey_scraper/nhl/pbp/__init__.py
================================================
================================================
FILE: hockey_scraper/nhl/pbp/espn_pbp.py
================================================
"""
This module contains code to scrape coordinates for games off of espn for any given game
"""
import re
import xml.etree.ElementTree as etree
import pandas as pd
from bs4 import BeautifulSoup
import hockey_scraper.utils.shared as shared
def event_type(play_description):
"""
Returns the event type (ex: a SHOT or a GOAL...etc) given the event description.
:param play_description: description of play
:return: event
"""
events = {'GOAL SCORED': 'GOAL', 'SHOT ON GOAL': 'SHOT', 'SHOT MISSED': 'MISS', 'SHOT BLOCKED': 'BLOCK',
'PENALTY': 'PENL', 'FACEOFF': 'FAC', 'HIT': 'HIT', 'TAKEAWAY': 'TAKE', 'GIVEAWAY': 'GIVE'}
event = [events[e] for e in events if e in play_description]
return event[0] if event else None
def get_game_ids(response):
"""
Get game_ids for date from doc
:param response: doc
:return: list of game_ids
"""
soup = BeautifulSoup(response, 'lxml')
sections = soup.findAll("section", {"class": "Scoreboard bg-clr-white flex flex-auto justify-between"})
game_ids = [section['id'] for section in sections]
return game_ids
def get_teams(response):
"""
Extract Teams for date from doc
ul-> class = ScoreCell__Competitors
div -> class = ScoreCell__TeamName ScoreCell__TeamName--shortDisplayName truncate db
:param response: doc
:return: list of teams
"""
teams = []
soup = BeautifulSoup(response, 'lxml')
uls = soup.findAll('div', {'class': "ScoreCell__Team"})
for ul in uls:
actual_tm = None
tm = ul.find('div', {'class': "ScoreCell__TeamName ScoreCell__TeamName--shortDisplayName truncate db"}).text
# ESPN stores the name and not the city
for real_tm in list(shared.TEAMS.keys()):
if tm.upper() in real_tm:
actual_tm = shared.TEAMS[real_tm]
# If not found we'll let the user know...this may happens
if actual_tm is None:
shared.print_warning("The team {} in the espn pbp is unknown. We use the supplied team name".format(tm))
actual_tm = tm
teams.append(actual_tm)
# Make a list of both teams for each game
games = [teams[i:i + 2] for i in range(0, len(teams), 2)]
return games
def get_espn_date(date):
"""
Get the page that contains all the games for that day
:param date: YYYY-MM-DD
:return: response
"""
page_info = {
"url": 'http://www.espn.com/nhl/scoreboard/_/date/{}'.format(date.replace('-', '')),
"name": date,
"type": "espn_scoreboard",
"season": shared.get_season(date),
}
response = shared.get_file(page_info)
# If can't get or not there throw an exception
if not response:
raise Exception
else:
return response
def get_espn_game_id(date, home_team, away_team):
"""
Scrapes the day's schedule and gets the id for the given game
Ex: http://www.espn.com/nhl/scoreboard/_/date/20161024
:param date: format-> YearMonthDay-> 20161024
:param home_team: home team
:param away_team: away team
:return: 9 digit game id as a string
"""
response = get_espn_date(date)
game_ids = get_game_ids(response)
games = get_teams(response)
# Get the game id with the right team for it
for i in range(len(games)):
if home_team in games[i] or away_team in games[i]:
return game_ids[i]
def get_espn_game(date, home_team, away_team, game_id=None):
"""
Gets the ESPN pbp feed
Ex: http://www.espn.com/nhl/gamecast/data/masterFeed?lang=en&isAll=true&gameId=400885300
:param date: date of the game
:param home_team: home team
:param away_team: away team
:param game_id: Game id of we already have it - for live scraping. None if not there
:return: raw xml
"""
# Get if not provided - for live games
if not game_id:
game_id = get_espn_game_id(date, home_team.upper(), away_team.upper())
file_info = {
"url": 'http://www.espn.com/nhl/gamecast/data/masterFeed?lang=en&isAll=true&gameId={}'.format(game_id),
"name": game_id,
"type": "espn_pbp",
"season": shared.get_season(date),
}
response = shared.get_file(file_info)
print(file_info)
## Needed?
if response is None:
raise Exception
return response
def parse_event(event):
"""
Parse each event. In the string each field is separated by a '~'.
Relevant for here: The first two are the x and y coordinates. And the 4th and 5th are the time elapsed and period.
:param event: string with info
:return: return dict with relevant info
"""
info = dict()
fields = event.split('~')
# Shootouts screw everything up so don't bother...coordinates don't matter there either way
if fields[4] == '5':
return None
info['xC'] = float(fields[0])
info['yC'] = float(fields[1])
info['time_elapsed'] = shared.convert_to_seconds(fields[3])
info['period'] = fields[4]
info['event'] = event_type(fields[8].upper())
return info
def parse_espn(espn_xml):
"""
Parse feed
:param espn_xml: raw xml of feed
:return: DataFrame with info
"""
columns = ['period', 'time_elapsed', 'event', 'xC', 'yC']
# Occasionally we get malformed XML because of the presence of \x13 characters
# Let's just replace them with dashes
espn_xml = espn_xml.replace(u'\x13', '-')
try:
tree = etree.fromstring(espn_xml)
except etree.ParseError as e:
shared.print_error("Espn pbp isn't valid xml, therefore coordinates can't be obtained for this game")
return pd.DataFrame([], columns=columns)
events = tree[1]
plays = [parse_event(event.text) for event in events]
plays = [play for play in plays if play is not None]
df = pd.DataFrame(plays, columns=columns)
df.period = df.period.astype(int) # Causes join issues with html later
return df
def scrape_game(date, home_team, away_team, game_id=None):
"""
Scrape the game
:param date: ex: 2016-20-24
:param home_team: tricode
:param away_team: tricode
:param game_id: Only provided for live games.
:return: DataFrame with info
"""
try:
shared.print_warning('Using espn for pbp')
espn_xml = get_espn_game(date, home_team, away_team, game_id)
except Exception as e:
shared.print_error("Espn pbp for game {a} {b} {c} is either not there or can't be obtained {d}".format(a=date,
b=home_team,
c=away_team, d=e))
return pd.DataFrame()
try:
espn_df = parse_espn(espn_xml)
except Exception as e:
shared.print_error("Issue parsing Espn pbp for game {a} {b} {c} {d}".format(a=date, b=home_team, c=away_team, d=e))
return pd.DataFrame()
if espn_df.shape[0] == 0:
shared.print_error("Espn is missing coordinates for game {a} {b} {c}".format(a=date, b=home_team, c=away_team))
return espn_df
# if __name__ == "__main__":
# get_espn_game('2022-10-08', 'SJS', 'NSH')
================================================
FILE: hockey_scraper/nhl/pbp/html_pbp.py
================================================
"""
This module contains functions to scrape the Html Play by Play for any given game
"""
import re
import pandas as pd
from bs4 import BeautifulSoup, SoupStrainer
import hockey_scraper.utils.shared as shared
def cur_game_status(doc):
"""
Return the game status
:param doc: Html text
:return: String -> one of ['Final', 'Intermission', 'Progress']
"""
soup = BeautifulSoup(doc, "lxml")
tables = soup.find_all('table', {'id': "GameInfo"})
tds = tables[0].find_all('td')
status = tds[-1].text
# 'End' - in there means an Intermission
# 'Final' - Game is over
# Otherwise - It's either in progress or b4 the game started
if 'end' in status.lower():
return 'Intermission'
elif 'final' in status.lower():
return 'Final'
else:
return 'Live'
def get_pbp(game_id):
"""
Given a game_id it returns the raw html
Ex: http://www.nhl.com/scores/htmlreports/20162017/PL020475.HTM
:param game_id: the game
:return: raw html of game
"""
game_id = str(game_id)
url = 'http://www.nhl.com/scores/htmlreports/{}{}/PL{}.HTM'.format(game_id[:4], int(game_id[:4]) + 1, game_id[4:])
page_info = {
"url": url,
"name": game_id,
"type": "html_pbp",
"season": game_id[:4],
}
return shared.get_file(page_info)
def get_contents(game_html):
"""
Uses Beautiful soup to parses the html document.
Some parsers work for some pages but don't work for others....I'm not sure why so I just try them all here in order
:param game_html: html doc
:return: "soupified" html
"""
parsers = ["html5lib", "lxml", "html.parser"]
strainer = SoupStrainer('td', attrs={'class': re.compile(r'bborder')})
for parser in parsers:
# parse_only only works with lxml for some reason
if parser == "lxml":
soup = BeautifulSoup(game_html, parser, parse_only=strainer)
else:
soup = BeautifulSoup(game_html, parser)
tds = soup.find_all("td", {"class": re.compile('.*bborder.*')})
if len(tds) > 0:
break
return tds
def strip_html_pbp(td):
"""
Strip html tags and such. (Note to Self: Don't touch this!!!)
:param td: pbp
:return: list of plays (which contain a list of info) stripped of html
"""
for y in range(len(td)):
# Get the 'br' tag for the time column...this get's us time remaining instead of elapsed and remaining combined
if y == 3:
td[y] = td[y].get_text() # This gets us elapsed and remaining combined-< 3:0017:00
index = td[y].find(':')
td[y] = td[y][:index+3]
elif (y == 6 or y == 7) and td[0] != '#':
# 6 & 7-> These are the player 1 ice one's
# The second statement controls for when it's just a header
baz = td[y].find_all('td')
bar = [baz[z] for z in range(len(baz)) if z % 4 != 0] # Because of previous step we get repeats...delete some
# The setup in the list is now: Name/Number->Position->Blank...and repeat
# Now strip all the html
players = []
for i in range(len(bar)):
if i % 3 == 0:
try:
name = return_name_html(bar[i].find('font')['title'])
number = bar[i].get_text().strip('\n') # Get number and strip leading/trailing newlines
except KeyError:
name = ''
number = ''
elif i % 3 == 1:
if name != '':
position = bar[i].get_text()
players.append([name, number, position])
td[y] = players
else:
td[y] = td[y].get_text()
return td
def clean_html_pbp(html):
"""
Get rid of html and format the data
:param html: the requested url
:return: a list with all the info
"""
soup = get_contents(html)
# Create a list of lists (each length 8)...corresponds to 8 columns in html pbp
td = [soup[i:i + 8] for i in range(0, len(soup), 8)]
cleaned_html = [strip_html_pbp(x) for x in td]
return cleaned_html
def add_home_zone(event_dict, home_team):
"""
Determines the zone relative to the home team and add it to event.
Keep in mind that the 'ev_zone' recorded is the zone relative to the event team. And for blocks the NHL counts
the ev_team as the blocking team (I like counting the shooting team for blocks). Therefore, when it's the home team
the zone only gets flipped when it's a block. For away teams it's the opposite.
:param event_dict: dict of event info
:param home_team: home team
:return: None
"""
ev_team = event_dict['Ev_Team']
ev_zone = event_dict['Ev_Zone']
event = event_dict['Event']
# Return if we got nothing in there
# Also just make the home zone nothing too
if ev_zone == '':
event_dict['Home_Zone'] = ''
return
# When it's either: The away team and not a block or the home team and a block
if (ev_team != home_team and event != 'BLOCK') or (ev_team == home_team and event == 'BLOCK'):
if ev_zone == 'Off':
event_dict['Home_Zone'] = 'Def'
elif ev_zone == 'Def':
event_dict['Home_Zone'] = 'Off'
else:
event_dict['Home_Zone'] = ev_zone
else:
event_dict['Home_Zone'] = ev_zone
def add_zone(event_dict, play_description):
"""
Determine which zone the play occurred in (unless one isn't listed) and add it to dict
:param event_dict: dict of event info
:param play_description: the zone would be included here
:return: Off, Def, Neu, or NA
"""
s = [x.strip() for x in play_description.split(',')] # Split by comma's into a list
zone = [x for x in s if 'Zone' in x] # Find if list contains which zone
if not zone:
event_dict['Ev_Zone'] = None
elif zone[0].find("Off") != -1:
event_dict['Ev_Zone'] = 'Off'
elif zone[0].find("Neu") != -1:
event_dict['Ev_Zone'] = 'Neu'
elif zone[0].find("Def") != -1:
event_dict['Ev_Zone'] = 'Def'
def add_type(event_dict, event, players, home_team):
"""
Add "type" for event -> either a penalty or a shot type
:param event_dict: dict of event info
:param event: list with parsed event info
:param players: dict of home and away players in game
:param home_team: home team for game
:return: None
"""
if 'PENL' in event[4]:
event_dict['Type'] = get_penalty(event[5], players, home_team)
else:
event_dict['Type'] = shot_type(event[5]).upper()
def add_strength(event_dict, home_players, away_players):
"""
Get strength for event -> It's home then away
:param event_dict: dict of event info
:param home_players: list of players for home team
:param away_players: list of players for away team
:return: None
"""
try:
home_skaters = event_dict['Home_Players'] - 1 if event_dict['Home_Goalie'] != '' else len(home_players)
away_skaters = event_dict['Away_Players'] - 1 if event_dict['Away_Goalie'] != '' else len(away_players)
except KeyError:
# Getting a key error here means that home/away goalie isn't there...which means home/away players are empty
home_skaters = 0
away_skaters = 0
event_dict['Strength'] = 'x'.join([str(home_skaters), str(away_skaters)])
def add_event_team(event_dict, event):
"""
Add event team for event.
Always first thing in description
:param event_dict: dict of event info
:param event: list with parsed event info
:return: None
"""
if event_dict['Event'] in ['GOAL', 'SHOT', 'MISS', 'BLOCK', 'PENL', 'FAC', 'HIT', 'TAKE', 'GIVE']:
event_dict['Ev_Team'] = shared.convert_tricode(event[5].split()[0])
else:
event_dict['Ev_Team'] = ''
def add_period(event_dict, event):
"""
Add period for event
:param event_dict: dict of event info
:param event: list with parsed event info
:return: None
"""
try:
event_dict['Period'] = int(event[1])
except ValueError:
event_dict['Period'] = 0
def add_time(event_dict, event):
"""
Fill in time and seconds elapsed
:param event_dict: dict of parsed event stuff
:param event: event info from pbp
:return: None
"""
event_dict['Time_Elapsed'] = str(event[3])
if event[3] != '':
event_dict['Seconds_Elapsed'] = shared.convert_to_seconds(event[3])
else:
event_dict['Seconds_Elapsed'] = 0.0
def add_score(event_dict, event, current_score, home_team):
"""
Change if someone scored...also change current score
:param event_dict: dict of parsed event stuff
:param event: event info from pbp
:param current_score: current score in game
:param home_team: home team for game
:return: None
"""
event_dict['Home_Score'] = current_score['Home']
event_dict['Away_Score'] = current_score['Away']
event_dict['score_diff'] = current_score['Home'] - current_score['Away']
# If it's a goal change the score
if event[4] == 'GOAL':
if event_dict['Ev_Team'] == home_team:
current_score['Home'] += 1
else:
current_score['Away'] += 1
def get_penalty(play_description, players, home_team):
"""
Get the penalty info
:param play_description: description of play field
:param players: all players with info
:param home_team: home team for game
:return: penalty info
"""
# First check if it's a bench
if "bench" in play_description or "TEAM" in play_description:
beg_penalty_index = play_description.find("TEAM") + 5
return play_description[beg_penalty_index: play_description.find(')') + 1]
else:
# If it's not a bench penl we look for the player who took the penalty
# Get Number, and name for player who took the penalty
num_regex = re.compile(r'#(\d+)')
numbers = num_regex.findall(play_description)
# If they don't have any players listed, then the description if fucked up and we got nothing
if not numbers:
return ''
else:
player = get_player_name(numbers[0], players, play_description[:3], home_team)
# Check if the number and player match up
if player['last_name'] is not None and player['last_name'] in play_description:
# beg_penalty_index is right after the penalty taker's last name (+1 for whitespace)
# Then we take from after his last name to right after the parentheses
beg_penalty_index = play_description.find(player['last_name']) + len(player['last_name']) + 1
return play_description[beg_penalty_index: play_description.find(')')+1]
else:
# This uses my old method...it falls apart for players like "Del Zotto"
pen_regex = re.compile(r'.{3}\s+#\d+\s+\w+\s+(.*)\)')
penalty = pen_regex.findall(play_description)
return penalty[0] + ')' if penalty else ''
def get_player_name(number, players, team, home_team):
"""
This function is used for the description field in the html. Given a last name and a number it return the player's
full name and id. Done by searching in players for the team until we find him (then just break)
:param number: player's number
:param players: all players with info
:param team: team of player listed in html
:param home_team: home team defined b4 hand (from json)
:return: dict with full and and id
"""
player = None
team = shared.convert_tricode(team) # Needed to convert from new format to old
venue = "Home" if team == home_team else "Away"
for name in players[venue]:
if players[venue][name]['number'] == number:
player = {
'name': name,
'id': players[venue][name]['id'],
'last_name': players[venue][name]['last_name']
}
break
# Control for when the name can't be found
if not player:
player = {'name': None, 'id': None, 'last_name': None}
return player
def if_valid_event(event):
"""
Checks if it's a valid event ('#' is meaningless and I don't like those other one's) to parse
Don't remember what 'GOFF' is but 'EGT' is for emergency goaltender. The reason I get rid of it is because it's not
in the json and there's another 'EGPID' that can be found in both (not sure why 'EGT' exists then).
Events 'PGSTR', 'PGEND', and 'ANTHEM' have been included at the start of each game for the 2017 season...I have no
idea why.
:param event: list of stuff in pbp
:return: boolean
"""
return event[0] != '#' and event[4] not in ['GOFF', 'EGT', 'PGSTR', 'PGEND', 'ANTHEM']
def return_name_html(info):
"""
In the PBP html the name is in a format like: 'Center - MIKE RICHARDS'
Some also have a hyphen in their last name so can't just split by '-'
:param info: position and name
:return: name
"""
s = info.index('-') # Find first hyphen
return info[s + 1:].strip(' ') # The name should be after the first hyphen
def shot_type(play_description):
"""
Determine which zone the play occurred in (unless one isn't listed)
:param play_description: the type would be in here
:return: the type if it's there (otherwise just NA)
"""
types = ['wrist', 'snap', 'slap', 'deflected', 'tip-in', 'backhand', 'wrap-around']
play_description = [x.strip() for x in play_description.split(',')] # Strip leading and trailing whitespace
play_description = [i.lower() for i in play_description] # Convert to lowercase
for p in play_description:
if p in types:
if p == 'wrist' or p == 'slap' or p == 'snap':
return ' '.join([p, 'shot'])
else:
return p
return ''
def parse_fac(description, players, ev_team, home_team):
"""
Parse the description field for a face-off
MTL won Neu. Zone - MTL #11 GOMEZ vs TOR #37 BRENT
:param description: Play Description
:param players: players in game
:param ev_team: Event Team
:param home_team: Home Team for game
:return: Dict with info
"""
event_info = {}
regex = re.compile(r'(.{3})\s+#(\d+)')
desc = regex.findall(description) # [[Team, num], [Team, num]]
if ev_team == desc[0][0]:
p1 = get_player_name(desc[0][1], players, desc[0][0], home_team)
p2 = get_player_name(desc[1][1], players, desc[1][0], home_team)
else:
p1 = get_player_name(desc[1][1], players, desc[1][0], home_team)
p2 = get_player_name(desc[0][1], players, desc[0][0], home_team)
event_info['p1_name'] = p1['name']
event_info['p1_ID'] = p1['id']
event_info['p2_name'] = p2['name']
event_info['p2_ID'] = p2['id']
return event_info
def parse_shot_miss_take_give(description, players, ev_team, home_team):
"""
Parse the description field for a: SHOT, MISS, TAKE, GIVE
MTL ONGOAL - #81 ELLER, Wrist, Off. Zone, 11 ft.
ANA #23 BEAUCHEMIN, Slap, Wide of Net, Off. Zone, 42 ft.
TOR GIVEAWAY - #35 GIGUERE, Def. Zone
TOR TAKEAWAY - #9 ARMSTRONG, Off. Zone
:param description: Play Description
:param players: players in game
:param ev_team: Event Team
:param home_team: Home Team for game
:return: Dict with info
"""
event_info = {}
regex = re.compile(r'(\d+)')
desc = regex.search(description).groups() # num
p = get_player_name(desc[0], players, ev_team, home_team)
event_info['p1_name'] = p['name']
event_info['p1_ID'] = p['id']
return event_info
def parse_hit(description, players, home_team):
"""
Parse the description field for a HIT
MTL #20 O'BYRNE HIT TOR #18 BROWN, Def. Zone
:param description: Play Description
:param players: players in game
:param home_team: Home Team for game
:return: Dict with info
"""
event_info = {}
regex = re.compile(r'(.{3})\s+#(\d+)')
desc = regex.findall(description) # [[Team, num], [Team, num]]
p1 = get_player_name(desc[0][1], players, desc[0][0], home_team)
event_info['p1_name'] = p1['name']
event_info['p1_ID'] = p1['id']
if len(desc) > 1:
p2 = get_player_name(desc[1][1], players, desc[1][0], home_team)
event_info['p2_name'] = p2['name']
event_info['p2_ID'] = p2['id']
return event_info
def parse_block(description, players, home_team):
"""
Parse the description field for a BLOCK
MTL #76 SUBBAN BLOCKED BY TOR #2 SCHENN, Wrist, Def. Zone
:param description: Play Description
:param players: players in game
:param home_team: Home Team for game
:return: Dict with info
"""
event_info = {}
regex = re.compile(r'(.{3})\s+#(\d+)')
desc = regex.findall(description) # [[Team, num], [Team, num]]
if len(desc) == 0:
event_info['p1_name'] = event_info['p2_name'] = event_info['p1_ID'] = event_info['p2_ID'] = None
else:
p1 = get_player_name(desc[len(desc) - 1][1], players, desc[len(desc) - 1][0], home_team)
event_info['p1_name'] = p1['name']
event_info['p1_ID'] = p1['id']
if len(desc) > 1:
p2 = get_player_name(desc[0][1], players, desc[0][0], home_team)
event_info['p2_name'] = p2['name']
event_info['p2_ID'] = p2['id']
return event_info
def parse_goal(description, players, ev_team, home_team):
"""
Parse the description field for a GOAL
TOR #81 KESSEL(1), Wrist, Off. Zone, 14 ft. Assists: #42 BOZAK(1); #8 KOMISAREK(1)
:param description: Play Description
:param players: players in game
:param ev_team: Event Team
:param home_team: Home Team for game
:return: Dict with info
"""
event_info = {}
regex = re.compile(r'#(\d+)\s+')
desc = regex.findall(description) # [num, ?, ?] -> ranging from 1 to 3 indices
p1 = get_player_name(desc[0], players, ev_team, home_team)
event_info['p1_name'] = p1['name']
event_info['p1_ID'] = p1['id']
if len(desc) >= 2:
p2 = get_player_name(desc[1], players, ev_team, home_team)
event_info['p2_name'] = p2['name']
event_info['p2_ID'] = p2['id']
if len(desc) == 3:
p3 = get_player_name(desc[2], players, ev_team, home_team)
event_info['p3_name'] = p3['name']
event_info['p3_ID'] = p3['id']
return event_info
def parse_penalty(description, players, home_team):
"""
Parse the description field for a Penalty
MTL #81 ELLER Hooking(2 min), Def. Zone Drawn By: TOR #11 SJOSTROM
:param description: Play Description
:param players: players in game
:param home_team: Home Team for game
:return: Dict with info
"""
event_info = {}
# Check if it's a Bench/Team Penalties
if "bench" in description or "TEAM" in description:
event_info['p1_name'] = 'Team'
else:
# Standard Penalty
regex = re.compile(r'(.{3})\s+#(\d+)')
desc = regex.findall(description) # [[team, num], ?[team, num]] -> Either one to three indices
if desc:
p1 = get_player_name(desc[0][1], players, desc[0][0], home_team)
event_info['p1_name'] = p1['name']
event_info['p1_ID'] = p1['id']
# When there are three the penalty was served by someone else
# The Person who served the penalty is placed as the 3rd event player
if len(desc) == 3:
p3 = get_player_name(desc[1][1], players, desc[0][0], home_team)
event_info['p3_name'] = p3['name']
event_info['p3_ID'] = p3['id']
p2 = get_player_name(desc[2][1], players, desc[2][0], home_team)
event_info['p2_name'] = p2['name']
event_info['p2_ID'] = p2['id']
elif len(desc) == 2:
p2 = get_player_name(desc[1][1], players, desc[1][0], home_team)
event_info['p2_name'] = p2['name']
event_info['p2_ID'] = p2['id']
return event_info
def add_event_players(event_dict, event, players, home_team):
"""
Add players involved in the event to event_dict
:param event_dict: dict of parsed event stuff
:param event: fixed up html
:param players: dict of players and id's
:param home_team: home team
:return: None
"""
event_info = {}
description = event[5].strip()
ev_team = shared.convert_tricode(description.split()[0])
if event[4] == 'FAC':
event_info = parse_fac(description, players, ev_team, home_team)
elif event[4] in ['SHOT', 'MISS', 'GIVE', 'TAKE']:
event_info = parse_shot_miss_take_give(description, players, ev_team, home_team)
elif event[4] == 'HIT':
event_info = parse_hit(description, players, home_team)
elif event[4] == 'BLOCK':
event_info = parse_block(description, players, home_team)
elif event[4] == 'GOAL':
event_info = parse_goal(description, players, ev_team, home_team)
elif event[4] == 'PENL':
event_info = parse_penalty(description, players, home_team)
# Transfer info over
for key in event_info:
event_dict[key] = event_info[key]
def populate_players(event_dict, players, away_players, home_players):
"""
Populate away and home player info (and num skaters on each side).
These include:
1. HomePlayer & AwayPlayers fields from 1-6 for name/id
2. Home & Away Goalie Fields for name/id
:param event_dict: dict with event info
:param players: all players in game and info
:param away_players: players for away team
:param home_players: players for home team
:return: None
"""
for venue in ['Home', 'Away']:
for j in range(6):
# Deal with the Home & Away Player Fields
try:
ven_player = home_players[j] if venue == "Home" else away_players[j]
name = shared.fix_name(ven_player[0])
event_dict['{}Player{}'.format(venue.lower(), j + 1)] = name
event_dict['{}Player{}_id'.format(venue.lower(), j + 1)] = players[venue][name]['id']
except KeyError:
event_dict['{}Player{}_id'.format(venue.lower(), j + 1)] = None
except IndexError:
event_dict['{}Player{}'.format(venue.lower(), j + 1)] = None
event_dict['{}Player{}_id'.format(venue.lower(), j + 1)] = None
continue
# If the player is a goalie we try filling that field
if ven_player[2] == "G":
try:
event_dict['{}_Goalie'.format(venue)] = name
event_dict['{}_Goalie_Id'.format(venue)] = players[venue][name]['id']
except KeyError:
pass
# Control for when no goalies present
if '{}_Goalie'.format(venue) not in event_dict:
event_dict['{}_Goalie'.format(venue)] = None
if '{}_Goalie_Id'.format(venue) not in event_dict:
event_dict['{}_Goalie_Id'.format(venue)] = None
event_dict['Away_Players'] = len(away_players)
event_dict['Home_Players'] = len(home_players)
def parse_event(event, players, home_team, current_score):
"""
Receives an event and parses it
:param event: event type
:param players: players in game
:param home_team: home team
:param current_score: current score for both teams
:return: dict with info
"""
event_dict = dict()
away_players = event[6]
home_players = event[7]
event_dict['Description'] = event[5]
event_dict['Event'] = str(event[4])
add_period(event_dict, event)
add_time(event_dict, event)
add_event_team(event_dict, event)
add_score(event_dict, event, current_score, home_team)
populate_players(event_dict, players, away_players, home_players)
add_strength(event_dict, home_players, away_players)
add_type(event_dict, event, players, home_team)
add_zone(event_dict, event[5])
add_home_zone(event_dict, home_team)
# Sometimes it's empty...(they seem to sometimes/always have a whitespace char)
if len(event_dict['Description']) > 1:
add_event_players(event_dict, event, players, home_team)
return event_dict
def parse_html(html, players, teams):
"""
Parse html game pbp
:param html: raw html
:param players: players in the game (from json pbp)
:param teams: dict with home and away teams
:return: DataFrame with info
"""
columns = ['Period', 'Event', 'Description', 'Time_Elapsed', 'Seconds_Elapsed', 'Strength', 'Ev_Zone', 'Type',
'Ev_Team', 'Home_Zone', 'Away_Team', 'Home_Team', 'p1_name', 'p1_ID', 'p2_name', 'p2_ID', 'p3_name',
'p3_ID', 'awayPlayer1', 'awayPlayer1_id', 'awayPlayer2', 'awayPlayer2_id', 'awayPlayer3', 'awayPlayer3_id',
'awayPlayer4', 'awayPlayer4_id', 'awayPlayer5', 'awayPlayer5_id', 'awayPlayer6', 'awayPlayer6_id',
'homePlayer1', 'homePlayer1_id', 'homePlayer2', 'homePlayer2_id', 'homePlayer3', 'homePlayer3_id',
'homePlayer4', 'homePlayer4_id', 'homePlayer5', 'homePlayer5_id', 'homePlayer6', 'homePlayer6_id',
'Away_Goalie', 'Away_Goalie_Id', 'Home_Goalie', 'Home_Goalie_Id', 'Away_Players', 'Home_Players',
'Away_Score', 'Home_Score']
current_score = {'Home': 0, 'Away': 0}
events = [parse_event(event, players, teams['Home'], current_score) for event in html if if_valid_event(event)]
df = pd.DataFrame(list(events), columns=columns)
# This is seen sometimes...it's a duplicate row
df.drop(df[df.Time_Elapsed == '-16:0-'].index, inplace=True)
df['p1_ID'] = df['p1_ID'].astype("float64")
df['Away_Team'] = teams['Away']
df['Home_Team'] = teams['Home']
return df
def scrape_pbp(game_html, game_id, players, teams):
"""
Scrape the data for the pbp
:param game_html: Html doc for the game
:param game_id: game to scrape
:param players: dict with player info
:param teams: dict with home and away teams
:return: DataFrame of game info or None if it fails
"""
if not game_html:
shared.print_error("Html pbp for game {} is either not there or can't be obtained".format(game_id))
return None
cleaned_html = clean_html_pbp(game_html)
if len(cleaned_html) == 0:
shared.print_error("Html pbp contains no plays, this game can't be scraped")
return None
try:
game_df = parse_html(cleaned_html, players, teams)
except Exception as e:
shared.print_error('Error parsing Html pbp for game {} {}'.format(game_id, e))
return None
# These sometimes end up as objects
game_df.Period = game_df.Period.astype(int)
game_df.Seconds_Elapsed = game_df.Seconds_Elapsed.astype(float)
return game_df
def scrape_game_live(game_id, players, teams):
"""
Scrape the data for the game when it's live
:param game_id: game to scrape
:param players: dict with player info
:param teams: dict with home and away teams
:return: Tuple - get_pbp(), cur_game_status()
"""
game_html = get_pbp(game_id)
return scrape_pbp(game_html, game_id, players, teams), cur_game_status(game_html)
def scrape_game(game_id, players, teams):
"""
Scrape the data for the game when not live
:param game_id: game to scrape
:param players: dict with player info
:param teams: dict with home and away teams
:return: DataFrame of game info or None if it fails
"""
game_html = get_pbp(game_id)
return scrape_pbp(game_html, game_id, players, teams)
================================================
FILE: hockey_scraper/nhl/pbp/json_pbp.py
================================================
"""
This module contains functions to scrape the Json Play by Play for any given game
"""
import json
import pandas as pd
from operator import itemgetter
import hockey_scraper.utils.shared as shared
def get_pbp(game_id):
"""
Given a game_id it returns the raw json
Ex: https://api-web.nhle.com/v1/gamecenter/2023010044/play-by-play
:param game_id: string - the game
:return: raw json of game or None if couldn't get game
"""
page_info = {
"url": 'https://api-web.nhle.com/v1/gamecenter/{}/play-by-play'.format(game_id),
"name": game_id,
"type": "json_pbp",
"season": game_id[:4],
}
response = shared.get_file(page_info)
if not response:
shared.print_error("Json pbp for game {} is either not there or can't be obtained".format(game_id))
return {}
else:
return json.loads(response)
def get_teams(pbp_json):
"""
Get teams
:param pbp_json: raw play by play json
:return: dict with home and away
"""
return {
<<<<<<< HEAD
'Home': pbp_json['homeTeam']['abbrev'],
'Away': pbp_json['awayTeam']['abbrev']
=======
'Home': shared.convert_tricode(pbp_json['homeTeam']['abbrev']),
'Away': shared.convert_tricode(pbp_json['awayTeam']['abbrev'])
>>>>>>> 1029299054fbe671c3ca9c5d413cdfd102416853
}
def change_event_name(event):
"""
Change event names from json style to html (ex: BLOCKED_SHOT to BLOCK).
:param event: event type
:return: fixed event type
"""
event_types = {
'PERIOD-START': 'PSTR',
'FACEOFF': 'FAC',
'BLOCKED-SHOT': 'BLOCK',
'GAME-END': 'GEND',
'GIVEAWAY': 'GIVE',
'GOAL': 'GOAL',
'HIT': 'HIT',
'MISSED SHOT': 'MISS',
'PERIOD-END': 'PEND',
'SHOT-ON-GOAL': 'SHOT',
'STOPPAGE': 'STOP',
'TAKEAWAY': 'TAKE',
'PENALTY': 'PENL',
'EARLY INT START': 'EISTR',
'EARLY INT END': 'EIEND',
'SHOOTOUT COMPLETE': 'SOC',
'CHALLENGE': 'CHL',
'EMERGENCY GOALTENDER': 'EGPID'
}
return event_types.get(event.upper(), event)
def parse_event(event):
"""
Parses a single event when the info is in a json format
:param event: json of event
:return: dictionary with the info
"""
play = dict()
play['event_id'] = event['eventId']
<<<<<<< HEAD
play['period'] = event['period']
=======
play['period'] = event['periodDescriptor']['number']
>>>>>>> 1029299054fbe671c3ca9c5d413cdfd102416853
play['event'] = str(change_event_name(event['typeDescKey'].upper()))
play['seconds_elapsed'] = shared.convert_to_seconds(event['timeInPeriod'])
play['p1_name'], play['p2_name'], play['p3_name'] = '', '', ''
if 'details' in event.keys():
details = event['details'].keys()
# If there's a players key that means an event occurred on the play.
if 'scoringPlayerId' in details:
play['p1_ID'] = event['details']['scoringPlayerId']
if 'shootingPlayerId' in details:
play['p1_ID'] = event['details']['shootingPlayerId']
if 'assist1PlayerId' in details:
play['p2_ID'] = event['details']['assist1PlayerId']
if 'assist2PlayerId' in details:
play['p3_ID'] = event['details']['assist2PlayerId']
if 'blockingPlayerId' in details:
play['p2_ID'] = event['details']['blockingPlayerId']
if 'xCoord' in details:
play['xC'] = event['details']['xCoord']
play['yC'] = event['details']['yCoord']
return play
def parse_json(game_json, game_id):
"""
Scrape the json for a game
:param game_json: raw json
:param game_id: game id for game
:return: Either a DataFrame with info for the game or None when fail
"""
columns = ['period', 'event', 'seconds_elapsed', 'p1_name', 'p1_ID', 'p2_name', 'p2_ID', 'p3_name', 'p3_ID', 'xC', 'yC']
# 'PERIOD READY' & 'PERIOD OFFICIAL'..etc aren't found in html...so get rid of them
events_to_ignore = ['PERIOD READY', 'PERIOD OFFICIAL', 'GAME READY', 'GAME OFFICIAL', 'GAME SCHEDULED']
try:
plays = game_json['plays']
events = [parse_event(play) for play in plays if play['typeDescKey'].upper() not in events_to_ignore]
except Exception as e:
shared.print_error('Error parsing Json pbp for game {} {}'.format(game_id, e))
return None
# Sort by event id.
# Sometimes it's not in order of the assigned id in the json. Like, 156...155 (not sure how this happens).
sorted_events = sorted(events, key=itemgetter('event_id'))
return pd.DataFrame(sorted_events, columns=columns)
def scrape_game(game_id):
"""
**Used for debugging**
HTML depends on json so can't follow this structure
:param game_id: game to scrape
:return: DataFrame of game info
"""
game_json = get_pbp(game_id)
if not game_json:
shared.print_error("Json pbp for game {} is not either not there or can't be obtained".format(game_id))
return None
try:
game_df = parse_json(game_json, game_id)
except Exception as e:
shared.print_error('Error parsing Json pbp for game {} {}'.format(game_id, e))
return None
return game_df
================================================
FILE: hockey_scraper/nhl/playing_roster.py
================================================
"""
This module contains functions to scrape the Html game roster for any given game
"""
from bs4 import BeautifulSoup
import hockey_scraper.utils.shared as shared
def get_roster(game_id):
"""
Given a game_id it returns the raw html
Ex: http://www.nhl.com/scores/htmlreports/20162017/RO020475.HTM
:param game_id: the game
:return: raw html of game
"""
game_id = str(game_id)
page_info = {
"url": 'http://www.nhl.com/scores/htmlreports/{}{}/RO{}.HTM'.format(game_id[:4], int(game_id[:4]) + 1, game_id[4:]),
"name": game_id,
"type": "html_roster",
"season": game_id[:4],
}
return shared.get_file(page_info)
def get_content(roster):
"""
Uses Beautiful soup to parses the html document.
Some parsers work for some pages but don't work for others....I'm not sure why so I just try them all here in order
:param roster: doc
:return: players and coaches
"""
parsers = ["lxml", "html.parser", "html5lib"]
for parser in parsers:
soup = BeautifulSoup(roster, "lxml")
players = get_players(soup)
head_coaches = get_coaches(soup)
if len(players) > 0:
break
return players, head_coaches
def fix_name(player):
"""
Get rid of (A) or (C) when a player has it attached to their name
:param player: list of player info -> [number, position, name]
:return: fixed list
"""
if player[2].find('(A)') != -1:
player[2] = player[2][:player[2].find('(A)')].strip()
elif player[2].find('(C)') != -1:
player[2] = player[2][:player[2].find('(C)')].strip()
return player
def get_coaches(soup):
"""
scrape head coaches
:param soup: html
:return: dict of coaches for game
"""
coaches = soup.find_all('tr', {'id': "HeadCoaches"})
# If it picks up nothing just return the empty list
if not coaches:
return coaches
coaches = coaches[0].find_all('td')
return {
'Away': coaches[1].get_text(),
'Home': coaches[3].get_text()
}
def get_players(soup):
"""
scrape roster for players
:param soup: html
:return: dict for home and away players
"""
tables = soup.findAll('table', {'align': 'center', 'border': '0', 'cellpadding': '0', 'cellspacing': '0', 'width': '100%'})
# If it picks up nothing just return the empty list
if not tables:
return tables
"""
There are 5 tables which correspond to the above criteria.
tables[0] is nothing
tables[1] is away starters
tables[2] is home starters
tables[3] is away scratches
tables[4] is home scratches
"""
del tables[0]
player_info = [table.find_all('td') for table in tables]
player_info = [[x.get_text() for x in group] for group in player_info]
# Make list of list of 3 each. The three are: number, position, name (in that order)
player_info = [[group[i:i+3] for i in range(0, len(group), 3)] for group in player_info]
# Get rid of header column
player_info = [[player for player in group if player[0] != '#'] for group in player_info]
# Add whether the player was a scratch
# 2 and 3 hold the scratches
for i in range(len(player_info)):
for j in range(len(player_info[i])):
if i == 2 or i == 3:
player_info[i][j].append(True)
else:
player_info[i][j].append(False)
players = {'Away': player_info[0], 'Home': player_info[1]}
# Scratches aren't always included
if len(player_info) == 4:
players['Away'] += player_info[2]
players['Home'] += player_info[3]
# For those with (A) or (C) in name field get rid of it
# First condition is to control when we get whitespace as one of the indices
players['Away'] = [fix_name(i) if i[0] != u'\xa0' else i for i in players['Away']]
players['Home'] = [fix_name(i) if i[0] != u'\xa0' else i for i in players['Home']]
# Get rid when just whitespace
players['Away'] = [i for i in players['Away'] if i[0] != u'\xa0']
players['Home'] = [i for i in players['Home'] if i[0] != u'\xa0']
return players
def scrape_roster(game_id):
"""
For a given game scrapes the roster
:param game_id: id for game
:return: dict of players (home and away) an dict for both head coaches
"""
roster = get_roster(game_id)
if not roster:
shared.print_error("Roster for game {} is either not there or can't be obtained".format(game_id))
return None
try:
players, head_coaches = get_content(roster)
except Exception as e:
shared.print_error('Error parsing Roster for game {} {}'.format(game_id, e))
return None
return {'players': players, 'head_coaches': head_coaches}
================================================
FILE: hockey_scraper/nhl/scrape_functions.py
================================================
"""
Functions to scrape by season, games, and date range
"""
import time
import pandas as pd
from datetime import datetime
import hockey_scraper.nhl.game_scraper as game_scraper
import hockey_scraper.nhl.json_schedule as json_schedule
import hockey_scraper.utils.shared as shared
def print_errors(detailed=True):
"""
Print errors with scraping.
Detailed parameter controls if certain errors should be *re-printed* after scraping all games.
For example if the pbp for a game is broken it's always printed immediately after that game.
But a summary of broken games will be printed if over 25 games are scraped. The logic is that
it'll be easier when you've scraped a lot of games to see all the errors at the end than scrolling
though all the output and potentially missing it.
:param detailed: When False only print player IDs otherwise all
:return: None
"""
print("")
if game_scraper.broken_pbp_games and detailed:
print('Broken pbp:')
for x in game_scraper.broken_pbp_games:
print(" -", x[0], x[1])
print("")
if game_scraper.broken_shifts_games and detailed:
print('Broken shifts:')
for x in game_scraper.broken_shifts_games:
print(" -", x[0], x[1])
print("")
if game_scraper.missing_coords and detailed:
print('Games missing coordinates:')
for x in game_scraper.missing_coords:
print(" -", x[0], x[1])
print("")
if game_scraper.players_missing_ids:
print("Players missing IDs:")
for x in game_scraper.players_missing_ids:
print(" -", x[0], x[1])
print("")
# Clear them all out for the next call
game_scraper.broken_shifts_games = []
game_scraper.broken_pbp_games = []
game_scraper.players_missing_ids = []
game_scraper.missing_coords = []
def scrape_list_of_games(games, if_scrape_shifts, verbose=False):
"""
Given a list of game_id's (and a date for each game) it scrapes them
:param games: list of [game_id, date]
:param if_scrape_shifts: Boolean indicating whether to also scrape shifts
:params verbose: Verbosity when printing errors. Defaults to False
:return: DataFrame of pbp info, also shifts if specified
"""
pbp_dfs = []
shifts_dfs = []
for game in games:
pbp_df, shifts_df = game_scraper.scrape_game(str(game["game_id"]), game["date"], if_scrape_shifts)
if pbp_df is not None:
pbp_dfs.extend([pbp_df])
if shifts_df is not None:
shifts_dfs.extend([shifts_df])
# Check if any games...if not let's get out of here
if len(pbp_dfs) == 0:
return None, None
else:
pbp_df = pd.concat(pbp_dfs)
pbp_df = pbp_df.reset_index(drop=True)
pbp_df.apply(lambda row: game_scraper.check_goalie(row), axis=1)
if if_scrape_shifts:
shifts_df = pd.concat(shifts_dfs).reset_index(drop=True)
else:
shifts_df = None
# Only print full details when # games > 25 or verbose=True
error_verbosity = verbose or len(games) >= 25
print_errors(error_verbosity)
return pbp_df, shifts_df
def scrape_schedule(from_date, to_date, data_format='pandas', rescrape=False, docs_dir=False):
"""
Scrape the games schedule in a given range.
:param from_date: date you want to scrape from
:param to_date: date you want to scrape to
:param data_format: format you want data in - csv or pandas (pandas is default)
:param rescrape: If you want to rescrape pages already scraped. Only applies if you supply a docs dir. (def. = None)
:param docs_dir: Directory that either contains previously scraped docs or one that you want them to be deposited
in after scraping. When True it'll refer to (or if needed create) such a repository in the home
directory. When provided a string it'll try to use that. Here it must be a valid directory otheriwse
it won't work (I won't make it for you). When False the files won't be saved.
:return: DataFrame of None
"""
cols = ["game_id", "date", "venue", "home_team", "away_team", "start_time", "home_score", "away_score", "status"]
shared.check_data_format(data_format)
shared.check_valid_dates(from_date, to_date)
shared.add_dir(docs_dir)
shared.if_rescrape(rescrape)
# live = True allows us to scrape games that aren't final
sched = json_schedule.scrape_schedule(from_date, to_date, preseason=True, not_over=True)
sched_df = pd.DataFrame(sched, columns=cols)
if data_format.lower() == 'csv':
shared.to_csv(from_date + '--' + to_date, sched_df, "nhl", "schedule")
else:
return sched_df
def scrape_date_range(from_date, to_date, if_scrape_shifts, data_format='csv', preseason=False, rescrape=False, docs_dir=False, verbose=False):
"""
Scrape games in given date range
:param from_date: date you want to scrape from
:param to_date: date you want to scrape to
:param if_scrape_shifts: Boolean indicating whether to also scrape shifts
:param data_format: format you want data in - csv or pandas (csv is default)
:param preseason: Boolean indicating whether to include preseason games (default if False)
This is may or may not work!!! I don't give a shit.
:param rescrape: If you want to rescrape pages already scraped. Only applies if you supply a docs dir. (def. = None)
:param docs_dir: Directory that either contains previously scraped docs or one that you want them to be deposited
in after scraping. When True it'll refer to (or if needed create) such a repository in the home
directory. When provided a string it'll try to use that. Here it must be a valid directory otheriwse
it won't work (I won't make it for you). When False the files won't be saved.
:params verbose: Override default verbosity when printing errors
:return: Dictionary with DataFrames and errors or None
"""
shared.check_data_format(data_format)
shared.check_valid_dates(from_date, to_date)
shared.add_dir(docs_dir)
shared.if_rescrape(rescrape)
games = json_schedule.scrape_schedule(from_date, to_date, preseason)
pbp_df, shifts_df = scrape_list_of_games(games, if_scrape_shifts, verbose)
if data_format.lower() == 'csv':
shared.to_csv(from_date + '--' + to_date, pbp_df, "nhl", "pbp")
shared.to_csv(from_date + '--' + to_date, shifts_df, "nhl", "shifts")
else:
return {"pbp": pbp_df, "shifts": shifts_df} if if_scrape_shifts else {"pbp": pbp_df}
def scrape_seasons(seasons, if_scrape_shifts, data_format='csv', preseason=False, rescrape=False, docs_dir=False, verbose=False):
"""
Given list of seasons it scrapes all the seasons
:param seasons: list of seasons
:param if_scrape_shifts: Boolean indicating whether to also scrape shifts
:param data_format: format you want data in - csv or pandas (csv is default)
:param preseason: Boolean indicating whether to include preseason games (default if False)
This is may or may not work!!! I don't give a shit.
:param rescrape: If you want to rescrape pages already scraped. Only applies if you supply a docs dir.
:param docs_dir: Directory that either contains previously scraped docs or one that you want them to be deposited
in after scraping. When True it'll refer to (or if needed create) such a repository in the home
directory. When provided a string it'll try to use that. Here it must be a valid directory otheriwse
it won't work (I won't make it for you). When False the files won't be saved.
:params verbose: Override default verbosity when printing errors
:return: Dictionary with DataFrames and errors or None
"""
shared.check_data_format(data_format)
shared.add_dir(docs_dir)
shared.if_rescrape(rescrape)
# Holds all seasons scraped (if not csv)
master_pbps, master_shifts = [], []
for season in seasons:
from_date = shared.season_start_bound(season)
to_date = datetime.strftime(shared.season_end_bound(str(int(season) + 1)), "%Y-%m-%d")
games = json_schedule.scrape_schedule(from_date, to_date, preseason)
pbp_df, shifts_df = scrape_list_of_games(games, if_scrape_shifts, verbose)
if data_format.lower() == 'csv':
shared.to_csv(str(season) + str(season + 1), pbp_df, "nhl", "pbp")
shared.to_csv(str(season) + str(season + 1), shifts_df, "nhl", "shifts")
elif pbp_df is not None:
master_pbps.append(pbp_df)
master_shifts.append(shifts_df)
if data_format.lower() == 'pandas' and master_pbps:
if if_scrape_shifts:
return {"pbp": pd.concat(master_pbps), "shifts": pd.concat(master_shifts)}
else:
return {"pbp": pd.concat(master_pbps)}
def scrape_games(games, if_scrape_shifts, data_format='csv', rescrape=False, docs_dir=False, verbose=False):
"""
Scrape a list of games
:param games: list of game_ids
:param if_scrape_shifts: Boolean indicating whether to also scrape shifts
:param data_format: format you want data in - csv or pandas (csv is default)
:param rescrape: If you want to rescrape pages already scraped. Only applies if you supply a docs dir.
:param docs_dir: Directory that either contains previously scraped docs or one that you want them to be deposited
in after scraping. When True it'll refer to (or if needed create) such a repository in the home
directory. When provided a string it'll try to use that. Here it must be a valid directory otheriwse
it won't work (I won't make it for you). When False the files won't be saved.
:params verbose: Override default verbosity when printing errors
:return: Dictionary with DataFrames and errors or None
"""
shared.check_data_format(data_format)
shared.add_dir(docs_dir)
shared.if_rescrape(rescrape)
# Create List of game_id's and dates
games_list = json_schedule.get_dates(games)
# Scrape pbp and shifts
pbp_df, shifts_df = scrape_list_of_games(games_list, if_scrape_shifts, verbose)
if data_format.lower() == 'csv':
shared.to_csv(str(int(time.time())), pbp_df, "nhl", "pbp")
shared.to_csv(str(int(time.time())), shifts_df, "nhl", "shifts")
else:
return {"pbp": pbp_df, "shifts": shifts_df} if if_scrape_shifts else {"pbp": pbp_df}
================================================
FILE: hockey_scraper/nhl/shifts/__init__.py
================================================
================================================
FILE: hockey_scraper/nhl/shifts/html_shifts.py
================================================
"""
This module contains functions to scrape the Html Toi Tables (or shifts) for any given game
"""
import re
import pandas as pd
from bs4 import BeautifulSoup
import hockey_scraper.utils.shared as shared
def get_shifts(game_id):
"""
Given a game_id it returns a the shifts for both teams
Ex: http://www.nhl.com/scores/htmlreports/20162017/TV020971.HTM
:param game_id: the game
:return: Shifts or None
"""
game_id = str(game_id)
venue_pgs = tuple()
for venue in ["home", "away"]:
venue_tag = "H" if venue == "home" else "V"
venue_url = 'http://www.nhl.com/scores/htmlreports/{}{}/T{}{}.HTM'.format(game_id[:4], int(game_id[:4])+1, venue_tag, game_id[4:])
page_info = {
"url": venue_url,
"name": game_id,
"type": "html_shifts_{}".format(venue),
"season": game_id[:4],
}
venue_pgs += (shared.get_file(page_info), )
return venue_pgs
def get_soup(shifts_html):
"""
Uses Beautiful soup to parses the html document.
Some parsers work for some pages but don't work for others....I'm not sure why so I just try them all here in order
:param shifts_html: html doc
:return: "soupified" html and player_shifts portion of html (it's a bunch of td tags)
"""
parsers = ["lxml", "html.parser", "html5lib"]
for parser in parsers:
soup = BeautifulSoup(shifts_html, parser)
td = soup.findAll(True, {'class': ['playerHeading + border', 'lborder + bborder']})
if len(td) > 0:
break
return td, get_teams(soup)
def get_teams(soup):
"""
Return the team for the TOI tables and the home team
:param soup: souped up html
:return: list with team and home team
"""
team = soup.find('td', class_='teamHeading + border') # Team for shifts
team = team.get_text()
# Get Home Team
teams = soup.find_all('td', {'align': 'center', 'style': 'font-size: 10px;font-weight:bold'})
regex = re.compile(r'>(.*)<br/?>')
home_team = regex.findall(str(teams[7]))
return [team, home_team[0]]
def analyze_shifts(shift, name, team, home_team, player_ids):
"""
Analyze shifts for each player when using.
Prior to this each player (in a dictionary) has a list with each entry being a shift.
:param shift: info on shift
:param name: player name
:param team: given team
:param home_team: home team for given game
:param player_ids: dict with info on players
:return: dict with info for shift
"""
shifts = dict()
shifts['Player'] = name.upper()
shifts['Period'] = '4' if shift[1] == 'OT' else shift[1]
shifts['Team'] = shared.get_team(team.strip(' '))
shifts['Start'] = shared.convert_to_seconds(shift[2].split('/')[0])
shifts['Duration'] = shared.convert_to_seconds(shift[4].split('/')[0])
# I've had problems with this one...if there are no digits the time is fucked up
if re.compile('\d+').findall(shift[3].split('/')[0]):
shifts['End'] = shared.convert_to_seconds(shift[3].split('/')[0])
else:
shifts['End'] = shifts['Start'] + shifts['Duration']
try:
if home_team == team:
shifts['Player_Id'] = player_ids['Home'][name.upper()]['id']
else:
shifts['Player_Id'] = player_ids['Away'][name.upper()]['id']
except KeyError:
shifts['Player_Id'] = None
return shifts
def parse_html(html, player_ids, game_id):
"""
Parse the html
Note: Don't fuck with this!!! I'm not exactly sure how or why but it works.
:param html: cleaned up html
:param player_ids: dict of home and away players
:param game_id: id for game
:return: DataFrame with info
"""
all_shifts = []
columns = ['Game_Id', 'Player', 'Player_Id', 'Period', 'Team', 'Start', 'End', 'Duration']
td, teams = get_soup(html)
team = teams[0]
home_team = teams[1]
players = dict()
# The list 'td' is laid out with player name followed by every component of each shift. Each shift contains:
# shift #, Period, begin, end, and duration. The shift event isn't included.
for t in td:
t = t.get_text()
if ',' in t: # If it has a comma in it we know it's a player's name...so add player to dict
name = t
# Just format the name normally...it's coded as: 'num last_name, first_name'
name = name.split(',')
name = ' '.join([name[1].strip(' '), name[0][2:].strip(' ')])
name = shared.fix_name(name)
players[name] = dict()
players[name]['number'] = name[0][:2].strip()
players[name]['Shifts'] = []
else:
# Here we add all the shifts to whatever player we are up to
players[name]['Shifts'].extend([t])
for key in players.keys():
# Create a list of lists (each length 5)...corresponds to 5 columns in html shifts
players[key]['Shifts'] = [players[key]['Shifts'][i:i + 5] for i in range(0, len(players[key]['Shifts']), 5)]
# Parse each shift
shifts = [analyze_shifts(shift, key, team, home_team, player_ids) for shift in players[key]['Shifts']]
all_shifts.extend(shifts)
df = pd.DataFrame(all_shifts)
df['Game_Id'] = str(game_id)[5:]
return df[columns]
def scrape_game(game_id, players):
"""
Scrape the game.
:param game_id: id for game
:param players: list of players
:return: DataFrame with info for the game
"""
columns = ['Game_Id', 'Period', 'Team', 'Player', 'Player_Id', 'Start', 'End', 'Duration']
home_html, away_html = get_shifts(game_id)
if home_html is None or away_html is None:
shared.print_error("Html shifts for game {} is either not there or can't be obtained".format(game_id))
return pd.DataFrame()
try:
away_df = parse_html(away_html, players, game_id)
home_df = parse_html(home_html, players, game_id)
except Exception as e:
shared.print_error('Error parsing Html shifts for game {} {}'.format(game_id, e))
return pd.DataFrame()
# Combine the two
game_df = pd.concat([away_df, home_df], ignore_index=True)
game_df = pd.DataFrame(game_df, columns=columns)
game_df = game_df.sort_values(by=['Period', 'Start'], ascending=[True, True])
return game_df.reset_index(drop=True)
================================================
FILE: hockey_scraper/nhl/shifts/json_shifts.py
================================================
"""
This module contains functions to scrape the Json toi/shifts for any given game
"""
import json
import pandas as pd
import hockey_scraper.utils.shared as shared
def get_shifts(game_id):
"""
Given a game_id it returns the raw json
Ex: https://api.nhle.com/stats/rest/en/shiftcharts?cayenneExp=gameId=2019020001
:param game_id: the game
:return: json or None
"""
page_info = {
"url": 'https://api.nhle.com/stats/rest/en/shiftcharts?cayenneExp=gameId={}'.format(game_id),
"name": str(game_id),
"type": "json_shifts",
"season": str(game_id)[:4],
}
response = shared.get_file(page_info)
# Return empty dict if can't get page
if not response:
return {}
else:
return json.loads(response)
def fix_team_tricode(tricode):
"""
Some of the tricodes are different than how I want them
:param tricode: 3 letter team name - ex: NYR
:return: fixed tricode
"""
fixed_tricodes = {
'TBL': 'T.B',
'LAK': 'L.A',
'NJD': 'N.J',
'SJS': 'S.J'
}
return fixed_tricodes.get(tricode.upper(), tricode)
def parse_shift(shift):
"""
Parse shift for json
:param shift: json for shift
:return: dict with shift info
"""
shift_dict = dict()
# At the end of the json they list when all the goal events happened. We don't want them...
# They are the only one's which have their eventDescription be not null
if shift['eventDescription'] is not None:
return {}
name = shared.fix_name(' '.join([shift['firstName'].strip(' '), shift['lastName'].strip(' ')]))
shift_dict['Player'] = name
shift_dict['Player_Id'] = shift['playerId']
shift_dict['Period'] = shift['period']
shift_dict['Team'] = fix_team_tricode(shift['teamAbbrev'])
shift_dict['Start'] = shared.convert_to_seconds(shift['startTime'])
shift_dict['End'] = shared.convert_to_seconds(shift['endTime'])
shift_dict['Duration'] = shared.convert_to_seconds(shift['duration'])
return shift_dict
def parse_json(shift_json, game_id):
"""
Parse the json
:param shift_json: raw json
:param game_id: if of game
:return: DataFrame with info
"""
columns = ['Game_Id', 'Period', 'Team', 'Player', 'Player_Id', 'Start', 'End', 'Duration']
shifts = [parse_shift(shift) for shift in shift_json['data']] # Go through the shifts
shifts = [shift for shift in shifts if shift != {}] # Get rid of null shifts (which happen at end)
df = pd.DataFrame(shifts, columns=columns)
df['Game_Id'] = str(game_id)[5:]
df = df.sort_values(by=['Period', 'Start'], ascending=[True, True])
return df.reset_index(drop=True)
def scrape_game(game_id):
"""
Scrape the game.
:param game_id: game
:return: DataFrame with info for the game
"""
shifts_json = get_shifts(game_id)
if not shifts_json:
#shared.print_error("Json shifts for game {} is either not there or can't be obtained".format(game_id))
return pd.DataFrame()
try:
game_df = parse_json(shifts_json, game_id)
except Exception as e:
shared.print_error('Error parsing Json shifts for game {} {}'.format(game_id, e))
return pd.DataFrame()
return game_df
================================================
FILE: hockey_scraper/nwhl/__init__.py
================================================
================================================
FILE: hockey_scraper/nwhl/game_pbp.py
================================================
"""
Scrape the PBP info for a given game
"""
import json
import time
import datetime
import pandas as pd
from bs4 import BeautifulSoup
import hockey_scraper.utils.shared as shared
import hockey_scraper.utils.save_pages as sp
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, ElementNotVisibleException, WebDriverException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
options = Options()
options.add_argument("--headless")
def scrape_page(url):
"""
:param url: Game pbp url
:return n pages - each have period info
"""
driver = webdriver.Firefox()
wait = WebDriverWait(driver, 10)
driver.get(url)
time.sleep(8)
"""
for _ in range(5):
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(.2)
#['SO', 'OT', 'OT1', 'OT2', '3', '2', '1']:
for period in ['3', '2', '1']:
btn = '//a[@ng-click="ctrl.period = \'{}\'"]'.format(period)
try:
wait.until(EC.element_to_be_clickable((By.XPATH, btn)))
#driver.find_element_by_xpath(btn).click()
btn_elem = driver.find_element_by_xpath(btn)
ActionChains(driver).move_to_element(btn_elem).click().perform()
except (TimeoutException, ElementNotVisibleException, WebDriverException) as e:
print(e)
### This just print the last row in the list to see if we are correctly toggling between periods
soup = BeautifulSoup(driver.page_source, "lxml")
plays_table = soup.find("table", {"class": "play-by-play"})
plays = plays_table.find_all("tr")
print(plays[-1])
"""
pg = driver.page_source
driver.close()
return pg
def get_pbp(game_id):
"""
Get the response for a game (e.g. https://www.nwhl.zone/stats#/100/game/268087/play-by-play)
:param game_id: Given Game id (e.g. 268087)
:return:
"""
file_info = {
"url": 'https://www.nwhl.zone/stats#/100/game/{}/play-by-play'.format(game_id),
"name": str(game_id),
"type": "nwhl_json_pbp",
"season": "nwhl",
'dir': shared.docs_dir
}
# Saved pages logic is here bec. of button logic in scrape_pbp
if shared.docs_dir and sp.check_file_exists(file_info) and not shared.re_scrape:
# TODO: Regex matching game_id
pgs = sp.get_page(file_info)
else:
pgs = scrape_page(file_info['url'])
# We have to save each individually
#for i, pg in enumerate(pgs):
i=1
file_info['name'] += "_{}".format(i)
sp.save_page(pgs, file_info)
return pgs
def parse_event(event, score, teams, date, game_id, players):
"""
Parses a single event when the info is in a json format
:param event: json of event
:param score: Current score of the game
:param teams: Teams dict (id -> name)
:param date: date of the game
:param game_id: game id for game
:param players: Dict of player ids to player names
:return: dictionary with the info
"""
play = dict()
def parse_json(game_json, game_id,):
"""
Scrape the json for a game
plus, minus players
:param game_json: raw json
:param game_id: game id for game
:return: Either a DataFrame with info for the game
"""
cols = ['game_id', 'date', 'season', 'period', 'seconds_elapsed', 'event', 'ev_team', 'home_team', 'away_team',
'p1_name', 'p1_id', 'p2_name', 'p2_id', 'p3_name', 'p3_id',
"homePlayer1", "homePlayer1_id", "homePlayer2", "homePlayer2_id", "homePlayer3", "homePlayer3_id",
"homePlayer4", "homePlayer4_id", "homePlayer5", "homePlayer5_id", "homePlayer6", "homePlayer6_id",
"awayPlayer1", "awayPlayer1_id", "awayPlayer2", "awayPlayer2_id", "awayPlayer3", "awayPlayer3_id",
"awayPlayer4", "awayPlayer4_id", "awayPlayer5", "awayPlayer5_id", "awayPlayer6", "awayPlayer6_id",
'home_goalie', 'home_goalie_id', 'away_goalie', 'away_goalie_id', 'details', 'home_score', 'away_score',
'xC', 'yC', 'play_index']
# B4 anything - if there are no plays we leave
if len(game_json['plays']) == 0:
shared.print_error("The Json pbp for game {} contains no plays and therefore can't be parsed".format(game_id))
return pd.DataFrame()
# Get all the players in the game
players = get_roster(game_json)
# Initialize & Update as we go along
score = {"home": 0, "away": 0}
teams = {"home": {"id": game_json['game']['home_team'], "name": game_json['team_instance'][0]['abbrev']},
"away": {"id": game_json['game']['away_team'], "name": game_json['team_instance'][1]['abbrev']}
}
# Get date from UTC timestamp
date = game_json['plays'][0]['created_at']
date = datetime.datetime.strptime(date[:date.rfind("-")], "%Y-%m-%dT%H:%M:%S").strftime("%Y-%m-%d")
try:
events = [parse_event(play, score, teams, date, game_id, players) for play in game_json['plays']]
except Exception as e:
shared.print_error('Error parsing Json pbp for game {} {}'.format(game_id, e))
return pd.DataFrame()
df = pd.DataFrame(events, columns=cols)
# Get rid of null events and order by play index
df = df[(~pd.isnull(df['event'])) & (df['event'] != "")]
df = df.sort_values(by=['play_index'])
df = df.drop(['play_index'], axis=1)
return df.reset_index(drop=True)
def scrape_pbp(game_id):
"""
Scrape the pbp data for a given game
:param game_id: Given Game id (e.g. 18507472)
:return: DataFrame with pbp info
"""
game_json = get_pbp(game_id)
if not game_json:
shared.print_error("Pbp for game {} is not either not there or can't be obtained".format(game_id))
return pd.DataFrame()
try:
game_df = parse_json(game_json, game_id)
except Exception as e:
shared.print_error('Error parsing the Pbp for game {} {}'.format(game_id, e))
return pd.DataFrame()
return game_df
================================================
FILE: hockey_scraper/nwhl/scrape_functions.py
================================================
"""
Functions to scrape by season, games, and date range
"""
import random
import pandas as pd
#from . import html_schedule, json_pbp
import hockey_scraper.utils.shared as shared
# All columns for the pbp
cols = ['game_id', 'date', 'season', 'period', 'seconds_elapsed', 'event', 'ev_team', 'home_team', 'away_team',
'p1_name', 'p1_id', 'p2_name', 'p2_id', 'p3_name', 'p3_id',
"homePlayer1", "homePlayer1_id", "homePlayer2", "homePlayer2_id", "homePlayer3", "homePlayer3_id",
"homePlayer4", "homePlayer4_id", "homePlayer5", "homePlayer5_id", "homePlayer6", "homePlayer6_id",
"awayPlayer1", "awayPlayer1_id", "awayPlayer2", "awayPlayer2_id", "awayPlayer3", "awayPlayer3_id",
"awayPlayer4", "awayPlayer4_id", "awayPlayer5", "awayPlayer5_id", "awayPlayer6", "awayPlayer6_id",
'home_goalie', 'home_goalie_id', 'away_goalie', 'away_goalie_id', 'details', 'home_score', 'away_score',
'xC', 'yC']
# Hold any games we didn't scrape for any reason
broken_games = []
def print_errors():
"""
Print any scraping errors.
:return: None
"""
global broken_games
if broken_games:
print('\nBroken pbp:')
for x in broken_games:
print(x)
broken_games = []
def scrape_list_of_games(games):
"""
Scrape an arbitrary list of games given the game id's
:param games: List of game_id's to scrape
:return: DataFrame of pbp info
"""
pbp_dfs = []
for game in games:
print(' '.join(['Scraping NWHL Game ', str(game)]))
pbp_df = json_pbp.scrape_pbp(game)
if not pbp_df.empty:
pbp_dfs.append(pbp_df)
else:
broken_games.append(game)
# If not empty...
if pbp_dfs:
return pd.concat(pbp_dfs, sort=True).reset_index(drop=True)[cols]
return None
def scrape_games(games, data_format='csv', rescrape=False, docs_dir=None):
"""
Scrape a list of games
:param games: list of game_ids
:param data_format: format you want data in - csv or pandas (csv is default)
:param rescrape: If you want to rescrape pages already scraped. Only applies if you supply a docs dir.
:param docs_dir: Directory that either contains previously scraped docs or one that you want them to be deposited
in after scraping
:return: Dictionary with DataFrames or None
"""
# First check if the inputs are good
shared.check_data_format(data_format)
# Check on the docs_dir and re_scrape
shared.add_dir(docs_dir)
shared.if_rescrape(rescrape)
pbp_df = scrape_list_of_games(games)
print_errors()
if data_format.lower() == 'csv':
shared.to_csv(str(random.randint(1, 101)), pbp_df, None, "nwhl")
else:
return pbp_df
def scrape_date_range(from_date, to_date, data_format='csv', rescrape=False, docs_dir=None):
"""
Scrape games in given date range
:param from_date: date you want to scrape from
:param to_date: date you want to scrape to
:param data_format: format you want data in - csv or pandas (csv is default)
:param rescrape: If you want to rescrape pages already scraped. Only applies if you supply a docs dir. (def. = None)
:param docs_dir: Directory that either contains previously scraped docs or one that you want them to be deposited
in after scraping. (default is None)
:return: Dictionary with DataFrames and errors or None
"""
# First check if the inputs are good
shared.check_data_format(data_format)
shared.check_valid_dates(from_date, to_date)
# Check on the docs_dir and re_scrape
shared.add_dir(docs_dir)
shared.if_rescrape(rescrape)
# Get dates and convert to just a list of game ids
games = html_schedule.scrape_dates(from_date, to_date)
game_ids = [game['game_id'] for game in games]
# Scrape all PBP
pbp_df = scrape_list_of_games(game_ids)
# Merge in subtype
pbp_df = pd.merge(pbp_df, pd.DataFrame(games, columns=['game_id', 'sub_type']), on="game_id", how="left")
print_errors()
if data_format.lower() == 'csv':
shared.to_csv(from_date + '--' + to_date, pbp_df, None, "nwhl")
else:
return pbp_df
def scrape_seasons(seasons, data_format='csv', rescrape=False, docs_dir=None):
"""
Given list of seasons it scrapes all the seasons
:param seasons: list of seasons
:param data_format: format you want data in - csv or pandas (csv is default)
:param rescrape: If you want to rescrape pages already scraped. Only applies if you supply a docs dir.
:param docs_dir: Directory that either contains previously scraped docs or one that you want them to be deposited
in after scraping
:return: Dictionary with DataFrames and errors or None
"""
# First check if the inputs are good
shared.check_data_format(data_format)
# Check on the docs_dir and re_scrape
shared.add_dir(docs_dir)
shared.if_rescrape(rescrape)
# Holds all seasons scraped (if not csv)
master_pbps = []
for season in seasons:
games = html_schedule.scrape_seasons(season)
game_ids = [game['game_id'] for game in games]
# Scrape all PBP
pbp_df = scrape_list_of_games(game_ids)
# Merge in subtype
pbp_df = pd.merge(pbp_df, pd.DataFrame(games, columns=['game_id', 'sub_type']), on="game_id", how="left")
if data_format.lower() == 'csv':
shared.to_csv(str(season) + str(season + 1), pbp_df, None, "nwhl")
else:
master_pbps.append(pbp_df)
print_errors()
if data_format.lower() == 'pandas':
return pd.concat(master_pbps, sort=True)
================================================
FILE: hockey_scraper/nwhl/scrape_schedule.py
================================================
"""
Scrape the schedule info for nwhl games
"""
import time
from datetime import datetime
import re
from bs4 import BeautifulSoup
import hockey_scraper.utils.shared as shared
import hockey_scraper.utils.save_pages as sp
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
def scrape_dynamic(url):
"""
Dynamically scrape a given url and scroll down.
:param url: Page to get
:return source page
"""
browser = webdriver.Chrome(chrome_options=options)
browser.get(url)
time.sleep(5)
# Scroll down to get all the games - Do it a few times to make sure
for _ in range(5):
browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(.2)
pg = browser.page_source
browser.close()
return pg
def get_schedule(url, name):
"""
Given a url it returns the raw html
:param url: url for page
:param name: Name for saved file
:return: raw html of game
"""
file_info = {
"url": url,
"name": str(name) + "_schedule",
"type": "html_schedule_nwhl",
"season": "nwhl",
'dir': shared.docs_dir
}
# Done manually due to custom scraping logic
if shared.docs_dir and sp.check_file_exists(file_info) and not shared.re_scrape:
pg = sp.get_page(file_info)
else:
pg = scrape_dynamic(file_info['url'])
sp.save_page(pg, file_info)
return pg
def get_season_codes():
"""
They use fucked up codes instead of actual years to represent seas
gitextract_48coyrer/
├── .gitignore
├── .vscode/
│ └── settings.json
├── CHANGELOG.rst
├── LICENSE.txt
├── MANIFEST.in
├── README.rst
├── docs/
│ ├── Makefile
│ ├── make.bat
│ └── source/
│ ├── cli.rst
│ ├── conf.py
│ ├── index.rst
│ ├── license_link.rst
│ ├── live_scrape.rst
│ ├── nhl_scrape_functions.rst
│ └── nwhl_scrape_functions.rst
├── hockey_scraper/
│ ├── __init__.py
│ ├── cli.py
│ ├── nhl/
│ │ ├── __init__.py
│ │ ├── game_scraper.py
│ │ ├── json_schedule.py
│ │ ├── live_scrape.py
│ │ ├── pbp/
│ │ │ ├── __init__.py
│ │ │ ├── espn_pbp.py
│ │ │ ├── html_pbp.py
│ │ │ └── json_pbp.py
│ │ ├── playing_roster.py
│ │ ├── scrape_functions.py
│ │ └── shifts/
│ │ ├── __init__.py
│ │ ├── html_shifts.py
│ │ └── json_shifts.py
│ ├── nwhl/
│ │ ├── __init__.py
│ │ ├── game_pbp.py
│ │ ├── scrape_functions.py
│ │ └── scrape_schedule.py
│ └── utils/
│ ├── __init__.py
│ ├── config.py
│ ├── merge_pbp_shifts.py
│ ├── player_name_fixes.json
│ ├── save_pages.py
│ ├── shared.py
│ ├── team_tri_codes.json
│ └── tri_code_conversion.json
├── readthedocs.yml
├── requirements.txt
├── setup.py
└── tests/
├── __init__.py
├── test_espn_pbp.py
├── test_game_scraper.py
├── test_html_pbp.py
├── test_html_shifts.py
├── test_json_pbp.py
├── test_json_schedule.py
├── test_json_shifts.py
├── test_nwhl.py
├── test_playing_roster.py
├── test_scrape_functions.py
└── test_shared.py
SYMBOL INDEX (208 symbols across 27 files)
FILE: hockey_scraper/cli.py
function validate_args (line 10) | def validate_args(user_args):
function run_cmd (line 38) | def run_cmd(user_args):
function main (line 59) | def main():
FILE: hockey_scraper/nhl/game_scraper.py
function check_goalie (line 33) | def check_goalie(row):
function get_players_json (line 50) | def get_players_json(game_json):
function combine_players_lists (line 94) | def combine_players_lists(json_players, roster_players, game_id):
function get_teams_and_players (line 125) | def get_teams_and_players(game_json, roster, game_id):
function combine_html_json_pbp (line 146) | def combine_html_json_pbp(json_df, html_df, game_id, date):
function combine_espn_html_pbp (line 194) | def combine_espn_html_pbp(html_df, espn_df, game_id, date, away_team, ho...
function scrape_pbp_live (line 232) | def scrape_pbp_live(game_id, date, roster, game_json, players, teams, es...
function scrape_pbp (line 251) | def scrape_pbp(game_id, date, roster, game_json, players, teams, espn_id...
function scrape_shifts (line 303) | def scrape_shifts(game_id, players, date):
function scrape_game (line 332) | def scrape_game(game_id, date, if_scrape_shifts):
FILE: hockey_scraper/nhl/json_schedule.py
function get_schedule (line 15) | def get_schedule(date):
function chunk_schedule_calls (line 34) | def chunk_schedule_calls(from_date, to_date):
function scrape_schedule (line 61) | def scrape_schedule(date_from, date_to, preseason=False, not_over=False):
function get_dates (line 118) | def get_dates(games):
FILE: hockey_scraper/nhl/live_scrape.py
function set_docs_dir (line 16) | def set_docs_dir(user_dir):
function check_date_format (line 29) | def check_date_format(date):
class LiveGame (line 45) | class LiveGame:
method __init__ (line 67) | def __init__(self, game_id, start_time, home_team, away_team, status, ...
method pbp_df (line 105) | def pbp_df(self):
method shifts_df (line 112) | def shifts_df(self):
method prev_pbp_df (line 119) | def prev_pbp_df(self):
method prev_pbp_df (line 126) | def prev_pbp_df(self):
method scrape (line 133) | def scrape(self, force=False):
method scrape_live_game (line 147) | def scrape_live_game(self, force=False):
method is_ongoing (line 214) | def is_ongoing(self):
method time_until_game (line 239) | def time_until_game(self):
method is_game_over (line 252) | def is_game_over(self, prev=False):
method is_intermission (line 266) | def is_intermission(self, prev=False):
class ScrapeLiveGames (line 281) | class ScrapeLiveGames:
method __init__ (line 292) | def __init__(self, date, preseason=False, if_scrape_shifts=False, paus...
method get_games (line 313) | def get_games(self):
method get_espn_ids (line 337) | def get_espn_ids(self, games):
method update_live_games (line 359) | def update_live_games(self, force=False, sleep_next=False):
method sleep_next_game (line 378) | def sleep_next_game(self):
method finished (line 400) | def finished(self):
FILE: hockey_scraper/nhl/pbp/espn_pbp.py
function event_type (line 12) | def event_type(play_description):
function get_game_ids (line 27) | def get_game_ids(response):
function get_teams (line 43) | def get_teams(response):
function get_espn_date (line 82) | def get_espn_date(date):
function get_espn_game_id (line 105) | def get_espn_game_id(date, home_team, away_team):
function get_espn_game (line 126) | def get_espn_game(date, home_team, away_team, game_id=None):
function parse_event (line 160) | def parse_event(event):
function parse_espn (line 185) | def parse_espn(espn_xml):
function scrape_game (line 215) | def scrape_game(date, home_team, away_team, game_id=None):
FILE: hockey_scraper/nhl/pbp/html_pbp.py
function cur_game_status (line 11) | def cur_game_status(doc):
function get_pbp (line 35) | def get_pbp(game_id):
function get_contents (line 58) | def get_contents(game_html):
function strip_html_pbp (line 85) | def strip_html_pbp(td):
function clean_html_pbp (line 128) | def clean_html_pbp(html):
function add_home_zone (line 146) | def add_home_zone(event_dict, home_team):
function add_zone (line 181) | def add_zone(event_dict, play_description):
function add_type (line 203) | def add_type(event_dict, event, players, home_team):
function add_strength (line 220) | def add_strength(event_dict, home_players, away_players):
function add_event_team (line 241) | def add_event_team(event_dict, event):
function add_period (line 258) | def add_period(event_dict, event):
function add_time (line 273) | def add_time(event_dict, event):
function add_score (line 290) | def add_score(event_dict, event, current_score, home_team):
function get_penalty (line 313) | def get_penalty(play_description, players, home_team):
function get_player_name (line 352) | def get_player_name(number, players, team, home_team):
function if_valid_event (line 384) | def if_valid_event(event):
function return_name_html (line 401) | def return_name_html(info):
function shot_type (line 414) | def shot_type(play_description):
function parse_fac (line 437) | def parse_fac(description, players, ev_team, home_team):
function parse_shot_miss_take_give (line 469) | def parse_shot_miss_take_give(description, players, ev_team, home_team):
function parse_hit (line 497) | def parse_hit(description, players, home_team):
function parse_block (line 526) | def parse_block(description, players, home_team):
function parse_goal (line 558) | def parse_goal(description, players, ev_team, home_team):
function parse_penalty (line 593) | def parse_penalty(description, players, home_team):
function add_event_players (line 638) | def add_event_players(event_dict, event, players, home_team):
function populate_players (line 671) | def populate_players(event_dict, players, away_players, home_players):
function parse_event (line 720) | def parse_event(event, players, home_team, current_score):
function parse_html (line 756) | def parse_html(html, players, teams):
function scrape_pbp (line 790) | def scrape_pbp(game_html, game_id, players, teams):
function scrape_game_live (line 824) | def scrape_game_live(game_id, players, teams):
function scrape_game (line 838) | def scrape_game(game_id, players, teams):
FILE: hockey_scraper/nhl/pbp/json_pbp.py
function get_pbp (line 11) | def get_pbp(game_id):
function get_teams (line 35) | def get_teams(pbp_json):
function change_event_name (line 54) | def change_event_name(event):
function parse_event (line 86) | def parse_event(event):
function parse_json (line 133) | def parse_json(game_json, game_id):
function scrape_game (line 161) | def scrape_game(game_id):
FILE: hockey_scraper/nhl/playing_roster.py
function get_roster (line 9) | def get_roster(game_id):
function get_content (line 30) | def get_content(roster):
function fix_name (line 52) | def fix_name(player):
function get_coaches (line 68) | def get_coaches(soup):
function get_players (line 90) | def get_players(soup):
function scrape_roster (line 152) | def scrape_roster(game_id):
FILE: hockey_scraper/nhl/scrape_functions.py
function print_errors (line 13) | def print_errors(detailed=True):
function scrape_list_of_games (line 60) | def scrape_list_of_games(games, if_scrape_shifts, verbose=False):
function scrape_schedule (line 100) | def scrape_schedule(from_date, to_date, data_format='pandas', rescrape=F...
function scrape_date_range (line 133) | def scrape_date_range(from_date, to_date, if_scrape_shifts, data_format=...
function scrape_seasons (line 168) | def scrape_seasons(seasons, if_scrape_shifts, data_format='csv', preseas...
function scrape_games (line 214) | def scrape_games(games, if_scrape_shifts, data_format='csv', rescrape=Fa...
FILE: hockey_scraper/nhl/shifts/html_shifts.py
function get_shifts (line 11) | def get_shifts(game_id):
function get_soup (line 39) | def get_soup(shifts_html):
function get_teams (line 60) | def get_teams(soup):
function analyze_shifts (line 79) | def analyze_shifts(shift, name, team, home_team, player_ids):
function parse_html (line 117) | def parse_html(html, player_ids, game_id):
function scrape_game (line 169) | def scrape_game(game_id, players):
FILE: hockey_scraper/nhl/shifts/json_shifts.py
function get_shifts (line 10) | def get_shifts(game_id):
function fix_team_tricode (line 35) | def fix_team_tricode(tricode):
function parse_shift (line 53) | def parse_shift(shift):
function parse_json (line 82) | def parse_json(shift_json, game_id):
function scrape_game (line 103) | def scrape_game(game_id):
FILE: hockey_scraper/nwhl/game_pbp.py
function scrape_page (line 28) | def scrape_page(url):
function get_pbp (line 76) | def get_pbp(game_id):
function parse_event (line 112) | def parse_event(event, score, teams, date, game_id, players):
function parse_json (line 130) | def parse_json(game_json, game_id,):
function scrape_pbp (line 184) | def scrape_pbp(game_id):
FILE: hockey_scraper/nwhl/scrape_functions.py
function print_errors (line 24) | def print_errors():
function scrape_list_of_games (line 39) | def scrape_list_of_games(games):
function scrape_games (line 62) | def scrape_games(games, data_format='csv', rescrape=False, docs_dir=None):
function scrape_date_range (line 90) | def scrape_date_range(from_date, to_date, data_format='csv', rescrape=Fa...
function scrape_seasons (line 128) | def scrape_seasons(seasons, data_format='csv', rescrape=False, docs_dir=...
FILE: hockey_scraper/nwhl/scrape_schedule.py
function scrape_dynamic (line 19) | def scrape_dynamic(url):
function get_schedule (line 42) | def get_schedule(url, name):
function get_season_codes (line 69) | def get_season_codes():
function parse_game (line 97) | def parse_game(game, season):
function get_season_games (line 144) | def get_season_games(season, season_code):
function scrape_dates (line 171) | def scrape_dates(from_date, to_date):
function scrape_season (line 201) | def scrape_season(season):
FILE: hockey_scraper/utils/merge_pbp_shifts.py
function label_priority (line 4) | def label_priority(row):
function group_shifts_cols (line 34) | def group_shifts_cols(shifts, type_group_cols):
function group_shifts_type (line 63) | def group_shifts_type(shifts, player_cols, player_id_cols):
function group_shifts (line 99) | def group_shifts(games_df, shifts):
function merge (line 141) | def merge(pbp_df, shifts_df):
FILE: hockey_scraper/utils/save_pages.py
function create_base_file_path (line 10) | def create_base_file_path(file_info):
function is_compressed (line 27) | def is_compressed(file_info):
function create_dir_structure (line 39) | def create_dir_structure(dir_name):
function create_season_dirs (line 56) | def create_season_dirs(file_info):
function check_file_exists (line 75) | def check_file_exists(file_info):
function get_page (line 97) | def get_page(file_info):
function save_page (line 118) | def save_page(page, file_info):
FILE: hockey_scraper/utils/shared.py
function fix_name (line 31) | def fix_name(name):
function get_team (line 42) | def get_team(team):
function convert_tricode (line 49) | def convert_tricode(tri):
function custom_formatwarning (line 58) | def custom_formatwarning(msg, *args, **kwargs):
function print_error (line 68) | def print_error(msg):
function print_warning (line 88) | def print_warning(msg):
function get_logger (line 102) | def get_logger(python_file):
function log_error (line 126) | def log_error(err, py_file):
function get_season (line 139) | def get_season(date):
function season_start_bound (line 163) | def season_start_bound(year):
function season_end_bound (line 185) | def season_end_bound(year):
function convert_to_seconds (line 202) | def convert_to_seconds(minutes):
function if_rescrape (line 222) | def if_rescrape(user_rescrape):
function add_dir (line 238) | def add_dir(user_dir):
function scrape_page (line 275) | def scrape_page(url):
function get_file (line 309) | def get_file(file_info, force=False):
function check_data_format (line 334) | def check_data_format(data_format):
function check_valid_dates (line 348) | def check_valid_dates(from_date, to_date):
function to_csv (line 365) | def to_csv(base_file_name, df, league, file_type):
FILE: setup.py
function read (line 5) | def read():
FILE: tests/test_game_scraper.py
function players (line 11) | def players():
function pbp_columns (line 60) | def pbp_columns():
function shifts_columns (line 73) | def shifts_columns():
function test_scrape_game (line 77) | def test_scrape_game(pbp_columns, shifts_columns):
function test_combine_players_lists (line 102) | def test_combine_players_lists(players):
FILE: tests/test_html_pbp.py
function game_id (line 13) | def game_id():
function cleaned_html (line 18) | def cleaned_html(game_id):
function pbp_cols (line 23) | def pbp_cols():
function event (line 36) | def event():
function players (line 45) | def players():
function teams (line 97) | def teams():
function current_score (line 102) | def current_score():
function test_parse_event (line 106) | def test_parse_event(event, players, teams, current_score):
function test_parse_html (line 128) | def test_parse_html(pbp_cols, players, teams, cleaned_html):
function test_get_pbp (line 137) | def test_get_pbp():
function test_get_soup (line 142) | def test_get_soup():
function test_strip_html_pbp (line 147) | def test_strip_html_pbp():
function test_clean_html_pbp (line 152) | def test_clean_html_pbp():
FILE: tests/test_html_shifts.py
function shift_cols (line 11) | def shift_cols():
function game_id (line 16) | def game_id():
function player_ids (line 20) | def player_ids():
function shifts_html (line 72) | def shifts_html():
function shifts_dfs (line 78) | def shifts_dfs(shifts_html, player_ids, game_id):
function test_get_shifts (line 85) | def test_get_shifts(shifts_html):
function test_get_soup (line 94) | def test_get_soup(shifts_html):
function test_analyze_shifts (line 112) | def test_analyze_shifts(player_ids):
function test_parse_html (line 125) | def test_parse_html(shifts_dfs, shift_cols):
function test_scrape_game (line 142) | def test_scrape_game(game_id, player_ids, shifts_dfs, shift_cols):
FILE: tests/test_json_pbp.py
function test_get_pbp (line 8) | def test_get_pbp():
function test_get_teams (line 14) | def test_get_teams():
function test_parse_json (line 19) | def test_parse_json():
function test_parse_event (line 34) | def test_parse_event():
FILE: tests/test_json_schedule.py
function test_get_schedule (line 7) | def test_get_schedule():
function test_scrape_schedule (line 12) | def test_scrape_schedule():
function test_get_dates (line 18) | def test_get_dates():
FILE: tests/test_json_shifts.py
function test_get_shifts (line 8) | def test_get_shifts():
function test_scrape_shifts (line 14) | def test_scrape_shifts():
function test_parse_shift (line 29) | def test_parse_shift():
FILE: tests/test_playing_roster.py
function scraped_roster (line 9) | def scraped_roster():
function test_fix_name (line 13) | def test_fix_name():
function test_get_players (line 20) | def test_get_players(scraped_roster):
function test_get_coaches (line 79) | def test_get_coaches(scraped_roster):
function test_scrape_roster (line 88) | def test_scrape_roster(scraped_roster):
FILE: tests/test_scrape_functions.py
function test_scrape_list_of_games (line 8) | def test_scrape_list_of_games():
FILE: tests/test_shared.py
function file_info (line 11) | def file_info():
function test_check_data_format (line 20) | def test_check_data_format():
function test_check_valid_dates (line 31) | def test_check_valid_dates():
function test_convert_to_seconds (line 39) | def test_convert_to_seconds():
function test_get_season (line 45) | def test_get_season():
function test_scrape_page (line 54) | def test_scrape_page(file_info):
function test_get_file (line 62) | def test_get_file(file_info):
function test_add_dir (line 87) | def test_add_dir():
Condensed preview — 57 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (294K chars).
[
{
"path": ".gitignore",
"chars": 1270,
"preview": ".DS_Store\n.idea/\n.csv/\ntests.py\nnotes.txt\nbuild/\n.pytest_cache\nupdate_season_data.py\n\n# Byte-compiled / optimized / DLL "
},
{
"path": ".vscode/settings.json",
"chars": 3,
"preview": "{\n}"
},
{
"path": "CHANGELOG.rst",
"chars": 2312,
"preview": "v1.2.6\n------\n\n * Added test coverage for most modules using pytest\n * Refactored large portion of 'html_pbp.py' and c"
},
{
"path": "LICENSE.txt",
"chars": 35150,
"preview": " GNU GENERAL PUBLIC LICENSE\n Version 3, 29 June 2007\n\n Copyright (C) 2007 Free "
},
{
"path": "MANIFEST.in",
"chars": 20,
"preview": "include README.rst\n\n"
},
{
"path": "README.rst",
"chars": 7718,
"preview": "\nThis repository is no longer maintained. Feel free to fork it.\n========================================================"
},
{
"path": "docs/Makefile",
"chars": 615,
"preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS =\nSPHI"
},
{
"path": "docs/make.bat",
"chars": 822,
"preview": "@ECHO OFF\r\n\r\npushd %~dp0\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sp"
},
{
"path": "docs/source/cli.rst",
"chars": 1678,
"preview": "Command Line Interface\n======================\n\nThere also exists a cli tool called `hockey-scraper` which can be used to"
},
{
"path": "docs/source/conf.py",
"chars": 4512,
"preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n#\n# hockey_scraper documentation build configuration file, created by\n# s"
},
{
"path": "docs/source/index.rst",
"chars": 303,
"preview": "Hockey-Scraper\n==============\n\nContents\n--------\n.. toctree::\n :maxdepth: 1\n\n nhl_scrape_functions\n live_scrape\n "
},
{
"path": "docs/source/license_link.rst",
"chars": 46,
"preview": "License\n=======\n.. include:: ../../LICENSE.txt"
},
{
"path": "docs/source/live_scrape.rst",
"chars": 11713,
"preview": "Live Scraping\n=============\n\nStandard Usage\n--------------\n\nTo get all the info for every game on a specific day we crea"
},
{
"path": "docs/source/nhl_scrape_functions.rst",
"chars": 6636,
"preview": "NHL Scraping Functions\n======================\n\nScraping\n--------\n\nThere are three ways to scrape games:\n\n\\1. *Scrape by "
},
{
"path": "docs/source/nwhl_scrape_functions.rst",
"chars": 1708,
"preview": "NWHL Scraping Functions\n=======================\n\nScraping\n--------\n\nThere are three ways to scrape games:\n\n\\1. *Scrape b"
},
{
"path": "hockey_scraper/__init__.py",
"chars": 288,
"preview": "from .nhl.live_scrape import ScrapeLiveGames, LiveGame\nfrom .nhl.scrape_functions import scrape_games, scrape_date_range"
},
{
"path": "hockey_scraper/cli.py",
"chars": 4112,
"preview": "\"\"\"\nInterface for running cli commands\n\"\"\"\nimport sys\nimport argparse\nfrom .utils.shared import print_error\nfrom .nhl.sc"
},
{
"path": "hockey_scraper/nhl/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "hockey_scraper/nhl/game_scraper.py",
"chars": 14576,
"preview": "\"\"\"\nThis module contains code to scrape data for a single game\n\"\"\"\n\nimport pandas as pd\n\nimport hockey_scraper.nhl.pbp.e"
},
{
"path": "hockey_scraper/nhl/json_schedule.py",
"chars": 5827,
"preview": "\"\"\"\nThis module contains functions to scrape the json schedule for any games or date range\n\"\"\"\nimport json\nfrom pytz imp"
},
{
"path": "hockey_scraper/nhl/live_scrape.py",
"chars": 15946,
"preview": "\"\"\"\nModule to scrape live game info\n\"\"\"\nimport datetime\nimport time\nimport warnings\nimport pandas as pd\nimport hockey_sc"
},
{
"path": "hockey_scraper/nhl/pbp/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "hockey_scraper/nhl/pbp/espn_pbp.py",
"chars": 7411,
"preview": "\"\"\"\nThis module contains code to scrape coordinates for games off of espn for any given game\n\"\"\"\n\nimport re\nimport xml.e"
},
{
"path": "hockey_scraper/nhl/pbp/html_pbp.py",
"chars": 28248,
"preview": "\"\"\"\nThis module contains functions to scrape the Html Play by Play for any given game\n\"\"\"\n\nimport re\nimport pandas as pd"
},
{
"path": "hockey_scraper/nhl/pbp/json_pbp.py",
"chars": 5381,
"preview": "\"\"\"\nThis module contains functions to scrape the Json Play by Play for any given game\n\"\"\"\n\nimport json\nimport pandas as "
},
{
"path": "hockey_scraper/nhl/playing_roster.py",
"chars": 4865,
"preview": "\"\"\"\nThis module contains functions to scrape the Html game roster for any given game\n\"\"\"\n\nfrom bs4 import BeautifulSoup\n"
},
{
"path": "hockey_scraper/nhl/scrape_functions.py",
"chars": 10719,
"preview": "\"\"\"\nFunctions to scrape by season, games, and date range\n\"\"\"\n\nimport time\nimport pandas as pd\nfrom datetime import datet"
},
{
"path": "hockey_scraper/nhl/shifts/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "hockey_scraper/nhl/shifts/html_shifts.py",
"chars": 6464,
"preview": "\"\"\"\nThis module contains functions to scrape the Html Toi Tables (or shifts) for any given game\n\"\"\"\n\nimport re\nimport pa"
},
{
"path": "hockey_scraper/nhl/shifts/json_shifts.py",
"chars": 3359,
"preview": "\"\"\"\nThis module contains functions to scrape the Json toi/shifts for any given game\n\"\"\"\n\nimport json\nimport pandas as pd"
},
{
"path": "hockey_scraper/nwhl/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "hockey_scraper/nwhl/game_pbp.py",
"chars": 6384,
"preview": "\"\"\"\nScrape the PBP info for a given game\n\"\"\"\nimport json\nimport time\nimport datetime\nimport pandas as pd\nfrom bs4 import"
},
{
"path": "hockey_scraper/nwhl/scrape_functions.py",
"chars": 5726,
"preview": "\"\"\"\nFunctions to scrape by season, games, and date range\n\"\"\"\nimport random\nimport pandas as pd\n\n#from . import html_sche"
},
{
"path": "hockey_scraper/nwhl/scrape_schedule.py",
"chars": 5861,
"preview": "\"\"\"\nScrape the schedule info for nwhl games\n\"\"\"\nimport time\nfrom datetime import datetime\nimport re\nfrom bs4 import Beau"
},
{
"path": "hockey_scraper/utils/__init__.py",
"chars": 35,
"preview": "from .merge_pbp_shifts import merge"
},
{
"path": "hockey_scraper/utils/config.py",
"chars": 338,
"preview": "\"\"\"\nBasic configurations\n\"\"\"\n\n# Directory where to save pages\n# When True assumes ~/hockey_scraper_data\n# Otherwise can "
},
{
"path": "hockey_scraper/utils/merge_pbp_shifts.py",
"chars": 7112,
"preview": "import pandas as pd\n\n\ndef label_priority(row):\n \"\"\"\n Priority for sorting\n \n Courtesy of Matt Barlowe (pre-N"
},
{
"path": "hockey_scraper/utils/player_name_fixes.json",
"chars": 12554,
"preview": "{\n \"_description\": \"Fixes some of the mistakes made with player names (converts to 'correct' name)\",\n \"_comment\": "
},
{
"path": "hockey_scraper/utils/save_pages.py",
"chars": 4847,
"preview": "\"\"\"\nSaves the scraped docs so you don't have to re-scrape them every time you want to parse the docs. \n\n\\**** Don't mess"
},
{
"path": "hockey_scraper/utils/shared.py",
"chars": 11935,
"preview": "\"\"\"\nThis file is a bunch of the shared functions or just general stuff used by the different scrapers in the package.\n\"\""
},
{
"path": "hockey_scraper/utils/team_tri_codes.json",
"chars": 1434,
"preview": "{\n \"_descrition\": \"# All the corresponding tri-codes for team names\",\n\n \"teams\": {\n \"ANAHEIM DUCKS\": \"ANA\","
},
{
"path": "hockey_scraper/utils/tri_code_conversion.json",
"chars": 175,
"preview": "{\n \"_description\": \"Conversion of new tri-code to old\",\n\n \"tri_codes\": { \n \"NJD\": \"N.J\",\n \"TBL\": \"T"
},
{
"path": "readthedocs.yml",
"chars": 142,
"preview": "# .readthedocs.yml\n\nbuild:\n os: ubuntu-20.04 # <- add this line\n tools:\n python: \"3.9\"\n\npython:\n version: 3.9\n s"
},
{
"path": "requirements.txt",
"chars": 125,
"preview": "BeautifulSoup4>=4.5.3\nrequests>=2.14.2\nlxml>=3.7.2\nhtml5lib>=0.999999999\npandas>=0.23.4\nsphinx>=1.5.1\npytest>=3.0.5\npytz"
},
{
"path": "setup.py",
"chars": 1033,
"preview": "import os\nfrom setuptools import setup, find_packages\n\n\ndef read():\n return open(os.path.join(os.path.dirname(__file_"
},
{
"path": "tests/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "tests/test_espn_pbp.py",
"chars": 2903,
"preview": "# \"\"\" Tests for 'espn_pbp.py' \"\"\"\n\n# import pandas as pd\n# import pytest\n\n# from hockey_scraper.nhl.pbp import espn_pbp\n"
},
{
"path": "tests/test_game_scraper.py",
"chars": 6179,
"preview": "\"\"\" Tests for 'game_scraper.py' \"\"\"\n\nimport pandas as pd\nimport pytest\n\nfrom hockey_scraper.nhl import game_scraper, pla"
},
{
"path": "tests/test_html_pbp.py",
"chars": 8309,
"preview": "\"\"\" Tests for 'html_pbp.py' \"\"\"\n\nimport pandas as pd\nimport pytest\n\nfrom hockey_scraper.nhl.pbp import html_pbp\n\n\n# TODO"
},
{
"path": "tests/test_html_shifts.py",
"chars": 7378,
"preview": "\"\"\" Tests for 'html_shifts.py' \"\"\"\n\nimport bs4\nimport pandas as pd\nimport pytest\n\nfrom hockey_scraper.nhl.shifts import "
},
{
"path": "tests/test_json_pbp.py",
"chars": 2395,
"preview": "\"\"\"Tests for 'json_pbp.py'\"\"\"\n\nimport pandas as pd\n\nfrom hockey_scraper.nhl.pbp import json_pbp\n\n\ndef test_get_pbp():\n "
},
{
"path": "tests/test_json_schedule.py",
"chars": 2889,
"preview": "\"\"\"Tests for 'json_schedule.py'\"\"\"\nimport datetime\n\nfrom hockey_scraper.nhl import json_schedule\n\n\ndef test_get_schedule"
},
{
"path": "tests/test_json_shifts.py",
"chars": 1650,
"preview": "\"\"\"Tests for 'json_shifts.py'\"\"\"\n\nimport pandas as pd\n\nfrom hockey_scraper.nhl.shifts import json_shifts\n\n\ndef test_get_"
},
{
"path": "tests/test_nwhl.py",
"chars": 37,
"preview": "import pandas as pd\nimport pytest\n\n\n\n"
},
{
"path": "tests/test_playing_roster.py",
"chars": 3585,
"preview": "\"\"\"Test for 'playing_roster.py'\"\"\"\n\nimport pytest\n\nfrom hockey_scraper.nhl import playing_roster\n\n\n@pytest.fixture\ndef s"
},
{
"path": "tests/test_scrape_functions.py",
"chars": 1234,
"preview": "\"\"\" Tests for 'scrape_functions.py' \"\"\"\n\nimport pandas as pd\n\nfrom hockey_scraper.nhl import scrape_functions\n\n\ndef test"
},
{
"path": "tests/test_shared.py",
"chars": 3011,
"preview": "\"\"\" Tests for 'shared.py' \"\"\"\n\nimport os\nimport shutil\nimport pytest\n\nfrom hockey_scraper.utils import shared, config\n\n\n"
}
]
About this extraction
This page contains the full source code of the HarryShomer/Hockey-Scraper GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 57 files (272.4 KB), approximately 74.1k tokens, and a symbol index with 208 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.