Showing preview only (563K chars total). Download the full file or copy to clipboard to get everything.
Repository: cerndb/dist-keras
Branch: master
Commit: 06c4e39954d9
Files: 37
Total size: 126.4 MB
Directory structure:
gitextract_pj297o4k/
├── LICENSE
├── README.md
├── distkeras/
│ ├── __init__.py
│ ├── evaluators.py
│ ├── job_deployment.py
│ ├── networking.py
│ ├── parameter_servers.py
│ ├── predictors.py
│ ├── schemes.py
│ ├── trainers.py
│ ├── transformers.py
│ ├── utils.py
│ └── workers.py
├── docs/
│ ├── index.md
│ ├── license.md
│ └── optimizers.md
├── examples/
│ ├── cifar-10-preprocessing.ipynb
│ ├── data/
│ │ ├── atlas_higgs.csv
│ │ └── mnist.csv
│ ├── distributed_numpy_parsing.ipynb
│ ├── example_0_data_preprocessing.ipynb
│ ├── example_1_analysis.ipynb
│ ├── kafka_producer.py
│ ├── kafka_spark_high_throughput_ml_pipeline.ipynb
│ ├── mnist.ipynb
│ ├── mnist.py
│ ├── mnist_analysis.ipynb
│ ├── mnist_preprocessing.ipynb
│ └── workflow.ipynb
├── mkdocs.yml
├── resources/
│ └── blog-posts/
│ ├── css/
│ │ └── main.css
│ ├── js/
│ │ ├── highlight.pack.js
│ │ └── main.js
│ └── part-1-an-introduction.html
├── scripts/
│ ├── generate_secret.py
│ └── punchcard.py
└── setup.py
================================================
FILE CONTENTS
================================================
================================================
FILE: LICENSE
================================================
GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The GNU General Public License is a free, copyleft license for
software and other kinds of works.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users. We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors. You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.
To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights. Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received. You must make sure that they, too, receive
or can get the source code. And you must show them these terms so they
know their rights.
Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.
For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software. For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.
Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so. This is fundamentally incompatible with the aim of
protecting users' freedom to change the software. The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable. Therefore, we
have designed this version of the GPL to prohibit the practice for those
products. If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.
Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary. To prevent this, the GPL assures that
patents cannot be used to render the program non-free.
The precise terms and conditions for copying, distribution and
modification follow.
TERMS AND CONDITIONS
0. Definitions.
"This License" refers to version 3 of the GNU General Public License.
"Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.
"The Program" refers to any copyrightable work licensed under this
License. Each licensee is addressed as "you". "Licensees" and
"recipients" may be individuals or organizations.
To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy. The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.
A "covered work" means either the unmodified Program or a work based
on the Program.
To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy. Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.
To "convey" a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License. If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.
1. Source Code.
The "source code" for a work means the preferred form of the work
for making modifications to it. "Object code" means any non-source
form of a work.
A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.
The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form. A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.
The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities. However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work. For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.
The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.
The Corresponding Source for a work in source code form is that
same work.
2. Basic Permissions.
All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force. You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright. Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under
the conditions stated below. Sublicensing is not allowed; section 10
makes it unnecessary.
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.
When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.
4. Conveying Verbatim Copies.
You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.
5. Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified
it, and giving a relevant date.
b) The work must carry prominent notices stating that it is
released under this License and any conditions added under section
7. This requirement modifies the requirement in section 4 to
"keep intact all notices".
c) You must license the entire work, as a whole, under this
License to anyone who comes into possession of a copy. This
License will therefore apply, along with any applicable section 7
additional terms, to the whole of the work, and all its parts,
regardless of how they are packaged. This License gives no
permission to license the work in any other way, but it does not
invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display
Appropriate Legal Notices; however, if the Program has interactive
interfaces that do not display Appropriate Legal Notices, your
work need not make them do so.
A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit. Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.
6. Conveying Non-Source Forms.
You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:
a) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by the
Corresponding Source fixed on a durable physical medium
customarily used for software interchange.
b) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by a
written offer, valid for at least three years and valid for as
long as you offer spare parts or customer support for that product
model, to give anyone who possesses the object code either (1) a
copy of the Corresponding Source for all the software in the
product that is covered by this License, on a durable physical
medium customarily used for software interchange, for a price no
more than your reasonable cost of physically performing this
conveying of source, or (2) access to copy the
Corresponding Source from a network server at no charge.
c) Convey individual copies of the object code with a copy of the
written offer to provide the Corresponding Source. This
alternative is allowed only occasionally and noncommercially, and
only if you received the object code with such an offer, in accord
with subsection 6b.
d) Convey the object code by offering access from a designated
place (gratis or for a charge), and offer equivalent access to the
Corresponding Source in the same way through the same place at no
further charge. You need not require recipients to copy the
Corresponding Source along with the object code. If the place to
copy the object code is a network server, the Corresponding Source
may be on a different server (operated by you or a third party)
that supports equivalent copying facilities, provided you maintain
clear directions next to the object code saying where to find the
Corresponding Source. Regardless of what server hosts the
Corresponding Source, you remain obligated to ensure that it is
available for as long as needed to satisfy these requirements.
e) Convey the object code using peer-to-peer transmission, provided
you inform other peers where the object code and Corresponding
Source of the work are being offered to the general public at no
charge under subsection 6d.
A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.
A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling. In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage. For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product. A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.
"Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source. The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.
If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information. But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).
The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed. Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.
7. Additional Terms.
"Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law. If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it. (Additional permissions may be written to require their own
removal in certain cases when you modify the work.) You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the
terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or
author attributions in that material or in the Appropriate Legal
Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or
requiring that modified versions of such material be marked in
reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or
authors of the material; or
e) Declining to grant rights under trademark law for use of some
trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that
material by anyone who conveys the material (or modified versions of
it) with contractual assumptions of liability to the recipient, for
any liability that these contractual assumptions directly impose on
those licensors and authors.
All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10. If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term. If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.
8. Termination.
You may not propagate or modify a covered work except as expressly
provided under this License. Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).
However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.
Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.
9. Acceptance Not Required for Having Copies.
You are not required to accept this License in order to receive or
run a copy of the Program. Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance. However,
nothing other than this License grants you permission to propagate or
modify any covered work. These actions infringe copyright if you do
not accept this License. Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.
10. Automatic Licensing of Downstream Recipients.
Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License. You are not responsible
for enforcing compliance by third parties with this License.
An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations. If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.
11. Patents.
A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based. The
work thus licensed is called the contributor's "contributor version".
A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version. For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.
In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement). To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.
If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients. "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.
A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License. You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.
12. No Surrender of Others' Freedom.
If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all. For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.
13. Use with the GNU Affero General Public License.
Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU Affero General Public License into a single
combined work, and to convey the resulting work. The terms of this
License will continue to apply to the part which is the covered work,
but the special requirements of the GNU Affero General Public License,
section 13, concerning interaction through a network will apply to the
combination as such.
14. Revised Versions of this License.
The Free Software Foundation may publish revised and/or new versions of
the GNU General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the
Program specifies that a certain numbered version of the GNU General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation. If the Program does not specify a version number of the
GNU General Public License, you may choose any version ever published
by the Free Software Foundation.
If the Program specifies that a proxy can decide which future
versions of the GNU General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.
Later license versions may give you additional or different
permissions. However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.
15. Disclaimer of Warranty.
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
16. Limitation of Liability.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
17. Interpretation of Sections 15 and 16.
If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
Distributed Deep Learning with Keras and Apache Spark.
Copyright (C) 2016 Joeri Hermans
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Also add information on how to contact you by electronic and paper mail.
If the program does terminal interaction, make it output a short
notice like this when it starts in an interactive mode:
Distributed Keras Copyright (C) 2016 Joeri Hermans
This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, your program's commands
might be different; for a GUI interface, you would use an "about box".
You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU GPL, see
<http://www.gnu.org/licenses/>.
The GNU General Public License does not permit incorporating your program
into proprietary programs. If your program is a subroutine library, you
may consider it more useful to permit linking proprietary applications with
the library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License. But first, please read
<http://www.gnu.org/philosophy/why-not-lgpl.html>.
================================================
FILE: README.md
================================================
# Distributed Keras
Distributed Deep Learning with Apache Spark and Keras.
## Introduction
Distributed Keras is a distributed deep learning framework built op top of Apache Spark and Keras, with a focus on "state-of-the-art" distributed optimization algorithms. We designed the framework in such a way that a new distributed optimizer could be implemented with ease, thus enabling a person to focus on research. Several distributed methods are supported, such as, but not restricted to, the training of **ensembles** and models using **data parallel** methods.
Most of the distributed optimizers we provide, are based on data parallel methods. A data parallel method, as described in [[1]](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf), is a learning paradigm where multiple replicas of a single model are used to optimize a single objective. Using this approach, we are able to dignificantly reduce the training time of a model. Depending on the parametrization, we also observed that it is possible to achieve better statistical model performance compared to a more traditional approach (e.g., like the [SingleTrainer](#single-trainer) implementation), and yet, spending less wallclock time on the training of the model. However, this is subject to further research.
**Attention**: A rather complete introduction to the problem of Distributed Deep Learning is presented in my Master Thesis [http://github.com/JoeriHermans/master-thesis](http://github.com/JoeriHermans/master-thesis). Furthermore, the thesis describes includes several *novel* insights, such as a redefinition of parameter staleness, and several new distributed optimizers such as AGN and ADAG.
## Installation
We will guide you how to install Distributed Keras. However, we will assume that an Apache Spark installation is available. In the following subsections, we describe two approaches to achieve this.
### pip
When you only require the framework for development purposes, just use `pip` to install dist-keras.
```bash
pip install --upgrade dist-keras
# OR
pip install --upgrade git+https://github.com/JoeriHermans/dist-keras.git
```
### git & pip
However, if you would like to contribute, or run some of the examples. It is probably best to clone the repository directly from GitHub and install it afterwards using `pip`. This will also resolve possible missing dependencies.
```bash
git clone https://github.com/JoeriHermans/dist-keras
cd dist-keras
pip install -e .
```
### General notes
#### .bashrc
Make sure the following variables are set in your `.bashrc`. It is possible, depending on your system configuration, that the following configuration **doesn't have to be applied**.
```bash
# Example of a .bashrc configuration.
export SPARK_HOME=/usr/lib/spark
export PYTHONPATH="$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH"
```
## Running an example
We would like to refer the reader to the `workflow.ipynb` notebook in the examples folder. This will give you a complete introduction to the problem of distributed deep learning, and will guide you through the steps that have to be executed.
Furthermore, we would also like to show how you exactly should process "big" datasets. This is shown in the examples starting with the prefix ```example_```. Please execute them in the provided sequence.
### Spark 2.0
If you want to run the examples using Apache Spark 2.0.0 and higher. You will need to remove the line containing `sqlContext = SQLContext(sc)`. We need to do this because in Spark 2.0+, the SQLContext, and Hive context are now merged in the Spark session.
## Optimization Algorithms
### Sequential Trainer
This optimizer follows the traditional scheme of training a model, i.e., it uses sequential gradient updates to optimize the parameters. It does this by executing the training procedure on a single Spark executor.
```python
SingleTrainer(model, features_col, label_col, batch_size, optimizer, loss, metrics=["accuracy"])
```
### ADAG (Currently Recommended)
DOWNPOUR variant which is able to achieve significantly better statistical performance while being less sensitive to hyperparameters. This optimizer was developed using insights gained while developing this framework. More research regarding parameter staleness is still being conducted to further improve this optimizer.
```python
ADAG(keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32,
features_col="features", label_col="label", num_epoch=1, communication_window=12)
```
### Dynamic SGD
Dynamic SGD, dynamically maintains a learning rate for every worker by incorperating parameter staleness. This optimization scheme is introduced in "Heterogeneity-aware Distributed Parameter Servers" at the SIGMOD 2017 conference [[5]](http://net.pku.edu.cn/~cuibin/Papers/2017SIGMOD.pdf).
```python
DynSGD(keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32,
features_col="features", label_col="label", num_epoch=1, communication_window=10)
```
### Asynchronous Elastic Averaging SGD (AEASGD)
The distinctive idea of EASGD is to allow the local workers to perform more exploration (small rho) and the master to perform exploitation. This approach differs from other settings explored in the literature, and focus on how fast the center variable converges [[2]](https://arxiv.org/pdf/1412.6651.pdf) .
In this section we show the asynchronous version of EASGD. Instead of waiting on the synchronization of other trainers, this method communicates the elastic difference (as described in the paper), with the parameter server. The only synchronization mechanism that has been implemented, is to ensure no race-conditions occur when updating the center variable.
```python
AEASGD(keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers, batch_size, features_col,
label_col, num_epoch, communication_window, rho, learning_rate)
```
### Asynchronous Elastic Averaging Momentum SGD (AEAMSGD)
Asynchronous EAMSGD is a variant of asynchronous EASGD. It is based on the Nesterov's momentum scheme, where the update of the local worker is modified to incorepare a momentum term [[2]](https://arxiv.org/pdf/1412.6651.pdf) .
```python
EAMSGD(keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers, batch_size,
features_col, label_col, num_epoch, communication_window, rho,
learning_rate, momentum)
```
### DOWNPOUR
An asynchronous stochastic gradient descent procedure introduced by Dean et al., supporting a large number of model replicas and leverages adaptive learning rates. This implementation is based on the pseudocode provided by [[1]](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf) .
```python
DOWNPOUR(keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers, batch_size,
features_col, label_col, num_epoch, learning_rate, communication_window)
```
### Ensemble Training
In ensemble training, we train `n` models in parallel on the same dataset. All models are trained in parallel, but the training of a single model is done in a sequential manner using Keras optimizers. After the training process, one can combine and, for example, average the output of the models.
```python
EnsembleTrainer(keras_model, worker_optimizer, loss, metrics=["accuracy"], features_col,
label_col, batch_size, num_ensembles)
```
### Model Averaging
Model averaging is a data parallel technique which will average the trainable parameters of model replicas after every epoch.
```python
AveragingTrainer(keras_model, worker_optimizer, loss, metrics=["accuracy"], features_col,
label_col, num_epoch, batch_size, num_workers)
```
## Job deployment
We also support remote job deployment. For example, imagine you are developing your model on a local notebook using a small development set. However, in order to submit your job on a remote cluster, you first need to develop a cluster job, and run the job there. In order to simplify this process, we have developed a simplified interface for a large scale machine learning job.
In order to submit a job to a remote cluster, you simply run the following code:
```python
# Define the distributed optimization procedure, and its parameters.
trainer = ADAG(keras_model=mlp, worker_optimizer=optimizer_mlp, loss=loss_mlp, metrics=["accuracy"], num_workers=20,
batch_size=32, communication_window=15, num_epoch=1,
features_col="features_normalized_dense", label_col="label_encoded")
# Define the job parameters.
job = Job(secret, job_name, data_path, num_executors, num_processes, trainer)
job.send('http://yourcluster:[port]')
job.wait_completion()
# Fetch the trained model, and history for training evaluation.
trained_model = job.get_trained_model()
history = job.get_history()
```
### Punchcard Server
Job scheduling, and execution is handled by our `Punchcard` server. This server will accept requests from a remote location given a specific `secret`, which is basically a long identification string of a specific user. However, a user can have multiple secrets. At the moment, a job is only executed if there are no other jobs running for the specified secret.
In order to submit jobs to `Punchcard` we need to specify a secrets file. This file is basically a JSON structure, it will have the following structure:
```json
[
{
"secret": "secret_of_user_1",
"identity": "user1"
},
{
"secret": "secret_of_user_2",
"identity": "user2"
}
]
```
After the secrets file has been constructed, the Punchcard server can be started by issueing the following command.
```sh
python scripts/punchcard.py --secrets /path/to/secrets.json
```
#### Secret Generation
In order to simplify secret generation, we have added a costum script which will generate a unique key for the specified identity. The structure can be generated by running the following command.
```sh
python scripts/generate_secret.py --identity userX
```
## Optimization Schemes
TODO
## General note
It is known that adding more asynchronous workers deteriorates the statistical performance of the model. There have been some studies which examinate this particular effect. However, some of them conclude that actually adding more asynchronous workers contributes to something what they call **implicit momentum** [[3]](https://arxiv.org/pdf/1605.09774.pdf). However, this is subject to further investigation.
## Known issues
- Python 3 compatibility.
## TODO's
List of possible future additions.
- Save Keras model to HDFS.
- Load Keras model from HDFS.
- Compression / decompression of network transmissions.
- Stop on target loss.
- Multiple parameter servers for large Deep Networks.
- Python 3 compatibility.
- For every worker, spawn an additional thread which is responsible for sending updates to the parameter server. The actual worker thread will just submit tasks to this queue.
## Citing
If you use this framework in any academic work, please use the following BibTex code.
```latex
@misc{dist_keras_joerihermans,
author = {Joeri R. Hermans, CERN IT-DB},
title = {Distributed Keras: Distributed Deep Learning with Apache Spark and Keras},
year = {2016},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/JoeriHermans/dist-keras/}},
}
```
## References
* Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., ... & Ng, A. Y. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223-1231). [[1]](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf)
* Zhang, S., Choromanska, A. E., & LeCun, Y. (2015). Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems (pp. 685-693). [[2]](https://arxiv.org/pdf/1412.6651.pdf)
* Mitliagkas, Ioannis, et al. "Asynchrony begets Momentum, with an Application to Deep Learning." arXiv preprint arXiv:1605.09774 (2016). [[3]](https://arxiv.org/pdf/1605.09774.pdf)
<!-- @misc{pumperla2015, -->
<!-- author = {Max Pumperla}, -->
<!-- title = {elephas}, -->
<!-- year = {2015}, -->
<!-- publisher = {GitHub}, -->
<!-- journal = {GitHub repository}, -->
<!-- howpublished = {\url{https://github.com/maxpumperla/elephas}} -->
<!-- } -->
* Pumperla, M. (2015). Elephas. Github Repository https://github.com/maxpumperla/elephas/. [4]
* Jiawei Jiang, Bin Cui, Ce Zhang and Lele Yu (2017). Heterogeneity-aware Distributed Parameter Servers [[5]](http://net.pku.edu.cn/~cuibin/Papers/2017SIGMOD.pdf)
## Licensing
 
================================================
FILE: distkeras/__init__.py
================================================
================================================
FILE: distkeras/evaluators.py
================================================
"""Evaluation module.
An evaluator will evaluate a dataframe according to specific requirements.
"""
class Evaluator(object):
"""An evaluator is an abstract class which will, given a label and a prediction,
will compute an evaluation metric.
# Arguments
label_col: string. Column name of the label.
prediction_col: string. Column name of the prediction.
"""
def __init__(self, label_col="label", prediction_col="prediction"):
self.label_column = label_col
self.prediction_column = prediction_col
def evaluate(self, dataframe):
"""Evalutes the specified dataframe.
# Arguments
dataframe: dataframe. Spark Dataframe.
"""
raise NotImplementedError
class AccuracyEvaluator(Evaluator):
"""Computes the accuracy of the prediction based on the label.
# Arguments
label_col: string. Label column.
prediction_col: string. Prediction column.
"""
def __init__(self, label_col="label", prediction_col="prediction"):
# Initialize the parent structure.
super(AccuracyEvaluator, self).__init__(label_col, prediction_col)
def evaluate(self, dataframe):
# Count the total number of instances.
num_instances = dataframe.count()
# Extract the matching indexes.
cleaned = dataframe.where(dataframe[self.prediction_column] == dataframe[self.label_column])
# Fetch the number of correctly guessed instances.
validated_instances = cleaned.count()
return float(validated_instances) / float(num_instances)
================================================
FILE: distkeras/job_deployment.py
================================================
"""Module which facilitates job deployment on remote Spark clusters.
This allows you to build models and architectures on, for example, remote
notebook servers, and submit the large scale training job on remote
Hadoop / Spark clusters."""
## BEGIN Imports. ##############################################################
from distkeras.utils import deserialize_keras_model
from distkeras.utils import get_os_username
from distkeras.utils import pickle_object
from distkeras.utils import serialize_keras_model
from distkeras.utils import unpickle_object
from flask import Flask
from flask import request
from os.path import expanduser
from threading import Lock
import base64
import json
import os
import subprocess
import threading
import time
import urllib2
## END Imports. ################################################################
class Punchcard(object):
def __init__(self, secrets_path="secrets.json", port=80):
self.application = Flask(__name__)
self.secrets_path = secrets_path
self.port = port
self.mutex = threading.Lock()
self.jobs = {}
def read_secrets(self):
with open(self.secrets_path) as f:
secrets_raw = f.read()
secrets = json.loads(secrets_raw)
return secrets
def valid_secret(self, secret, secrets):
num_secrets = len(secrets)
for i in range(0, num_secrets):
description = secrets[i]
if description['secret'] == secret:
return True
return False
def secret_in_use(self, secret):
return secret in self.jobs
def set_trained_model(self, job, model):
with self.mutex:
self.models[job.get_secret()] = model
def get_submitted_job(self, secret):
with self.mutex:
if self.secret_in_use(secret):
job = self.jobs[secret]
else:
job = None
return job
def define_routes(self):
## BEGIN Route definitions. ############################################
@self.application.route('/api/submit', methods=['POST'])
def submit_job():
# Parse the incoming JSON data.
data = json.loads(request.data)
# Fetch the required job arguments.
secret = data['secret']
job_name = data['job_name']
num_executors = data['num_executors']
num_processes = data['num_processes']
data_path = data['data_path']
trainer = unpickle_object(data['trainer'].decode('hex_codec'))
# Fetch the parameters for the job.
secrets = self.read_secrets()
with self.mutex:
if self.valid_secret(secret, secrets) and not self.secret_in_use(secret):
job = PunchcardJob(secret, job_name, data_path, num_executors, num_processes, trainer)
self.jobs[secret] = job
job.start()
return '', 200
return '', 403
@self.application.route('/api/state')
def job_state():
secret = request.args.get('secret')
job = self.get_submitted_job(secret)
# Check if the job exists.
if job is not None:
d = {}
d['job_name'] = job.get_job_name()
d['running'] = job.running()
return json.dumps(d), 200
return '', 404
@self.application.route('/api/cancel')
def cancel():
secret = request.args.get('secret')
job = self.get_submitted_job(secret)
if job is not None and job.running():
with self.mutex:
job.cancel()
del self.jobs[secret]
return '', 200
@self.application.route('/api/destroy')
def destroy_job():
secret = request.args.get('secret')
job = self.get_submitted_job(secret)
if job is not None and not job.running():
with self.mutex:
model = self.jobs[secret].get_trained_model()
history = self.jobs[secret].get_history()
model = pickle_object(serialize_keras_model(model)).encode('hex_codec')
history = pickle_object(history).encode('hex_codec')
d = {}
d['model'] = model
d['history'] = history
del self.jobs[secret]
return json.dumps(d), 200
return '', 400
## END Route definitions. ##############################################
def run(self):
self.define_routes()
self.application.run('0.0.0.0', self.port)
class PunchcardJob(object):
def __init__(self, secret, job_name, data_path, num_executors, num_processes, trainer):
self.secret = secret
self.job_name = job_name
self.data_path = data_path
self.num_executors = num_executors
self.num_processes = num_processes
self.trainer = trainer
self.is_running = True
self.thread = None
self.trained_model = None
self.history = None
def get_job_name(self):
return self.job_name
def get_secret(self):
return self.secret
def get_history(self):
return self.history
def get_trained_model(self):
return self.trained_model
def start(self):
self.trainer.determine_new_master()
self.thread = threading.Thread(target=self.run)
self.thread.setDaemon(True)
self.thread.start()
def cancel(self):
self.thread.exit()
def running(self):
return self.is_running
def join(self):
self.thread.join()
def run_job(self):
os.system("python ~/jobs/" + self.secret + ".py")
def clean_up(self):
home = expanduser("~")
os.remove(home + "/models/" + self.secret)
os.remove(home + "/histories/" + self.secret)
os.remove(home + "/trainers/" + self.secret)
def read_trained_model(self):
home = expanduser("~")
with open(home + "/models/" + self.secret, "r") as f:
self.trained_model = deserialize_keras_model(unpickle_object(f.read()))
def read_history(self):
home = expanduser("~")
with open(home + "/histories/" + self.secret, "r") as f:
self.history = unpickle_object(f.read())
def serialize_trainer(self):
trainer = pickle_object(self.trainer)
home = expanduser("~")
with open(home + "/trainers/" + self.secret, "w") as f:
f.write(trainer)
def generate_code(self):
source = """
from distkeras.evaluators import *
from distkeras.predictors import *
from distkeras.trainers import *
from distkeras.trainers import *
from distkeras.transformers import *
from distkeras.utils import *
from keras import *
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark import SQLContext
from os.path import expanduser
secret = '{secret}'
application_name = '{job_name}'
num_executors = {num_executors}
num_processes = {num_processes}
path_data = '{data_path}'
num_workers = num_processes * num_executors
# Allocate a Spark Context, and a Spark SQL context.
conf = SparkConf()
conf.set("spark.app.name", application_name)
conf.set("spark.master", "yarn-client")
conf.set("spark.executor.cores", num_processes)
conf.set("spark.executor.instances", num_executors)
conf.set("spark.executor.memory", "5g")
conf.set("spark.locality.wait", "0")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
# Read the dataset from HDFS. For now we assume Parquet files.
dataset = sqlContext.read.parquet(path_data).repartition(num_workers)
# Deserialize the trainer object.
home = expanduser("~")
with open(home + "/trainers/" + secret, "r") as f:
trainer = unpickle_object(f.read())
# Train the model, and save it afterwards.
trained_model = trainer.train(dataset)
with open(home + "/models/" + secret, "w") as f:
f.write(pickle_object(serialize_keras_model(trained_model)))
# Save the history of the training process.
histories = trainer.get_history()
with open(home + "/histories/" + secret, "w") as f:
f.write(pickle_object(histories))
sc.stop()
""".format(
secret=self.secret,
job_name=self.job_name,
num_executors=self.num_executors,
num_processes=self.num_processes,
data_path=self.data_path
)
home = expanduser("~")
with open(home + "/jobs/" + self.secret + ".py", "w") as f:
f.write(source)
def run(self):
self.serialize_trainer()
self.generate_code()
self.run_job()
self.read_trained_model()
self.read_history()
self.clean_up()
self.is_running = False
class Job(object):
def __init__(self, secret, job_name, data_path, num_executors, num_processes, trainer):
self.secret = secret
self.job_name = job_name
self.num_executors = 20
self.num_processes = 1
self.data_path = data_path
self.trainer = trainer
self.trained_model = None
self.history = None
self.address = None
def set_num_executors(self, num_executors):
self.num_executors = num_executors
def set_num_processes(self, num_processes):
self.num_processes = num_processes
def get_trained_model(self):
return self.trained_model
def get_history(self):
return self.history
def is_finished(self):
address = self.address + '/api/state?secret=' + self.secret
request = urllib2.Request(address)
response = urllib2.urlopen(request)
data = json.load(response)
return not data['running']
def destroy_remote_job(self):
address = self.address + '/api/destroy?secret=' + self.secret
request = urllib2.Request(address)
response = urllib2.urlopen(request)
data = json.load(response)
model = unpickle_object(data['model'].decode('hex_codec'))
self.trained_model = deserialize_keras_model(model)
self.history = unpickle_object(data['history'].decode('hex_codec'))
def start(self):
self.thread = threading.Thread(target=self.run)
self.thread.start()
def wait_completion(self):
self.thread.join()
def cancel(self):
address = self.address + '/api/cancel?secret=' + self.secret
request = urllib2.Request(address)
urllib2.urlopen(request)
def send(self, address):
data = {}
data['secret'] = self.secret
data['job_name'] = self.job_name
data['num_executors'] = self.num_executors
data['num_processes'] = self.num_processes
data['data_path'] = self.data_path
data['trainer'] = pickle_object(self.trainer).encode('hex_codec')
request = urllib2.Request(address + "/api/submit")
request.add_header('Content-Type', 'application/json')
urllib2.urlopen(request, json.dumps(data))
self.address = address
self.start()
def run(self):
time.sleep(1)
while not self.is_finished():
time.sleep(10)
self.destroy_remote_job()
================================================
FILE: distkeras/networking.py
================================================
"""Networking utility functions."""
## BEGIN Imports. ##############################################################
import pickle
import socket
## END Imports. ################################################################
def determine_host_address():
"""Determines the human-readable host address of the local machine."""
host_address = socket.gethostbyname(socket.gethostname())
return host_address
def recvall(connection, num_bytes):
"""Reads `num_bytes` bytes from the specified connection.
# Arguments
connection: socket. Opened socket.
num_bytes: int. Number of bytes to read.
"""
byte_buffer = b''
buffer_size = 0
bytes_left = num_bytes
# Iterate until we received all data.
while buffer_size < num_bytes:
# Fetch the next frame from the network.
data = connection.recv(bytes_left)
# Compute the size of the frame.
delta = len(data)
buffer_size += delta
bytes_left -= delta
# Append the data to the buffer.
byte_buffer += data
return byte_buffer
def recv_data(connection):
"""Will fetch the next data frame from the connection.
The protocol for reading is structured as follows:
1. The first 20 bytes represents a string which holds the next number of bytes to read.
2. We convert the 20 byte string to an integer (e.g. '00000000000000000011' -> 11).
3. We read `num_bytes` from the socket (which is in our example 11).
4. Deserialize the retrieved string.
# Arguments
connection: socket. Opened socket.
"""
data = b''
# Fetch the serialized data length.
length = int(recvall(connection, 20).decode())
# Fetch the serialized data.
serialized_data = recvall(connection, length)
# Deserialize the data.
data = pickle.loads(serialized_data)
return data
def send_data(connection, data):
"""Sends the data to the other endpoint of the socket using our protocol.
The protocol for sending is structured as follows:
1. Serialize the data.
2. Obtain the buffer-size of the serialized data.
3. Serialize the buffer-size in 20 bytes (e.g. 11 -> '00000000000000000011').
4. Send the serialized buffer size.
5. Send the serialized data.
# Arguments
connection: socket. Opened socket.
data: any. Data to send.
"""
# Serialize the data.
serialized_data = pickle.dumps(data, -1)
length = len(serialized_data)
# Serialize the number of bytes in the data.
serialized_length = str(length).zfill(20)
# Send the data over the provided socket.
connection.sendall(serialized_length.encode())
connection.sendall(serialized_data)
def connect(host, port, disable_nagle=True):
fd = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Check if Nagle's algorithm needs to be disabled.
if disable_nagle:
fd.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
else:
fd.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 0)
# Connect to the specified URI.
fd.connect((host, port))
return fd
================================================
FILE: distkeras/parameter_servers.py
================================================
"""Parameter servers.
A parameter server is a process which will aggregate all the incoming gradient
or parameter updates of the workers and incorperate it into a single center variable.
This center variable will eventually be the produced model of the trainer.
"""
## BEGIN Imports. ##############################################################
import copy
import math
import numpy as np
import socket
import threading
from distkeras.networking import recv_data
from distkeras.networking import send_data
from distkeras.utils import deserialize_keras_model
## END Imports. ################################################################
class ParameterServer(object):
"""Abstract class which provides basic attributed and methods for all
parameter servers.
# Arguments
model: string. Serialized Keras model.
See: distkeras.utils.serialize_keras_model
"""
def __init__(self, model):
self.model = deserialize_keras_model(model)
self.num_updates = 1
def initialize(self):
"""Initializes the parameter server.
This method is called after self.start().
"""
raise NotImplementedError
def start(self):
"""Starts the parameter server in a new thread."""
raise NotImplementedError
def run(self):
"""Main event loop of the parameter server."""
raise NotImplementedError
def stop(self):
"""Notifies the parameter server thread to stop."""
raise NotImplementedError
def get_model(self):
"""Returns the Keras model which will be trained by the workers."""
return self.model
def next_update(self):
"""Increments the number of model updates by 1."""
self.num_updates += 1
def reset_update_counter(self):
"""Resets the model update counter."""
self.num_updates = 0
def get_num_updates(self):
"""Returns the number of model updates the parameter server has performed."""
return self.num_updates
class SocketParameterServer(ParameterServer):
"""Abstract class of a parameter server which is based on a socket implementation.
This means that this parameter server accepts multiple TCP connections from multiple
workers, and uses a costum protocol to transmit and receive the model parameters. This
is done by implementing a custom protocol. Which is fully described in the
distkeras.networking module.
# Arguments
model: string. Serialized Keras model.
See: distkeras.utils.serialize_keras_model
port: int. Listing port number.
"""
def __init__(self, model, port=5000):
super(SocketParameterServer, self).__init__(model)
self.master_port = port
self.socket = None
self.running = False
self.connections = []
self.mutex = threading.Lock()
def initialize(self):
"""Sets up the listing port."""
# Reset the running flag.
self.running = True
# Prepare a socket.
file_descriptor = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Disable Nagle's algorithm.
file_descriptor.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
# Check if the master port needs to be assigned by the OS.
if self.master_port is None:
file_descriptor.bind(('0.0.0.0', 0))
# Retrieve the port assigned by the OS.
self.master_port = int(file_descriptor.getsockname()[1])
else:
file_descriptor.bind(('0.0.0.0', self.master_port))
# Listen to the socket.
file_descriptor.listen(5)
# Assign the socket.
self.socket = file_descriptor
def handle_commit(self, conn, addr):
"""Handles parameter updates coming from the workers.
# Arguments:
conn: socket. The opened connection.
addr: addr. Address of the remote host.
"""
raise NotImplementedError
def handle_pull(self, conn, addr):
"""Handles parameter requests coming from the workers. This will
actually send the model parameters to the requesting host.
# Arguments:
conn: socket. The opened connection.
addr: addr. Address of the remote host.
"""
# Fetch the raw center variables.
with self.mutex:
center_variable = self.model.get_weights()
cv = copy.deepcopy(center_variable)
# Send the data over the socket.
send_data(conn, cv)
def cancel_accept(self):
"""This method will cancel the accept procedure. The method
is meant to be executed by the stop() procedure.
"""
file_descriptor = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
# Connect to the listening socket to cancel the accept.
file_descriptor.connect(("localhost", self.master_port))
file_descriptor.close()
except Exception as e:
print(e)
def handle_connection(self, conn, addr):
"""
A parameter server has two main functionalities. Nodes are able to
pull (p) the current state, or 'commit' a state. This is implemented
in the following functionality. Classes which implement these interfaces
should not worry about connection handling.
"""
try:
while self.running:
# Fetch the current action.
action = conn.recv(1).decode()
# Check if the action is a commit (most of the cases).
if action == 'c':
# Handle the commit.
self.handle_commit(conn, addr)
elif action == 'p':
# Handle the pull.
self.handle_pull(conn, addr)
except Exception as e:
print(e)
def start(self):
"""Starts the parameter server."""
# Set the running flag.
self.running = True
def run(self):
"""Main event loop of the parameter server."""
# Listen for incoming connections.
while self.running:
try:
# Accept incoming connections.
conn, addr = self.socket.accept()
# Handle the connection.
thread = threading.Thread(target=self.handle_connection, args=(conn, addr))
thread.start()
# Store the connection in the dictionary.
self.connections.append(thread)
except Exception as e:
print(e)
def stop(self):
"""Stop the parameter server. This will also cleanup all existing connections."""
self.running = False
# Check if a socket is allocated.
if self.socket:
self.cleanup_connections()
self.finalize()
self.socket.close()
self.cancel_accept()
self.socket = None
self.connections = []
def finalize(self):
"""Method that is called when the parameter server stops."""
print("Not executed")
def cleanup_connections(self):
"""Clean all existing connections up."""
# Iterate over all connections.
for thread in self.connections:
# Fetch the thread object.
thread.join()
del thread
class DeltaParameterServer(SocketParameterServer):
"""A parameter server which integrates all incoming deltas into the model.
# Arguments
model: string. Serialized Keras model.
See: distkeras.utils.serialize_keras_model
master_port: int. Port number of the parameter server.
"""
def __init__(self, model, master_port):
super(DeltaParameterServer, self).__init__(model, master_port)
self.center_variable = np.asarray(self.model.get_weights())
def handle_commit(self, conn, addr):
# Receive the parameters from the remote node.
data = recv_data(conn)
# Extract the delta from the dictionary.
delta = data['delta']
# Update the center variable with the delta.
with self.mutex:
self.center_variable = self.center_variable + delta
# Next iteration.
self.next_update()
def handle_pull(self, conn, addr):
"""Handles parameter requests coming from the workers. This will
actually send the model parameters to the requesting host.
# Arguments:
conn: socket. The opened connection.
addr: addr. Address of the remote host.
"""
# Fetch the raw center variables.
with self.mutex:
cv = copy.deepcopy(self.center_variable)
# Send the data over the socket.
send_data(conn, cv)
def finalize(self):
# Set the final weights of the model.
self.model.set_weights(self.center_variable)
class ADAGParameterServer(SocketParameterServer):
"""A parameter server which integrates the incoming gradient residuals into
the model, and integrates them using the ADAG scheme.
# Arguments
model: string. Keras model.
See: distkeras.utils.serialize_keras_model
master_port: int. Port number of the parameter server.
"""
def __init__(self, model, master_port):
super(ADAGParameterServer, self).__init__(model, master_port)
self.center_variable = np.asarray(self.model.get_weights())
def handle_commit(self, conn, addr):
# Receive the parameters from the remote node.
data = recv_data(conn)
# Extract the data from the dictionary.
r = data['residual']
with self.mutex:
# Update the center variable.
self.center_variable = self.center_variable + r
# Increment the number of parameter server updates.
self.next_update()
def handle_pull(self, conn, addr):
"""Handles parameter requests coming from the workers. This will
actually send the model parameters to the requesting host.
# Arguments:
conn: socket. The opened connection.
addr: addr. Address of the remote host.
"""
# Fetch the raw center variables.
with self.mutex:
cv = copy.deepcopy(self.center_variable)
# Send the data over the socket.
send_data(conn, cv)
def finalize(self):
# Set the weights of the model.
self.model.set_weights(self.center_variable)
class DynSGDParameterServer(SocketParameterServer):
"""DynSGD parameter server, keeps track of the staleness between updates
to maintain dynamic worker learning rates based on staleness.
# Arguments
model: string. Keras model
See: distkeras.utils.serialize_keras_model
master_port: int. Port number of the parameter server.
"""
def __init__(self, model, master_port):
super(DynSGDParameterServer, self).__init__(model, master_port)
def handle_pull(self, conn, addr):
"""Handles parameter requests coming from the workers. This will
actually send the model parameters to the requesting host.
This is a specific implementation for DynSGD.
# Arguments:
conn: socket. The opened connection.
addr: addr. Address of the remote host.
"""
# Allocate a new dictionary.
data = {}
# Fetch the raw center variables.
with self.mutex:
center_variable = self.model.get_weights()
cv = copy.deepcopy(center_variable)
# Store the number of updates (u) the PS executed.
data['update'] = self.num_updates
# Store the model (m).
data['model'] = cv
# Send the data over the socket.
send_data(conn, data)
def handle_commit(self, conn, addr):
data = recv_data(conn)
r = data['residual']
# Fetch the last iteration number
last_update = data['last_update']
du = (self.num_updates - last_update) + 1
r /= du
with self.mutex:
center_variable = self.model.get_weights()
center_variable = center_variable + r
self.model.set_weights(center_variable)
# Increment the number of parameter server updates.
self.next_update()
class ExperimentalParameterServer(SocketParameterServer):
"""A parameter server which integrates the incoming gradient residuals into
the model, and integrates them using the ADAG scheme.
# Arguments
model: string. Keras model.
See: distkeras.utils.serialize_keras_model
master_port: int. Port number of the parameter server.
"""
def __init__(self, model, master_port, learning_rate):
super(ExperimentalParameterServer, self).__init__(model, master_port)
self.center_variable = np.asarray(self.model.get_weights())
self.inverse_learning_rate = 1.0 / learning_rate
def handle_commit(self, conn, addr):
# Receive the parameters from the remote node.
data = recv_data(conn)
# Extract the data from the dictionary.
r = data['residual']
worker_id = data['worker_id']
stale_cv = data['stale_center_variable']
with self.mutex:
diff_cv = np.subtract(self.center_variable, stale_cv)
d = 1 / (self.inverse_learning_rate * np.power(diff_cv, 2) + 1)
r = np.multiply(d, r)
# Update the center variable.
self.center_variable = self.center_variable + r
# Increment the number of parameter server updates.
self.next_update()
def handle_pull(self, conn, addr):
"""Handles parameter requests coming from the workers. This will
actually send the model parameters to the requesting host.
# Arguments:
conn: socket. The opened connection.
addr: addr. Address of the remote host.
"""
# Fetch the raw center variables.
with self.mutex:
cv = copy.deepcopy(self.center_variable)
# Send the data over the socket.
send_data(conn, cv)
def finalize(self):
# Set the weights of the model.
self.model.set_weights(self.center_variable)
================================================
FILE: distkeras/predictors.py
================================================
"""Predictors take a model and will transform the Dataframe by adding a prediction column."""
## BEGIN Imports. ##############################################################
import numpy as np
from pyspark.mllib.linalg import DenseVector
from distkeras.utils import serialize_keras_model
from distkeras.utils import deserialize_keras_model
from distkeras.utils import new_dataframe_row
## END Imports. ################################################################
class Predictor(object):
"""Abstract predictor class.
# Arguments
keras_model: Keras Model.
"""
def __init__(self, keras_model):
self.model = serialize_keras_model(keras_model)
def predict(self, dataframe):
"""Transforms the dataframe to add a prediction.
# Arguments
dataframe: dataframe. Spark Dataframe.
"""
raise NotImplementedError
class ModelPredictor(Predictor):
"""Takes a Keras model and adds a prediction column to the dataframe
given a features column.
# Arguments
keras_model: Keras model.
features_col: string. Name of the features column.
output_col: string. Name of the prediction column.
"""
def __init__(self, keras_model, features_col="features", output_col="prediction"):
super(ModelPredictor, self).__init__(keras_model)
assert isinstance(features_col, (str, list)), "'features_col' must be a string or a list of strings"
self.features_column = [features_col] if isinstance(features_col, str) else features_col
self.output_column = output_col
def _predict(self, iterator):
"""Lambda method which will append a prediction column to the provided rows.
# Arguments:
iterator: iterator. Spark Row iterator.
"""
model = deserialize_keras_model(self.model)
for row in iterator:
features = [np.asarray([row[c]]) for c in self.features_column]
prediction = model.predict(features)
dense_prediction = DenseVector(prediction[0])
new_row = new_dataframe_row(row, self.output_column, dense_prediction)
yield new_row
def predict(self, dataframe):
"""Returns a dataframe which is the old dataframe with an additional
prediction column.
"""
return dataframe.rdd.mapPartitions(self._predict).toDF()
================================================
FILE: distkeras/schemes.py
================================================
"""Schemes module.
Module with schemes to automatize a distributed learning process. These schemes will automatically
adjust the hyperparameters to improve training performance.
"""
## BEGIN Imports. ##############################################################
import math
## END Imports. ################################################################
class Scheme(object):
"""A 'Scheme' is way to describe how a distributed optimization sequence
should perform. For example, it is responsible for adjusting the learning
rate of the parameter server if it notices that the loss doesn't decay.
However, this is only one of the possible solutions. Others include the
optimization of other hyperparameters such as the number of workers.
# Arguments
optimizer: trainer. A distributed optimizer.
num_epoch: int. Total number of epoch.
evaluation_frequency: int. Frequency of hyperparameter evaluation.
"""
def __init__(self, optimizer, num_epoch=15, evaluation_frequency=5):
self.optimizer = optimizer
self.num_epoch = num_epoch
self.evaluation_frequency = evaluation_frequency
self.epoch_over_eval_frequency = int(self.num_epoch / self.evaluation_frequency)
self.initialize()
def initialize(self):
"""Initializes the hyperparameters to follow the scheme parameters."""
self.optimizer.set_num_epoch(self.get_epoch_over_evaluation_frequency())
def get_epoch_over_evaluation_frequency(self):
"""Returns the number of epochs per evaluation frequency."""
return self.epoch_over_eval_frequency
def optimize(self, training_set, validation_set):
raise NotImplementedError
class Emperor(Scheme):
"""The 'Emporor' optimization schema will make hyperparameter changes based
on the loss derrivatives of the validation set.
# Arguments
optimizer: trainer. A distributed optimizer.
evaluate_loss: function. Function which evaluates the loss. This
function should accept a model, and a dataframe.
num_epoch: int. Total number of epoch.
evaluation_frequency: int. Frequency of hyperparameter evaluation.
"""
def __init__(self, optimizer, evaluate_loss, num_epoch=15, evaluation_frequency=5,
loss_threshold=0.005):
super(Emperor, self).__init__(optimizer, num_epoch, evaluation_frequency)
self.previous_loss = float('inf')
self.loss_threshold = loss_threshold
self.evaluate_loss = evaluate_loss
def optimize(self, training_set, validation_set):
trained_model = None
# Fetch the number of evaluations, to match the number of epochs.
num_evaluations = self.get_epoch_over_evaluation_frequency() + 1
# Iterate over the number of evaluation epochs.
for i in range(0, num_evaluations):
# Train the model.
trained_model = self.optimizer.train(training_set)
self.optimizer.set_model(trained_model)
# Evaluate the training set, and fetch the loss.
loss = self.evaluate_loss(trained_model, validation_set)
print("Current loss: " + str(loss))
dl = math.fabs(loss - self.previous_loss)
self.previous_loss = loss
if dl <= self.loss_threshold:
print("Lowering learning rate.")
print("Old learning rate: " + str(self.optimizer.get_learning_rate()))
# Modify the learning rate.
learning_rate = self.optimizer.get_learning_rate()
learning_rate /= 10
self.optimizer.set_learning_rate(learning_rate)
print("New learning rate: "+ str(self.optimizer.get_learning_rate()))
return trained_model
================================================
FILE: distkeras/trainers.py
================================================
"""Model optimizers. Depending on the implementation, these classes will optimize the
Keras model in a distributed manner (with exception of the SingleTrainer)."""
## BEGIN Imports. ##############################################################
import numpy as np
import threading
import time
from distkeras.parameter_servers import ADAGParameterServer
from distkeras.parameter_servers import DeltaParameterServer
from distkeras.parameter_servers import DynSGDParameterServer
from distkeras.parameter_servers import ExperimentalParameterServer
from distkeras.utils import deserialize_keras_model
from distkeras.utils import history_executor
from distkeras.utils import history_executors_average
from distkeras.utils import pickle_object
from distkeras.utils import serialize_keras_model
from distkeras.utils import set_keras_base_directory
from distkeras.utils import unpickle_object
from distkeras.networking import determine_host_address
from distkeras.workers import ADAGWorker
from distkeras.workers import AEASGDWorker
from distkeras.workers import DOWNPOURWorker
from distkeras.workers import DynSGDWorker
from distkeras.workers import ExperimentalWorker
from distkeras.workers import EAMSGDWorker
from distkeras.workers import SequentialWorker
from keras import backend as K
## END Imports. ################################################################
class Trainer(object):
"""Abstract trainer class. This class provides all base functionality which
all optimizers need to implement.
# Arguments
keras_model: Keras model.
loss: string. String representing the loss.
See: https://keras.io/objectives/
worker_optimizer: string. String representing worker optimizer.
See https://keras.io/optimizers/
metrics: list of strings representing model evaluation metrics. Default is ["accuracy"].
See: https://keras.io/metrics/
loss_weights: optional list or dict specifying weights for different losses.
"""
def __init__(self, keras_model, loss, worker_optimizer, metrics=["accuracy"], loss_weights=None):
set_keras_base_directory()
self.master_model = serialize_keras_model(keras_model)
self.loss = loss
self.loss_weights = loss_weights
self.worker_optimizer = worker_optimizer
self.metrics = metrics
self.history = []
self.training_time_start = 0
self.training_time_end = 0
self.training_time = 0
self.max_mini_batches_prefetch = 100
def set_max_prefetch(self, max_mini_batches):
"""Sets the maximum amount of mini-batches that can be prefetched by a worker."""
self.max_mini_batches_prefetch = max_mini_batches
def set_model(self, model):
"""Sets the master model to be used by the trainer."""
self.master_model = serialize_keras_model(model)
def record_training_start(self):
"""Records the start of the training.
This private function is called when the training process starts.
"""
self.training_time = 0
self.training_time_start = time.time()
def record_training_end(self):
"""Records the end of the traing.
This private function is called when the training process is terminated.
"""
self.training_time_end = time.time()
self.training_time = self.training_time_end - self.training_time_start
def get_training_time(self):
"""Returns the told training time."""
return self.training_time
def get_history(self):
"""Returns all history object aggregated during training."""
return self.history
def get_averaged_history(self):
"""Returns the averaged history of the center variable."""
return history_executors_average(self.history)
def get_executor_history(self, executor_id):
"""Returns the history of a specific executor."""
return history_executor(self.history, executor_id)
def train(self, dataframe, shuffle=False):
"""Trains the specified model using the specified dataframe.
# Arguments
dataframe: dataframe. A Spark Dataframe containing the training data.
shuffle: boolean. Tells to shuffle the dataframe before training.
Warning: this will tell Spark to shuffle all partitions over
the network. It is recommended to shuffle the dataframe before
training and store it.
"""
raise NotImplementedError
def serialize(self):
return pickle_object(self)
class SingleTrainer(Trainer):
"""An optimizer which will train a network on a single machine.
# Arguments
keras_model: model. Keras model to train.
worker_optimizer: string. String representing worker optimizer.
See https://keras.io/optimizers/
loss: string. String representing the loss.
See: https://keras.io/objectives/
metrics: list of strings representing model evaluation metrics. Default is ["accuracy"].
See: https://keras.io/metrics/
features_col: string or list of strings. Name(s) of the features column(s).
label_col: string or list of strings. Name(s) of the label column(s).
num_epoch: int. Number of epochs.
batch_size: int. Mini-batch size.
loss_weights: optional list or dict specifying weights for different losses.
"""
def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], features_col="features",
label_col="label", num_epoch=1, batch_size=32, loss_weights=None):
super(SingleTrainer, self).__init__(keras_model, loss, worker_optimizer, metrics, loss_weights)
self.features_column = features_col
self.label_column = label_col
self.num_epoch = num_epoch
self.batch_size = batch_size
def allocate_worker(self):
"""Allocates a worker for the Single Trainer instance.
Only for internal use.
"""
worker = SequentialWorker(model=self.master_model, features_col=self.features_column,
label_col=self.label_column, batch_size=self.batch_size, num_epoch = self.num_epoch,
optimizer=self.worker_optimizer, loss=self.loss, loss_weights=self.loss_weights,
metrics = self.metrics)
return worker
def train(self, dataframe, shuffle=False):
"""See distkeras.trainers.Trainer.train
# Arguments
dataframe: dataframe. A Spark Dataframe containing the training data.
shuffle: boolean. Tells to shuffle the dataframe before training.
Warning: this will tell Spark to shuffle all partitions over
the network. It is recommended to shuffle the dataframe before
training and store it.
"""
# Check if the data needs to be shuffled.
if shuffle:
dataframe = shuffle(dataframe)
# Collect the dataframe on a single worker node.
dataframe = dataframe.coalesce(1)
# Cache the dataframe.
dataframe.cache()
# Allocate a worker.
worker = self.allocate_worker()
# Set the maximum number of mini-batches.
worker.set_max_prefetch(self.max_mini_batches_prefetch)
# Start recording training time.
self.record_training_start()
# Fetch the trained model.
self.master_model = dataframe.rdd.mapPartitionsWithIndex(worker.train).collect()[0]
# Stop recording of training time.
self.record_training_end()
return deserialize_keras_model(self.master_model)
class AveragingTrainer(Trainer):
"""A trainer which implements a data parallel technique using model averaging.
In this implementation, the model replicas are averages after every epoch.
# Arguments
keras_model: model. Keras model to train.
worker_optimizer: string. String representing worker optimizer.
See https://keras.io/optimizers/
loss: string. String representing the loss.
See: https://keras.io/objectives/
metrics: list of strings representing model evaluation metrics. Default is ["accuracy"].
See: https://keras.io/metrics/
features_col: string or list of strings. Name(s) of the features column(s).
label_col: string or list of strings. Name(s) of the label column(s).
num_epoch: int. Number of epochs.
batch_size: int. Mini-batch size.
num_workers: int. Number of model replicas to train in parallel.
loss_weights: optional list or dict specifying weights for different losses.
"""
def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], features_col="features",
label_col="label", num_epoch=1, batch_size=32, num_workers=2, loss_weights=None):
super(AveragingTrainer, self).__init__(keras_model, loss, worker_optimizer, metrics, loss_weights)
self.features_column = features_col
self.label_column = label_col
self.num_epoch = num_epoch
self.batch_size = batch_size
self.num_workers = num_workers
self.parameter_buffer = np.asarray(keras_model.get_weights())
self.parameter_buffer.fill(0.0)
def average_models(self, models):
"""Averages the specified list of Keras models, and assigns the
averaged model as the master model.
# Arguments:
models: list. A list of serialized Keras models.
"""
num_models = len(models)
# Get all weights of the models.
for i in range(0, num_models):
weights = np.asarray(deserialize_keras_model(models[i]).get_weights())
self.parameter_buffer += weights
# Average the parameters.
self.parameter_buffer /= num_models
temp_model = deserialize_keras_model(self.master_model)
temp_model.set_weights(self.parameter_buffer)
self.master_model = serialize_keras_model(temp_model)
def allocate_worker(self):
"""Allocates the AveragingWorker for internal use."""
worker = SequentialWorker(model=self.master_model, features_col=self.features_column,
label_col=self.label_column, batch_size=self.batch_size, num_epoch = 1,
optimizer=self.worker_optimizer, loss=self.loss, loss_weights=self.loss_weights, metrics = self.metrics)
return worker
def train(self, dataframe, shuffle=False):
"""Applies model averaging to the model replicas distributed over the specified
number of Spark executors.
# Arguments
dataframe: dataframe: A Spark Dataframe containing the training data.
shuffle: boolean. Tells to shuffle the dataframe before training.
Warning: this will tell Spark to shuffle all partitions over
the network. It is recommended to shuffle the dataframe before
training and store it.
"""
# Repartition the data in order to fit the number of workers.
num_partitions = dataframe.rdd.getNumPartitions()
# Check if the dataframe needs to be shuffled.
if shuffle:
dataframe = shuffle(dataframe)
# Check if we need to repartition the dataframe.
if num_partitions >= self.num_workers:
dataframe = dataframe.coalesce(self.num_workers)
else:
dataframe = dataframe.repartition(self.num_workers)
# Start the training procedure.
self.record_training_start()
for i in range(0, self.num_epoch):
worker = self.allocate_worker()
# Set the maximum number of mini-batches.
worker.set_max_prefetch(self.max_mini_batches_prefetch)
models = dataframe.rdd.mapPartitionsWithIndex(worker.train).collect()
self.average_models(models)
# End the training procedure.
self.record_training_end()
return deserialize_keras_model(self.master_model)
class EnsembleTrainer(Trainer):
"""Utility trainer which will train ensemble methods in parallel.
# Arguments
keras_model: model. Keras model to train.
worker_optimizer: string. String representing worker optimizer.
See https://keras.io/optimizers/
loss: string. String representing the loss.
See: https://keras.io/objectives/
metrics: list of strings representing model evaluation metrics. Default is ["accuracy"].
See: https://keras.io/metrics/
features_col: string or list of strings. Name(s) of the features column(s).
label_col: string or list of strings. Name(s) of the label column(s).
batch_size: int. Mini-batch size.
num_ensembles: int. Number of ensembles to train.
loss_weights: optional list or dict specifying weights for different losses.
# Note
This will note employ a data-parallell approach for the ensembles.
"""
def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], features_col="features",
label_col="label", batch_size=32, num_ensembles=2, loss_weights=None):
super(EnsembleTrainer, self).__init__(keras_model, loss, worker_optimizer, metrics, loss_weights)
self.features_column = features_col
self.label_column = label_col
self.batch_size = batch_size
self.num_ensembles = num_ensembles
def allocate_worker(self):
"""Allocates the EnsembleWorker for internal use."""
worker = SequentialWorker(model=self.master_model, features_col=self.features_column,
label_col=self.label_column, batch_size=self.batch_size, num_epoch = self.num_epoch,
optimizer=self.worker_optimizer, loss=self.loss, loss_weights=self.loss_weights, metrics=self.metrics)
return worker
def train(self, dataframe, shuffle=False):
"""Trains the specified number of ensemble models using the specified dataframe.
# Arguments
dataframe: dataframe. A Spark Dataframe containing the training data.
shuffle: boolean. Tells to shuffle the dataframe before training.
Warning: this will tell Spark to shuffle all partitions over
the network. It is recommended to shuffle the dataframe before
training and store it.
"""
# Allocate a worker.
worker = self.allocate_worker()
# Set the maximum number of mini-batches.
worker.set_max_prefetch(self.max_mini_batches_prefetch)
# Repartition in order to fit the number of workers.
num_partitions = dataframe.rdd.getNumPartitions()
# Check if the dataframe needs to be shuffled before training.
if shuffle:
dataframe = shuffle(dataframe)
# Check if we need to repartition the dataframe.
if num_partitions >= self.num_workers:
dataframe = dataframe.coalesce(self.num_workers)
else:
dataframe = dataframe.repartition(self.num_workers)
# Start the training procedure.
self.record_training_start()
# Train the models in parallel.
models = dataframe.rdd.mapPartitionsWithIndex(worker.train).collect()
# End the training procedure.
self.record_training_end()
return models
class DistributedTrainer(Trainer):
"""Abstract class which describes the properties of a distributed optimizer.
# Arguments
keras_model: model. Keras model to train.
worker_optimizer: string. String representing worker optimizer.
See https://keras.io/optimizers/
loss: string. String representing the loss.
See: https://keras.io/objectives/
metrics: list of strings representing model evaluation metrics. Default is ["accuracy"].
See: https://keras.io/metrics/
features_col: string or list of strings. Name(s) of the features column(s).
label_col: string or list of strings. Name(s) of the label column(s).
num_epoch: int. Number of epochs.
batch_size: int. Mini-batch size.
num_workers: int. Number of distributed workers.
master_port: int. port number for the parameter server.
loss_weights: optional list or dict specifying weights for different losses.
"""
def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32,
features_col="features", label_col="label", num_epoch=1, master_port=5000, loss_weights=None):
super(DistributedTrainer, self).__init__(keras_model, loss, worker_optimizer, metrics, loss_weights)
self.num_workers = num_workers
self.batch_size = batch_size
self.features_column = features_col
self.label_column = label_col
self.num_epoch = num_epoch
self.parameter_server = None
self.parameter_server_thread = None
self.master_host = determine_host_address()
self.master_port = master_port
self.learning_rate = 1.0
def set_minibatch_size(self, size):
"""Sets the size of the mini-batch."""
self.batch_size = size
def get_minibatch_size(self):
"""Returns the size of the mini-batch."""
return self.batch_size
def get_features_column(self):
"""Returns the name of the features column."""
return self.features_column
def get_label_column(self):
"""Returns the name of the label column."""
return self.label_column
def get_learning_rate(self):
"""Returns the learning rate of the worker which can be tuned by
the parameter server, or optimization scheme.
Note: this learning rate is independent of the learning rate of the optimizer.
"""
return self.learning_rate
def set_learning_rate(self, learning_rate):
"""Sets the learning rate which can be tuned by the parameter server,
or optimization scheme.
Note: this learning rate is independent of the learning rate of the optimizer.
"""
self.learning_rate = learning_rate
def set_num_epoch(self, num_epoch):
"""Sets the number of epochs."""
self.num_epoch = num_epoch
def get_num_epoch(self):
"""Returns the number of epochs."""
return self.num_epoch
def allocate_worker(self):
"""Allocates the worker implementation.
Implement this method in subclasses.
"""
raise NotImplementedError
def set_master(self, master):
"""Sets the master address of the parameter server."""
self.master_host = master
def determine_new_master(self):
"""Sets the new master address to the current host."""
self.master_host = determine_host_address()
def allocate_parameter_server(self):
"""Allocates the parameter server.
If an other type of parameter server is required, you can overwrite
this implementation.
"""
parameter_server = DeltaParameterServer(self.master_model, self.master_port)
return parameter_server
def set_num_workers(self, num_workers):
"""Sets the number of parallel workers to use."""
self.num_workers = num_workers
def get_num_workers(self):
"""Returns the number of parallel workers."""
return self.num_workers
def num_updates(self):
"""Returns the number of model updates the parameter server performed."""
return self.parameter_server.num_updates()
def service(self):
"""Executes the parameter server service."""
self.parameter_server.start()
self.parameter_server.initialize()
self.parameter_server.run()
def stop_service(self):
"""Stops the parameter server service."""
self.parameter_server.stop()
self.parameter_server_thread.join()
self.parameter_server_thread = None
def start_service(self):
"""Starts the parameter server service."""
# Check if a parameter server thread is already allocated.
if not self.parameter_server_thread is None:
# Stop the parameter server service.
self.stop_service()
# Allocate a new parameter service thread.
self.parameter_server_thread = threading.Thread(target=self.service)
self.parameter_server_thread.start()
def train(self, dataframe, shuffle=False):
"""Training procedure of a distributed optimization process.
# Arguments
dataframe: dataframe. A Spark Dataframe containing the training data.
shuffle: boolean. Tells to shuffle the dataframe before training.
Warning: this will tell Spark to shuffle all partitions over
the network. It is recommended to shuffle the dataframe before
training and store it.
"""
# Check if a parameter server has been allocated.
if self.parameter_server is not None:
# Cleanup the old parameter server.
self.parameter_server.stop()
self.parameter_server = None
# Allocate the parameter server.
self.parameter_server = self.allocate_parameter_server()
# Start the communication service.
self.start_service()
# Allocate a worker.
worker = self.allocate_worker()
# Set the maximum number of mini-batches.
worker.set_max_prefetch(self.max_mini_batches_prefetch)
# Repartition in order to fit the number of workers.
num_partitions = dataframe.rdd.getNumPartitions()
# Check if the dataframe needs to be shuffled before training.
if shuffle:
dataframe = shuffle(dataframe)
# Check if we need to repartition the dataframe.
if num_partitions >= self.num_workers:
dataframe = dataframe.coalesce(self.num_workers)
else:
dataframe = dataframe.repartition(self.num_workers)
# Cache the dataframe.
dataframe.cache()
# Start the training procedure.
self.record_training_start()
# Iterate through the epochs.
self.history = dataframe.rdd.mapPartitionsWithIndex(worker.train).collect()
# End the training procedure.
self.record_training_end()
# Stop the communication service.
self.stop_service()
return self.parameter_server.get_model()
class AsynchronousDistributedTrainer(DistributedTrainer):
"""Abstract class for an asynchronous distributed trainer.
This trainer also allows us to set a parallelism factor. This parallelism factor allows
us to further parallelize the Spark job. For example, imagine having n machines optimizing
a model in an asynchronous distributed setting. If for some, but likely reason, some machines
are performing worse compared to others. It will cause the complete learning procedure to be
stuck on this one particular machine since every machine will be assigned a single partition.
In order to resolve this, we added a parallelization factor. This factor indicates the ratio
of the number of jobs per machine (executor). For small dataframes, we recommend that this factor
is set to 1. However, this effect really is prominent when the dataframe is large. In this case
we recommend that the ratio is 2 or 3.
# Arguments
keras_model: model. Keras model to train.
worker_optimizer: string. String representing worker optimizer.
See https://keras.io/optimizers/
loss: string. String representing the loss.
See: https://keras.io/objectives/
metrics: list of strings representing model evaluation metrics. Default is ["accuracy"].
See: https://keras.io/metrics/
features_col: string or list of strings. Name(s) of the features column(s).
label_col: string or list of strings. Name(s) of the label column(s).
num_epoch: int. Number of epochs.
batch_size: int. Mini-batch size.
num_workers: int. Number of distributed workers.
master_port: int. port number for the parameter server.
loss_weights: optional list or dict specifying weights for different losses.
# Note
By default, the parallelization factor is set to 1.
"""
def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32,
features_col="features", label_col="label", num_epoch=1, master_port=5000, loss_weights=None):
super(AsynchronousDistributedTrainer, self).__init__(keras_model, worker_optimizer, loss, metrics,
num_workers, batch_size, features_col,
label_col, num_epoch, master_port, loss_weights)
# Initialize asynchronous methods variables.
self.parallelism_factor = 1
def allocate_worker(self):
"""Allocates the worker implementation.
Implement this method in subclasses.
"""
raise NotImplementedError
def set_parallelism_factor(self, factor):
"""Sets the parallelization factor.
# Arguments
factor: int. The new parallelization factor.
"""
self.parallelism_factor = factor
def get_parallelism_factor(self):
"""Returns the parallelization factor."""
return self.parallelism_factor
def train(self, dataframe, shuffle=False):
"""Training procedure of an asynchronous distributed optimization process.
# Arguments
dataframe: dataframe. A Spark Dataframe containing the training data.
shuffle: boolean. Tells to shuffle the dataframe before training.
Warning: this will tell Spark to shuffle all partitions over
the network. It is recommended to shuffle the dataframe before
training and store it.
"""
# Check if a parameter server has been allocated.
if self.parameter_server is not None:
# Cleanup the old parameter server.
self.parameter_server.stop()
self.parameter_server = None
# Allocate the parameter server.
self.parameter_server = self.allocate_parameter_server()
# Start the communication service.
self.start_service()
# Allocate a worker.
worker = self.allocate_worker()
# Set the maximum number of mini-batches.
worker.set_max_prefetch(self.max_mini_batches_prefetch)
# Repartition in order to fit the number of workers.
num_partitions = dataframe.rdd.getNumPartitions()
# Check if the dataframe needs to be shuffled before training.
if shuffle:
dataframe = shuffle(dataframe)
# Indicate the parallelism (number of worker times parallelism factor).
parallelism = self.parallelism_factor * self.num_workers
# Check if we need to repartition the dataframe.
if num_partitions >= parallelism:
dataframe = dataframe.coalesce(parallelism)
else:
dataframe = dataframe.repartition(parallelism)
# Start the training procedure.
self.record_training_start()
# Iterate through the epochs.
self.history = dataframe.rdd.mapPartitionsWithIndex(worker.train).collect()
# End the training procedure.
self.record_training_end()
# Stop the communication service.
self.stop_service()
return self.parameter_server.get_model()
class AEASGD(AsynchronousDistributedTrainer):
"""Asynchronous Elastic Averaging SGD optimizer.
Introduced by Zhang et al.
https://arxiv.org/pdf/1412.6651.pdf
# Arguments
keras_model: model. Keras model to train.
worker_optimizer: string. String representing worker optimizer.
See https://keras.io/optimizers/
loss: string. String representing the loss.
See: https://keras.io/objectives/
metrics: list of strings representing model evaluation metrics. Default is ["accuracy"].
See: https://keras.io/metrics/
features_col: string or list of strings. Name(s) of the features column(s).
label_col: string or list of strings. Name(s) of the label column(s).
num_epoch: int. Number of epochs.
batch_size: int. Mini-batch size.
num_workers: int. Number of distributed workers.
communication_window: int. Staleness parameter.
This parameter describes the number of mini-batches that will be
computed before updating the center variable. For EASGD based
algorithms we recommend large communication windows.
learning_rate: float. Learning rate.
rho: float. Elastic "exploration" variable.
Higher values mean that the model is allowed to "explore" its surroundings.
Smaller values are correlated with less exploration. We use the value
recommend by the authors.
master_port: int. port number for the parameter server.
loss_weights: optional list or dict specifying weights for different losses.
"""
def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32,
features_col="features", label_col="label", num_epoch=1, communication_window=32,
rho=5.0, learning_rate=0.1, master_port=5000, loss_weights=None):
super(AEASGD, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers,
batch_size, features_col, label_col, num_epoch, master_port, loss_weights)
self.communication_window = communication_window
self.rho = rho
self.learning_rate = learning_rate
def allocate_worker(self):
"""Allocates the asynchronous EASGD worker."""
# Allocate a AEASGD worker.
worker = AEASGDWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics,
self.features_column, self.label_column, self.batch_size, self.num_epoch,
self.master_host, self.master_port, self.rho, self.learning_rate,
self.communication_window)
return worker
class DOWNPOUR(AsynchronousDistributedTrainer):
"""DOWNPOUR Optimizer.
Asynchronous data-parallel optimizer introduced by Dean et al.
http://static.googleusercontent.com/media/research.google.com/en/archive/large_deep_networks_nips2012.pdf
# Arguments
keras_model: model. Keras model to train.
worker_optimizer: string. String representing worker optimizer.
See https://keras.io/optimizers/
loss: string. String representing the loss.
See: https://keras.io/objectives/
metrics: list of strings representing model evaluation metrics. Default is ["accuracy"].
See: https://keras.io/metrics/
features_col: string or list of strings. Name(s) of the features column(s).
label_col: string or list of strings. Name(s) of the label column(s).
num_epoch: int. Number of epochs.
batch_size: int. Mini-batch size.
num_workers: int. Number of distributed workers.
communication_window: int. Staleness parameter.
This parameter describes the number of mini-batches that will be
computed before updating the center variable. For DOWNPOUR we
recommend small communication windows.
learning_rate: float. Learning rate.
master_port: int. port number for the parameter server.
loss_weights: optional list or dict specifying weights for different losses.
"""
def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32,
features_col="features", label_col="label", num_epoch=1, communication_window=5, master_port=5000, loss_weights=None):
super(DOWNPOUR, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers,
batch_size, features_col, label_col, num_epoch, master_port, loss_weights)
self.communication_window = communication_window
def allocate_worker(self):
"""Allocates the DOWNPOUR worker."""
# Allocate DOWNPOUR worker.
worker = DOWNPOURWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics,
self.features_column, self.label_column, self.batch_size, self.num_epoch,
self.master_host, self.master_port, self.communication_window)
return worker
class EAMSGD(AsynchronousDistributedTrainer):
"""Asynchronous Elastic Averaging w/ Momentum SGD optimizer.
Introduced by Zhang et al.
https://arxiv.org/pdf/1412.6651.pdf
# Arguments
keras_model: model. Keras model to train.
worker_optimizer: string. String representing worker optimizer.
See https://keras.io/optimizers/
loss: string. String representing the loss.
See: https://keras.io/objectives/
metrics: list of strings representing model evaluation metrics. Default is ["accuracy"].
See: https://keras.io/metrics/
features_col: string or list of strings. Name(s) of the features column(s).
label_col: string or list of strings. Name(s) of the label column(s).
num_epoch: int. Number of epochs.
batch_size: int. Mini-batch size.
num_workers: int. Number of distributed workers.
communication_window: int. Staleness parameter.
This parameter describes the number of mini-batches that will be
computed before updating the center variable. For EASGD based
algorithms we recommend large communication windows.
learning_rate: float. Learning rate.
rho: float. Elastic "exploration" variable.
Higher values mean that the model is allowed to "explore" its surroundings.
Smaller values are correlated with less exploration. We use the value
recommend by the authors.
momentum: float. Momentum term.
master_port: int. port number for the parameter server.
loss_weights: optional list or dict specifying weights for different losses.
"""
def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32,
features_col="features", label_col="label", num_epoch=1, communication_window=32,
rho=5.0, learning_rate=0.1, momentum=0.9, master_port=5000, loss_weights=None):
super(EAMSGD, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers,
batch_size, features_col, label_col, num_epoch, master_port, loss_weights)
self.communication_window = communication_window
self.rho = rho
self.learning_rate = learning_rate
self.momentum = momentum
def allocate_worker(self):
"""Allocates the asynchronous EAMSGD worker."""
# Allocate a EAMSGD REST worker.
worker = EAMSGDWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics,
self.features_column, self.label_column, self.batch_size, self.num_epoch,
self.master_host, self.master_port, self.rho, self.learning_rate,
self.momentum, self.communication_window)
return worker
class ADAG(AsynchronousDistributedTrainer):
"""Asynchronous Distributed Adaptive Gradient (Stochastic Gradient Descent).
Introduced by Hermans et al.
# Arguments:
keras_model: model. Keras model to train.
worker_optimizer: string. String representing worker optimizer.
See: https://keras.io/optimizers/
loss: string. String representing the loss function.
See: https://keras.io/objectives/
metrics: list of strings representing model evaluation metrics. Default is ["accuracy"].
See: https://keras.io/metrics/
features_col: string or list of strings. Name(s) of the features column(s).
num_epoch: int. Number of epochs.
batch_size: int. Mini-batch size.
num_workers: int. Number of distributed workers.
communication_window: int. Staleness parameter.
This parameter describes the number of mini-batches that will be
computed before updating the center variable. For DOWNPOUR based
algorithms we recommend large communication windows.
master_port: int. port number for the parameter server.
loss_weights: optional list or dict specifying weights for different losses.
"""
def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32,
features_col="features", label_col="label", num_epoch=1, communication_window=12, master_port=5000, loss_weights=None):
# Initialize the parent object.
super(ADAG, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers,
batch_size, features_col, label_col, num_epoch, master_port, loss_weights)
# Set algorithm parameters.
self.communication_window = communication_window
def allocate_worker(self):
"""Allocate an Adag worker."""
worker = ADAGWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics,
self.features_column, self.label_column, self.batch_size, self.num_epoch,
self.master_host, self.master_port, self.communication_window)
return worker
def allocate_parameter_server(self):
"""Allocate the Adag parameter server."""
parameter_server = ADAGParameterServer(self.master_model, self.master_port)
return parameter_server
class DynSGD(AsynchronousDistributedTrainer):
"""Dynamic SGD, dynamically maintains learning rate for every worker
and incorperates staleness.
Introduced in SIGMOD 2017 "Heterogenity-aware Parameter Servers"
http://net.pku.edu.cn/~cuibin/Papers/2017SIGMOD.pdf
# Arguments:
keras_model: model. Keras model to train.
worker_optimizer: string. String representing worker optimizer.
See: https://keras.io/optimizers/
loss: string. String representing the loss function.
See: https://keras.io/objectives/
metrics: list of strings representing model evaluation metrics. Default is ["accuracy"].
See: https://keras.io/metrics/
features_col: string or list of strings. Name(s) of the features column(s).
num_epoch: int. Number of epochs.
batch_size: int. Mini-batch size.
num_workers: int. Number of distributed workers.
communication_window: int. Staleness parameter.
This parameter describes the number of mini-batches that will be
computed before updating the center variable. For DOWNPOUR based
algorithms we recommend large communication windows.
master_port: int. port number for the parameter server.
loss_weights: optional list or dict specifying weights for different losses.
"""
def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32,
features_col="features", label_col="label", num_epoch=1, communication_window=5, master_port=5000, loss_weights=None):
# Initialize the parent object.
super(DynSGD, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers,
batch_size, features_col, label_col, num_epoch, master_port, loss_weights)
# Set algorithm parameters.
self.communication_window = communication_window
def allocate_worker(self):
"""Allocate DYNSGD worker."""
worker = DynSGDWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics,
self.features_column, self.label_column, self.batch_size, self.num_epoch,
self.master_host, self.master_port, self.communication_window)
return worker
def allocate_parameter_server(self):
"""Allocate DYNSGD parameter server."""
parameter_server = DynSGDParameterServer(self.master_model, self.master_port)
return parameter_server
class Experimental(AsynchronousDistributedTrainer):
"""Experimental optimization scheme for development purposes."""
def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32,
features_col="features", label_col="label", num_epoch=1, communication_window=5,
learning_rate=1.0, master_port=5000, loss_weights=None):
# Initialize the parent object.
super(Experimental, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers,
batch_size, features_col, label_col, num_epoch, master_port, loss_weights)
# Set the algorithm parameters.
self.communication_window = communication_window
self.learning_rate = learning_rate
def allocate_worker(self):
"""Allocate experimental worker."""
worker = ExperimentalWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics,
self.features_column, self.label_column, self.batch_size, self.num_epoch,
self.master_host, self.master_port, self.communication_window,
self.num_workers, self.learning_rate)
return worker
def allocate_parameter_server(self):
"""Allocate experimental parameter server."""
parameter_server = ExperimentalParameterServer(self.master_model, self.master_port, self.learning_rate)
return parameter_server
================================================
FILE: distkeras/transformers.py
================================================
"""Commonly used Dataframe transformers.
A transformer will "transform" a Spark dataframe from one form into
the other. For example, mapping the column to an other value, or adding
a column to a dataframe based on a collection of specified values.
"""
## BEGIN Imports. ##############################################################
import numpy as np
from distkeras.utils import new_dataframe_row
from distkeras.utils import to_one_hot_encoded_dense
from pyspark.mllib.linalg import DenseMatrix
from pyspark.mllib.linalg import DenseVector
from pyspark.sql.functions import mean
from pyspark.sql.functions import stddev_pop
## END Imports. ################################################################
class Transformer(object):
"""Interface which defines a transformer object."""
def transform(self, dataframe):
"""Transforms the dataframe into an other dataframe.
# Returns
The transformed dataframe.
"""
raise NotImplementedError
class MinMaxTransformer(Transformer):
"""Will transform every feature of an instance between a specified range.
# Arguments
o_min: float. Original minimum of dataset.
o_max: float. Original maximum of dataset.
n_min: float. New minimum of dataset.
n_max: float. New maximum of dataset.
input_col: string. Name of input column.
output_col: string. Name of output column.
is_vector. boolean. Indicates if the data element is a vector or
a singular value.
# Summary
New range: [o_min; o_max]
Old range: [n_min; n_max]
"""
def __init__(self, o_min, o_max, n_min, n_max, input_col, output_col, is_vector=True):
self.o_min = float(o_min)
self.o_max = float(o_max)
self.n_min = float(n_min)
self.n_max = float(n_max)
self.scale = (self.n_max - self.n_min) / (self.o_max - self.o_min)
self.input_column = input_col
self.output_column = output_col
self.is_vector = is_vector
def _transform(self, row):
"""Rescale every instance like this:
x' = \frac{x - min}{max - min}
"""
if self.is_vector:
vector = row[self.input_column].toArray()
vector = self.scale * (vector - self.o_max) + self.n_max
new_value = DenseVector(vector)
else:
value = row[self.input_column]
new_value = self.scale * (value - self.o_max) + self.n_max
# Construct a new row with the normalized vector.
new_row = new_dataframe_row(row, self.output_column, new_value)
return new_row
def transform(self, dataframe):
"""Applies the min-max transformation to every row in the dataframe.
# Arguments
dataframe: dataframe. Spark Dataframe.
"""
return dataframe.rdd.map(self._transform).toDF()
class BinaryLabelTransformer(Transformer):
"""Transformers the specified a column to a binary label, i.e., [0, 1] give
a specific label name. Given the specified label, this transformer will generate
[1,0], in the other case [0,1].
# Arguments:
input_column: string. Column name of the label identifier.
output_column: string. Name of the new label which contains the binary label.
label: string. Name of the label which needs to serve as 1.
"""
def __init__(self, input_column, output_column, label):
self.input_column = input_column
self.output_column = output_column
self.label = label
def _transform(self, row):
"""Appends the desired binary label column."""
value = row[self.input_column]
vector = np.zeros(2)
# Check if the name matches.
if value == self.label:
vector[0] = 1.0
else:
vector[1] = 1.0
# Convert to a Spark DenseVector
vector = DenseVector(vector)
return new_dataframe_row(row, self.output_column, vector)
def transform(self, dataframe):
"""Applies the binary label transformation to the applied dataframe.
# Arguments
dataframe: dataframe. Spark Dataframe.
"""
return dataframe.rdd.map(self._transform).toDF()
class StandardTransformer(Transformer):
"""Will transform the specified columns to unit standard deviation (if specified),
and centers the data to mean 0 (if specified).
# Arguments
columns: list. List of columns.
suffix: string. Suffix name of the column after processing.
# Note
We assume equal probability of the rows.
"""
def __init__(self, columns, suffix="_normalized"):
self.columns = columns
self.column_suffix = suffix
self.current_column = None
self.means = {}
self.stddevs = {}
def clean_mean_keys(self, means):
"""Cleans the keys of the specified dictionary (mean)."""
new_means = {}
for k in means:
new_means[k[4:-1]] = means[k]
return new_means
def clean_stddev_keys(self, stddevs):
"""Cleans the keys of the specified dictionary (stddev)."""
new_stddevs = {}
for k in stddevs:
new_stddevs[k[11:-5]] = stddevs[k]
return new_stddevs
def _transform(self, row):
"""Take the column, and normalize it with the computed means and std devs."""
mean = self.means[self.current_column]
stddev = self.stddevs[self.current_column]
x = row[self.current_column]
x_normalized = (x - mean) / stddev
output_column = self.current_column + self.column_suffix
new_row = new_dataframe_row(row, output_column, x_normalized)
return new_row
def transform(self, dataframe):
"""Applies standardization to the specified columns.
# Arguments
dataframe: dataframe. Spark Dataframe.
"""
# Compute the means of the specified columns.
means = [mean(x) for x in self.columns]
means = dataframe.select(means).collect()[0].asDict()
self.means = self.clean_mean_keys(means)
# Compute the standard deviation of the specified columns.
stddevs = [stddev_pop(x) for x in self.columns]
stddevs = dataframe.select(stddevs).collect()[0].asDict()
self.stddevs = self.clean_stddev_keys(stddevs)
# For every feature, add a new column to the dataframe.
for column in self.columns:
self.current_column = column
dataframe = dataframe.rdd.map(self._transform).toDF()
return dataframe
class DenseTransformer(Transformer):
"""Transformes sparse vectors into dense vectors.
# Arguments
input_col: string. Name of the input column of the sparse vector.
output_col: string. Name of the output column.
"""
def __init__(self, input_col, output_col):
self.input_column = input_col
self.output_column = output_col
def _transform(self, row):
"""Transforms the sparse vector to a dense vector while putting it in a new column."""
sparse_vector = row[self.input_column]
dense_vector = DenseVector(sparse_vector.toArray())
new_row = new_dataframe_row(row, self.output_column, dense_vector)
return new_row
def transform(self, dataframe):
"""Transforms every sparse vector in the input column to a dense vector.
# Arguments
dataframe: dataframe. Spark Dataframe.
# Returns
A transformed Spark Dataframe.
"""
return dataframe.rdd.map(self._transform).toDF()
class ReshapeTransformer(Transformer):
"""Transforms vectors into other dense shapes.
# Note:
Only use this transformer in the last stage of the processing pipeline.
Since the arbitrary vector shapes will be directly passed on to the models.
# Arguments:
input_col: string. Name of the input column containing the vector.
output_col: string. Name of the output column.
shape: tuple. Shape of the matrix.
"""
def __init__(self, input_col, output_col, shape):
self.input_column = input_col
self.output_column = output_col
self.shape = shape
def _transform(self, row):
"""Transforms the vector to a dense matrix while putting it in a new column."""
vector = row[self.input_column]
vector = np.asarray(vector)
reshaped = vector.reshape(self.shape).tolist()
new_row = new_dataframe_row(row, self.output_column, reshaped)
return new_row
def transform(self, dataframe):
"""Transforms every vector in the input column to a dense vector.
# Arguments
dataframe: dataframe. Spark Dataframe.
# Returns
A transformed Spark Dataframe.
"""
return dataframe.rdd.map(self._transform).toDF()
class OneHotTransformer(Transformer):
"""Transformer which transforms an integer index into a vector using one-hot-encoding.
# Arguments
output_dim: int. Dimension of output vector.
input_col: string. Name of input column.
output_col: string. Name of output column.
"""
def __init__(self, output_dim, input_col, output_col):
self.input_column = input_col
self.output_column = output_col
self.output_dimensionality = output_dim
def _transform(self, row):
"""Transforms every individual row.
Only for internal use.
"""
label = row[self.input_column]
vector = to_one_hot_encoded_dense(label, self.output_dimensionality)
new_row = new_dataframe_row(row, self.output_column, vector.tolist())
return new_row
def transform(self, dataframe):
"""Applies One-Hot encoding to every row in the dataframe.
# Arguments
dataframe: dataframe. A Spark Dataframe.
# Returns
A Spark Dataframe with one-hot encoded features.
"""
return dataframe.rdd.map(self._transform).toDF()
class LabelIndexTransformer(Transformer):
"""Transformer which will transform a prediction vector into an integer label.
# Arguments
output_dim: int. Dimension of output vector.
input_col: string. Name of the input column.
output_col: string. Name of the output column.
default_index: int. Default "answer".
activation_threshold: float. Threshold of immediate activation.
"""
def __init__(self, output_dim, input_col="prediction", output_col="prediction_index",
default_index=0, activation_threshold=0.55):
self.input_column = input_col
self.output_column = output_col
self.output_dimensionality = output_dim
self.activation_threshold = activation_threshold
self.default_index = default_index
def get_index(self, vector):
"""Returns the index with the highest value or with activation threshold."""
max = 0.0
max_index = self.default_index
for index in range(0, self.output_dimensionality):
if vector[index] >= self.activation_threshold:
return index
if vector[index] > max:
max = vector[index]
max_index = index
return max_index
def _transform(self, row):
"""Transforms every row by adding a "predicted index" column to the dataframe. """
prediction = row[self.input_column]
index = float(self.get_index(prediction))
new_row = new_dataframe_row(row, self.output_column, index)
return new_row
def transform(self, dataframe):
"""Transforms the dataframe by adding a predicted index.
# Arguments
dataframe: dataframe. A Spark Dataframe.
# Returns
A Spark Dataframe with a "predicted" index.
"""
return dataframe.rdd.map(self._transform).toDF()
================================================
FILE: distkeras/utils.py
================================================
"""Utility functions used throughout Distributed Keras."""
## BEGIN Import. ###############################################################
from keras import backend as K
from keras.models import model_from_json
from keras import backend as K
from pyspark.mllib.linalg import DenseVector
from pyspark.sql import Row
from pyspark.sql.functions import rand
import pickle
import json
import numpy as np
import os
import pwd
## END Import. #################################################################
def get_os_username():
"""Returns the username of user on the operating system.
From: http://stackoverflow.com/questions/842059/is-there-a-portable-way-to-get-the-current-username-in-python
"""
return pwd.getpwuid(os.getuid())[0]
def set_keras_base_directory(base_dir='/tmp/' + get_os_username()):
"""Sets the base directory of Keras."""
K._keras_base_dir = base_dir
def to_one_hot_encoded_dense(value, n_dim=2):
"""Converts the value to a one-hot encoded vector.
# Arguments
value: float. Value of the single "hot" value.
n_dim: int. Dimension of the output vector.
"""
value = int(value)
vector = np.zeros(n_dim)
vector[value] = 1.0
return vector
def new_dataframe_row(old_row, column_name, column_value):
"""Constructs a new Spark Row based on the old row, and a new column name and value."""
row = Row(*(old_row.__fields__ + [column_name]))(*(old_row + (column_value, )))
return row
def json_to_dataframe_row(string):
"""Converts a JSON String to a Spark Dataframe row."""
dictionary = json.loads(string)
row = Row(**dictionary)
return row
def pickle_object(o):
"""Pickles the specified model and its weights."""
return pickle.dumps(o, -1)
def unpickle_object(string):
"""Unpickles the specified string into a model."""
return pickle.loads(string)
def serialize_keras_model(model):
"""Serializes the specified Keras model into a dictionary."""
dictionary = {}
dictionary['model'] = model.to_json()
dictionary['weights'] = model.get_weights()
return dictionary
def history_executors_average(history):
"""Returns the averaged training metrics for all the executors."""
max_iteration = max(history, key=lambda x: x['iteration'])['iteration']
max_executor = max(history, key=lambda x: x['worker_id'])['worker_id']
histories = []
averaged_history = []
# Fetch the histories of the individual executors.
for i in range(0, max_executor):
histories.append(history_executor(history, i))
# Construct the averaged history.
for i in range(0, max_iteration):
num_executors = 0
sum = np.zeros(2)
for j in range(0, max_executor):
if len(histories[j]) - 1 >= i:
num_executors += 1
sum += histories[j][i]['history']
# Average the history.
sum /= num_executors
averaged_history.append(sum)
return averaged_history
def history_executor(history, id):
"""Returns the history of a specific executor."""
executor_history = [h for h in history if h['worker_id'] == id]
executor_history.sort(key=lambda x: x['iteration'])
return executor_history
def deserialize_keras_model(dictionary):
"""Deserialized the Keras model using the specified dictionary."""
architecture = dictionary['model']
weights = dictionary['weights']
model = model_from_json(architecture)
model.set_weights(weights)
return model
def uniform_weights(model, constraints=[-0.5, 0.5]):
"""Initializes the parameters of the specified Keras model with uniform
weights between the specified ranges.
# Arguments
model: Keras model.
constraints: array. An array with two elements which defines the range
of the uniform initalization.
"""
# We assume the following: Keras will return a list of weight matrices.
# All layers, even the activiation layers, will be randomly initialized.
weights = model.get_weights()
for layer in weights:
shape = layer.shape
if len(shape) > 1:
# Fill the matrix with random numbers.
n_rows = shape[0]
n_columns = shape[1]
for i in range(0, n_rows):
for j in range(0, n_columns):
layer[i][j] = np.random.uniform(low=constraints[0], high=constraints[1])
else:
# Fill the vector with random numbers.
n_elements = shape[0]
for i in range(0, n_elements):
layer[i] = np.random.uniform(low=constraints[0], high=constraints[1])
# Set the new weights in the model.
model.set_weights(weights)
def shuffle(dataset):
"""Shuffles the rows in the specified Spark Dataframe.
# Arguments
dataset: dataframe. A Spark Dataframe.
"""
dataset = dataset.orderBy(rand())
dataset.cache()
return dataset
def precache(dataset, num_workers):
"""Precaches the specified dataset.
Make sure the specified dataframe has the desired partitioning scheme.
# Arguments
dataset: dataframe. A Spark Dataframe.
num_workers: int. Number of workers you are going to use.
"""
dataset = dataset.repartition(num_workers)
dataset.cache()
dataset.count()
return dataset
================================================
FILE: distkeras/workers.py
================================================
"""Workers module.
This module contains all worker specific implementations for different optimization
algorithms.
"""
## BEGIN Imports. ##############################################################
from distkeras.networking import connect
from distkeras.networking import recv_data
from distkeras.networking import send_data
from distkeras.utils import deserialize_keras_model
from distkeras.utils import serialize_keras_model
from distkeras.utils import set_keras_base_directory
from distkeras.utils import shuffle
from distkeras.utils import uniform_weights
from keras.optimizers import Optimizer, serialize, deserialize
import keras.backend as K
from itertools import tee
from multiprocessing import Pool
import numpy as np
import threading
import tensorflow as tf
import sys
# "queue" module in python 3 is named "Queue" in python 2
use_python3 = sys.version_info[0] == 3
if use_python3:
import queue
else:
import Queue as queue
import random
import socket
import time
## END Imports. ################################################################
class Worker(object):
"""Abstract class of a worker.
This class provides basic functionality and properties all workers share.
"""
def __init__(self, model, optimizer, loss, loss_weights, metrics=["accuracy"], features_col="features", label_col="label",
batch_size=32, num_epoch=1, learning_rate=1.0):
assert isinstance(optimizer, (str, Optimizer)), "'optimizer' must be a string or a Keras Optimizer instance"
assert isinstance(features_col, (str, list)), "'features_col' must be a string or a list of strings"
assert isinstance(label_col, (str, list)), "'label_col' must be a string or a list of strings"
self.model = model
self.optimizer = {'class_name': optimizer, 'config': {}} if isinstance(optimizer, str) else serialize(optimizer)
self.loss = loss
self.loss_weights = loss_weights
self.metrics= metrics
self.features_column = [features_col] if isinstance(features_col, str) else features_col
self.label_column = [label_col] if isinstance(label_col, str) else label_col
self.batch_size = batch_size
self.num_epoch = num_epoch
self.max_mini_batches = 100
self.prefetching_thread = None
self.mini_batches = None
self.is_prefetching = True
self.worker_id = -1
self.learning_rate = learning_rate
self.num_inputs = len(self.features_column)
self.num_outputs = len(self.label_column)
self.current_epoch = 0
def set_max_prefetch(self, max_mini_batches):
"""Sets the maximum number of mini-batches that can be prefetched."""
self.max_mini_batches = max_mini_batches
def set_learning_rate(self, learning_rate):
"""Sets the learning rate of the worker."""
self.learning_rate = learning_rate
def get_learning_rate(self):
"""Returns the learning rate of the worker."""
return self.learning_rate
def set_worker_id(self, worker_id):
"""Sets the worker id.
# Arguments
worker_id: int. Worker identifier.
"""
self.worker_id = worker_id
def get_worker_id(self):
"""Returns the worker id."""
return self.worker_id
def prepare_model(self):
"""Prepares the model for training."""
# Set the Keras directory.
set_keras_base_directory()
if K.backend() == 'tensorflow':
# set GPU option allow_growth to False for GPU-enabled tensorflow
config = tf.ConfigProto()
config.gpu_options.allow_growth = False
sess = tf.Session(config=config)
K.set_session(sess)
# Deserialize the Keras model.
self.model = deserialize_keras_model(self.model)
self.optimizer = deserialize(self.optimizer)
# Compile the model with the specified loss and optimizer.
self.model.compile(loss=self.loss, loss_weights = self.loss_weights,
optimizer=self.optimizer, metrics=self.metrics)
def get_next_minibatch(self):
"""Returns the next mini-batch."""
return self.mini_batches.get(timeout=10)
def start_prefetching_thread(self, iterator):
"""Starts the data prefetching thread."""
self.mini_batches = queue.Queue()
self.iterator = iterator
self.prefetching_thread = threading.Thread(target=self.prefetching)
self.prefetching_thread.start()
def prefetching(self):
partition_iterators_all_epochs = tee(self.iterator, self.num_epoch)
for iter_one_epoch in partition_iterators_all_epochs:
self.current_epoch += 1
self.is_prefetching = True
try:
while self.is_prefetching:
if self.mini_batches.qsize() < self.max_mini_batches:
batch = [next(iter_one_epoch) for _ in range(self.batch_size)]
batch_iterator_copies = tee(batch, self.num_inputs + self.num_outputs)
feature_iterators = batch_iterator_copies[:self.num_inputs]
label_iterators = batch_iterator_copies[self.num_inputs:]
X = [np.asarray([x[self.features_column[i]] for x in iterator])
for i, iterator in enumerate(feature_iterators)]
Y = [np.asarray([x[self.label_column[i]] for x in iterator])
for i, iterator in enumerate(label_iterators)]
self.mini_batches.put([X, Y])
except Exception as e:
print(e)
self.is_prefetching = False
def optimize(self):
"""Optimization procedure of a worker."""
raise NotImplementedError
def train(self, worker_id, iterator):
"""Training procedure for the worker node.
# Arguments
worker_id: int. Partition index provided by Spark. Can be used as a worker_id.
iterator: iterator. Data iterator.
"""
# Prepare the optimization procedure.
self.start_prefetching_thread(iterator)
self.set_worker_id(worker_id)
self.prepare_model()
# Start the optimization procedure.
try:
self.optimize()
except Exception as e:
# Stop the prefetching process.
self.is_prefetching = False
print(e)
# Wait for the prefetching thread to stop.
self.prefetching_thread.join()
return iter([serialize_keras_model(self.model)])
class SequentialWorker(Worker):
"""Implementation for sequential gradient updates on a single worker.
Will train a model on a single worker node.
"""
def __init__(self, model, optimizer, loss, loss_weights, metrics=["accuracy"],
features_col="features", label_col="label", batch_size=32, num_epoch=1):
# Initialize the parent class.
super(SequentialWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col,
label_col, batch_size, num_epoch)
def optimize(self):
"""Training procedure with sequential gradient updates.
# Returns
Trained serialized Keras model.
"""
while True:
X, Y = self.get_next_minibatch()
h = self.model.train_on_batch(X, Y)
self.add_history(h)
class NetworkWorker(Worker):
"""Abstract class of a worker who shares the variables using the network."""
def __init__(self, model, optimizer, loss, loss_weights, metrics=["accuracy"], features_col="features", label_col="label",
batch_size=32, num_epoch=1, master_host="localhost", master_port=5000, learning_rate=1.0):
super(NetworkWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col,
label_col, batch_size, num_epoch, learning_rate)
self.master_host = master_host
self.master_port = master_port
self.socket = None
self.center_variable = None
self.disable_nagle = True
self.training_history = []
self.worker_id = 0
def connect(self):
"""Connect with the remote parameter server."""
self.socket = connect(self.master_host, self.master_port, self.disable_nagle)
def pull(self):
"""Requests the center variable from the parameter server."""
# Request a pull from the parameter server.
self.socket.sendall(b'p')
# Fetch the center variable from the parameter server.
self.center_variable = np.asarray(recv_data(self.socket))
def commit(self, residual):
"""Sends the gradient residual to the parameter server."""
# Prepare the datastructure.
data = {}
data['worker_id'] = self.get_worker_id()
data['delta'] = residual
# Request a commit from the parameter server.
self.socket.sendall(b'c')
# Send the data to the paramter server.
send_data(self.socket, data)
def set_tcp_no_delay(self, flag):
"""Disables or enables Nagle's algorithm.
(True -> TCP_NODELAY = 1)
(False -> TCP_NODELAY = 0)
# Arguments:
flag: boolean. Indicates if Nagle's algorithm should be disabled.
"""
self.disable_nagle = flag
def tcp_no_delay(self):
"""Returns the value TCP_NODELAY of the flag (Nagle's algorithm).
# Returns
True, if Nagle's algorithm is disabled. False otherwise.
"""
return self.disable_nagle
def get_master_host(self):
"""Returns the host address of the master parameter server."""
return self.master_host
def get_master_port(self):
"""Returns the port of the master parameter server."""
return self.master_port
def add_history(self, h):
"""Appends the specified history data."""
d = {}
d['history'] = h
d['worker_id'] = self.worker_id
d['iteration'] = self.iteration
d['timestamp'] = time.time()
self.training_history.append(d)
def optimize(self):
"""Optimization procedure of a network worker."""
raise NotImplementedError
def train(self, worker_id, iterator):
"""Training procedure of a networked worker with a parameter server."""
self.start_prefetching_thread(iterator)
self.set_worker_id(worker_id)
self.prepare_model()
self.connect()
self.pull()
self.model.set_weights(self.center_variable)
try:
self.optimize()
except Exception as e:
# Stop the prefetching process.
self.is_prefetching = False
print(e)
self.socket.close()
self.prefetching_thread.join(timeout=1)
return iter(self.training_history)
class ADAGWorker(NetworkWorker):
"""Implements the training procedure for ADAG.
Introduced by Hermans et al.
"""
def __init__(self, model, optimizer, loss, loss_weights, metrics=["accuracy"], features_col="features", label_col="label",
batch_size=32, num_epoch=1, master_host="localhost", master_port=5000, communication_window=5):
# Initialize the parent object.
super(ADAGWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col,
batch_size, num_epoch, master_host, master_port)
# Initialize ADAG parameters.
self.communication_window = communication_window
self.iteration = 1
def commit(self, residual):
"""Sends the gradient residual to the parameter server."""
# Prepare the datastructure.
data = {}
data['worker_id'] = self.get_worker_id()
data['residual'] = residual
# Request a commit from the parameter server.
self.socket.sendall(b'c')
# Send the data to the paramter server.
send_data(self.socket, data)
def optimize(self):
"""Optimization procedure of ADAG."""
W1 = np.asarray(self.model.get_weights())
while True:
X, Y = self.get_next_minibatch()
h = self.model.train_on_batch(X, Y)
self.add_history(h)
if self.iteration % self.communication_window == 0:
W2 = np.asarray(self.model.get_weights())
delta = W2 - W1
delta /= self.communication_window
self.commit(delta)
self.pull()
self.model.set_weights(self.center_variable)
W1 = self.center_variable
self.iteration += 1
class DOWNPOURWorker(NetworkWorker):
"""Implements the training procedure for the distributed DOWNPOUR optimizer.
Introduced by Dean et al.
http://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf
"""
def __init__(self, model, optimizer, loss, loss_weights, metrics=["accuracy"], features_col="features", label_col="label",
batch_size=32, num_epoch=1, master_host="localhost", master_port=5000, communication_window=3):
# Initialize the parent object.
super(DOWNPOURWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col,
batch_size, num_epoch, master_host, master_port)
self.communication_window = communication_window
self.iteration = 1
def optimize(self):
"""Specific optimization procedure for DOWNPOUR."""
W1 = np.asarray(self.model.get_weights())
while True:
X, Y = self.get_next_minibatch()
if self.iteration % self.communication_window == 0:
W2 = np.asarray(self.model.get_weights())
delta = W2 - W1
self.commit(delta)
self.pull()
self.model.set_weights(self.center_variable)
W1 = self.center_variable
h = self.model.train_on_batch(X, Y)
self.add_history(h)
self.iteration += 1
class AEASGDWorker(NetworkWorker):
"""Implementation of asynchronous EASGD worker.
Introduced by Zhang et al.
https://arxiv.org/pdf/1412.6651.pdf
"""
def __init__(self, model, optimizer, loss, loss_weights, metrics=['accuracy'], features_col="features", label_col="label",
batch_size=32, num_epoch=1, master_host="localhost", master_port=5000, rho=5.0,
learning_rate=0.01, communication_window=32):
# Initialize the parent object.
super(AEASGDWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col,
batch_size, num_epoch, master_host, master_port)
# Initialize AEASGD specific variables.
self.rho = rho
self.learning_rate = learning_rate
self.communication_window = communication_window
self.alpha = self.rho * self.learning_rate
self.iteration = 1
def optimize(self):
"""Specific training procedure for AEASGD."""
while True:
X, Y = self.get_next_minibatch()
if self.iteration % self.communication_window == 0:
self.pull()
W = np.asarray(self.model.get_weights())
E = self.alpha * (W - self.center_variable)
W = W - E
self.model.set_weights(W)
self.commit(E)
h = self.model.train_on_batch(X, Y)
self.add_history(h)
self.iteration += 1
class EAMSGDWorker(NetworkWorker):
"""Worker implementation of Asynchronous EA Momentum SGD.
Introduced by Zhang et al.
https://arxiv.org/pdf/1412.6651.pdf
"""
def __init__(self, model, optimizer, loss, loss_weights, metrics=['accuracy'], features_col="features", label_col="label",
batch_size=32, num_epoch=1, master_host="localhost", master_port=5000, rho=5.0,
learning_rate=0.01, momentum=0.9, communication_window=32):
# Initialize the parent object.
super(EAMSGDWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col,
batch_size, num_epoch, master_host, master_port)
# Initialize EAMSGD specific variables.
self.rho = rho
self.learning_rate = learning_rate
self.momentum = momentum
self.communication_window = communication_window
self.alpha = self.learning_rate * self.rho
self.iteration = 1
def optimize(self):
"""Specific training procedure of asynchronous EAMSGD."""
r = np.asarray(self.model.get_weights())
r.fill(0.0)
while True:
X, Y = self.get_next_minibatch()
if self.iteration % self.communication_window == 0:
self.pull()
W = np.asarray(self.model.get_weights())
E = self.alpha * (W - self.center_variable)
W = W - E
self.model.set_weights(W)
self.commit(E)
r_t = self.momentum * r
W_copy = np.asarray(self.model.get_weights())
W = np.asarray(self.model.get_weights())
W += r_t
self.model.set_weights(W)
h = self.model.train_on_batch(X, Y)
self.add_history(h)
gradient = np.asarray(self.model.get_weights()) - W
r = r_t - self.learning_rate * gradient
W_copy -= r
self.model.set_weights(W_copy)
self.iteration += 1
class DynSGDWorker(NetworkWorker):
"""Implements the training procedure for DynSGD."""
def __init__(self, model, optimizer, loss, loss_weights, metrics=["accuracy"], features_col="features", label_col="label",
batch_size=32, num_epoch=1, master_host="localhost", master_port=5000, communication_window=5):
# Initialize the parent object.
super(DynSGDWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col,
batch_size, num_epoch, master_host, master_port)
# Initialize DynSGD parameters.
self.communication_window = communication_window
self.iteration = 1
self.last_update = 0
def pull(self):
"""Requests the center variable and last update from the parameter server."""
# Request a pull from the parameter server.
self.socket.sendall(b'p')
# Fetch the dictionary from the parameter server.
data = recv_data(self.socket)
self.center_variable = np.asarray(data['model'])
self.last_update = data['update']
def commit(self, residual):
"""Sends the gradient residual to the parameter server."""
# Prepare the datastructure.
data = {}
data['worker_id'] = self.get_worker_id()
data['residual'] = residual
data['last_update'] = self.last_update
# Request a commit from the parameter server.
self.socket.sendall(b'c')
# Send the data to the paramter server.
send_data(self.socket, data)
def optimize(self):
"""Optimization procedure of DynSGD."""
W1 = np.asarray(self.model.get_weights())
while True:
X, Y = self.get_next_minibatch()
h = self.model.train_on_batch(X, Y)
self.add_history(h)
if self.iteration % self.communication_window == 0:
W2 = np.asarray(self.model.get_weights())
delta = W2 - W1
self.commit(delta)
self.pull()
self.model.set_weights(self.center_variable)
W1 = self.center_variable
self.iteration += 1
class ExperimentalWorker(NetworkWorker):
"""Implements the training procedure for ADAG.
Introduced by Hermans et al.
"""
def __init__(self, model, optimizer, loss, loss_weights, metrics=["accuracy"], features_col="features", label_col="label",
batch_size=32, num_epoch=1, master_host="localhost", master_port=5000, communication_window=5,
num_workers=2, learning_rate=1.0):
# Initialize the parent object.
super(ExperimentalWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col,
batch_size, num_epoch, master_host, master_port, learning_rate)
# Initialize ADAG parameters.
self.communication_window = communication_window
self.num_workers = num_workers
self.current_num_workers = self.num_workers
self.inverse_learning_rate = 1 / self.learning_rate
self.iteration = 1
def commit(self, residual):
"""Sends the gradient residual to the parameter server."""
# Prepare the datastructure.
data = {}
data['worker_id'] = self.get_worker_id()
data['residual'] = residual
data['stale_center_variable'] = self.center_variable
# Request a commit from the parameter server.
self.socket.sendall(b'c')
# Send the data to the paramter server.
send_data(self.socket, data)
def pull(self):
"""Requests the center variable from the parameter server."""
# Request a pull from the parameter server.
self.socket.sendall(b'p')
# Fetch the center variable from the parameter server.
self.center_variable = np.asarray(recv_data(self.socket))
def optimize(self):
"""Optimization procedure of ADAG."""
W1 = np.asarray(self.model.get_weights())
while True:
X, Y = self.get_next_minibatch()
h = self.model.train_on_batch(X, Y)
self.add_history(h)
if self.iteration % self.communication_window == 0:
W2 = np.asarray(self.model.get_weights())
delta = W2 - W1
delta /= self.communication_window
self.commit(delta)
self.pull()
self.model.set_weights(self.center_variable)
W1 = self.center_variable
self.iteration += 1
================================================
FILE: docs/index.md
================================================
# Distributed Keras
Distributed Keras (DK) is a **distributed deep learning framework** built op top of Apache Spark and Keras with the goal to significantly reduce the training time using distributed machine learning algorithms. We designed the framework in such a way that a developer could implement a new distributed optimizer with ease, thus enabling a person to focus on research and model development.
As mentioned above, most of our methods follow the data parallel approach as described in the paper on [Large Scale Distributed Deep Networks](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf). In this paradigm, replicas of a model are distributed over several "trainers", and every model replica will be trained on a different partition of the dataset. The gradient (or all network weights, depending on the implementation details) will be communicated with the parameter server after every gradient update. The parameter server is responsible for handling the gradient updates of all workers and incorperating all gradient updates into a single master model which will be returned to the user after the training procedure is complete.
## Installation
We rely on [Keras](https://keras.io) for the construction of models, and thus following the Keras dependencies. Furthermore, PySpark is also a dependency for this project since DK is using Apache Spark for the distribution of the data and the model replicas.
### Pip
You can use `pip` if you only need to DK framework without examples.
```bash
pip install git+https://github.com/JoeriHermans/dist-keras.git
```
### Git
However, if you would like to play with the examples and notebooks, simply install the framework using the approach described below.
```bash
git clone https://github.com/JoeriHermans/dist-keras
cd dist-keras
pip install -e .
```
## Getting Started
We recommend starting with the `workflow` notebook located in the `examples` directory. This Python notebook will guide you through all general steps which should need to perform. This includes setting up a Spark Context, reading the data, applying preprocessing, training and evaluation of your model in a distributed way.
!!! Note
Running the **workflow.ipyn** notebook can be run on your local machine. However, we recommend running the notebook on a Spark cluster since the distributed trainers start to outperform the *SingleTrainer* when the number of workers (cores multiplied by executors) is usually higher than 10.
## Support
For issues, bugs, questions, and suggestions. Please use the appropriate channels on [GitHub](https://github.com/JoeriHermans/dist-keras/).
After the installation process is complete, you can start exploring the functionality by browsing the examples. We have also prepared a notebook which basically compares the different distributed optimizers with each other. This notebook is located at `examples/experiment.ipynb`. However, other examples are also provided which show you how to use the different distributed optimizers with Apache Spark for distributed pre-processing.
## References
* Zhang, S., Choromanska, A. E., & LeCun, Y. (2015). Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems (pp. 685-693).
* Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. arXiv preprint arXiv:1511.06051.
* Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., ... & Ng, A. Y. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223-1231).
* Pumperla, M. (2015). Elephas. Github Repository https://github.com/maxpumperla/elephas/. [4]
## Licensing
 
================================================
FILE: docs/license.md
================================================
# GNU General Public License
**Version 3, 29 June 2007**
Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
## Preamble
The GNU General Public License is a free, copyleft license for
software and other kinds of works.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users. We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors. You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.
To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights. Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received. You must make sure that they, too, receive
or can get the source code. And you must show them these terms so they
know their rights.
Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.
For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software. For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.
Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so. This is fundamentally incompatible with the aim of
protecting users' freedom to change the software. The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable. Therefore, we
have designed this version of the GPL to prohibit the practice for those
products. If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.
Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary. To prevent this, the GPL assures that
patents cannot be used to render the program non-free.
The precise terms and conditions for copying, distribution and
modification follow.
## Terms And Conditions
0. Definitions.
"This License" refers to version 3 of the GNU General Public License.
"Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.
"The Program" refers to any copyrightable work licensed under this
License. Each licensee is addressed as "you". "Licensees" and
"recipients" may be individuals or organizations.
To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy. The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.
A "covered work" means either the unmodified Program or a work based
on the Program.
To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy. Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.
To "convey" a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License. If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.
1. Source Code.
The "source code" for a work means the preferred form of the work
for making modifications to it. "Object code" means any non-source
form of a work.
A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.
The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form. A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.
The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities. However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work. For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.
The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.
The Corresponding Source for a work in source code form is that
same work.
2. Basic Permissions.
All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force. You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright. Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under
the conditions stated below. Sublicensing is not allowed; section 10
makes it unnecessary.
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.
When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.
4. Conveying Verbatim Copies.
You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.
5. Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified
it, and giving a relevant date.
b) The work must carry prominent notices stating that it is
released under this License and any conditions added under section
7. This requirement modifies the requirement in section 4 to
"keep intact all notices".
c) You must license the entire work, as a whole, under this
License to anyone who comes into possession of a copy. This
License will therefore apply, along with any applicable section 7
additional terms, to the whole of the work, and all its parts,
regardless of how they are packaged. This License gives no
permission to license the work in any other way, but it does not
invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display
Appropriate Legal Notices; however, if the Program has interactive
interfaces that do not display Appropriate Legal Notices, your
work need not make them do so.
A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit. Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.
6. Conveying Non-Source Forms.
You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:
a) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by the
Corresponding Source fixed on a durable physical medium
customarily used for software interchange.
b) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by a
written offer, valid for at least three years and valid for as
long as you offer spare parts or customer support for that product
model, to give anyone who possesses the object code either (1) a
copy of the Corresponding Source for all the software in the
product that is covered by this License, on a durable physical
medium customarily used for software interchange, for a price no
more than your reasonable cost of physically performing this
conveying of source, or (2) access to copy the
Corresponding Source from a network server at no charge.
c) Convey individual copies of the object code with a copy of the
written offer to provide the Corresponding Source. This
alternative is allowed only occasionally and noncommercially, and
only if you received the object code with such an offer, in accord
with subsection 6b.
d) Convey the object code by offering access from a designated
place (gratis or for a charge), and offer equivalent access to the
Corresponding Source in the same way through the same place at no
further charge. You need not require recipients to copy the
Corresponding Source along with the object code. If the place to
copy the object code is a network server, the Corresponding Source
may be on a different server (operated by you or a third party)
that supports equivalent copying facilities, provided you maintain
clear directions next to the object code saying where to find the
Corresponding Source. Regardless of what server hosts the
Corresponding Source, you remain obligated to ensure that it is
available for as long as needed to satisfy these requirements.
e) Convey the object code using peer-to-peer transmission, provided
you inform other peers where the object code and Corresponding
Source of the work are being offered to the general public at no
charge under subsection 6d.
A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.
A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling. In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage. For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product. A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.
"Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source. The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.
If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information. But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).
The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed. Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.
7. Additional Terms.
"Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law. If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it. (Additional permissions may be written to require their own
removal in certain cases when you modify the work.) You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the
terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or
author attributions in that material or in the Appropriate Legal
Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or
requiring that modified versions of such material be marked in
reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or
authors of the material; or
e) Declining to grant rights under trademark law for use of some
trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that
material by anyone who conveys the material (or modified versions of
it) with contractual assumptions of liability to the recipient, for
any liability that these contractual assumptions directly impose on
those licensors and authors.
All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10. If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term. If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.
8. Termination.
You may not propagate or modify a covered work except as expressly
provided under this License. Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).
However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.
Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.
9. Acceptance Not Required for Having Copies.
You are not required to accept this License in order to receive or
run a copy of the Program. Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance. However,
nothing other than this License grants you permission to propagate or
modify any covered work. These actions infringe copyright if you do
not accept this License. Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.
10. Automatic Licensing of Downstream Recipients.
Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License. You are not responsible
for enforcing compliance by third parties with this License.
An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations. If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.
11. Patents.
A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based. The
work thus licensed is called the contributor's "contributor version".
A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version. For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.
In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement). To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.
If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients. "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or
arrangement, you convey,
gitextract_pj297o4k/ ├── LICENSE ├── README.md ├── distkeras/ │ ├── __init__.py │ ├── evaluators.py │ ├── job_deployment.py │ ├── networking.py │ ├── parameter_servers.py │ ├── predictors.py │ ├── schemes.py │ ├── trainers.py │ ├── transformers.py │ ├── utils.py │ └── workers.py ├── docs/ │ ├── index.md │ ├── license.md │ └── optimizers.md ├── examples/ │ ├── cifar-10-preprocessing.ipynb │ ├── data/ │ │ ├── atlas_higgs.csv │ │ └── mnist.csv │ ├── distributed_numpy_parsing.ipynb │ ├── example_0_data_preprocessing.ipynb │ ├── example_1_analysis.ipynb │ ├── kafka_producer.py │ ├── kafka_spark_high_throughput_ml_pipeline.ipynb │ ├── mnist.ipynb │ ├── mnist.py │ ├── mnist_analysis.ipynb │ ├── mnist_preprocessing.ipynb │ └── workflow.ipynb ├── mkdocs.yml ├── resources/ │ └── blog-posts/ │ ├── css/ │ │ └── main.css │ ├── js/ │ │ ├── highlight.pack.js │ │ └── main.js │ └── part-1-an-introduction.html ├── scripts/ │ ├── generate_secret.py │ └── punchcard.py └── setup.py
SYMBOL INDEX (311 symbols across 16 files)
FILE: distkeras/evaluators.py
class Evaluator (line 6) | class Evaluator(object):
method __init__ (line 15) | def __init__(self, label_col="label", prediction_col="prediction"):
method evaluate (line 19) | def evaluate(self, dataframe):
class AccuracyEvaluator (line 28) | class AccuracyEvaluator(Evaluator):
method __init__ (line 36) | def __init__(self, label_col="label", prediction_col="prediction"):
method evaluate (line 40) | def evaluate(self, dataframe):
FILE: distkeras/job_deployment.py
class Punchcard (line 37) | class Punchcard(object):
method __init__ (line 39) | def __init__(self, secrets_path="secrets.json", port=80):
method read_secrets (line 46) | def read_secrets(self):
method valid_secret (line 53) | def valid_secret(self, secret, secrets):
method secret_in_use (line 61) | def secret_in_use(self, secret):
method set_trained_model (line 64) | def set_trained_model(self, job, model):
method get_submitted_job (line 68) | def get_submitted_job(self, secret):
method define_routes (line 77) | def define_routes(self):
method run (line 147) | def run(self):
class PunchcardJob (line 152) | class PunchcardJob(object):
method __init__ (line 154) | def __init__(self, secret, job_name, data_path, num_executors, num_pro...
method get_job_name (line 166) | def get_job_name(self):
method get_secret (line 169) | def get_secret(self):
method get_history (line 172) | def get_history(self):
method get_trained_model (line 175) | def get_trained_model(self):
method start (line 178) | def start(self):
method cancel (line 184) | def cancel(self):
method running (line 187) | def running(self):
method join (line 190) | def join(self):
method run_job (line 193) | def run_job(self):
method clean_up (line 196) | def clean_up(self):
method read_trained_model (line 202) | def read_trained_model(self):
method read_history (line 207) | def read_history(self):
method serialize_trainer (line 212) | def serialize_trainer(self):
method generate_code (line 218) | def generate_code(self):
method run (line 274) | def run(self):
class Job (line 284) | class Job(object):
method __init__ (line 286) | def __init__(self, secret, job_name, data_path, num_executors, num_pro...
method set_num_executors (line 297) | def set_num_executors(self, num_executors):
method set_num_processes (line 300) | def set_num_processes(self, num_processes):
method get_trained_model (line 303) | def get_trained_model(self):
method get_history (line 306) | def get_history(self):
method is_finished (line 309) | def is_finished(self):
method destroy_remote_job (line 317) | def destroy_remote_job(self):
method start (line 326) | def start(self):
method wait_completion (line 330) | def wait_completion(self):
method cancel (line 333) | def cancel(self):
method send (line 338) | def send(self, address):
method run (line 352) | def run(self):
FILE: distkeras/networking.py
function determine_host_address (line 11) | def determine_host_address():
function recvall (line 18) | def recvall(connection, num_bytes):
function recv_data (line 42) | def recv_data(connection):
function send_data (line 65) | def send_data(connection, data):
function connect (line 89) | def connect(host, port, disable_nagle=True):
FILE: distkeras/parameter_servers.py
class ParameterServer (line 26) | class ParameterServer(object):
method __init__ (line 35) | def __init__(self, model):
method initialize (line 39) | def initialize(self):
method start (line 46) | def start(self):
method run (line 50) | def run(self):
method stop (line 54) | def stop(self):
method get_model (line 58) | def get_model(self):
method next_update (line 62) | def next_update(self):
method reset_update_counter (line 66) | def reset_update_counter(self):
method get_num_updates (line 70) | def get_num_updates(self):
class SocketParameterServer (line 75) | class SocketParameterServer(ParameterServer):
method __init__ (line 89) | def __init__(self, model, port=5000):
method initialize (line 97) | def initialize(self):
method handle_commit (line 117) | def handle_commit(self, conn, addr):
method handle_pull (line 126) | def handle_pull(self, conn, addr):
method cancel_accept (line 141) | def cancel_accept(self):
method handle_connection (line 153) | def handle_connection(self, conn, addr):
method start (line 174) | def start(self):
method run (line 179) | def run(self):
method stop (line 194) | def stop(self):
method finalize (line 206) | def finalize(self):
method cleanup_connections (line 210) | def cleanup_connections(self):
class DeltaParameterServer (line 219) | class DeltaParameterServer(SocketParameterServer):
method __init__ (line 228) | def __init__(self, model, master_port):
method handle_commit (line 232) | def handle_commit(self, conn, addr):
method handle_pull (line 243) | def handle_pull(self, conn, addr):
method finalize (line 257) | def finalize(self):
class ADAGParameterServer (line 262) | class ADAGParameterServer(SocketParameterServer):
method __init__ (line 272) | def __init__(self, model, master_port):
method handle_commit (line 276) | def handle_commit(self, conn, addr):
method handle_pull (line 287) | def handle_pull(self, conn, addr):
method finalize (line 301) | def finalize(self):
class DynSGDParameterServer (line 306) | class DynSGDParameterServer(SocketParameterServer):
method __init__ (line 316) | def __init__(self, model, master_port):
method handle_pull (line 319) | def handle_pull(self, conn, addr):
method handle_commit (line 342) | def handle_commit(self, conn, addr):
class ExperimentalParameterServer (line 357) | class ExperimentalParameterServer(SocketParameterServer):
method __init__ (line 367) | def __init__(self, model, master_port, learning_rate):
method handle_commit (line 372) | def handle_commit(self, conn, addr):
method handle_pull (line 388) | def handle_pull(self, conn, addr):
method finalize (line 402) | def finalize(self):
FILE: distkeras/predictors.py
class Predictor (line 15) | class Predictor(object):
method __init__ (line 22) | def __init__(self, keras_model):
method predict (line 25) | def predict(self, dataframe):
class ModelPredictor (line 34) | class ModelPredictor(Predictor):
method __init__ (line 44) | def __init__(self, keras_model, features_col="features", output_col="p...
method _predict (line 50) | def _predict(self, iterator):
method predict (line 64) | def predict(self, dataframe):
FILE: distkeras/schemes.py
class Scheme (line 13) | class Scheme(object):
method __init__ (line 26) | def __init__(self, optimizer, num_epoch=15, evaluation_frequency=5):
method initialize (line 33) | def initialize(self):
method get_epoch_over_evaluation_frequency (line 37) | def get_epoch_over_evaluation_frequency(self):
method optimize (line 41) | def optimize(self, training_set, validation_set):
class Emperor (line 45) | class Emperor(Scheme):
method __init__ (line 57) | def __init__(self, optimizer, evaluate_loss, num_epoch=15, evaluation_...
method optimize (line 64) | def optimize(self, training_set, validation_set):
FILE: distkeras/trainers.py
class Trainer (line 39) | class Trainer(object):
method __init__ (line 54) | def __init__(self, keras_model, loss, worker_optimizer, metrics=["accu...
method set_max_prefetch (line 67) | def set_max_prefetch(self, max_mini_batches):
method set_model (line 71) | def set_model(self, model):
method record_training_start (line 75) | def record_training_start(self):
method record_training_end (line 83) | def record_training_end(self):
method get_training_time (line 91) | def get_training_time(self):
method get_history (line 95) | def get_history(self):
method get_averaged_history (line 99) | def get_averaged_history(self):
method get_executor_history (line 103) | def get_executor_history(self, executor_id):
method train (line 107) | def train(self, dataframe, shuffle=False):
method serialize (line 119) | def serialize(self):
class SingleTrainer (line 123) | class SingleTrainer(Trainer):
method __init__ (line 141) | def __init__(self, keras_model, worker_optimizer, loss, metrics=["accu...
method allocate_worker (line 149) | def allocate_worker(self):
method train (line 161) | def train(self, dataframe, shuffle=False):
class AveragingTrainer (line 192) | class AveragingTrainer(Trainer):
method __init__ (line 212) | def __init__(self, keras_model, worker_optimizer, loss, metrics=["accu...
method average_models (line 223) | def average_models(self, models):
method allocate_worker (line 242) | def allocate_worker(self):
method train (line 250) | def train(self, dataframe, shuffle=False):
class EnsembleTrainer (line 285) | class EnsembleTrainer(Trainer):
method __init__ (line 305) | def __init__(self, keras_model, worker_optimizer, loss, metrics=["accu...
method allocate_worker (line 313) | def allocate_worker(self):
method train (line 321) | def train(self, dataframe, shuffle=False):
class DistributedTrainer (line 355) | class DistributedTrainer(Trainer):
method __init__ (line 375) | def __init__(self, keras_model, worker_optimizer, loss, metrics=["accu...
method set_minibatch_size (line 389) | def set_minibatch_size(self, size):
method get_minibatch_size (line 393) | def get_minibatch_size(self):
method get_features_column (line 397) | def get_features_column(self):
method get_label_column (line 401) | def get_label_column(self):
method get_learning_rate (line 405) | def get_learning_rate(self):
method set_learning_rate (line 413) | def set_learning_rate(self, learning_rate):
method set_num_epoch (line 421) | def set_num_epoch(self, num_epoch):
method get_num_epoch (line 425) | def get_num_epoch(self):
method allocate_worker (line 429) | def allocate_worker(self):
method set_master (line 436) | def set_master(self, master):
method determine_new_master (line 440) | def determine_new_master(self):
method allocate_parameter_server (line 444) | def allocate_parameter_server(self):
method set_num_workers (line 454) | def set_num_workers(self, num_workers):
method get_num_workers (line 458) | def get_num_workers(self):
method num_updates (line 462) | def num_updates(self):
method service (line 466) | def service(self):
method stop_service (line 472) | def stop_service(self):
method start_service (line 478) | def start_service(self):
method train (line 488) | def train(self, dataframe, shuffle=False):
class AsynchronousDistributedTrainer (line 535) | class AsynchronousDistributedTrainer(DistributedTrainer):
method __init__ (line 568) | def __init__(self, keras_model, worker_optimizer, loss, metrics=["accu...
method allocate_worker (line 576) | def allocate_worker(self):
method set_parallelism_factor (line 583) | def set_parallelism_factor(self, factor):
method get_parallelism_factor (line 591) | def get_parallelism_factor(self):
method train (line 595) | def train(self, dataframe, shuffle=False):
class AEASGD (line 642) | class AEASGD(AsynchronousDistributedTrainer):
method __init__ (line 672) | def __init__(self, keras_model, worker_optimizer, loss, metrics=["accu...
method allocate_worker (line 681) | def allocate_worker(self):
class DOWNPOUR (line 692) | class DOWNPOUR(AsynchronousDistributedTrainer):
method __init__ (line 720) | def __init__(self, keras_model, worker_optimizer, loss, metrics=["accu...
method allocate_worker (line 726) | def allocate_worker(self):
class EAMSGD (line 736) | class EAMSGD(AsynchronousDistributedTrainer):
method __init__ (line 769) | def __init__(self, keras_model, worker_optimizer, loss, metrics=["accu...
method allocate_worker (line 779) | def allocate_worker(self):
class ADAG (line 790) | class ADAG(AsynchronousDistributedTrainer):
method __init__ (line 815) | def __init__(self, keras_model, worker_optimizer, loss, metrics=["accu...
method allocate_worker (line 823) | def allocate_worker(self):
method allocate_parameter_server (line 831) | def allocate_parameter_server(self):
class DynSGD (line 838) | class DynSGD(AsynchronousDistributedTrainer):
method __init__ (line 865) | def __init__(self, keras_model, worker_optimizer, loss, metrics=["accu...
method allocate_worker (line 873) | def allocate_worker(self):
method allocate_parameter_server (line 881) | def allocate_parameter_server(self):
class Experimental (line 888) | class Experimental(AsynchronousDistributedTrainer):
method __init__ (line 891) | def __init__(self, keras_model, worker_optimizer, loss, metrics=["accu...
method allocate_worker (line 901) | def allocate_worker(self):
method allocate_parameter_server (line 910) | def allocate_parameter_server(self):
FILE: distkeras/transformers.py
class Transformer (line 23) | class Transformer(object):
method transform (line 26) | def transform(self, dataframe):
class MinMaxTransformer (line 35) | class MinMaxTransformer(Transformer):
method __init__ (line 53) | def __init__(self, o_min, o_max, n_min, n_max, input_col, output_col, ...
method _transform (line 63) | def _transform(self, row):
method transform (line 80) | def transform(self, dataframe):
class BinaryLabelTransformer (line 89) | class BinaryLabelTransformer(Transformer):
method __init__ (line 100) | def __init__(self, input_column, output_column, label):
method _transform (line 105) | def _transform(self, row):
method transform (line 119) | def transform(self, dataframe):
class StandardTransformer (line 128) | class StandardTransformer(Transformer):
method __init__ (line 139) | def __init__(self, columns, suffix="_normalized"):
method clean_mean_keys (line 146) | def clean_mean_keys(self, means):
method clean_stddev_keys (line 155) | def clean_stddev_keys(self, stddevs):
method _transform (line 164) | def _transform(self, row):
method transform (line 175) | def transform(self, dataframe):
class DenseTransformer (line 197) | class DenseTransformer(Transformer):
method __init__ (line 205) | def __init__(self, input_col, output_col):
method _transform (line 209) | def _transform(self, row):
method transform (line 217) | def transform(self, dataframe):
class ReshapeTransformer (line 228) | class ReshapeTransformer(Transformer):
method __init__ (line 241) | def __init__(self, input_col, output_col, shape):
method _transform (line 246) | def _transform(self, row):
method transform (line 255) | def transform(self, dataframe):
class OneHotTransformer (line 266) | class OneHotTransformer(Transformer):
method __init__ (line 275) | def __init__(self, output_dim, input_col, output_col):
method _transform (line 280) | def _transform(self, row):
method transform (line 291) | def transform(self, dataframe):
class LabelIndexTransformer (line 302) | class LabelIndexTransformer(Transformer):
method __init__ (line 313) | def __init__(self, output_dim, input_col="prediction", output_col="pre...
method get_index (line 321) | def get_index(self, vector):
method _transform (line 334) | def _transform(self, row):
method transform (line 342) | def transform(self, dataframe):
FILE: distkeras/utils.py
function get_os_username (line 28) | def get_os_username():
function set_keras_base_directory (line 36) | def set_keras_base_directory(base_dir='/tmp/' + get_os_username()):
function to_one_hot_encoded_dense (line 41) | def to_one_hot_encoded_dense(value, n_dim=2):
function new_dataframe_row (line 55) | def new_dataframe_row(old_row, column_name, column_value):
function json_to_dataframe_row (line 62) | def json_to_dataframe_row(string):
function pickle_object (line 70) | def pickle_object(o):
function unpickle_object (line 75) | def unpickle_object(string):
function serialize_keras_model (line 80) | def serialize_keras_model(model):
function history_executors_average (line 89) | def history_executors_average(history):
function history_executor (line 113) | def history_executor(history, id):
function deserialize_keras_model (line 121) | def deserialize_keras_model(dictionary):
function uniform_weights (line 131) | def uniform_weights(model, constraints=[-0.5, 0.5]):
function shuffle (line 161) | def shuffle(dataset):
function precache (line 173) | def precache(dataset, num_workers):
FILE: distkeras/workers.py
class Worker (line 49) | class Worker(object):
method __init__ (line 55) | def __init__(self, model, optimizer, loss, loss_weights, metrics=["acc...
method set_max_prefetch (line 79) | def set_max_prefetch(self, max_mini_batches):
method set_learning_rate (line 83) | def set_learning_rate(self, learning_rate):
method get_learning_rate (line 87) | def get_learning_rate(self):
method set_worker_id (line 91) | def set_worker_id(self, worker_id):
method get_worker_id (line 99) | def get_worker_id(self):
method prepare_model (line 103) | def prepare_model(self):
method get_next_minibatch (line 121) | def get_next_minibatch(self):
method start_prefetching_thread (line 125) | def start_prefetching_thread(self, iterator):
method prefetching (line 132) | def prefetching(self):
method optimize (line 153) | def optimize(self):
method train (line 157) | def train(self, worker_id, iterator):
class SequentialWorker (line 181) | class SequentialWorker(Worker):
method __init__ (line 187) | def __init__(self, model, optimizer, loss, loss_weights, metrics=["acc...
method optimize (line 193) | def optimize(self):
class NetworkWorker (line 205) | class NetworkWorker(Worker):
method __init__ (line 208) | def __init__(self, model, optimizer, loss, loss_weights, metrics=["acc...
method connect (line 220) | def connect(self):
method pull (line 224) | def pull(self):
method commit (line 231) | def commit(self, residual):
method set_tcp_no_delay (line 242) | def set_tcp_no_delay(self, flag):
method tcp_no_delay (line 252) | def tcp_no_delay(self):
method get_master_host (line 260) | def get_master_host(self):
method get_master_port (line 264) | def get_master_port(self):
method add_history (line 268) | def add_history(self, h):
method optimize (line 277) | def optimize(self):
method train (line 281) | def train(self, worker_id, iterator):
class ADAGWorker (line 301) | class ADAGWorker(NetworkWorker):
method __init__ (line 307) | def __init__(self, model, optimizer, loss, loss_weights, metrics=["acc...
method commit (line 316) | def commit(self, residual):
method optimize (line 327) | def optimize(self):
class DOWNPOURWorker (line 345) | class DOWNPOURWorker(NetworkWorker):
method __init__ (line 352) | def __init__(self, model, optimizer, loss, loss_weights, metrics=["acc...
method optimize (line 360) | def optimize(self):
class AEASGDWorker (line 377) | class AEASGDWorker(NetworkWorker):
method __init__ (line 384) | def __init__(self, model, optimizer, loss, loss_weights, metrics=['acc...
method optimize (line 397) | def optimize(self):
class EAMSGDWorker (line 413) | class EAMSGDWorker(NetworkWorker):
method __init__ (line 420) | def __init__(self, model, optimizer, loss, loss_weights, metrics=['acc...
method optimize (line 434) | def optimize(self):
class DynSGDWorker (line 461) | class DynSGDWorker(NetworkWorker):
method __init__ (line 464) | def __init__(self, model, optimizer, loss, loss_weights, metrics=["acc...
method pull (line 474) | def pull(self):
method commit (line 483) | def commit(self, residual):
method optimize (line 495) | def optimize(self):
class ExperimentalWorker (line 512) | class ExperimentalWorker(NetworkWorker):
method __init__ (line 518) | def __init__(self, model, optimizer, loss, loss_weights, metrics=["acc...
method commit (line 531) | def commit(self, residual):
method pull (line 543) | def pull(self):
method optimize (line 550) | def optimize(self):
FILE: examples/kafka_producer.py
function usage (line 20) | def usage():
function allocate_producer (line 27) | def allocate_producer(bootstrap_server):
function read_data (line 32) | def read_data():
function produce (line 45) | def produce(producer, topic, data):
function main (line 49) | def main():
FILE: examples/mnist.py
function evaluate_accuracy (line 173) | def evaluate_accuracy(model, test_set, features="matrix"):
FILE: resources/blog-posts/js/highlight.pack.js
function n (line 2) | function n(e){return e.replace(/[&<>]/gm,function(e){return I[e]})}
function t (line 2) | function t(e){return e.nodeName.toLowerCase()}
function r (line 2) | function r(e,n){var t=e&&e.exec(n);return t&&0===t.index}
function a (line 2) | function a(e){return k.test(e)}
function i (line 2) | function i(e){var n,t,r,i,o=e.className+" ";if(o+=e.parentNode?e.parentN...
function o (line 2) | function o(e,n){var t,r={};for(t in e)r[t]=e[t];if(n)for(t in n)r[t]=n[t...
function u (line 2) | function u(e){var n=[];return function r(e,a){for(var i=e.firstChild;i;i...
function c (line 2) | function c(e,r,a){function i(){return e.length&&r.length?e[0].offset!==r...
function s (line 2) | function s(e){function n(e){return e&&e.source||e}function t(t,r){return...
function l (line 2) | function l(e,t,a,i){function o(e,n){for(var t=0;t<n.c.length;t++)if(r(n....
function f (line 2) | function f(e,t){t=t||y.languages||E(x);var r={r:0,value:n(e)},a=r;return...
function g (line 2) | function g(e){return y.tabReplace||y.useBR?e.replace(M,function(e,n){ret...
function h (line 2) | function h(e,n,t){var r=n?L[n]:t,a=[e.trim()];return e.match(/\bhljs\b/)...
function p (line 2) | function p(e){var n,t,r,o,s,p=i(e);a(p)||(y.useBR?(n=document.createElem...
function d (line 2) | function d(e){y=o(y,e)}
function b (line 2) | function b(){if(!b.called){b.called=!0;var e=document.querySelectorAll("...
function v (line 2) | function v(){addEventListener("DOMContentLoaded",b,!1),addEventListener(...
function m (line 2) | function m(n,t){var r=x[n]=t(e);r.aliases&&r.aliases.forEach(function(e)...
function N (line 2) | function N(){return E(x)}
function R (line 2) | function R(e){return e=(e||"").toLowerCase(),x[e]||x[L[e]]}
FILE: resources/blog-posts/js/main.js
function addRippleEffects (line 36) | function addRippleEffects() {
function renderMath (line 44) | function renderMath() {
FILE: scripts/generate_secret.py
function generate_secret (line 19) | def generate_secret(identity):
function parse_arguments (line 26) | def parse_arguments():
function main (line 34) | def main():
FILE: scripts/punchcard.py
function parse_arguments (line 20) | def parse_arguments():
function start_punchcard (line 29) | def start_punchcard(port, secrets):
function main (line 33) | def main():
Condensed preview — 37 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (588K chars).
[
{
"path": "LICENSE",
"chars": 35128,
"preview": " GNU GENERAL PUBLIC LICENSE\n Version 3, 29 June 2007\n\n Copyright (C) 2007 Free "
},
{
"path": "README.md",
"chars": 12759,
"preview": "# Distributed Keras\n\nDistributed Deep Learning with Apache Spark and Keras.\n\n\n## Introduction\n\nDistributed Keras is a di"
},
{
"path": "distkeras/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "distkeras/evaluators.py",
"chars": 1604,
"preview": "\"\"\"Evaluation module.\n\nAn evaluator will evaluate a dataframe according to specific requirements.\n\"\"\"\n\nclass Evaluator(o"
},
{
"path": "distkeras/job_deployment.py",
"chars": 11411,
"preview": "\"\"\"Module which facilitates job deployment on remote Spark clusters.\nThis allows you to build models and architectures o"
},
{
"path": "distkeras/networking.py",
"chars": 3122,
"preview": "\"\"\"Networking utility functions.\"\"\"\n\n## BEGIN Imports. ##############################################################\n\ni"
},
{
"path": "distkeras/parameter_servers.py",
"chars": 14329,
"preview": "\"\"\"Parameter servers.\n\nA parameter server is a process which will aggregate all the incoming gradient\nor parameter updat"
},
{
"path": "distkeras/predictors.py",
"chars": 2399,
"preview": "\"\"\"Predictors take a model and will transform the Dataframe by adding a prediction column.\"\"\"\n\n## BEGIN Imports. #######"
},
{
"path": "distkeras/schemes.py",
"chars": 3817,
"preview": "\"\"\"Schemes module.\n\nModule with schemes to automatize a distributed learning process. These schemes will automatically\na"
},
{
"path": "distkeras/trainers.py",
"chars": 43180,
"preview": "\"\"\"Model optimizers. Depending on the implementation, these classes will optimize the\nKeras model in a distributed manne"
},
{
"path": "distkeras/transformers.py",
"chars": 12004,
"preview": "\"\"\"Commonly used Dataframe transformers.\n\nA transformer will \"transform\" a Spark dataframe from one form into\nthe other."
},
{
"path": "distkeras/utils.py",
"chars": 5381,
"preview": "\"\"\"Utility functions used throughout Distributed Keras.\"\"\"\n\n## BEGIN Import. ###########################################"
},
{
"path": "distkeras/workers.py",
"chars": 22490,
"preview": "\"\"\"Workers module.\n\nThis module contains all worker specific implementations for different optimization\nalgorithms.\n\"\"\"\n"
},
{
"path": "docs/index.md",
"chars": 3777,
"preview": "# Distributed Keras\n\nDistributed Keras (DK) is a **distributed deep learning framework** built op top of Apache Spark an"
},
{
"path": "docs/license.md",
"chars": 35045,
"preview": "# GNU General Public License\n**Version 3, 29 June 2007**\n\n Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf"
},
{
"path": "docs/optimizers.md",
"chars": 4369,
"preview": "# Optimizers\n\nOptimizers, or trainers, are the main component in Distributed Keras (DK). All trainers share a single int"
},
{
"path": "examples/cifar-10-preprocessing.ipynb",
"chars": 17728,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# CIFAR-10 Preprocessing\"\n ]\n },"
},
{
"path": "examples/distributed_numpy_parsing.ipynb",
"chars": 19524,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Distributed Numpy Parsing\\n\",\n "
},
{
"path": "examples/example_0_data_preprocessing.ipynb",
"chars": 15254,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Data Preprocessing\\n\",\n \"\\n\",\n"
},
{
"path": "examples/example_1_analysis.ipynb",
"chars": 49727,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Model Development and Evaluation\\"
},
{
"path": "examples/kafka_producer.py",
"chars": 1754,
"preview": "\"\"\"\nThis example will be used as a Kafka producer to generate dummy\ndata for our Spark Streaming example.\n\"\"\"\n\n## BEGIN "
},
{
"path": "examples/kafka_spark_high_throughput_ml_pipeline.ipynb",
"chars": 19880,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Kafka and Spark High Throughput D"
},
{
"path": "examples/mnist.ipynb",
"chars": 25370,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# MNIST using Distributed Keras\\n\","
},
{
"path": "examples/mnist.py",
"chars": 8226,
"preview": "\"\"\"MNIST classification using Distributed Keras.\n\nATTENTION:\nBefore running this example, make sure you put the MNIST da"
},
{
"path": "examples/mnist_analysis.ipynb",
"chars": 44343,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# MNIST Analysis with Distributed K"
},
{
"path": "examples/mnist_preprocessing.ipynb",
"chars": 14700,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# MNIST Preprocessing\\n\",\n \"\\n\","
},
{
"path": "examples/workflow.ipynb",
"chars": 39161,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Distributed Deep Learning with Ap"
},
{
"path": "mkdocs.yml",
"chars": 721,
"preview": "# Project information\nsite_name: Distributed Keras\nsite_description: Distributed Deep Learning with Apache Spark and Ker"
},
{
"path": "resources/blog-posts/css/main.css",
"chars": 2807,
"preview": "/**\n * joerihermans.com main stylesheet.\n *\n * @author Joeri Hermans\n * @version 0,1\n * @since 28 June 2016\n */\n\n/** "
},
{
"path": "resources/blog-posts/js/highlight.pack.js",
"chars": 31924,
"preview": "/*! highlight.js v9.5.0 | BSD3 License | git.io/hljslicense */\n!function(e){var n=\"object\"==typeof window&&window||\"obje"
},
{
"path": "resources/blog-posts/js/main.js",
"chars": 2046,
"preview": "/**\n * Main JavaScript file for additional main site functionality.\n *\n * @author Joeri Hermans\n * @version 0.1\n * @sin"
},
{
"path": "resources/blog-posts/part-1-an-introduction.html",
"chars": 48811,
"preview": "<!DOCTYPE html>\n<html lang=\"en\">\n <title>Distributed Deep Learning with Apache Spark and Keras - Part 1 - An introducti"
},
{
"path": "scripts/generate_secret.py",
"chars": 1105,
"preview": "\"\"\"Generates a JSON structure that needs to be added to the\nsecrets file.\n\nAuthor: Joeri Hermans\n\"\"\"\n\n## BEGIN Imports. "
},
{
"path": "scripts/punchcard.py",
"chars": 1150,
"preview": "\"\"\"Script which starts the Punchcard daemon. Punchcard will accept remote job\nrequests and execute them on the local clu"
},
{
"path": "setup.py",
"chars": 850,
"preview": "\"\"\"Setup-module for DistKeras.\n\nThis software enables distrubuted Machine Learning on Apache Spark using Keras.\n\nSee:\nht"
}
]
// ... and 2 more files (download for full content)
About this extraction
This page contains the full source code of the cerndb/dist-keras GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 37 files (126.4 MB), approximately 156.1k tokens, and a symbol index with 311 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.