Repository: cerndb/dist-keras Branch: master Commit: 06c4e39954d9 Files: 37 Total size: 126.4 MB Directory structure: gitextract_pj297o4k/ ├── LICENSE ├── README.md ├── distkeras/ │ ├── __init__.py │ ├── evaluators.py │ ├── job_deployment.py │ ├── networking.py │ ├── parameter_servers.py │ ├── predictors.py │ ├── schemes.py │ ├── trainers.py │ ├── transformers.py │ ├── utils.py │ └── workers.py ├── docs/ │ ├── index.md │ ├── license.md │ └── optimizers.md ├── examples/ │ ├── cifar-10-preprocessing.ipynb │ ├── data/ │ │ ├── atlas_higgs.csv │ │ └── mnist.csv │ ├── distributed_numpy_parsing.ipynb │ ├── example_0_data_preprocessing.ipynb │ ├── example_1_analysis.ipynb │ ├── kafka_producer.py │ ├── kafka_spark_high_throughput_ml_pipeline.ipynb │ ├── mnist.ipynb │ ├── mnist.py │ ├── mnist_analysis.ipynb │ ├── mnist_preprocessing.ipynb │ └── workflow.ipynb ├── mkdocs.yml ├── resources/ │ └── blog-posts/ │ ├── css/ │ │ └── main.css │ ├── js/ │ │ ├── highlight.pack.js │ │ └── main.js │ └── part-1-an-introduction.html ├── scripts/ │ ├── generate_secret.py │ └── punchcard.py └── setup.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: LICENSE ================================================ GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007 Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The GNU General Public License is a free, copyleft license for software and other kinds of works. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others. For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it. For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions. Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users. Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free. The precise terms and conditions for copying, distribution and modification follow. TERMS AND CONDITIONS 0. Definitions. "This License" refers to version 3 of the GNU General Public License. "Copyright" also means copyright-like laws that apply to other kinds of works, such as semiconductor masks. "The Program" refers to any copyrightable work licensed under this License. Each licensee is addressed as "you". "Licensees" and "recipients" may be individuals or organizations. To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work. A "covered work" means either the unmodified Program or a work based on the Program. To "propagate" a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well. To "convey" a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying. An interactive user interface displays "Appropriate Legal Notices" to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion. 1. Source Code. The "source code" for a work means the preferred form of the work for making modifications to it. "Object code" means any non-source form of a work. A "Standard Interface" means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language. The "System Libraries" of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A "Major Component", in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it. The "Corresponding Source" for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work. The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source. The Corresponding Source for a work in source code form is that same work. 2. Basic Permissions. All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law. You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you. Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 3. Protecting Users' Legal Rights From Anti-Circumvention Law. No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures. When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures. 4. Conveying Verbatim Copies. You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program. You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee. 5. Conveying Modified Source Versions. You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to "keep intact all notices". c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it. d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so. A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an "aggregate" if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate. 6. Conveying Non-Source Forms. You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways: a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange. b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either (1) a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or (2) access to copy the Corresponding Source from a network server at no charge. c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b. d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server (operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements. e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d. A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work. A "User Product" is either (1) a "consumer product", which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, "normally used" refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product. "Installation Information" for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made. If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM). The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network. Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying. 7. Additional Terms. "Additional permissions" are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions. When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission. Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or d) Limiting the use for publicity purposes of names of licensors or authors of the material; or e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors. All other non-permissive additional terms are considered "further restrictions" within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying. If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms. Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way. 8. Termination. You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11). However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation. Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice. Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10. 9. Acceptance Not Required for Having Copies. You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so. 10. Automatic Licensing of Downstream Recipients. Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License. An "entity transaction" is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts. You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it. 11. Patents. A "contributor" is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's "contributor version". A contributor's "essential patent claims" are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, "control" includes the right to grant patent sublicenses in a manner consistent with the requirements of this License. Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version. In the following three paragraphs, a "patent license" is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To "grant" such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party. If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. "Knowingly relying" means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid. If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it. A patent license is "discriminatory" if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007. Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law. 12. No Surrender of Others' Freedom. If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program. 13. Use with the GNU Affero General Public License. Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU Affero General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the special requirements of the GNU Affero General Public License, section 13, concerning interaction through a network will apply to the combination as such. 14. Revised Versions of this License. The Free Software Foundation may publish revised and/or new versions of the GNU General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU General Public License "or any later version" applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU General Public License, you may choose any version ever published by the Free Software Foundation. If the Program specifies that a proxy can decide which future versions of the GNU General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program. Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version. 15. Disclaimer of Warranty. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. Limitation of Liability. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 17. Interpretation of Sections 15 and 16. If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Distributed Deep Learning with Keras and Apache Spark. Copyright (C) 2016 Joeri Hermans This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . Also add information on how to contact you by electronic and paper mail. If the program does terminal interaction, make it output a short notice like this when it starts in an interactive mode: Distributed Keras Copyright (C) 2016 Joeri Hermans This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, your program's commands might be different; for a GUI interface, you would use an "about box". You should also get your employer (if you work as a programmer) or school, if any, to sign a "copyright disclaimer" for the program, if necessary. For more information on this, and how to apply and follow the GNU GPL, see . The GNU General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. But first, please read . ================================================ FILE: README.md ================================================ # Distributed Keras Distributed Deep Learning with Apache Spark and Keras. ## Introduction Distributed Keras is a distributed deep learning framework built op top of Apache Spark and Keras, with a focus on "state-of-the-art" distributed optimization algorithms. We designed the framework in such a way that a new distributed optimizer could be implemented with ease, thus enabling a person to focus on research. Several distributed methods are supported, such as, but not restricted to, the training of **ensembles** and models using **data parallel** methods. Most of the distributed optimizers we provide, are based on data parallel methods. A data parallel method, as described in [[1]](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf), is a learning paradigm where multiple replicas of a single model are used to optimize a single objective. Using this approach, we are able to dignificantly reduce the training time of a model. Depending on the parametrization, we also observed that it is possible to achieve better statistical model performance compared to a more traditional approach (e.g., like the [SingleTrainer](#single-trainer) implementation), and yet, spending less wallclock time on the training of the model. However, this is subject to further research. **Attention**: A rather complete introduction to the problem of Distributed Deep Learning is presented in my Master Thesis [http://github.com/JoeriHermans/master-thesis](http://github.com/JoeriHermans/master-thesis). Furthermore, the thesis describes includes several *novel* insights, such as a redefinition of parameter staleness, and several new distributed optimizers such as AGN and ADAG. ## Installation We will guide you how to install Distributed Keras. However, we will assume that an Apache Spark installation is available. In the following subsections, we describe two approaches to achieve this. ### pip When you only require the framework for development purposes, just use `pip` to install dist-keras. ```bash pip install --upgrade dist-keras # OR pip install --upgrade git+https://github.com/JoeriHermans/dist-keras.git ``` ### git & pip However, if you would like to contribute, or run some of the examples. It is probably best to clone the repository directly from GitHub and install it afterwards using `pip`. This will also resolve possible missing dependencies. ```bash git clone https://github.com/JoeriHermans/dist-keras cd dist-keras pip install -e . ``` ### General notes #### .bashrc Make sure the following variables are set in your `.bashrc`. It is possible, depending on your system configuration, that the following configuration **doesn't have to be applied**. ```bash # Example of a .bashrc configuration. export SPARK_HOME=/usr/lib/spark export PYTHONPATH="$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH" ``` ## Running an example We would like to refer the reader to the `workflow.ipynb` notebook in the examples folder. This will give you a complete introduction to the problem of distributed deep learning, and will guide you through the steps that have to be executed. Furthermore, we would also like to show how you exactly should process "big" datasets. This is shown in the examples starting with the prefix ```example_```. Please execute them in the provided sequence. ### Spark 2.0 If you want to run the examples using Apache Spark 2.0.0 and higher. You will need to remove the line containing `sqlContext = SQLContext(sc)`. We need to do this because in Spark 2.0+, the SQLContext, and Hive context are now merged in the Spark session. ## Optimization Algorithms ### Sequential Trainer This optimizer follows the traditional scheme of training a model, i.e., it uses sequential gradient updates to optimize the parameters. It does this by executing the training procedure on a single Spark executor. ```python SingleTrainer(model, features_col, label_col, batch_size, optimizer, loss, metrics=["accuracy"]) ``` ### ADAG (Currently Recommended) DOWNPOUR variant which is able to achieve significantly better statistical performance while being less sensitive to hyperparameters. This optimizer was developed using insights gained while developing this framework. More research regarding parameter staleness is still being conducted to further improve this optimizer. ```python ADAG(keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32, features_col="features", label_col="label", num_epoch=1, communication_window=12) ``` ### Dynamic SGD Dynamic SGD, dynamically maintains a learning rate for every worker by incorperating parameter staleness. This optimization scheme is introduced in "Heterogeneity-aware Distributed Parameter Servers" at the SIGMOD 2017 conference [[5]](http://net.pku.edu.cn/~cuibin/Papers/2017SIGMOD.pdf). ```python DynSGD(keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32, features_col="features", label_col="label", num_epoch=1, communication_window=10) ``` ### Asynchronous Elastic Averaging SGD (AEASGD) The distinctive idea of EASGD is to allow the local workers to perform more exploration (small rho) and the master to perform exploitation. This approach differs from other settings explored in the literature, and focus on how fast the center variable converges [[2]](https://arxiv.org/pdf/1412.6651.pdf) . In this section we show the asynchronous version of EASGD. Instead of waiting on the synchronization of other trainers, this method communicates the elastic difference (as described in the paper), with the parameter server. The only synchronization mechanism that has been implemented, is to ensure no race-conditions occur when updating the center variable. ```python AEASGD(keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers, batch_size, features_col, label_col, num_epoch, communication_window, rho, learning_rate) ``` ### Asynchronous Elastic Averaging Momentum SGD (AEAMSGD) Asynchronous EAMSGD is a variant of asynchronous EASGD. It is based on the Nesterov's momentum scheme, where the update of the local worker is modified to incorepare a momentum term [[2]](https://arxiv.org/pdf/1412.6651.pdf) . ```python EAMSGD(keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers, batch_size, features_col, label_col, num_epoch, communication_window, rho, learning_rate, momentum) ``` ### DOWNPOUR An asynchronous stochastic gradient descent procedure introduced by Dean et al., supporting a large number of model replicas and leverages adaptive learning rates. This implementation is based on the pseudocode provided by [[1]](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf) . ```python DOWNPOUR(keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers, batch_size, features_col, label_col, num_epoch, learning_rate, communication_window) ``` ### Ensemble Training In ensemble training, we train `n` models in parallel on the same dataset. All models are trained in parallel, but the training of a single model is done in a sequential manner using Keras optimizers. After the training process, one can combine and, for example, average the output of the models. ```python EnsembleTrainer(keras_model, worker_optimizer, loss, metrics=["accuracy"], features_col, label_col, batch_size, num_ensembles) ``` ### Model Averaging Model averaging is a data parallel technique which will average the trainable parameters of model replicas after every epoch. ```python AveragingTrainer(keras_model, worker_optimizer, loss, metrics=["accuracy"], features_col, label_col, num_epoch, batch_size, num_workers) ``` ## Job deployment We also support remote job deployment. For example, imagine you are developing your model on a local notebook using a small development set. However, in order to submit your job on a remote cluster, you first need to develop a cluster job, and run the job there. In order to simplify this process, we have developed a simplified interface for a large scale machine learning job. In order to submit a job to a remote cluster, you simply run the following code: ```python # Define the distributed optimization procedure, and its parameters. trainer = ADAG(keras_model=mlp, worker_optimizer=optimizer_mlp, loss=loss_mlp, metrics=["accuracy"], num_workers=20, batch_size=32, communication_window=15, num_epoch=1, features_col="features_normalized_dense", label_col="label_encoded") # Define the job parameters. job = Job(secret, job_name, data_path, num_executors, num_processes, trainer) job.send('http://yourcluster:[port]') job.wait_completion() # Fetch the trained model, and history for training evaluation. trained_model = job.get_trained_model() history = job.get_history() ``` ### Punchcard Server Job scheduling, and execution is handled by our `Punchcard` server. This server will accept requests from a remote location given a specific `secret`, which is basically a long identification string of a specific user. However, a user can have multiple secrets. At the moment, a job is only executed if there are no other jobs running for the specified secret. In order to submit jobs to `Punchcard` we need to specify a secrets file. This file is basically a JSON structure, it will have the following structure: ```json [ { "secret": "secret_of_user_1", "identity": "user1" }, { "secret": "secret_of_user_2", "identity": "user2" } ] ``` After the secrets file has been constructed, the Punchcard server can be started by issueing the following command. ```sh python scripts/punchcard.py --secrets /path/to/secrets.json ``` #### Secret Generation In order to simplify secret generation, we have added a costum script which will generate a unique key for the specified identity. The structure can be generated by running the following command. ```sh python scripts/generate_secret.py --identity userX ``` ## Optimization Schemes TODO ## General note It is known that adding more asynchronous workers deteriorates the statistical performance of the model. There have been some studies which examinate this particular effect. However, some of them conclude that actually adding more asynchronous workers contributes to something what they call **implicit momentum** [[3]](https://arxiv.org/pdf/1605.09774.pdf). However, this is subject to further investigation. ## Known issues - Python 3 compatibility. ## TODO's List of possible future additions. - Save Keras model to HDFS. - Load Keras model from HDFS. - Compression / decompression of network transmissions. - Stop on target loss. - Multiple parameter servers for large Deep Networks. - Python 3 compatibility. - For every worker, spawn an additional thread which is responsible for sending updates to the parameter server. The actual worker thread will just submit tasks to this queue. ## Citing If you use this framework in any academic work, please use the following BibTex code. ```latex @misc{dist_keras_joerihermans, author = {Joeri R. Hermans, CERN IT-DB}, title = {Distributed Keras: Distributed Deep Learning with Apache Spark and Keras}, year = {2016}, publisher = {GitHub}, journal = {GitHub Repository}, howpublished = {\url{https://github.com/JoeriHermans/dist-keras/}}, } ``` ## References * Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., ... & Ng, A. Y. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223-1231). [[1]](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf) * Zhang, S., Choromanska, A. E., & LeCun, Y. (2015). Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems (pp. 685-693). [[2]](https://arxiv.org/pdf/1412.6651.pdf) * Mitliagkas, Ioannis, et al. "Asynchrony begets Momentum, with an Application to Deep Learning." arXiv preprint arXiv:1605.09774 (2016). [[3]](https://arxiv.org/pdf/1605.09774.pdf) * Pumperla, M. (2015). Elephas. Github Repository https://github.com/maxpumperla/elephas/. [4] * Jiawei Jiang, Bin Cui, Ce Zhang and Lele Yu (2017). Heterogeneity-aware Distributed Parameter Servers [[5]](http://net.pku.edu.cn/~cuibin/Papers/2017SIGMOD.pdf) ## Licensing ![GPLv3](resources/gpl_v3.png) ![CERN](resources/cern_logo.jpg) ================================================ FILE: distkeras/__init__.py ================================================ ================================================ FILE: distkeras/evaluators.py ================================================ """Evaluation module. An evaluator will evaluate a dataframe according to specific requirements. """ class Evaluator(object): """An evaluator is an abstract class which will, given a label and a prediction, will compute an evaluation metric. # Arguments label_col: string. Column name of the label. prediction_col: string. Column name of the prediction. """ def __init__(self, label_col="label", prediction_col="prediction"): self.label_column = label_col self.prediction_column = prediction_col def evaluate(self, dataframe): """Evalutes the specified dataframe. # Arguments dataframe: dataframe. Spark Dataframe. """ raise NotImplementedError class AccuracyEvaluator(Evaluator): """Computes the accuracy of the prediction based on the label. # Arguments label_col: string. Label column. prediction_col: string. Prediction column. """ def __init__(self, label_col="label", prediction_col="prediction"): # Initialize the parent structure. super(AccuracyEvaluator, self).__init__(label_col, prediction_col) def evaluate(self, dataframe): # Count the total number of instances. num_instances = dataframe.count() # Extract the matching indexes. cleaned = dataframe.where(dataframe[self.prediction_column] == dataframe[self.label_column]) # Fetch the number of correctly guessed instances. validated_instances = cleaned.count() return float(validated_instances) / float(num_instances) ================================================ FILE: distkeras/job_deployment.py ================================================ """Module which facilitates job deployment on remote Spark clusters. This allows you to build models and architectures on, for example, remote notebook servers, and submit the large scale training job on remote Hadoop / Spark clusters.""" ## BEGIN Imports. ############################################################## from distkeras.utils import deserialize_keras_model from distkeras.utils import get_os_username from distkeras.utils import pickle_object from distkeras.utils import serialize_keras_model from distkeras.utils import unpickle_object from flask import Flask from flask import request from os.path import expanduser from threading import Lock import base64 import json import os import subprocess import threading import time import urllib2 ## END Imports. ################################################################ class Punchcard(object): def __init__(self, secrets_path="secrets.json", port=80): self.application = Flask(__name__) self.secrets_path = secrets_path self.port = port self.mutex = threading.Lock() self.jobs = {} def read_secrets(self): with open(self.secrets_path) as f: secrets_raw = f.read() secrets = json.loads(secrets_raw) return secrets def valid_secret(self, secret, secrets): num_secrets = len(secrets) for i in range(0, num_secrets): description = secrets[i] if description['secret'] == secret: return True return False def secret_in_use(self, secret): return secret in self.jobs def set_trained_model(self, job, model): with self.mutex: self.models[job.get_secret()] = model def get_submitted_job(self, secret): with self.mutex: if self.secret_in_use(secret): job = self.jobs[secret] else: job = None return job def define_routes(self): ## BEGIN Route definitions. ############################################ @self.application.route('/api/submit', methods=['POST']) def submit_job(): # Parse the incoming JSON data. data = json.loads(request.data) # Fetch the required job arguments. secret = data['secret'] job_name = data['job_name'] num_executors = data['num_executors'] num_processes = data['num_processes'] data_path = data['data_path'] trainer = unpickle_object(data['trainer'].decode('hex_codec')) # Fetch the parameters for the job. secrets = self.read_secrets() with self.mutex: if self.valid_secret(secret, secrets) and not self.secret_in_use(secret): job = PunchcardJob(secret, job_name, data_path, num_executors, num_processes, trainer) self.jobs[secret] = job job.start() return '', 200 return '', 403 @self.application.route('/api/state') def job_state(): secret = request.args.get('secret') job = self.get_submitted_job(secret) # Check if the job exists. if job is not None: d = {} d['job_name'] = job.get_job_name() d['running'] = job.running() return json.dumps(d), 200 return '', 404 @self.application.route('/api/cancel') def cancel(): secret = request.args.get('secret') job = self.get_submitted_job(secret) if job is not None and job.running(): with self.mutex: job.cancel() del self.jobs[secret] return '', 200 @self.application.route('/api/destroy') def destroy_job(): secret = request.args.get('secret') job = self.get_submitted_job(secret) if job is not None and not job.running(): with self.mutex: model = self.jobs[secret].get_trained_model() history = self.jobs[secret].get_history() model = pickle_object(serialize_keras_model(model)).encode('hex_codec') history = pickle_object(history).encode('hex_codec') d = {} d['model'] = model d['history'] = history del self.jobs[secret] return json.dumps(d), 200 return '', 400 ## END Route definitions. ############################################## def run(self): self.define_routes() self.application.run('0.0.0.0', self.port) class PunchcardJob(object): def __init__(self, secret, job_name, data_path, num_executors, num_processes, trainer): self.secret = secret self.job_name = job_name self.data_path = data_path self.num_executors = num_executors self.num_processes = num_processes self.trainer = trainer self.is_running = True self.thread = None self.trained_model = None self.history = None def get_job_name(self): return self.job_name def get_secret(self): return self.secret def get_history(self): return self.history def get_trained_model(self): return self.trained_model def start(self): self.trainer.determine_new_master() self.thread = threading.Thread(target=self.run) self.thread.setDaemon(True) self.thread.start() def cancel(self): self.thread.exit() def running(self): return self.is_running def join(self): self.thread.join() def run_job(self): os.system("python ~/jobs/" + self.secret + ".py") def clean_up(self): home = expanduser("~") os.remove(home + "/models/" + self.secret) os.remove(home + "/histories/" + self.secret) os.remove(home + "/trainers/" + self.secret) def read_trained_model(self): home = expanduser("~") with open(home + "/models/" + self.secret, "r") as f: self.trained_model = deserialize_keras_model(unpickle_object(f.read())) def read_history(self): home = expanduser("~") with open(home + "/histories/" + self.secret, "r") as f: self.history = unpickle_object(f.read()) def serialize_trainer(self): trainer = pickle_object(self.trainer) home = expanduser("~") with open(home + "/trainers/" + self.secret, "w") as f: f.write(trainer) def generate_code(self): source = """ from distkeras.evaluators import * from distkeras.predictors import * from distkeras.trainers import * from distkeras.trainers import * from distkeras.transformers import * from distkeras.utils import * from keras import * from pyspark import SparkConf from pyspark import SparkContext from pyspark import SQLContext from os.path import expanduser secret = '{secret}' application_name = '{job_name}' num_executors = {num_executors} num_processes = {num_processes} path_data = '{data_path}' num_workers = num_processes * num_executors # Allocate a Spark Context, and a Spark SQL context. conf = SparkConf() conf.set("spark.app.name", application_name) conf.set("spark.master", "yarn-client") conf.set("spark.executor.cores", num_processes) conf.set("spark.executor.instances", num_executors) conf.set("spark.executor.memory", "5g") conf.set("spark.locality.wait", "0") conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); sc = SparkContext(conf=conf) sqlContext = SQLContext(sc) # Read the dataset from HDFS. For now we assume Parquet files. dataset = sqlContext.read.parquet(path_data).repartition(num_workers) # Deserialize the trainer object. home = expanduser("~") with open(home + "/trainers/" + secret, "r") as f: trainer = unpickle_object(f.read()) # Train the model, and save it afterwards. trained_model = trainer.train(dataset) with open(home + "/models/" + secret, "w") as f: f.write(pickle_object(serialize_keras_model(trained_model))) # Save the history of the training process. histories = trainer.get_history() with open(home + "/histories/" + secret, "w") as f: f.write(pickle_object(histories)) sc.stop() """.format( secret=self.secret, job_name=self.job_name, num_executors=self.num_executors, num_processes=self.num_processes, data_path=self.data_path ) home = expanduser("~") with open(home + "/jobs/" + self.secret + ".py", "w") as f: f.write(source) def run(self): self.serialize_trainer() self.generate_code() self.run_job() self.read_trained_model() self.read_history() self.clean_up() self.is_running = False class Job(object): def __init__(self, secret, job_name, data_path, num_executors, num_processes, trainer): self.secret = secret self.job_name = job_name self.num_executors = 20 self.num_processes = 1 self.data_path = data_path self.trainer = trainer self.trained_model = None self.history = None self.address = None def set_num_executors(self, num_executors): self.num_executors = num_executors def set_num_processes(self, num_processes): self.num_processes = num_processes def get_trained_model(self): return self.trained_model def get_history(self): return self.history def is_finished(self): address = self.address + '/api/state?secret=' + self.secret request = urllib2.Request(address) response = urllib2.urlopen(request) data = json.load(response) return not data['running'] def destroy_remote_job(self): address = self.address + '/api/destroy?secret=' + self.secret request = urllib2.Request(address) response = urllib2.urlopen(request) data = json.load(response) model = unpickle_object(data['model'].decode('hex_codec')) self.trained_model = deserialize_keras_model(model) self.history = unpickle_object(data['history'].decode('hex_codec')) def start(self): self.thread = threading.Thread(target=self.run) self.thread.start() def wait_completion(self): self.thread.join() def cancel(self): address = self.address + '/api/cancel?secret=' + self.secret request = urllib2.Request(address) urllib2.urlopen(request) def send(self, address): data = {} data['secret'] = self.secret data['job_name'] = self.job_name data['num_executors'] = self.num_executors data['num_processes'] = self.num_processes data['data_path'] = self.data_path data['trainer'] = pickle_object(self.trainer).encode('hex_codec') request = urllib2.Request(address + "/api/submit") request.add_header('Content-Type', 'application/json') urllib2.urlopen(request, json.dumps(data)) self.address = address self.start() def run(self): time.sleep(1) while not self.is_finished(): time.sleep(10) self.destroy_remote_job() ================================================ FILE: distkeras/networking.py ================================================ """Networking utility functions.""" ## BEGIN Imports. ############################################################## import pickle import socket ## END Imports. ################################################################ def determine_host_address(): """Determines the human-readable host address of the local machine.""" host_address = socket.gethostbyname(socket.gethostname()) return host_address def recvall(connection, num_bytes): """Reads `num_bytes` bytes from the specified connection. # Arguments connection: socket. Opened socket. num_bytes: int. Number of bytes to read. """ byte_buffer = b'' buffer_size = 0 bytes_left = num_bytes # Iterate until we received all data. while buffer_size < num_bytes: # Fetch the next frame from the network. data = connection.recv(bytes_left) # Compute the size of the frame. delta = len(data) buffer_size += delta bytes_left -= delta # Append the data to the buffer. byte_buffer += data return byte_buffer def recv_data(connection): """Will fetch the next data frame from the connection. The protocol for reading is structured as follows: 1. The first 20 bytes represents a string which holds the next number of bytes to read. 2. We convert the 20 byte string to an integer (e.g. '00000000000000000011' -> 11). 3. We read `num_bytes` from the socket (which is in our example 11). 4. Deserialize the retrieved string. # Arguments connection: socket. Opened socket. """ data = b'' # Fetch the serialized data length. length = int(recvall(connection, 20).decode()) # Fetch the serialized data. serialized_data = recvall(connection, length) # Deserialize the data. data = pickle.loads(serialized_data) return data def send_data(connection, data): """Sends the data to the other endpoint of the socket using our protocol. The protocol for sending is structured as follows: 1. Serialize the data. 2. Obtain the buffer-size of the serialized data. 3. Serialize the buffer-size in 20 bytes (e.g. 11 -> '00000000000000000011'). 4. Send the serialized buffer size. 5. Send the serialized data. # Arguments connection: socket. Opened socket. data: any. Data to send. """ # Serialize the data. serialized_data = pickle.dumps(data, -1) length = len(serialized_data) # Serialize the number of bytes in the data. serialized_length = str(length).zfill(20) # Send the data over the provided socket. connection.sendall(serialized_length.encode()) connection.sendall(serialized_data) def connect(host, port, disable_nagle=True): fd = socket.socket(socket.AF_INET, socket.SOCK_STREAM) # Check if Nagle's algorithm needs to be disabled. if disable_nagle: fd.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) else: fd.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 0) # Connect to the specified URI. fd.connect((host, port)) return fd ================================================ FILE: distkeras/parameter_servers.py ================================================ """Parameter servers. A parameter server is a process which will aggregate all the incoming gradient or parameter updates of the workers and incorperate it into a single center variable. This center variable will eventually be the produced model of the trainer. """ ## BEGIN Imports. ############################################################## import copy import math import numpy as np import socket import threading from distkeras.networking import recv_data from distkeras.networking import send_data from distkeras.utils import deserialize_keras_model ## END Imports. ################################################################ class ParameterServer(object): """Abstract class which provides basic attributed and methods for all parameter servers. # Arguments model: string. Serialized Keras model. See: distkeras.utils.serialize_keras_model """ def __init__(self, model): self.model = deserialize_keras_model(model) self.num_updates = 1 def initialize(self): """Initializes the parameter server. This method is called after self.start(). """ raise NotImplementedError def start(self): """Starts the parameter server in a new thread.""" raise NotImplementedError def run(self): """Main event loop of the parameter server.""" raise NotImplementedError def stop(self): """Notifies the parameter server thread to stop.""" raise NotImplementedError def get_model(self): """Returns the Keras model which will be trained by the workers.""" return self.model def next_update(self): """Increments the number of model updates by 1.""" self.num_updates += 1 def reset_update_counter(self): """Resets the model update counter.""" self.num_updates = 0 def get_num_updates(self): """Returns the number of model updates the parameter server has performed.""" return self.num_updates class SocketParameterServer(ParameterServer): """Abstract class of a parameter server which is based on a socket implementation. This means that this parameter server accepts multiple TCP connections from multiple workers, and uses a costum protocol to transmit and receive the model parameters. This is done by implementing a custom protocol. Which is fully described in the distkeras.networking module. # Arguments model: string. Serialized Keras model. See: distkeras.utils.serialize_keras_model port: int. Listing port number. """ def __init__(self, model, port=5000): super(SocketParameterServer, self).__init__(model) self.master_port = port self.socket = None self.running = False self.connections = [] self.mutex = threading.Lock() def initialize(self): """Sets up the listing port.""" # Reset the running flag. self.running = True # Prepare a socket. file_descriptor = socket.socket(socket.AF_INET, socket.SOCK_STREAM) # Disable Nagle's algorithm. file_descriptor.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) # Check if the master port needs to be assigned by the OS. if self.master_port is None: file_descriptor.bind(('0.0.0.0', 0)) # Retrieve the port assigned by the OS. self.master_port = int(file_descriptor.getsockname()[1]) else: file_descriptor.bind(('0.0.0.0', self.master_port)) # Listen to the socket. file_descriptor.listen(5) # Assign the socket. self.socket = file_descriptor def handle_commit(self, conn, addr): """Handles parameter updates coming from the workers. # Arguments: conn: socket. The opened connection. addr: addr. Address of the remote host. """ raise NotImplementedError def handle_pull(self, conn, addr): """Handles parameter requests coming from the workers. This will actually send the model parameters to the requesting host. # Arguments: conn: socket. The opened connection. addr: addr. Address of the remote host. """ # Fetch the raw center variables. with self.mutex: center_variable = self.model.get_weights() cv = copy.deepcopy(center_variable) # Send the data over the socket. send_data(conn, cv) def cancel_accept(self): """This method will cancel the accept procedure. The method is meant to be executed by the stop() procedure. """ file_descriptor = socket.socket(socket.AF_INET, socket.SOCK_STREAM) try: # Connect to the listening socket to cancel the accept. file_descriptor.connect(("localhost", self.master_port)) file_descriptor.close() except Exception as e: print(e) def handle_connection(self, conn, addr): """ A parameter server has two main functionalities. Nodes are able to pull (p) the current state, or 'commit' a state. This is implemented in the following functionality. Classes which implement these interfaces should not worry about connection handling. """ try: while self.running: # Fetch the current action. action = conn.recv(1).decode() # Check if the action is a commit (most of the cases). if action == 'c': # Handle the commit. self.handle_commit(conn, addr) elif action == 'p': # Handle the pull. self.handle_pull(conn, addr) except Exception as e: print(e) def start(self): """Starts the parameter server.""" # Set the running flag. self.running = True def run(self): """Main event loop of the parameter server.""" # Listen for incoming connections. while self.running: try: # Accept incoming connections. conn, addr = self.socket.accept() # Handle the connection. thread = threading.Thread(target=self.handle_connection, args=(conn, addr)) thread.start() # Store the connection in the dictionary. self.connections.append(thread) except Exception as e: print(e) def stop(self): """Stop the parameter server. This will also cleanup all existing connections.""" self.running = False # Check if a socket is allocated. if self.socket: self.cleanup_connections() self.finalize() self.socket.close() self.cancel_accept() self.socket = None self.connections = [] def finalize(self): """Method that is called when the parameter server stops.""" print("Not executed") def cleanup_connections(self): """Clean all existing connections up.""" # Iterate over all connections. for thread in self.connections: # Fetch the thread object. thread.join() del thread class DeltaParameterServer(SocketParameterServer): """A parameter server which integrates all incoming deltas into the model. # Arguments model: string. Serialized Keras model. See: distkeras.utils.serialize_keras_model master_port: int. Port number of the parameter server. """ def __init__(self, model, master_port): super(DeltaParameterServer, self).__init__(model, master_port) self.center_variable = np.asarray(self.model.get_weights()) def handle_commit(self, conn, addr): # Receive the parameters from the remote node. data = recv_data(conn) # Extract the delta from the dictionary. delta = data['delta'] # Update the center variable with the delta. with self.mutex: self.center_variable = self.center_variable + delta # Next iteration. self.next_update() def handle_pull(self, conn, addr): """Handles parameter requests coming from the workers. This will actually send the model parameters to the requesting host. # Arguments: conn: socket. The opened connection. addr: addr. Address of the remote host. """ # Fetch the raw center variables. with self.mutex: cv = copy.deepcopy(self.center_variable) # Send the data over the socket. send_data(conn, cv) def finalize(self): # Set the final weights of the model. self.model.set_weights(self.center_variable) class ADAGParameterServer(SocketParameterServer): """A parameter server which integrates the incoming gradient residuals into the model, and integrates them using the ADAG scheme. # Arguments model: string. Keras model. See: distkeras.utils.serialize_keras_model master_port: int. Port number of the parameter server. """ def __init__(self, model, master_port): super(ADAGParameterServer, self).__init__(model, master_port) self.center_variable = np.asarray(self.model.get_weights()) def handle_commit(self, conn, addr): # Receive the parameters from the remote node. data = recv_data(conn) # Extract the data from the dictionary. r = data['residual'] with self.mutex: # Update the center variable. self.center_variable = self.center_variable + r # Increment the number of parameter server updates. self.next_update() def handle_pull(self, conn, addr): """Handles parameter requests coming from the workers. This will actually send the model parameters to the requesting host. # Arguments: conn: socket. The opened connection. addr: addr. Address of the remote host. """ # Fetch the raw center variables. with self.mutex: cv = copy.deepcopy(self.center_variable) # Send the data over the socket. send_data(conn, cv) def finalize(self): # Set the weights of the model. self.model.set_weights(self.center_variable) class DynSGDParameterServer(SocketParameterServer): """DynSGD parameter server, keeps track of the staleness between updates to maintain dynamic worker learning rates based on staleness. # Arguments model: string. Keras model See: distkeras.utils.serialize_keras_model master_port: int. Port number of the parameter server. """ def __init__(self, model, master_port): super(DynSGDParameterServer, self).__init__(model, master_port) def handle_pull(self, conn, addr): """Handles parameter requests coming from the workers. This will actually send the model parameters to the requesting host. This is a specific implementation for DynSGD. # Arguments: conn: socket. The opened connection. addr: addr. Address of the remote host. """ # Allocate a new dictionary. data = {} # Fetch the raw center variables. with self.mutex: center_variable = self.model.get_weights() cv = copy.deepcopy(center_variable) # Store the number of updates (u) the PS executed. data['update'] = self.num_updates # Store the model (m). data['model'] = cv # Send the data over the socket. send_data(conn, data) def handle_commit(self, conn, addr): data = recv_data(conn) r = data['residual'] # Fetch the last iteration number last_update = data['last_update'] du = (self.num_updates - last_update) + 1 r /= du with self.mutex: center_variable = self.model.get_weights() center_variable = center_variable + r self.model.set_weights(center_variable) # Increment the number of parameter server updates. self.next_update() class ExperimentalParameterServer(SocketParameterServer): """A parameter server which integrates the incoming gradient residuals into the model, and integrates them using the ADAG scheme. # Arguments model: string. Keras model. See: distkeras.utils.serialize_keras_model master_port: int. Port number of the parameter server. """ def __init__(self, model, master_port, learning_rate): super(ExperimentalParameterServer, self).__init__(model, master_port) self.center_variable = np.asarray(self.model.get_weights()) self.inverse_learning_rate = 1.0 / learning_rate def handle_commit(self, conn, addr): # Receive the parameters from the remote node. data = recv_data(conn) # Extract the data from the dictionary. r = data['residual'] worker_id = data['worker_id'] stale_cv = data['stale_center_variable'] with self.mutex: diff_cv = np.subtract(self.center_variable, stale_cv) d = 1 / (self.inverse_learning_rate * np.power(diff_cv, 2) + 1) r = np.multiply(d, r) # Update the center variable. self.center_variable = self.center_variable + r # Increment the number of parameter server updates. self.next_update() def handle_pull(self, conn, addr): """Handles parameter requests coming from the workers. This will actually send the model parameters to the requesting host. # Arguments: conn: socket. The opened connection. addr: addr. Address of the remote host. """ # Fetch the raw center variables. with self.mutex: cv = copy.deepcopy(self.center_variable) # Send the data over the socket. send_data(conn, cv) def finalize(self): # Set the weights of the model. self.model.set_weights(self.center_variable) ================================================ FILE: distkeras/predictors.py ================================================ """Predictors take a model and will transform the Dataframe by adding a prediction column.""" ## BEGIN Imports. ############################################################## import numpy as np from pyspark.mllib.linalg import DenseVector from distkeras.utils import serialize_keras_model from distkeras.utils import deserialize_keras_model from distkeras.utils import new_dataframe_row ## END Imports. ################################################################ class Predictor(object): """Abstract predictor class. # Arguments keras_model: Keras Model. """ def __init__(self, keras_model): self.model = serialize_keras_model(keras_model) def predict(self, dataframe): """Transforms the dataframe to add a prediction. # Arguments dataframe: dataframe. Spark Dataframe. """ raise NotImplementedError class ModelPredictor(Predictor): """Takes a Keras model and adds a prediction column to the dataframe given a features column. # Arguments keras_model: Keras model. features_col: string. Name of the features column. output_col: string. Name of the prediction column. """ def __init__(self, keras_model, features_col="features", output_col="prediction"): super(ModelPredictor, self).__init__(keras_model) assert isinstance(features_col, (str, list)), "'features_col' must be a string or a list of strings" self.features_column = [features_col] if isinstance(features_col, str) else features_col self.output_column = output_col def _predict(self, iterator): """Lambda method which will append a prediction column to the provided rows. # Arguments: iterator: iterator. Spark Row iterator. """ model = deserialize_keras_model(self.model) for row in iterator: features = [np.asarray([row[c]]) for c in self.features_column] prediction = model.predict(features) dense_prediction = DenseVector(prediction[0]) new_row = new_dataframe_row(row, self.output_column, dense_prediction) yield new_row def predict(self, dataframe): """Returns a dataframe which is the old dataframe with an additional prediction column. """ return dataframe.rdd.mapPartitions(self._predict).toDF() ================================================ FILE: distkeras/schemes.py ================================================ """Schemes module. Module with schemes to automatize a distributed learning process. These schemes will automatically adjust the hyperparameters to improve training performance. """ ## BEGIN Imports. ############################################################## import math ## END Imports. ################################################################ class Scheme(object): """A 'Scheme' is way to describe how a distributed optimization sequence should perform. For example, it is responsible for adjusting the learning rate of the parameter server if it notices that the loss doesn't decay. However, this is only one of the possible solutions. Others include the optimization of other hyperparameters such as the number of workers. # Arguments optimizer: trainer. A distributed optimizer. num_epoch: int. Total number of epoch. evaluation_frequency: int. Frequency of hyperparameter evaluation. """ def __init__(self, optimizer, num_epoch=15, evaluation_frequency=5): self.optimizer = optimizer self.num_epoch = num_epoch self.evaluation_frequency = evaluation_frequency self.epoch_over_eval_frequency = int(self.num_epoch / self.evaluation_frequency) self.initialize() def initialize(self): """Initializes the hyperparameters to follow the scheme parameters.""" self.optimizer.set_num_epoch(self.get_epoch_over_evaluation_frequency()) def get_epoch_over_evaluation_frequency(self): """Returns the number of epochs per evaluation frequency.""" return self.epoch_over_eval_frequency def optimize(self, training_set, validation_set): raise NotImplementedError class Emperor(Scheme): """The 'Emporor' optimization schema will make hyperparameter changes based on the loss derrivatives of the validation set. # Arguments optimizer: trainer. A distributed optimizer. evaluate_loss: function. Function which evaluates the loss. This function should accept a model, and a dataframe. num_epoch: int. Total number of epoch. evaluation_frequency: int. Frequency of hyperparameter evaluation. """ def __init__(self, optimizer, evaluate_loss, num_epoch=15, evaluation_frequency=5, loss_threshold=0.005): super(Emperor, self).__init__(optimizer, num_epoch, evaluation_frequency) self.previous_loss = float('inf') self.loss_threshold = loss_threshold self.evaluate_loss = evaluate_loss def optimize(self, training_set, validation_set): trained_model = None # Fetch the number of evaluations, to match the number of epochs. num_evaluations = self.get_epoch_over_evaluation_frequency() + 1 # Iterate over the number of evaluation epochs. for i in range(0, num_evaluations): # Train the model. trained_model = self.optimizer.train(training_set) self.optimizer.set_model(trained_model) # Evaluate the training set, and fetch the loss. loss = self.evaluate_loss(trained_model, validation_set) print("Current loss: " + str(loss)) dl = math.fabs(loss - self.previous_loss) self.previous_loss = loss if dl <= self.loss_threshold: print("Lowering learning rate.") print("Old learning rate: " + str(self.optimizer.get_learning_rate())) # Modify the learning rate. learning_rate = self.optimizer.get_learning_rate() learning_rate /= 10 self.optimizer.set_learning_rate(learning_rate) print("New learning rate: "+ str(self.optimizer.get_learning_rate())) return trained_model ================================================ FILE: distkeras/trainers.py ================================================ """Model optimizers. Depending on the implementation, these classes will optimize the Keras model in a distributed manner (with exception of the SingleTrainer).""" ## BEGIN Imports. ############################################################## import numpy as np import threading import time from distkeras.parameter_servers import ADAGParameterServer from distkeras.parameter_servers import DeltaParameterServer from distkeras.parameter_servers import DynSGDParameterServer from distkeras.parameter_servers import ExperimentalParameterServer from distkeras.utils import deserialize_keras_model from distkeras.utils import history_executor from distkeras.utils import history_executors_average from distkeras.utils import pickle_object from distkeras.utils import serialize_keras_model from distkeras.utils import set_keras_base_directory from distkeras.utils import unpickle_object from distkeras.networking import determine_host_address from distkeras.workers import ADAGWorker from distkeras.workers import AEASGDWorker from distkeras.workers import DOWNPOURWorker from distkeras.workers import DynSGDWorker from distkeras.workers import ExperimentalWorker from distkeras.workers import EAMSGDWorker from distkeras.workers import SequentialWorker from keras import backend as K ## END Imports. ################################################################ class Trainer(object): """Abstract trainer class. This class provides all base functionality which all optimizers need to implement. # Arguments keras_model: Keras model. loss: string. String representing the loss. See: https://keras.io/objectives/ worker_optimizer: string. String representing worker optimizer. See https://keras.io/optimizers/ metrics: list of strings representing model evaluation metrics. Default is ["accuracy"]. See: https://keras.io/metrics/ loss_weights: optional list or dict specifying weights for different losses. """ def __init__(self, keras_model, loss, worker_optimizer, metrics=["accuracy"], loss_weights=None): set_keras_base_directory() self.master_model = serialize_keras_model(keras_model) self.loss = loss self.loss_weights = loss_weights self.worker_optimizer = worker_optimizer self.metrics = metrics self.history = [] self.training_time_start = 0 self.training_time_end = 0 self.training_time = 0 self.max_mini_batches_prefetch = 100 def set_max_prefetch(self, max_mini_batches): """Sets the maximum amount of mini-batches that can be prefetched by a worker.""" self.max_mini_batches_prefetch = max_mini_batches def set_model(self, model): """Sets the master model to be used by the trainer.""" self.master_model = serialize_keras_model(model) def record_training_start(self): """Records the start of the training. This private function is called when the training process starts. """ self.training_time = 0 self.training_time_start = time.time() def record_training_end(self): """Records the end of the traing. This private function is called when the training process is terminated. """ self.training_time_end = time.time() self.training_time = self.training_time_end - self.training_time_start def get_training_time(self): """Returns the told training time.""" return self.training_time def get_history(self): """Returns all history object aggregated during training.""" return self.history def get_averaged_history(self): """Returns the averaged history of the center variable.""" return history_executors_average(self.history) def get_executor_history(self, executor_id): """Returns the history of a specific executor.""" return history_executor(self.history, executor_id) def train(self, dataframe, shuffle=False): """Trains the specified model using the specified dataframe. # Arguments dataframe: dataframe. A Spark Dataframe containing the training data. shuffle: boolean. Tells to shuffle the dataframe before training. Warning: this will tell Spark to shuffle all partitions over the network. It is recommended to shuffle the dataframe before training and store it. """ raise NotImplementedError def serialize(self): return pickle_object(self) class SingleTrainer(Trainer): """An optimizer which will train a network on a single machine. # Arguments keras_model: model. Keras model to train. worker_optimizer: string. String representing worker optimizer. See https://keras.io/optimizers/ loss: string. String representing the loss. See: https://keras.io/objectives/ metrics: list of strings representing model evaluation metrics. Default is ["accuracy"]. See: https://keras.io/metrics/ features_col: string or list of strings. Name(s) of the features column(s). label_col: string or list of strings. Name(s) of the label column(s). num_epoch: int. Number of epochs. batch_size: int. Mini-batch size. loss_weights: optional list or dict specifying weights for different losses. """ def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], features_col="features", label_col="label", num_epoch=1, batch_size=32, loss_weights=None): super(SingleTrainer, self).__init__(keras_model, loss, worker_optimizer, metrics, loss_weights) self.features_column = features_col self.label_column = label_col self.num_epoch = num_epoch self.batch_size = batch_size def allocate_worker(self): """Allocates a worker for the Single Trainer instance. Only for internal use. """ worker = SequentialWorker(model=self.master_model, features_col=self.features_column, label_col=self.label_column, batch_size=self.batch_size, num_epoch = self.num_epoch, optimizer=self.worker_optimizer, loss=self.loss, loss_weights=self.loss_weights, metrics = self.metrics) return worker def train(self, dataframe, shuffle=False): """See distkeras.trainers.Trainer.train # Arguments dataframe: dataframe. A Spark Dataframe containing the training data. shuffle: boolean. Tells to shuffle the dataframe before training. Warning: this will tell Spark to shuffle all partitions over the network. It is recommended to shuffle the dataframe before training and store it. """ # Check if the data needs to be shuffled. if shuffle: dataframe = shuffle(dataframe) # Collect the dataframe on a single worker node. dataframe = dataframe.coalesce(1) # Cache the dataframe. dataframe.cache() # Allocate a worker. worker = self.allocate_worker() # Set the maximum number of mini-batches. worker.set_max_prefetch(self.max_mini_batches_prefetch) # Start recording training time. self.record_training_start() # Fetch the trained model. self.master_model = dataframe.rdd.mapPartitionsWithIndex(worker.train).collect()[0] # Stop recording of training time. self.record_training_end() return deserialize_keras_model(self.master_model) class AveragingTrainer(Trainer): """A trainer which implements a data parallel technique using model averaging. In this implementation, the model replicas are averages after every epoch. # Arguments keras_model: model. Keras model to train. worker_optimizer: string. String representing worker optimizer. See https://keras.io/optimizers/ loss: string. String representing the loss. See: https://keras.io/objectives/ metrics: list of strings representing model evaluation metrics. Default is ["accuracy"]. See: https://keras.io/metrics/ features_col: string or list of strings. Name(s) of the features column(s). label_col: string or list of strings. Name(s) of the label column(s). num_epoch: int. Number of epochs. batch_size: int. Mini-batch size. num_workers: int. Number of model replicas to train in parallel. loss_weights: optional list or dict specifying weights for different losses. """ def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], features_col="features", label_col="label", num_epoch=1, batch_size=32, num_workers=2, loss_weights=None): super(AveragingTrainer, self).__init__(keras_model, loss, worker_optimizer, metrics, loss_weights) self.features_column = features_col self.label_column = label_col self.num_epoch = num_epoch self.batch_size = batch_size self.num_workers = num_workers self.parameter_buffer = np.asarray(keras_model.get_weights()) self.parameter_buffer.fill(0.0) def average_models(self, models): """Averages the specified list of Keras models, and assigns the averaged model as the master model. # Arguments: models: list. A list of serialized Keras models. """ num_models = len(models) # Get all weights of the models. for i in range(0, num_models): weights = np.asarray(deserialize_keras_model(models[i]).get_weights()) self.parameter_buffer += weights # Average the parameters. self.parameter_buffer /= num_models temp_model = deserialize_keras_model(self.master_model) temp_model.set_weights(self.parameter_buffer) self.master_model = serialize_keras_model(temp_model) def allocate_worker(self): """Allocates the AveragingWorker for internal use.""" worker = SequentialWorker(model=self.master_model, features_col=self.features_column, label_col=self.label_column, batch_size=self.batch_size, num_epoch = 1, optimizer=self.worker_optimizer, loss=self.loss, loss_weights=self.loss_weights, metrics = self.metrics) return worker def train(self, dataframe, shuffle=False): """Applies model averaging to the model replicas distributed over the specified number of Spark executors. # Arguments dataframe: dataframe: A Spark Dataframe containing the training data. shuffle: boolean. Tells to shuffle the dataframe before training. Warning: this will tell Spark to shuffle all partitions over the network. It is recommended to shuffle the dataframe before training and store it. """ # Repartition the data in order to fit the number of workers. num_partitions = dataframe.rdd.getNumPartitions() # Check if the dataframe needs to be shuffled. if shuffle: dataframe = shuffle(dataframe) # Check if we need to repartition the dataframe. if num_partitions >= self.num_workers: dataframe = dataframe.coalesce(self.num_workers) else: dataframe = dataframe.repartition(self.num_workers) # Start the training procedure. self.record_training_start() for i in range(0, self.num_epoch): worker = self.allocate_worker() # Set the maximum number of mini-batches. worker.set_max_prefetch(self.max_mini_batches_prefetch) models = dataframe.rdd.mapPartitionsWithIndex(worker.train).collect() self.average_models(models) # End the training procedure. self.record_training_end() return deserialize_keras_model(self.master_model) class EnsembleTrainer(Trainer): """Utility trainer which will train ensemble methods in parallel. # Arguments keras_model: model. Keras model to train. worker_optimizer: string. String representing worker optimizer. See https://keras.io/optimizers/ loss: string. String representing the loss. See: https://keras.io/objectives/ metrics: list of strings representing model evaluation metrics. Default is ["accuracy"]. See: https://keras.io/metrics/ features_col: string or list of strings. Name(s) of the features column(s). label_col: string or list of strings. Name(s) of the label column(s). batch_size: int. Mini-batch size. num_ensembles: int. Number of ensembles to train. loss_weights: optional list or dict specifying weights for different losses. # Note This will note employ a data-parallell approach for the ensembles. """ def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], features_col="features", label_col="label", batch_size=32, num_ensembles=2, loss_weights=None): super(EnsembleTrainer, self).__init__(keras_model, loss, worker_optimizer, metrics, loss_weights) self.features_column = features_col self.label_column = label_col self.batch_size = batch_size self.num_ensembles = num_ensembles def allocate_worker(self): """Allocates the EnsembleWorker for internal use.""" worker = SequentialWorker(model=self.master_model, features_col=self.features_column, label_col=self.label_column, batch_size=self.batch_size, num_epoch = self.num_epoch, optimizer=self.worker_optimizer, loss=self.loss, loss_weights=self.loss_weights, metrics=self.metrics) return worker def train(self, dataframe, shuffle=False): """Trains the specified number of ensemble models using the specified dataframe. # Arguments dataframe: dataframe. A Spark Dataframe containing the training data. shuffle: boolean. Tells to shuffle the dataframe before training. Warning: this will tell Spark to shuffle all partitions over the network. It is recommended to shuffle the dataframe before training and store it. """ # Allocate a worker. worker = self.allocate_worker() # Set the maximum number of mini-batches. worker.set_max_prefetch(self.max_mini_batches_prefetch) # Repartition in order to fit the number of workers. num_partitions = dataframe.rdd.getNumPartitions() # Check if the dataframe needs to be shuffled before training. if shuffle: dataframe = shuffle(dataframe) # Check if we need to repartition the dataframe. if num_partitions >= self.num_workers: dataframe = dataframe.coalesce(self.num_workers) else: dataframe = dataframe.repartition(self.num_workers) # Start the training procedure. self.record_training_start() # Train the models in parallel. models = dataframe.rdd.mapPartitionsWithIndex(worker.train).collect() # End the training procedure. self.record_training_end() return models class DistributedTrainer(Trainer): """Abstract class which describes the properties of a distributed optimizer. # Arguments keras_model: model. Keras model to train. worker_optimizer: string. String representing worker optimizer. See https://keras.io/optimizers/ loss: string. String representing the loss. See: https://keras.io/objectives/ metrics: list of strings representing model evaluation metrics. Default is ["accuracy"]. See: https://keras.io/metrics/ features_col: string or list of strings. Name(s) of the features column(s). label_col: string or list of strings. Name(s) of the label column(s). num_epoch: int. Number of epochs. batch_size: int. Mini-batch size. num_workers: int. Number of distributed workers. master_port: int. port number for the parameter server. loss_weights: optional list or dict specifying weights for different losses. """ def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32, features_col="features", label_col="label", num_epoch=1, master_port=5000, loss_weights=None): super(DistributedTrainer, self).__init__(keras_model, loss, worker_optimizer, metrics, loss_weights) self.num_workers = num_workers self.batch_size = batch_size self.features_column = features_col self.label_column = label_col self.num_epoch = num_epoch self.parameter_server = None self.parameter_server_thread = None self.master_host = determine_host_address() self.master_port = master_port self.learning_rate = 1.0 def set_minibatch_size(self, size): """Sets the size of the mini-batch.""" self.batch_size = size def get_minibatch_size(self): """Returns the size of the mini-batch.""" return self.batch_size def get_features_column(self): """Returns the name of the features column.""" return self.features_column def get_label_column(self): """Returns the name of the label column.""" return self.label_column def get_learning_rate(self): """Returns the learning rate of the worker which can be tuned by the parameter server, or optimization scheme. Note: this learning rate is independent of the learning rate of the optimizer. """ return self.learning_rate def set_learning_rate(self, learning_rate): """Sets the learning rate which can be tuned by the parameter server, or optimization scheme. Note: this learning rate is independent of the learning rate of the optimizer. """ self.learning_rate = learning_rate def set_num_epoch(self, num_epoch): """Sets the number of epochs.""" self.num_epoch = num_epoch def get_num_epoch(self): """Returns the number of epochs.""" return self.num_epoch def allocate_worker(self): """Allocates the worker implementation. Implement this method in subclasses. """ raise NotImplementedError def set_master(self, master): """Sets the master address of the parameter server.""" self.master_host = master def determine_new_master(self): """Sets the new master address to the current host.""" self.master_host = determine_host_address() def allocate_parameter_server(self): """Allocates the parameter server. If an other type of parameter server is required, you can overwrite this implementation. """ parameter_server = DeltaParameterServer(self.master_model, self.master_port) return parameter_server def set_num_workers(self, num_workers): """Sets the number of parallel workers to use.""" self.num_workers = num_workers def get_num_workers(self): """Returns the number of parallel workers.""" return self.num_workers def num_updates(self): """Returns the number of model updates the parameter server performed.""" return self.parameter_server.num_updates() def service(self): """Executes the parameter server service.""" self.parameter_server.start() self.parameter_server.initialize() self.parameter_server.run() def stop_service(self): """Stops the parameter server service.""" self.parameter_server.stop() self.parameter_server_thread.join() self.parameter_server_thread = None def start_service(self): """Starts the parameter server service.""" # Check if a parameter server thread is already allocated. if not self.parameter_server_thread is None: # Stop the parameter server service. self.stop_service() # Allocate a new parameter service thread. self.parameter_server_thread = threading.Thread(target=self.service) self.parameter_server_thread.start() def train(self, dataframe, shuffle=False): """Training procedure of a distributed optimization process. # Arguments dataframe: dataframe. A Spark Dataframe containing the training data. shuffle: boolean. Tells to shuffle the dataframe before training. Warning: this will tell Spark to shuffle all partitions over the network. It is recommended to shuffle the dataframe before training and store it. """ # Check if a parameter server has been allocated. if self.parameter_server is not None: # Cleanup the old parameter server. self.parameter_server.stop() self.parameter_server = None # Allocate the parameter server. self.parameter_server = self.allocate_parameter_server() # Start the communication service. self.start_service() # Allocate a worker. worker = self.allocate_worker() # Set the maximum number of mini-batches. worker.set_max_prefetch(self.max_mini_batches_prefetch) # Repartition in order to fit the number of workers. num_partitions = dataframe.rdd.getNumPartitions() # Check if the dataframe needs to be shuffled before training. if shuffle: dataframe = shuffle(dataframe) # Check if we need to repartition the dataframe. if num_partitions >= self.num_workers: dataframe = dataframe.coalesce(self.num_workers) else: dataframe = dataframe.repartition(self.num_workers) # Cache the dataframe. dataframe.cache() # Start the training procedure. self.record_training_start() # Iterate through the epochs. self.history = dataframe.rdd.mapPartitionsWithIndex(worker.train).collect() # End the training procedure. self.record_training_end() # Stop the communication service. self.stop_service() return self.parameter_server.get_model() class AsynchronousDistributedTrainer(DistributedTrainer): """Abstract class for an asynchronous distributed trainer. This trainer also allows us to set a parallelism factor. This parallelism factor allows us to further parallelize the Spark job. For example, imagine having n machines optimizing a model in an asynchronous distributed setting. If for some, but likely reason, some machines are performing worse compared to others. It will cause the complete learning procedure to be stuck on this one particular machine since every machine will be assigned a single partition. In order to resolve this, we added a parallelization factor. This factor indicates the ratio of the number of jobs per machine (executor). For small dataframes, we recommend that this factor is set to 1. However, this effect really is prominent when the dataframe is large. In this case we recommend that the ratio is 2 or 3. # Arguments keras_model: model. Keras model to train. worker_optimizer: string. String representing worker optimizer. See https://keras.io/optimizers/ loss: string. String representing the loss. See: https://keras.io/objectives/ metrics: list of strings representing model evaluation metrics. Default is ["accuracy"]. See: https://keras.io/metrics/ features_col: string or list of strings. Name(s) of the features column(s). label_col: string or list of strings. Name(s) of the label column(s). num_epoch: int. Number of epochs. batch_size: int. Mini-batch size. num_workers: int. Number of distributed workers. master_port: int. port number for the parameter server. loss_weights: optional list or dict specifying weights for different losses. # Note By default, the parallelization factor is set to 1. """ def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32, features_col="features", label_col="label", num_epoch=1, master_port=5000, loss_weights=None): super(AsynchronousDistributedTrainer, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers, batch_size, features_col, label_col, num_epoch, master_port, loss_weights) # Initialize asynchronous methods variables. self.parallelism_factor = 1 def allocate_worker(self): """Allocates the worker implementation. Implement this method in subclasses. """ raise NotImplementedError def set_parallelism_factor(self, factor): """Sets the parallelization factor. # Arguments factor: int. The new parallelization factor. """ self.parallelism_factor = factor def get_parallelism_factor(self): """Returns the parallelization factor.""" return self.parallelism_factor def train(self, dataframe, shuffle=False): """Training procedure of an asynchronous distributed optimization process. # Arguments dataframe: dataframe. A Spark Dataframe containing the training data. shuffle: boolean. Tells to shuffle the dataframe before training. Warning: this will tell Spark to shuffle all partitions over the network. It is recommended to shuffle the dataframe before training and store it. """ # Check if a parameter server has been allocated. if self.parameter_server is not None: # Cleanup the old parameter server. self.parameter_server.stop() self.parameter_server = None # Allocate the parameter server. self.parameter_server = self.allocate_parameter_server() # Start the communication service. self.start_service() # Allocate a worker. worker = self.allocate_worker() # Set the maximum number of mini-batches. worker.set_max_prefetch(self.max_mini_batches_prefetch) # Repartition in order to fit the number of workers. num_partitions = dataframe.rdd.getNumPartitions() # Check if the dataframe needs to be shuffled before training. if shuffle: dataframe = shuffle(dataframe) # Indicate the parallelism (number of worker times parallelism factor). parallelism = self.parallelism_factor * self.num_workers # Check if we need to repartition the dataframe. if num_partitions >= parallelism: dataframe = dataframe.coalesce(parallelism) else: dataframe = dataframe.repartition(parallelism) # Start the training procedure. self.record_training_start() # Iterate through the epochs. self.history = dataframe.rdd.mapPartitionsWithIndex(worker.train).collect() # End the training procedure. self.record_training_end() # Stop the communication service. self.stop_service() return self.parameter_server.get_model() class AEASGD(AsynchronousDistributedTrainer): """Asynchronous Elastic Averaging SGD optimizer. Introduced by Zhang et al. https://arxiv.org/pdf/1412.6651.pdf # Arguments keras_model: model. Keras model to train. worker_optimizer: string. String representing worker optimizer. See https://keras.io/optimizers/ loss: string. String representing the loss. See: https://keras.io/objectives/ metrics: list of strings representing model evaluation metrics. Default is ["accuracy"]. See: https://keras.io/metrics/ features_col: string or list of strings. Name(s) of the features column(s). label_col: string or list of strings. Name(s) of the label column(s). num_epoch: int. Number of epochs. batch_size: int. Mini-batch size. num_workers: int. Number of distributed workers. communication_window: int. Staleness parameter. This parameter describes the number of mini-batches that will be computed before updating the center variable. For EASGD based algorithms we recommend large communication windows. learning_rate: float. Learning rate. rho: float. Elastic "exploration" variable. Higher values mean that the model is allowed to "explore" its surroundings. Smaller values are correlated with less exploration. We use the value recommend by the authors. master_port: int. port number for the parameter server. loss_weights: optional list or dict specifying weights for different losses. """ def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32, features_col="features", label_col="label", num_epoch=1, communication_window=32, rho=5.0, learning_rate=0.1, master_port=5000, loss_weights=None): super(AEASGD, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers, batch_size, features_col, label_col, num_epoch, master_port, loss_weights) self.communication_window = communication_window self.rho = rho self.learning_rate = learning_rate def allocate_worker(self): """Allocates the asynchronous EASGD worker.""" # Allocate a AEASGD worker. worker = AEASGDWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics, self.features_column, self.label_column, self.batch_size, self.num_epoch, self.master_host, self.master_port, self.rho, self.learning_rate, self.communication_window) return worker class DOWNPOUR(AsynchronousDistributedTrainer): """DOWNPOUR Optimizer. Asynchronous data-parallel optimizer introduced by Dean et al. http://static.googleusercontent.com/media/research.google.com/en/archive/large_deep_networks_nips2012.pdf # Arguments keras_model: model. Keras model to train. worker_optimizer: string. String representing worker optimizer. See https://keras.io/optimizers/ loss: string. String representing the loss. See: https://keras.io/objectives/ metrics: list of strings representing model evaluation metrics. Default is ["accuracy"]. See: https://keras.io/metrics/ features_col: string or list of strings. Name(s) of the features column(s). label_col: string or list of strings. Name(s) of the label column(s). num_epoch: int. Number of epochs. batch_size: int. Mini-batch size. num_workers: int. Number of distributed workers. communication_window: int. Staleness parameter. This parameter describes the number of mini-batches that will be computed before updating the center variable. For DOWNPOUR we recommend small communication windows. learning_rate: float. Learning rate. master_port: int. port number for the parameter server. loss_weights: optional list or dict specifying weights for different losses. """ def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32, features_col="features", label_col="label", num_epoch=1, communication_window=5, master_port=5000, loss_weights=None): super(DOWNPOUR, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers, batch_size, features_col, label_col, num_epoch, master_port, loss_weights) self.communication_window = communication_window def allocate_worker(self): """Allocates the DOWNPOUR worker.""" # Allocate DOWNPOUR worker. worker = DOWNPOURWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics, self.features_column, self.label_column, self.batch_size, self.num_epoch, self.master_host, self.master_port, self.communication_window) return worker class EAMSGD(AsynchronousDistributedTrainer): """Asynchronous Elastic Averaging w/ Momentum SGD optimizer. Introduced by Zhang et al. https://arxiv.org/pdf/1412.6651.pdf # Arguments keras_model: model. Keras model to train. worker_optimizer: string. String representing worker optimizer. See https://keras.io/optimizers/ loss: string. String representing the loss. See: https://keras.io/objectives/ metrics: list of strings representing model evaluation metrics. Default is ["accuracy"]. See: https://keras.io/metrics/ features_col: string or list of strings. Name(s) of the features column(s). label_col: string or list of strings. Name(s) of the label column(s). num_epoch: int. Number of epochs. batch_size: int. Mini-batch size. num_workers: int. Number of distributed workers. communication_window: int. Staleness parameter. This parameter describes the number of mini-batches that will be computed before updating the center variable. For EASGD based algorithms we recommend large communication windows. learning_rate: float. Learning rate. rho: float. Elastic "exploration" variable. Higher values mean that the model is allowed to "explore" its surroundings. Smaller values are correlated with less exploration. We use the value recommend by the authors. momentum: float. Momentum term. master_port: int. port number for the parameter server. loss_weights: optional list or dict specifying weights for different losses. """ def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32, features_col="features", label_col="label", num_epoch=1, communication_window=32, rho=5.0, learning_rate=0.1, momentum=0.9, master_port=5000, loss_weights=None): super(EAMSGD, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers, batch_size, features_col, label_col, num_epoch, master_port, loss_weights) self.communication_window = communication_window self.rho = rho self.learning_rate = learning_rate self.momentum = momentum def allocate_worker(self): """Allocates the asynchronous EAMSGD worker.""" # Allocate a EAMSGD REST worker. worker = EAMSGDWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics, self.features_column, self.label_column, self.batch_size, self.num_epoch, self.master_host, self.master_port, self.rho, self.learning_rate, self.momentum, self.communication_window) return worker class ADAG(AsynchronousDistributedTrainer): """Asynchronous Distributed Adaptive Gradient (Stochastic Gradient Descent). Introduced by Hermans et al. # Arguments: keras_model: model. Keras model to train. worker_optimizer: string. String representing worker optimizer. See: https://keras.io/optimizers/ loss: string. String representing the loss function. See: https://keras.io/objectives/ metrics: list of strings representing model evaluation metrics. Default is ["accuracy"]. See: https://keras.io/metrics/ features_col: string or list of strings. Name(s) of the features column(s). num_epoch: int. Number of epochs. batch_size: int. Mini-batch size. num_workers: int. Number of distributed workers. communication_window: int. Staleness parameter. This parameter describes the number of mini-batches that will be computed before updating the center variable. For DOWNPOUR based algorithms we recommend large communication windows. master_port: int. port number for the parameter server. loss_weights: optional list or dict specifying weights for different losses. """ def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32, features_col="features", label_col="label", num_epoch=1, communication_window=12, master_port=5000, loss_weights=None): # Initialize the parent object. super(ADAG, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers, batch_size, features_col, label_col, num_epoch, master_port, loss_weights) # Set algorithm parameters. self.communication_window = communication_window def allocate_worker(self): """Allocate an Adag worker.""" worker = ADAGWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics, self.features_column, self.label_column, self.batch_size, self.num_epoch, self.master_host, self.master_port, self.communication_window) return worker def allocate_parameter_server(self): """Allocate the Adag parameter server.""" parameter_server = ADAGParameterServer(self.master_model, self.master_port) return parameter_server class DynSGD(AsynchronousDistributedTrainer): """Dynamic SGD, dynamically maintains learning rate for every worker and incorperates staleness. Introduced in SIGMOD 2017 "Heterogenity-aware Parameter Servers" http://net.pku.edu.cn/~cuibin/Papers/2017SIGMOD.pdf # Arguments: keras_model: model. Keras model to train. worker_optimizer: string. String representing worker optimizer. See: https://keras.io/optimizers/ loss: string. String representing the loss function. See: https://keras.io/objectives/ metrics: list of strings representing model evaluation metrics. Default is ["accuracy"]. See: https://keras.io/metrics/ features_col: string or list of strings. Name(s) of the features column(s). num_epoch: int. Number of epochs. batch_size: int. Mini-batch size. num_workers: int. Number of distributed workers. communication_window: int. Staleness parameter. This parameter describes the number of mini-batches that will be computed before updating the center variable. For DOWNPOUR based algorithms we recommend large communication windows. master_port: int. port number for the parameter server. loss_weights: optional list or dict specifying weights for different losses. """ def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32, features_col="features", label_col="label", num_epoch=1, communication_window=5, master_port=5000, loss_weights=None): # Initialize the parent object. super(DynSGD, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers, batch_size, features_col, label_col, num_epoch, master_port, loss_weights) # Set algorithm parameters. self.communication_window = communication_window def allocate_worker(self): """Allocate DYNSGD worker.""" worker = DynSGDWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics, self.features_column, self.label_column, self.batch_size, self.num_epoch, self.master_host, self.master_port, self.communication_window) return worker def allocate_parameter_server(self): """Allocate DYNSGD parameter server.""" parameter_server = DynSGDParameterServer(self.master_model, self.master_port) return parameter_server class Experimental(AsynchronousDistributedTrainer): """Experimental optimization scheme for development purposes.""" def __init__(self, keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32, features_col="features", label_col="label", num_epoch=1, communication_window=5, learning_rate=1.0, master_port=5000, loss_weights=None): # Initialize the parent object. super(Experimental, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers, batch_size, features_col, label_col, num_epoch, master_port, loss_weights) # Set the algorithm parameters. self.communication_window = communication_window self.learning_rate = learning_rate def allocate_worker(self): """Allocate experimental worker.""" worker = ExperimentalWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics, self.features_column, self.label_column, self.batch_size, self.num_epoch, self.master_host, self.master_port, self.communication_window, self.num_workers, self.learning_rate) return worker def allocate_parameter_server(self): """Allocate experimental parameter server.""" parameter_server = ExperimentalParameterServer(self.master_model, self.master_port, self.learning_rate) return parameter_server ================================================ FILE: distkeras/transformers.py ================================================ """Commonly used Dataframe transformers. A transformer will "transform" a Spark dataframe from one form into the other. For example, mapping the column to an other value, or adding a column to a dataframe based on a collection of specified values. """ ## BEGIN Imports. ############################################################## import numpy as np from distkeras.utils import new_dataframe_row from distkeras.utils import to_one_hot_encoded_dense from pyspark.mllib.linalg import DenseMatrix from pyspark.mllib.linalg import DenseVector from pyspark.sql.functions import mean from pyspark.sql.functions import stddev_pop ## END Imports. ################################################################ class Transformer(object): """Interface which defines a transformer object.""" def transform(self, dataframe): """Transforms the dataframe into an other dataframe. # Returns The transformed dataframe. """ raise NotImplementedError class MinMaxTransformer(Transformer): """Will transform every feature of an instance between a specified range. # Arguments o_min: float. Original minimum of dataset. o_max: float. Original maximum of dataset. n_min: float. New minimum of dataset. n_max: float. New maximum of dataset. input_col: string. Name of input column. output_col: string. Name of output column. is_vector. boolean. Indicates if the data element is a vector or a singular value. # Summary New range: [o_min; o_max] Old range: [n_min; n_max] """ def __init__(self, o_min, o_max, n_min, n_max, input_col, output_col, is_vector=True): self.o_min = float(o_min) self.o_max = float(o_max) self.n_min = float(n_min) self.n_max = float(n_max) self.scale = (self.n_max - self.n_min) / (self.o_max - self.o_min) self.input_column = input_col self.output_column = output_col self.is_vector = is_vector def _transform(self, row): """Rescale every instance like this: x' = \frac{x - min}{max - min} """ if self.is_vector: vector = row[self.input_column].toArray() vector = self.scale * (vector - self.o_max) + self.n_max new_value = DenseVector(vector) else: value = row[self.input_column] new_value = self.scale * (value - self.o_max) + self.n_max # Construct a new row with the normalized vector. new_row = new_dataframe_row(row, self.output_column, new_value) return new_row def transform(self, dataframe): """Applies the min-max transformation to every row in the dataframe. # Arguments dataframe: dataframe. Spark Dataframe. """ return dataframe.rdd.map(self._transform).toDF() class BinaryLabelTransformer(Transformer): """Transformers the specified a column to a binary label, i.e., [0, 1] give a specific label name. Given the specified label, this transformer will generate [1,0], in the other case [0,1]. # Arguments: input_column: string. Column name of the label identifier. output_column: string. Name of the new label which contains the binary label. label: string. Name of the label which needs to serve as 1. """ def __init__(self, input_column, output_column, label): self.input_column = input_column self.output_column = output_column self.label = label def _transform(self, row): """Appends the desired binary label column.""" value = row[self.input_column] vector = np.zeros(2) # Check if the name matches. if value == self.label: vector[0] = 1.0 else: vector[1] = 1.0 # Convert to a Spark DenseVector vector = DenseVector(vector) return new_dataframe_row(row, self.output_column, vector) def transform(self, dataframe): """Applies the binary label transformation to the applied dataframe. # Arguments dataframe: dataframe. Spark Dataframe. """ return dataframe.rdd.map(self._transform).toDF() class StandardTransformer(Transformer): """Will transform the specified columns to unit standard deviation (if specified), and centers the data to mean 0 (if specified). # Arguments columns: list. List of columns. suffix: string. Suffix name of the column after processing. # Note We assume equal probability of the rows. """ def __init__(self, columns, suffix="_normalized"): self.columns = columns self.column_suffix = suffix self.current_column = None self.means = {} self.stddevs = {} def clean_mean_keys(self, means): """Cleans the keys of the specified dictionary (mean).""" new_means = {} for k in means: new_means[k[4:-1]] = means[k] return new_means def clean_stddev_keys(self, stddevs): """Cleans the keys of the specified dictionary (stddev).""" new_stddevs = {} for k in stddevs: new_stddevs[k[11:-5]] = stddevs[k] return new_stddevs def _transform(self, row): """Take the column, and normalize it with the computed means and std devs.""" mean = self.means[self.current_column] stddev = self.stddevs[self.current_column] x = row[self.current_column] x_normalized = (x - mean) / stddev output_column = self.current_column + self.column_suffix new_row = new_dataframe_row(row, output_column, x_normalized) return new_row def transform(self, dataframe): """Applies standardization to the specified columns. # Arguments dataframe: dataframe. Spark Dataframe. """ # Compute the means of the specified columns. means = [mean(x) for x in self.columns] means = dataframe.select(means).collect()[0].asDict() self.means = self.clean_mean_keys(means) # Compute the standard deviation of the specified columns. stddevs = [stddev_pop(x) for x in self.columns] stddevs = dataframe.select(stddevs).collect()[0].asDict() self.stddevs = self.clean_stddev_keys(stddevs) # For every feature, add a new column to the dataframe. for column in self.columns: self.current_column = column dataframe = dataframe.rdd.map(self._transform).toDF() return dataframe class DenseTransformer(Transformer): """Transformes sparse vectors into dense vectors. # Arguments input_col: string. Name of the input column of the sparse vector. output_col: string. Name of the output column. """ def __init__(self, input_col, output_col): self.input_column = input_col self.output_column = output_col def _transform(self, row): """Transforms the sparse vector to a dense vector while putting it in a new column.""" sparse_vector = row[self.input_column] dense_vector = DenseVector(sparse_vector.toArray()) new_row = new_dataframe_row(row, self.output_column, dense_vector) return new_row def transform(self, dataframe): """Transforms every sparse vector in the input column to a dense vector. # Arguments dataframe: dataframe. Spark Dataframe. # Returns A transformed Spark Dataframe. """ return dataframe.rdd.map(self._transform).toDF() class ReshapeTransformer(Transformer): """Transforms vectors into other dense shapes. # Note: Only use this transformer in the last stage of the processing pipeline. Since the arbitrary vector shapes will be directly passed on to the models. # Arguments: input_col: string. Name of the input column containing the vector. output_col: string. Name of the output column. shape: tuple. Shape of the matrix. """ def __init__(self, input_col, output_col, shape): self.input_column = input_col self.output_column = output_col self.shape = shape def _transform(self, row): """Transforms the vector to a dense matrix while putting it in a new column.""" vector = row[self.input_column] vector = np.asarray(vector) reshaped = vector.reshape(self.shape).tolist() new_row = new_dataframe_row(row, self.output_column, reshaped) return new_row def transform(self, dataframe): """Transforms every vector in the input column to a dense vector. # Arguments dataframe: dataframe. Spark Dataframe. # Returns A transformed Spark Dataframe. """ return dataframe.rdd.map(self._transform).toDF() class OneHotTransformer(Transformer): """Transformer which transforms an integer index into a vector using one-hot-encoding. # Arguments output_dim: int. Dimension of output vector. input_col: string. Name of input column. output_col: string. Name of output column. """ def __init__(self, output_dim, input_col, output_col): self.input_column = input_col self.output_column = output_col self.output_dimensionality = output_dim def _transform(self, row): """Transforms every individual row. Only for internal use. """ label = row[self.input_column] vector = to_one_hot_encoded_dense(label, self.output_dimensionality) new_row = new_dataframe_row(row, self.output_column, vector.tolist()) return new_row def transform(self, dataframe): """Applies One-Hot encoding to every row in the dataframe. # Arguments dataframe: dataframe. A Spark Dataframe. # Returns A Spark Dataframe with one-hot encoded features. """ return dataframe.rdd.map(self._transform).toDF() class LabelIndexTransformer(Transformer): """Transformer which will transform a prediction vector into an integer label. # Arguments output_dim: int. Dimension of output vector. input_col: string. Name of the input column. output_col: string. Name of the output column. default_index: int. Default "answer". activation_threshold: float. Threshold of immediate activation. """ def __init__(self, output_dim, input_col="prediction", output_col="prediction_index", default_index=0, activation_threshold=0.55): self.input_column = input_col self.output_column = output_col self.output_dimensionality = output_dim self.activation_threshold = activation_threshold self.default_index = default_index def get_index(self, vector): """Returns the index with the highest value or with activation threshold.""" max = 0.0 max_index = self.default_index for index in range(0, self.output_dimensionality): if vector[index] >= self.activation_threshold: return index if vector[index] > max: max = vector[index] max_index = index return max_index def _transform(self, row): """Transforms every row by adding a "predicted index" column to the dataframe. """ prediction = row[self.input_column] index = float(self.get_index(prediction)) new_row = new_dataframe_row(row, self.output_column, index) return new_row def transform(self, dataframe): """Transforms the dataframe by adding a predicted index. # Arguments dataframe: dataframe. A Spark Dataframe. # Returns A Spark Dataframe with a "predicted" index. """ return dataframe.rdd.map(self._transform).toDF() ================================================ FILE: distkeras/utils.py ================================================ """Utility functions used throughout Distributed Keras.""" ## BEGIN Import. ############################################################### from keras import backend as K from keras.models import model_from_json from keras import backend as K from pyspark.mllib.linalg import DenseVector from pyspark.sql import Row from pyspark.sql.functions import rand import pickle import json import numpy as np import os import pwd ## END Import. ################################################################# def get_os_username(): """Returns the username of user on the operating system. From: http://stackoverflow.com/questions/842059/is-there-a-portable-way-to-get-the-current-username-in-python """ return pwd.getpwuid(os.getuid())[0] def set_keras_base_directory(base_dir='/tmp/' + get_os_username()): """Sets the base directory of Keras.""" K._keras_base_dir = base_dir def to_one_hot_encoded_dense(value, n_dim=2): """Converts the value to a one-hot encoded vector. # Arguments value: float. Value of the single "hot" value. n_dim: int. Dimension of the output vector. """ value = int(value) vector = np.zeros(n_dim) vector[value] = 1.0 return vector def new_dataframe_row(old_row, column_name, column_value): """Constructs a new Spark Row based on the old row, and a new column name and value.""" row = Row(*(old_row.__fields__ + [column_name]))(*(old_row + (column_value, ))) return row def json_to_dataframe_row(string): """Converts a JSON String to a Spark Dataframe row.""" dictionary = json.loads(string) row = Row(**dictionary) return row def pickle_object(o): """Pickles the specified model and its weights.""" return pickle.dumps(o, -1) def unpickle_object(string): """Unpickles the specified string into a model.""" return pickle.loads(string) def serialize_keras_model(model): """Serializes the specified Keras model into a dictionary.""" dictionary = {} dictionary['model'] = model.to_json() dictionary['weights'] = model.get_weights() return dictionary def history_executors_average(history): """Returns the averaged training metrics for all the executors.""" max_iteration = max(history, key=lambda x: x['iteration'])['iteration'] max_executor = max(history, key=lambda x: x['worker_id'])['worker_id'] histories = [] averaged_history = [] # Fetch the histories of the individual executors. for i in range(0, max_executor): histories.append(history_executor(history, i)) # Construct the averaged history. for i in range(0, max_iteration): num_executors = 0 sum = np.zeros(2) for j in range(0, max_executor): if len(histories[j]) - 1 >= i: num_executors += 1 sum += histories[j][i]['history'] # Average the history. sum /= num_executors averaged_history.append(sum) return averaged_history def history_executor(history, id): """Returns the history of a specific executor.""" executor_history = [h for h in history if h['worker_id'] == id] executor_history.sort(key=lambda x: x['iteration']) return executor_history def deserialize_keras_model(dictionary): """Deserialized the Keras model using the specified dictionary.""" architecture = dictionary['model'] weights = dictionary['weights'] model = model_from_json(architecture) model.set_weights(weights) return model def uniform_weights(model, constraints=[-0.5, 0.5]): """Initializes the parameters of the specified Keras model with uniform weights between the specified ranges. # Arguments model: Keras model. constraints: array. An array with two elements which defines the range of the uniform initalization. """ # We assume the following: Keras will return a list of weight matrices. # All layers, even the activiation layers, will be randomly initialized. weights = model.get_weights() for layer in weights: shape = layer.shape if len(shape) > 1: # Fill the matrix with random numbers. n_rows = shape[0] n_columns = shape[1] for i in range(0, n_rows): for j in range(0, n_columns): layer[i][j] = np.random.uniform(low=constraints[0], high=constraints[1]) else: # Fill the vector with random numbers. n_elements = shape[0] for i in range(0, n_elements): layer[i] = np.random.uniform(low=constraints[0], high=constraints[1]) # Set the new weights in the model. model.set_weights(weights) def shuffle(dataset): """Shuffles the rows in the specified Spark Dataframe. # Arguments dataset: dataframe. A Spark Dataframe. """ dataset = dataset.orderBy(rand()) dataset.cache() return dataset def precache(dataset, num_workers): """Precaches the specified dataset. Make sure the specified dataframe has the desired partitioning scheme. # Arguments dataset: dataframe. A Spark Dataframe. num_workers: int. Number of workers you are going to use. """ dataset = dataset.repartition(num_workers) dataset.cache() dataset.count() return dataset ================================================ FILE: distkeras/workers.py ================================================ """Workers module. This module contains all worker specific implementations for different optimization algorithms. """ ## BEGIN Imports. ############################################################## from distkeras.networking import connect from distkeras.networking import recv_data from distkeras.networking import send_data from distkeras.utils import deserialize_keras_model from distkeras.utils import serialize_keras_model from distkeras.utils import set_keras_base_directory from distkeras.utils import shuffle from distkeras.utils import uniform_weights from keras.optimizers import Optimizer, serialize, deserialize import keras.backend as K from itertools import tee from multiprocessing import Pool import numpy as np import threading import tensorflow as tf import sys # "queue" module in python 3 is named "Queue" in python 2 use_python3 = sys.version_info[0] == 3 if use_python3: import queue else: import Queue as queue import random import socket import time ## END Imports. ################################################################ class Worker(object): """Abstract class of a worker. This class provides basic functionality and properties all workers share. """ def __init__(self, model, optimizer, loss, loss_weights, metrics=["accuracy"], features_col="features", label_col="label", batch_size=32, num_epoch=1, learning_rate=1.0): assert isinstance(optimizer, (str, Optimizer)), "'optimizer' must be a string or a Keras Optimizer instance" assert isinstance(features_col, (str, list)), "'features_col' must be a string or a list of strings" assert isinstance(label_col, (str, list)), "'label_col' must be a string or a list of strings" self.model = model self.optimizer = {'class_name': optimizer, 'config': {}} if isinstance(optimizer, str) else serialize(optimizer) self.loss = loss self.loss_weights = loss_weights self.metrics= metrics self.features_column = [features_col] if isinstance(features_col, str) else features_col self.label_column = [label_col] if isinstance(label_col, str) else label_col self.batch_size = batch_size self.num_epoch = num_epoch self.max_mini_batches = 100 self.prefetching_thread = None self.mini_batches = None self.is_prefetching = True self.worker_id = -1 self.learning_rate = learning_rate self.num_inputs = len(self.features_column) self.num_outputs = len(self.label_column) self.current_epoch = 0 def set_max_prefetch(self, max_mini_batches): """Sets the maximum number of mini-batches that can be prefetched.""" self.max_mini_batches = max_mini_batches def set_learning_rate(self, learning_rate): """Sets the learning rate of the worker.""" self.learning_rate = learning_rate def get_learning_rate(self): """Returns the learning rate of the worker.""" return self.learning_rate def set_worker_id(self, worker_id): """Sets the worker id. # Arguments worker_id: int. Worker identifier. """ self.worker_id = worker_id def get_worker_id(self): """Returns the worker id.""" return self.worker_id def prepare_model(self): """Prepares the model for training.""" # Set the Keras directory. set_keras_base_directory() if K.backend() == 'tensorflow': # set GPU option allow_growth to False for GPU-enabled tensorflow config = tf.ConfigProto() config.gpu_options.allow_growth = False sess = tf.Session(config=config) K.set_session(sess) # Deserialize the Keras model. self.model = deserialize_keras_model(self.model) self.optimizer = deserialize(self.optimizer) # Compile the model with the specified loss and optimizer. self.model.compile(loss=self.loss, loss_weights = self.loss_weights, optimizer=self.optimizer, metrics=self.metrics) def get_next_minibatch(self): """Returns the next mini-batch.""" return self.mini_batches.get(timeout=10) def start_prefetching_thread(self, iterator): """Starts the data prefetching thread.""" self.mini_batches = queue.Queue() self.iterator = iterator self.prefetching_thread = threading.Thread(target=self.prefetching) self.prefetching_thread.start() def prefetching(self): partition_iterators_all_epochs = tee(self.iterator, self.num_epoch) for iter_one_epoch in partition_iterators_all_epochs: self.current_epoch += 1 self.is_prefetching = True try: while self.is_prefetching: if self.mini_batches.qsize() < self.max_mini_batches: batch = [next(iter_one_epoch) for _ in range(self.batch_size)] batch_iterator_copies = tee(batch, self.num_inputs + self.num_outputs) feature_iterators = batch_iterator_copies[:self.num_inputs] label_iterators = batch_iterator_copies[self.num_inputs:] X = [np.asarray([x[self.features_column[i]] for x in iterator]) for i, iterator in enumerate(feature_iterators)] Y = [np.asarray([x[self.label_column[i]] for x in iterator]) for i, iterator in enumerate(label_iterators)] self.mini_batches.put([X, Y]) except Exception as e: print(e) self.is_prefetching = False def optimize(self): """Optimization procedure of a worker.""" raise NotImplementedError def train(self, worker_id, iterator): """Training procedure for the worker node. # Arguments worker_id: int. Partition index provided by Spark. Can be used as a worker_id. iterator: iterator. Data iterator. """ # Prepare the optimization procedure. self.start_prefetching_thread(iterator) self.set_worker_id(worker_id) self.prepare_model() # Start the optimization procedure. try: self.optimize() except Exception as e: # Stop the prefetching process. self.is_prefetching = False print(e) # Wait for the prefetching thread to stop. self.prefetching_thread.join() return iter([serialize_keras_model(self.model)]) class SequentialWorker(Worker): """Implementation for sequential gradient updates on a single worker. Will train a model on a single worker node. """ def __init__(self, model, optimizer, loss, loss_weights, metrics=["accuracy"], features_col="features", label_col="label", batch_size=32, num_epoch=1): # Initialize the parent class. super(SequentialWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col, batch_size, num_epoch) def optimize(self): """Training procedure with sequential gradient updates. # Returns Trained serialized Keras model. """ while True: X, Y = self.get_next_minibatch() h = self.model.train_on_batch(X, Y) self.add_history(h) class NetworkWorker(Worker): """Abstract class of a worker who shares the variables using the network.""" def __init__(self, model, optimizer, loss, loss_weights, metrics=["accuracy"], features_col="features", label_col="label", batch_size=32, num_epoch=1, master_host="localhost", master_port=5000, learning_rate=1.0): super(NetworkWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col, batch_size, num_epoch, learning_rate) self.master_host = master_host self.master_port = master_port self.socket = None self.center_variable = None self.disable_nagle = True self.training_history = [] self.worker_id = 0 def connect(self): """Connect with the remote parameter server.""" self.socket = connect(self.master_host, self.master_port, self.disable_nagle) def pull(self): """Requests the center variable from the parameter server.""" # Request a pull from the parameter server. self.socket.sendall(b'p') # Fetch the center variable from the parameter server. self.center_variable = np.asarray(recv_data(self.socket)) def commit(self, residual): """Sends the gradient residual to the parameter server.""" # Prepare the datastructure. data = {} data['worker_id'] = self.get_worker_id() data['delta'] = residual # Request a commit from the parameter server. self.socket.sendall(b'c') # Send the data to the paramter server. send_data(self.socket, data) def set_tcp_no_delay(self, flag): """Disables or enables Nagle's algorithm. (True -> TCP_NODELAY = 1) (False -> TCP_NODELAY = 0) # Arguments: flag: boolean. Indicates if Nagle's algorithm should be disabled. """ self.disable_nagle = flag def tcp_no_delay(self): """Returns the value TCP_NODELAY of the flag (Nagle's algorithm). # Returns True, if Nagle's algorithm is disabled. False otherwise. """ return self.disable_nagle def get_master_host(self): """Returns the host address of the master parameter server.""" return self.master_host def get_master_port(self): """Returns the port of the master parameter server.""" return self.master_port def add_history(self, h): """Appends the specified history data.""" d = {} d['history'] = h d['worker_id'] = self.worker_id d['iteration'] = self.iteration d['timestamp'] = time.time() self.training_history.append(d) def optimize(self): """Optimization procedure of a network worker.""" raise NotImplementedError def train(self, worker_id, iterator): """Training procedure of a networked worker with a parameter server.""" self.start_prefetching_thread(iterator) self.set_worker_id(worker_id) self.prepare_model() self.connect() self.pull() self.model.set_weights(self.center_variable) try: self.optimize() except Exception as e: # Stop the prefetching process. self.is_prefetching = False print(e) self.socket.close() self.prefetching_thread.join(timeout=1) return iter(self.training_history) class ADAGWorker(NetworkWorker): """Implements the training procedure for ADAG. Introduced by Hermans et al. """ def __init__(self, model, optimizer, loss, loss_weights, metrics=["accuracy"], features_col="features", label_col="label", batch_size=32, num_epoch=1, master_host="localhost", master_port=5000, communication_window=5): # Initialize the parent object. super(ADAGWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col, batch_size, num_epoch, master_host, master_port) # Initialize ADAG parameters. self.communication_window = communication_window self.iteration = 1 def commit(self, residual): """Sends the gradient residual to the parameter server.""" # Prepare the datastructure. data = {} data['worker_id'] = self.get_worker_id() data['residual'] = residual # Request a commit from the parameter server. self.socket.sendall(b'c') # Send the data to the paramter server. send_data(self.socket, data) def optimize(self): """Optimization procedure of ADAG.""" W1 = np.asarray(self.model.get_weights()) while True: X, Y = self.get_next_minibatch() h = self.model.train_on_batch(X, Y) self.add_history(h) if self.iteration % self.communication_window == 0: W2 = np.asarray(self.model.get_weights()) delta = W2 - W1 delta /= self.communication_window self.commit(delta) self.pull() self.model.set_weights(self.center_variable) W1 = self.center_variable self.iteration += 1 class DOWNPOURWorker(NetworkWorker): """Implements the training procedure for the distributed DOWNPOUR optimizer. Introduced by Dean et al. http://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf """ def __init__(self, model, optimizer, loss, loss_weights, metrics=["accuracy"], features_col="features", label_col="label", batch_size=32, num_epoch=1, master_host="localhost", master_port=5000, communication_window=3): # Initialize the parent object. super(DOWNPOURWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col, batch_size, num_epoch, master_host, master_port) self.communication_window = communication_window self.iteration = 1 def optimize(self): """Specific optimization procedure for DOWNPOUR.""" W1 = np.asarray(self.model.get_weights()) while True: X, Y = self.get_next_minibatch() if self.iteration % self.communication_window == 0: W2 = np.asarray(self.model.get_weights()) delta = W2 - W1 self.commit(delta) self.pull() self.model.set_weights(self.center_variable) W1 = self.center_variable h = self.model.train_on_batch(X, Y) self.add_history(h) self.iteration += 1 class AEASGDWorker(NetworkWorker): """Implementation of asynchronous EASGD worker. Introduced by Zhang et al. https://arxiv.org/pdf/1412.6651.pdf """ def __init__(self, model, optimizer, loss, loss_weights, metrics=['accuracy'], features_col="features", label_col="label", batch_size=32, num_epoch=1, master_host="localhost", master_port=5000, rho=5.0, learning_rate=0.01, communication_window=32): # Initialize the parent object. super(AEASGDWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col, batch_size, num_epoch, master_host, master_port) # Initialize AEASGD specific variables. self.rho = rho self.learning_rate = learning_rate self.communication_window = communication_window self.alpha = self.rho * self.learning_rate self.iteration = 1 def optimize(self): """Specific training procedure for AEASGD.""" while True: X, Y = self.get_next_minibatch() if self.iteration % self.communication_window == 0: self.pull() W = np.asarray(self.model.get_weights()) E = self.alpha * (W - self.center_variable) W = W - E self.model.set_weights(W) self.commit(E) h = self.model.train_on_batch(X, Y) self.add_history(h) self.iteration += 1 class EAMSGDWorker(NetworkWorker): """Worker implementation of Asynchronous EA Momentum SGD. Introduced by Zhang et al. https://arxiv.org/pdf/1412.6651.pdf """ def __init__(self, model, optimizer, loss, loss_weights, metrics=['accuracy'], features_col="features", label_col="label", batch_size=32, num_epoch=1, master_host="localhost", master_port=5000, rho=5.0, learning_rate=0.01, momentum=0.9, communication_window=32): # Initialize the parent object. super(EAMSGDWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col, batch_size, num_epoch, master_host, master_port) # Initialize EAMSGD specific variables. self.rho = rho self.learning_rate = learning_rate self.momentum = momentum self.communication_window = communication_window self.alpha = self.learning_rate * self.rho self.iteration = 1 def optimize(self): """Specific training procedure of asynchronous EAMSGD.""" r = np.asarray(self.model.get_weights()) r.fill(0.0) while True: X, Y = self.get_next_minibatch() if self.iteration % self.communication_window == 0: self.pull() W = np.asarray(self.model.get_weights()) E = self.alpha * (W - self.center_variable) W = W - E self.model.set_weights(W) self.commit(E) r_t = self.momentum * r W_copy = np.asarray(self.model.get_weights()) W = np.asarray(self.model.get_weights()) W += r_t self.model.set_weights(W) h = self.model.train_on_batch(X, Y) self.add_history(h) gradient = np.asarray(self.model.get_weights()) - W r = r_t - self.learning_rate * gradient W_copy -= r self.model.set_weights(W_copy) self.iteration += 1 class DynSGDWorker(NetworkWorker): """Implements the training procedure for DynSGD.""" def __init__(self, model, optimizer, loss, loss_weights, metrics=["accuracy"], features_col="features", label_col="label", batch_size=32, num_epoch=1, master_host="localhost", master_port=5000, communication_window=5): # Initialize the parent object. super(DynSGDWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col, batch_size, num_epoch, master_host, master_port) # Initialize DynSGD parameters. self.communication_window = communication_window self.iteration = 1 self.last_update = 0 def pull(self): """Requests the center variable and last update from the parameter server.""" # Request a pull from the parameter server. self.socket.sendall(b'p') # Fetch the dictionary from the parameter server. data = recv_data(self.socket) self.center_variable = np.asarray(data['model']) self.last_update = data['update'] def commit(self, residual): """Sends the gradient residual to the parameter server.""" # Prepare the datastructure. data = {} data['worker_id'] = self.get_worker_id() data['residual'] = residual data['last_update'] = self.last_update # Request a commit from the parameter server. self.socket.sendall(b'c') # Send the data to the paramter server. send_data(self.socket, data) def optimize(self): """Optimization procedure of DynSGD.""" W1 = np.asarray(self.model.get_weights()) while True: X, Y = self.get_next_minibatch() h = self.model.train_on_batch(X, Y) self.add_history(h) if self.iteration % self.communication_window == 0: W2 = np.asarray(self.model.get_weights()) delta = W2 - W1 self.commit(delta) self.pull() self.model.set_weights(self.center_variable) W1 = self.center_variable self.iteration += 1 class ExperimentalWorker(NetworkWorker): """Implements the training procedure for ADAG. Introduced by Hermans et al. """ def __init__(self, model, optimizer, loss, loss_weights, metrics=["accuracy"], features_col="features", label_col="label", batch_size=32, num_epoch=1, master_host="localhost", master_port=5000, communication_window=5, num_workers=2, learning_rate=1.0): # Initialize the parent object. super(ExperimentalWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col, batch_size, num_epoch, master_host, master_port, learning_rate) # Initialize ADAG parameters. self.communication_window = communication_window self.num_workers = num_workers self.current_num_workers = self.num_workers self.inverse_learning_rate = 1 / self.learning_rate self.iteration = 1 def commit(self, residual): """Sends the gradient residual to the parameter server.""" # Prepare the datastructure. data = {} data['worker_id'] = self.get_worker_id() data['residual'] = residual data['stale_center_variable'] = self.center_variable # Request a commit from the parameter server. self.socket.sendall(b'c') # Send the data to the paramter server. send_data(self.socket, data) def pull(self): """Requests the center variable from the parameter server.""" # Request a pull from the parameter server. self.socket.sendall(b'p') # Fetch the center variable from the parameter server. self.center_variable = np.asarray(recv_data(self.socket)) def optimize(self): """Optimization procedure of ADAG.""" W1 = np.asarray(self.model.get_weights()) while True: X, Y = self.get_next_minibatch() h = self.model.train_on_batch(X, Y) self.add_history(h) if self.iteration % self.communication_window == 0: W2 = np.asarray(self.model.get_weights()) delta = W2 - W1 delta /= self.communication_window self.commit(delta) self.pull() self.model.set_weights(self.center_variable) W1 = self.center_variable self.iteration += 1 ================================================ FILE: docs/index.md ================================================ # Distributed Keras Distributed Keras (DK) is a **distributed deep learning framework** built op top of Apache Spark and Keras with the goal to significantly reduce the training time using distributed machine learning algorithms. We designed the framework in such a way that a developer could implement a new distributed optimizer with ease, thus enabling a person to focus on research and model development. As mentioned above, most of our methods follow the data parallel approach as described in the paper on [Large Scale Distributed Deep Networks](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf). In this paradigm, replicas of a model are distributed over several "trainers", and every model replica will be trained on a different partition of the dataset. The gradient (or all network weights, depending on the implementation details) will be communicated with the parameter server after every gradient update. The parameter server is responsible for handling the gradient updates of all workers and incorperating all gradient updates into a single master model which will be returned to the user after the training procedure is complete. ## Installation We rely on [Keras](https://keras.io) for the construction of models, and thus following the Keras dependencies. Furthermore, PySpark is also a dependency for this project since DK is using Apache Spark for the distribution of the data and the model replicas. ### Pip You can use `pip` if you only need to DK framework without examples. ```bash pip install git+https://github.com/JoeriHermans/dist-keras.git ``` ### Git However, if you would like to play with the examples and notebooks, simply install the framework using the approach described below. ```bash git clone https://github.com/JoeriHermans/dist-keras cd dist-keras pip install -e . ``` ## Getting Started We recommend starting with the `workflow` notebook located in the `examples` directory. This Python notebook will guide you through all general steps which should need to perform. This includes setting up a Spark Context, reading the data, applying preprocessing, training and evaluation of your model in a distributed way. !!! Note Running the **workflow.ipyn** notebook can be run on your local machine. However, we recommend running the notebook on a Spark cluster since the distributed trainers start to outperform the *SingleTrainer* when the number of workers (cores multiplied by executors) is usually higher than 10. ## Support For issues, bugs, questions, and suggestions. Please use the appropriate channels on [GitHub](https://github.com/JoeriHermans/dist-keras/). After the installation process is complete, you can start exploring the functionality by browsing the examples. We have also prepared a notebook which basically compares the different distributed optimizers with each other. This notebook is located at `examples/experiment.ipynb`. However, other examples are also provided which show you how to use the different distributed optimizers with Apache Spark for distributed pre-processing. ## References * Zhang, S., Choromanska, A. E., & LeCun, Y. (2015). Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems (pp. 685-693). * Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. arXiv preprint arXiv:1511.06051. * Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., ... & Ng, A. Y. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223-1231). * Pumperla, M. (2015). Elephas. Github Repository https://github.com/maxpumperla/elephas/. [4] ## Licensing ![GPLv3](images/gpl_v3.png) ![CERN](images/cern_logo.jpg) ================================================ FILE: docs/license.md ================================================ # GNU General Public License **Version 3, 29 June 2007** Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. ## Preamble The GNU General Public License is a free, copyleft license for software and other kinds of works. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others. For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it. For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions. Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users. Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free. The precise terms and conditions for copying, distribution and modification follow. ## Terms And Conditions 0. Definitions. "This License" refers to version 3 of the GNU General Public License. "Copyright" also means copyright-like laws that apply to other kinds of works, such as semiconductor masks. "The Program" refers to any copyrightable work licensed under this License. Each licensee is addressed as "you". "Licensees" and "recipients" may be individuals or organizations. To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work. A "covered work" means either the unmodified Program or a work based on the Program. To "propagate" a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well. To "convey" a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying. An interactive user interface displays "Appropriate Legal Notices" to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion. 1. Source Code. The "source code" for a work means the preferred form of the work for making modifications to it. "Object code" means any non-source form of a work. A "Standard Interface" means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language. The "System Libraries" of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A "Major Component", in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it. The "Corresponding Source" for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work. The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source. The Corresponding Source for a work in source code form is that same work. 2. Basic Permissions. All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law. You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you. Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 3. Protecting Users' Legal Rights From Anti-Circumvention Law. No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures. When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures. 4. Conveying Verbatim Copies. You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program. You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee. 5. Conveying Modified Source Versions. You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to "keep intact all notices". c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it. d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so. A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an "aggregate" if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate. 6. Conveying Non-Source Forms. You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways: a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange. b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either (1) a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or (2) access to copy the Corresponding Source from a network server at no charge. c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b. d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server (operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements. e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d. A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work. A "User Product" is either (1) a "consumer product", which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, "normally used" refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product. "Installation Information" for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made. If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM). The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network. Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying. 7. Additional Terms. "Additional permissions" are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions. When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission. Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or d) Limiting the use for publicity purposes of names of licensors or authors of the material; or e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors. All other non-permissive additional terms are considered "further restrictions" within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying. If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms. Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way. 8. Termination. You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11). However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation. Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice. Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10. 9. Acceptance Not Required for Having Copies. You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so. 10. Automatic Licensing of Downstream Recipients. Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License. An "entity transaction" is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts. You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it. 11. Patents. A "contributor" is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's "contributor version". A contributor's "essential patent claims" are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, "control" includes the right to grant patent sublicenses in a manner consistent with the requirements of this License. Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version. In the following three paragraphs, a "patent license" is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To "grant" such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party. If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. "Knowingly relying" means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid. If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it. A patent license is "discriminatory" if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007. Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law. 12. No Surrender of Others' Freedom. If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program. 13. Use with the GNU Affero General Public License. Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU Affero General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the special requirements of the GNU Affero General Public License, section 13, concerning interaction through a network will apply to the combination as such. 14. Revised Versions of this License. The Free Software Foundation may publish revised and/or new versions of the GNU General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU General Public License "or any later version" applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU General Public License, you may choose any version ever published by the Free Software Foundation. If the Program specifies that a proxy can decide which future versions of the GNU General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program. Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version. 15. Disclaimer of Warranty. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. Limitation of Liability. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 17. Interpretation of Sections 15 and 16. If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Distributed Deep Learning using Keras and Apache Spark. Copyright (C) 2016 Joeri Hermans This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . Also add information on how to contact you by electronic and paper mail. If the program does terminal interaction, make it output a short notice like this when it starts in an interactive mode: Distributed Keras Copyright (C) 2016 Joeri Hermans This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, your program's commands might be different; for a GUI interface, you would use an "about box". You should also get your employer (if you work as a programmer) or school, if any, to sign a "copyright disclaimer" for the program, if necessary. For more information on this, and how to apply and follow the GNU GPL, see . The GNU General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. But first, please read . ================================================ FILE: docs/optimizers.md ================================================ # Optimizers Optimizers, or trainers, are the main component in Distributed Keras (DK). All trainers share a single interface, which is the `Trainer` class, defined in `distkeras/distributed.py`. This class also contains the `serialized model`, the `loss`, and the `Keras optimizer` the workers need to use. Generally, a trainer will run on a single worker. In the context of Apache Spark, this means that the thread which is responsible for doing the foreachPartition or mapPartitions will have been assigned a trainer. In reality however, the training of the model itself will utilise more physical cores. In fact, it will employ all available cores, and thus bypassing resource managers such as YARN. ## Single Trainer A single trainer is in all simplicity a trainer which will use a single thread (as discussed above) to train a model. This trainer is usually used as a baseline metric for new distributed optimizers. ```python SingleTrainer(keras_model, worker_optimizer, loss, metrics=["accuracy"], num_epoch=1, batch_size=32, features_col="features", label_col="label") ``` **Parameters**: - **keras_model**: The Keras model which should be trained. - **worker_optmizer**: Keras optimizer for workers. - **num_epoch**: Number of epoch iterations over the data. - **batch_size**: Mini-batch size. - **features_col**: Column of the feature vector in the Spark Dataframe. - **label_col**: Column of the label in the Spark Dataframe. ## EASGD The distinctive idea of EASGD is to allow the local workers to perform more exploration (small rho) and the master to perform exploitation. This approach differs from other settings explored in the literature, and focus on how fast the center variable converges [(paper)](https://arxiv.org/pdf/1412.6651.pdf) . We want to note the basic version of EASGD is a synchronous algorithm, i.e., once a worker is done processing a batch of the data, it will wait until all other workers have submitted their variables (this includes the weight parameterization, iteration number, and worker id) to the parameter server before starting the next data batch. ```python EASGD(keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, features_col="features", label_col="label", rho=5.0, learning_rate=0.01, batch_size=32, num_epoch=1, master_port=5000) ``` **Parameters**: TODO ## Asynchronous EASGD In this section we propose the asynchronous version of EASGD. Instead of waiting on the synchronization of other trainers, this method communicates the elastic difference (as described in the paper), with the parameter server. The only synchronization mechanism that has been implemented, is to ensure no race-conditions occur when updating the center variable. ```python AsynchronousEASGD(keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=1000, features_col="features", label_col="label", communication_window=3, rho=0.01, learning_rate=0.01, master_port=5000, num_epoch=1) ``` **Parameters**: TODO ## Asynchronous EAMSGD Asynchronous EAMSGD is a variant of asynchronous EASGD. It is based on the Nesterov's momentum scheme, where the update of the local worker is modified to incorepare a momentum term. ```python AsynchornousEAMSGD(keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=32, features_col="features", label_col="label", communication_window=10, rho=5.0, learning_rate=0.01, momentum=0.9, master_port=5000, num_epoch=1) ``` **Parameters**: TODO ## DOWNPOUR An asynchronous stochastic gradient descent procedure supporting a large number of model replicas and leverages adaptive learning rates. This implementation is based on the pseudocode provided by [Zhang et al.](https://arxiv.org/pdf/1412.6651.pdf) ```python DOWNPOUR(keras_model, worker_optimizer, loss, metrics=["accuracy"], num_workers=2, batch_size=1000, features_col="features", label_col="label", communication_window=5, master_port=5000, num_epoch=1, learning_rate=0.01)) ``` **Parameters**: TODO ## Custom distributed optimizer TODO ### Synchronized Distributed Trainer TODO ### Asynchronous Distributed Trainer TODO ### Implementing a custom worker TODO ================================================ FILE: examples/cifar-10-preprocessing.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CIFAR-10 Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN) \n", "*Departement of Data Science & Knowledge Engineering* \n", "*Maastricht University, The Netherlands*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we download the CIFAR-10 dataset, and prepare it in such a way it can be processed by Spark." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "import cPickle as pickle\n", "\n", "import csv\n", "\n", "import numpy as np\n", "\n", "from pyspark import SparkContext\n", "from pyspark import SparkConf\n", "\n", "from pyspark.ml.feature import VectorAssembler\n", "from pyspark.ml.feature import OneHotEncoder\n", "\n", "from distkeras.trainers import *\n", "from distkeras.predictors import *\n", "from distkeras.transformers import *\n", "from distkeras.evaluators import *\n", "from distkeras.utils import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Downloading and decompressing the dataset" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2017-01-26 15:42:04-- https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz\n", "Resolving www.cs.toronto.edu... 128.100.3.30\n", "Connecting to www.cs.toronto.edu|128.100.3.30|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 170498071 (163M) [application/x-gzip]\n", "Saving to: “cifar-10-python.tar.gz”\n", "\n", "100%[======================================>] 170,498,071 4.88M/s in 33s \n", "\n", "2017-01-26 15:42:40 (4.89 MB/s) - “cifar-10-python.tar.gz” saved [170498071/170498071]\n", "\n" ] } ], "source": [ "!rm cifar-10-python.tar.gz\n", "!rm -r cifar-10-batches-py\n", "!wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cifar-10-batches-py/\n", "cifar-10-batches-py/data_batch_4\n", "cifar-10-batches-py/readme.html\n", "cifar-10-batches-py/test_batch\n", "cifar-10-batches-py/data_batch_3\n", "cifar-10-batches-py/batches.meta\n", "cifar-10-batches-py/data_batch_2\n", "cifar-10-batches-py/data_batch_5\n", "cifar-10-batches-py/data_batch_1\n" ] } ], "source": [ "!tar -xvzf cifar-10-python.tar.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the dataset in memory for further processing" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of training instances: 50000\n" ] } ], "source": [ "# Define the required datastructures.\n", "training_instances = []\n", "training_labels = []\n", "\n", "# Iterate through all training batches, and load them in memory.\n", "for i in range(1, 6):\n", " path = \"cifar-10-batches-py/data_batch_\" + str(i)\n", " fd = open(path, \"rb\")\n", " d = pickle.load(fd)\n", " fd.close()\n", " # Add the training data to our datastructures.\n", " num_instances = len(d['data'])\n", " for j in range(0, num_instances):\n", " training_instances.append(d['data'][j])\n", " training_labels.append(d['labels'][j])\n", " \n", "print(\"Number of training instances: \" + str(len(training_instances)))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of test instances: 10000\n" ] } ], "source": [ "# Define the reuiqred datastructures.\n", "test_instances = []\n", "test_labels = []\n", "\n", "# Load the test batch.\n", "path = \"cifar-10-batches-py/test_batch\"\n", "fd = open(path, \"rb\")\n", "d = pickle.load(fd)\n", "fd.close()\n", "# Add the testset to our datastructures.\n", "num_instances = len(d['data'])\n", "for j in range(0, num_instances):\n", " test_instances.append(d['data'][j])\n", " test_labels.append(d['labels'][j])\n", " \n", "print(\"Number of test instances: \" + str(len(test_instances)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point we have the training and test set in memory. At this point we basically have 2 options to prepare it for Apache Spark. First, we simply \"parallelize\" the data, and continue from there. However, this requires some additional logic. The second approach is to write it to a file which Spark will be able to read (CSV, Parquet, Avro...). Due to the simplicity of the second approach, we will choose to write the contents of our datastructures in a CSV file." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of columns: 3073\n" ] } ], "source": [ "# First, prepare the column names.\n", "columns = ['label']\n", "# Now, add the pixel column names. Note, first 1024 pixels are red, then green and finally blue.\n", "for c in ['r','g','b']:\n", " for i in range(0, 1024):\n", " column_name = \"p_\" + str(i) + \"_\" + c\n", " columns.append(column_name)\n", " \n", "# Now, we should have 3072 (data) + 1 (label) column names.\n", "print(\"Number of columns: \" + str(len(columns)))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Size training set: 50000\n", "Size test set: 10000\n" ] } ], "source": [ "training_set = []\n", "test_set = []\n", "\n", "# Prepare the training set.\n", "for i in range(0, len(training_instances)):\n", " row = np.insert(training_instances[i], 0, training_labels[i])\n", " training_set.append(row)\n", "\n", "# Prepare the test set.\n", "for i in range(0, len(test_instances)):\n", " row = np.insert(test_instances[i], 0, test_labels[i])\n", " test_set.append(row)\n", " \n", "print(\"Size training set: \" + str(len(training_set)))\n", "print(\"Size test set: \" + str(len(test_set)))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def save(path, columns, dataset):\n", " with open(path, 'wb') as f:\n", " w = csv.writer(f)\n", " # Write the columns.\n", " w.writerow(columns)\n", " # Iterate through all instances in the training set.\n", " n = len(dataset)\n", " for i in range(0, n):\n", " w.writerow(dataset[i].tolist())" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Save the datasets to disk.\n", "save(\"cifar-10-training.csv\", columns, training_set)\n", "save(\"cifar-10-test.csv\", columns, test_set)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cifar-10-test.csv\r\n", "cifar-10-training.csv\r\n" ] } ], "source": [ "# Confirming that produced CSV's are present\n", "!ls | grep cifar | grep csv" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Deleted data/cifar-10-training.csv\n", "Deleted data/cifar-10-test.csv\n" ] } ], "source": [ "# Remove the old training and test set from HDFS.\n", "!hdfs dfs -rm data/cifar-10-training.csv\n", "!hdfs dfs -rm data/cifar-10-test.csv\n", "# Copy the training and test set to HDFS.\n", "!hdfs dfs -copyFromLocal cifar-10-training.csv data/cifar-10-training.csv\n", "!hdfs dfs -copyFromLocal cifar-10-test.csv data/cifar-10-test.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Further distributed preprocessing with Apache Spark" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting up a Spark Context" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Modify these variables according to your needs.\n", "application_name = \"CIFAR-10 Preprocessing Notebook\"\n", "using_spark_2 = False\n", "local = False\n", "path_train = \"data/cifar-10-training.csv\"\n", "path_test = \"data/cifar-10-test.csv\"\n", "if local:\n", " # Tell master to use local resources.\n", " master = \"local[*]\"\n", " num_processes = 3\n", " num_executors = 1\n", "else:\n", " # Tell master to use YARN.\n", " master = \"yarn-client\"\n", " num_executors = 20\n", " num_processes = 1\n", " \n", "num_workers = num_executors * num_processes" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "\n", "# Use the DataBricks CSV reader, this has some nice functionality regarding invalid values.\n", "os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell'" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "conf = SparkConf()\n", "conf.set(\"spark.app.name\", application_name)\n", "conf.set(\"spark.master\", master)\n", "conf.set(\"spark.executor.cores\", `num_processes`)\n", "conf.set(\"spark.executor.instances\", `num_executors`)\n", "conf.set(\"spark.executor.memory\", \"4g\")\n", "conf.set(\"spark.locality.wait\", \"0\")\n", "conf.set(\"spark.serializer\", \"org.apache.spark.serializer.KryoSerializer\");\n", "\n", "# Check if the user is running Spark 2.0 +\n", "if using_spark_2:\n", " sc = SparkSession.builder.config(conf=conf) \\\n", " .appName(application_name) \\\n", " .getOrCreate()\n", "else:\n", " # Create the Spark context.\n", " sc = SparkContext(conf=conf)\n", " # Add the missing imports\n", " from pyspark import SQLContext\n", " sqlContext = SQLContext(sc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading the raw CSV files" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Check if we are using Spark 2.0\n", "if using_spark_2:\n", " reader = sc\n", "else:\n", " reader = sqlContext\n", "# Read the training set.\n", "raw_dataset_train = reader.read.format('com.databricks.spark.csv') \\\n", " .options(header='true', inferSchema='true') \\\n", " .load(path_train)\n", "# Read the testing set.\n", "raw_dataset_test = reader.read.format('com.databricks.spark.csv') \\\n", " .options(header='true', inferSchema='true') \\\n", " .load(path_test)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training set size: 50000\n", "Test set size: 10000\n" ] } ], "source": [ "# Count the number of instances in the training and test set (to check).\n", "print(\"Training set size: \" + str(raw_dataset_train.count()))\n", "print(\"Test set size: \" + str(raw_dataset_test.count()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preparing for further preprocessing, training and testing\n", "\n", "In order to ensure compatibility with Apache Spark, we vectorize the columns, and add the resulting vectors as a seperate column. However, in order to achieve this, we first need a list of the required columns. This is shown in the cell below." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "features = raw_dataset_train.columns\n", "features.remove('label')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have a list of columns names, we can pass this to Spark's [VectorAssembler](http://spark.apache.org/docs/latest/ml-features.html#vectorassembler). This VectorAssembler will take a list of features, vectorize them, and place them in a column defined in `outputCol`." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Assemble the columns.\n", "vector_assembler = VectorAssembler(inputCols=features, outputCol=\"features\")\n", "dataset_train = vector_assembler.transform(raw_dataset_train)\n", "dataset_test = vector_assembler.transform(raw_dataset_test)\n", "# Repartition the dataset.\n", "dataset_train = dataset_train.repartition(num_workers)\n", "dataset_test = dataset_test.repartition(num_workers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have the inputs for our Neural Network (features column) after applying the VectorAssembler, we should also define the outputs. Since we are dealing with a classification task, the output of our Neural Network should be a one-hot encoded vector with 10 elements. For this, we provide a `OneHotTransformer` which accomplish this exact task." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [], "source": [ "nb_classes = 10\n", "encoder = OneHotTransformer(nb_classes, input_col=\"label\", output_col=\"label_encoded\")\n", "dataset_train = encoder.transform(dataset_train)\n", "dataset_test = encoder.transform(dataset_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, normalize the pixel intensities with the range [0, 1]." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Allocate a MinMaxTransformer.\n", "transformer = MinMaxTransformer(n_min=0.0, n_max=1.0, \\\n", " o_min=0.0, o_max=250.0, \\\n", " input_col=\"features\", \\\n", " output_col=\"features_normalized\")\n", "# Transform the datasets.\n", "dataset_train = transformer.transform(dataset_train)\n", "dataset_test = transformer.transform(dataset_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving the datasets to Parquet." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Deleted data/cifar-10-train-preprocessed.parquet\n", "Deleted data/cifar-10-test-preprocessed.parquet\n" ] } ], "source": [ "# Delete the old preprocessed Parquet files.\n", "!hdfs dfs -rm -r data/cifar-10-train-preprocessed.parquet\n", "!hdfs dfs -rm -r data/cifar-10-test-preprocessed.parquet" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dataset_train.write.parquet(\"data/cifar-10-train-preprocessed.parquet\")\n", "dataset_test.write.parquet(\"data/cifar-10-test-preprocessed.parquet\")" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [conda root]", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 1 } ================================================ FILE: examples/data/atlas_higgs.csv ================================================ [File too large to display: 52.7 MB] ================================================ FILE: examples/data/mnist.csv ================================================ [File too large to display: 73.2 MB] ================================================ FILE: examples/distributed_numpy_parsing.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Distributed Numpy Parsing\n", "\n", "Joeri R. Hermans \n", "*Departement of Data Science & Knowledge Engineering* \n", "*Maastricht University, The Netherlands* \n", "\n", "This notebook will show you how to parse a collection of Numpy files straight from HDFS into a Spark Dataframe.\n", "\n", "## Cluster Configuration\n", "\n", "In the following sections, we set up the cluster properties." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import numpy as np\n", "\n", "import os\n", "\n", "from pyspark import SparkContext\n", "from pyspark import SparkConf\n", "\n", "from pyspark.sql.types import *\n", "\n", "from pyspark.sql import Row\n", "\n", "from pyspark.storagelevel import StorageLevel\n", "\n", "# Use the DataBricks AVRO reader.\n", "os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-avro_2.11:3.2.0 pyspark-shell'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of desired executors: 20\n", "Number of desired processes / executor: 1\n", "Total number of workers: 20\n" ] } ], "source": [ "# Modify these variables according to your needs.\n", "application_name = \"Distributed Numpy Parsing\"\n", "using_spark_2 = False\n", "local = False\n", "\n", "if local:\n", " # Tell master to use local resources.\n", " master = \"local[*]\"\n", " num_processes = 3\n", " num_executors = 1\n", "else:\n", " # Tell master to use YARN.\n", " master = \"yarn-client\"\n", " num_executors = 20\n", " num_processes = 1\n", "\n", "# This variable is derived from the number of cores and executors,\n", "# and will be used to assign the number of model trainers.\n", "num_workers = num_executors * num_processes\n", "\n", "print(\"Number of desired executors: \" + `num_executors`)\n", "print(\"Number of desired processes / executor: \" + `num_processes`)\n", "print(\"Total number of workers: \" + `num_workers`)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Do not change anything here.\n", "conf = SparkConf()\n", "conf.set(\"spark.app.name\", application_name)\n", "conf.set(\"spark.master\", master)\n", "conf.set(\"spark.executor.cores\", `num_processes`)\n", "conf.set(\"spark.executor.instances\", `num_executors`)\n", "conf.set(\"spark.executor.memory\", \"5g\")\n", "conf.set(\"spark.locality.wait\", \"0\")\n", "conf.set(\"spark.serializer\", \"org.apache.spark.serializer.KryoSerializer\")\n", "conf.set(\"spark.kryoserializer.buffer.max\", \"2000\")\n", "conf.set(\"spark.executor.heartbeatInterval\", \"6000s\")\n", "conf.set(\"spark.network.timeout\", \"10000000s\")\n", "conf.set(\"spark.shuffle.spill\", \"true\")\n", "conf.set(\"spark.driver.memory\", \"10g\")\n", "conf.set(\"spark.driver.maxResultSize\", \"10g\")\n", "\n", "# Check if the user is running Spark 2.0 +\n", "if using_spark_2:\n", " sc = SparkSession.builder.config(conf=conf) \\\n", " .appName(application_name) \\\n", " .getOrCreate()\n", "else:\n", " # Create the Spark context.\n", " sc = SparkContext(conf=conf)\n", " # Add the missing imports\n", " from pyspark import SQLContext\n", " sqlContext = SQLContext(sc)\n", "\n", "# Check if we are using Spark 2.0\n", "if using_spark_2:\n", " reader = sc\n", "else:\n", " reader = sqlContext" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Obtaining the required file-paths\n", "\n", "Basically what we are going to do now, is obtain a lists of file paths (*.npy) which we will map with a custom lambda function to read all the data into a dataframe." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Define the command that needs to be executed, this will list all the numpy files in the specified directory.\n", "cmd = \"hdfs dfs -ls /user/jhermans/data/cms/RelValWjet_Pt_3000_3500_13_GEN-SIM-RECO_evt3150/*.npy | awk '{print $NF}'\"\n", "# Fetch the output of the command, and construct a list.\n", "output = os.popen(cmd).read()\n", "file_paths = output.split(\"\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a Spark Dataframe from the specified list\n", "\n", "Before we convert to a list to a Spark Dataframe, we first need to specify the schema. We do this by converting every element in the list to a Spark row. Afterwards, Spark will be able to automatically infer the schema of the dataframe." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rows = []\n", "\n", "for path in file_paths:\n", " row = Row(**{'path': path})\n", " rows.append(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are able to create the Spark DataFrame. Note, for Spark 2.0 use `spark.` instead of `sqlContext.`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of paths to be parsed: 393\n", "root\n", " |-- path: string (nullable = true)\n", "\n" ] } ], "source": [ "df = sqlContext.createDataFrame(rows)\n", "# Repartition the dataset for increased parallelism.\n", "df = df.repartition(20)\n", "\n", "print(\"Number of paths to be parsed: \" + str(df.count()))\n", "df.printSchema()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[Row(path=u'/user/jhermans/data/cms/RelValWjet_Pt_3000_3500_13_GEN-SIM-RECO_evt3150/trackparams220.npy')]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Example content of the dataframe.\n", "df.take(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parsing your Numpy files\n", "\n", "This is a fairly straightforward operation where we basically map all the file paths using a custom lambda function to read the numpy files from HDFS." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of columns: 190\n", "First five columns: \n", "sis_25_x\n", "normalizedChi2\n", "sis_25_z\n", "sis_25_y\n", "sis_48_x\n" ] } ], "source": [ "# Development cell, this will be executed in the lambdas.\n", "\n", "import pydoop.hdfs as hdfs\n", "\n", "with hdfs.open(file_paths[0]) as f:\n", " data = np.load(f)\n", "\n", "# Obtain the fields (columns) of your numpy data.\n", "fields = []\n", "for k in data[0].dtype.fields:\n", " fields.append(k)\n", " \n", "print(\"Number of columns: \" + str(len(data.dtype.fields)))\n", "\n", "print(\"First five columns: \")\n", "i = 0\n", "for k in data.dtype.fields:\n", " print(k)\n", " i += 1\n", " if i == 5:\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have a working prototype, let's construct a Spark mapper which will fetch the data in a distributed manner from HDFS. Note that if you would like to adjust the data in any way after reading, you can do so by modifying the lambda function, or executing another map after the data has been read." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- TrackId: long (nullable = true)\n", " |-- charge: long (nullable = true)\n", " |-- chi2: double (nullable = true)\n", " |-- d0: double (nullable = true)\n", " |-- dsz: double (nullable = true)\n", " |-- dxy: double (nullable = true)\n", " |-- dz: double (nullable = true)\n", " |-- eta: double (nullable = true)\n", " |-- evt: long (nullable = true)\n", " |-- lambda: double (nullable = true)\n", " |-- lumi: long (nullable = true)\n", " |-- ndof: double (nullable = true)\n", " |-- normalizedChi2: double (nullable = true)\n", " |-- p: double (nullable = true)\n", " |-- phi: double (nullable = true)\n", " |-- pix_0_x: double (nullable = true)\n", " |-- pix_0_y: double (nullable = true)\n", " |-- pix_0_z: double (nullable = true)\n", " |-- pix_1_x: double (nullable = true)\n", " |-- pix_1_y: double (nullable = true)\n", " |-- pix_1_z: double (nullable = true)\n", " |-- pix_2_x: double (nullable = true)\n", " |-- pix_2_y: double (nullable = true)\n", " |-- pix_2_z: double (nullable = true)\n", " |-- pix_3_x: double (nullable = true)\n", " |-- pix_3_y: double (nullable = true)\n", " |-- pix_3_z: double (nullable = true)\n", " |-- pix_4_x: double (nullable = true)\n", " |-- pix_4_y: double (nullable = true)\n", " |-- pix_4_z: double (nullable = true)\n", " |-- pt: double (nullable = true)\n", " |-- px: double (nullable = true)\n", " |-- py: double (nullable = true)\n", " |-- pz: double (nullable = true)\n", " |-- qoverp: double (nullable = true)\n", " |-- run: long (nullable = true)\n", " |-- sis_0_x: double (nullable = true)\n", " |-- sis_0_y: double (nullable = true)\n", " |-- sis_0_z: double (nullable = true)\n", " |-- sis_10_x: double (nullable = true)\n", " |-- sis_10_y: double (nullable = true)\n", " |-- sis_10_z: double (nullable = true)\n", " |-- sis_11_x: double (nullable = true)\n", " |-- sis_11_y: double (nullable = true)\n", " |-- sis_11_z: double (nullable = true)\n", " |-- sis_12_x: double (nullable = true)\n", " |-- sis_12_y: double (nullable = true)\n", " |-- sis_12_z: double (nullable = true)\n", " |-- sis_13_x: double (nullable = true)\n", " |-- sis_13_y: double (nullable = true)\n", " |-- sis_13_z: double (nullable = true)\n", " |-- sis_14_x: double (nullable = true)\n", " |-- sis_14_y: double (nullable = true)\n", " |-- sis_14_z: double (nullable = true)\n", " |-- sis_15_x: double (nullable = true)\n", " |-- sis_15_y: double (nullable = true)\n", " |-- sis_15_z: double (nullable = true)\n", " |-- sis_16_x: double (nullable = true)\n", " |-- sis_16_y: double (nullable = true)\n", " |-- sis_16_z: double (nullable = true)\n", " |-- sis_17_x: double (nullable = true)\n", " |-- sis_17_y: double (nullable = true)\n", " |-- sis_17_z: double (nullable = true)\n", " |-- sis_18_x: double (nullable = true)\n", " |-- sis_18_y: double (nullable = true)\n", " |-- sis_18_z: double (nullable = true)\n", " |-- sis_19_x: double (nullable = true)\n", " |-- sis_19_y: double (nullable = true)\n", " |-- sis_19_z: double (nullable = true)\n", " |-- sis_1_x: double (nullable = true)\n", " |-- sis_1_y: double (nullable = true)\n", " |-- sis_1_z: double (nullable = true)\n", " |-- sis_20_x: double (nullable = true)\n", " |-- sis_20_y: double (nullable = true)\n", " |-- sis_20_z: double (nullable = true)\n", " |-- sis_21_x: double (nullable = true)\n", " |-- sis_21_y: double (nullable = true)\n", " |-- sis_21_z: double (nullable = true)\n", " |-- sis_22_x: double (nullable = true)\n", " |-- sis_22_y: double (nullable = true)\n", " |-- sis_22_z: double (nullable = true)\n", " |-- sis_23_x: double (nullable = true)\n", " |-- sis_23_y: double (nullable = true)\n", " |-- sis_23_z: double (nullable = true)\n", " |-- sis_24_x: double (nullable = true)\n", " |-- sis_24_y: double (nullable = true)\n", " |-- sis_24_z: double (nullable = true)\n", " |-- sis_25_x: double (nullable = true)\n", " |-- sis_25_y: double (nullable = true)\n", " |-- sis_25_z: double (nullable = true)\n", " |-- sis_26_x: double (nullable = true)\n", " |-- sis_26_y: double (nullable = true)\n", " |-- sis_26_z: double (nullable = true)\n", " |-- sis_27_x: double (nullable = true)\n", " |-- sis_27_y: double (nullable = true)\n", " |-- sis_27_z: double (nullable = true)\n", " |-- sis_28_x: double (nullable = true)\n", " |-- sis_28_y: double (nullable = true)\n", " |-- sis_28_z: double (nullable = true)\n", " |-- sis_29_x: double (nullable = true)\n", " |-- sis_29_y: double (nullable = true)\n", " |-- sis_29_z: double (nullable = true)\n", " |-- sis_2_x: double (nullable = true)\n", " |-- sis_2_y: double (nullable = true)\n", " |-- sis_2_z: double (nullable = true)\n", " |-- sis_30_x: double (nullable = true)\n", " |-- sis_30_y: double (nullable = true)\n", " |-- sis_30_z: double (nullable = true)\n", " |-- sis_31_x: double (nullable = true)\n", " |-- sis_31_y: double (nullable = true)\n", " |-- sis_31_z: double (nullable = true)\n", " |-- sis_32_x: double (nullable = true)\n", " |-- sis_32_y: double (nullable = true)\n", " |-- sis_32_z: double (nullable = true)\n", " |-- sis_33_x: double (nullable = true)\n", " |-- sis_33_y: double (nullable = true)\n", " |-- sis_33_z: double (nullable = true)\n", " |-- sis_34_x: double (nullable = true)\n", " |-- sis_34_y: double (nullable = true)\n", " |-- sis_34_z: double (nullable = true)\n", " |-- sis_35_x: double (nullable = true)\n", " |-- sis_35_y: double (nullable = true)\n", " |-- sis_35_z: double (nullable = true)\n", " |-- sis_36_x: double (nullable = true)\n", " |-- sis_36_y: double (nullable = true)\n", " |-- sis_36_z: double (nullable = true)\n", " |-- sis_37_x: double (nullable = true)\n", " |-- sis_37_y: double (nullable = true)\n", " |-- sis_37_z: double (nullable = true)\n", " |-- sis_38_x: double (nullable = true)\n", " |-- sis_38_y: double (nullable = true)\n", " |-- sis_38_z: double (nullable = true)\n", " |-- sis_39_x: double (nullable = true)\n", " |-- sis_39_y: double (nullable = true)\n", " |-- sis_39_z: double (nullable = true)\n", " |-- sis_3_x: double (nullable = true)\n", " |-- sis_3_y: double (nullable = true)\n", " |-- sis_3_z: double (nullable = true)\n", " |-- sis_40_x: double (nullable = true)\n", " |-- sis_40_y: double (nullable = true)\n", " |-- sis_40_z: double (nullable = true)\n", " |-- sis_41_x: double (nullable = true)\n", " |-- sis_41_y: double (nullable = true)\n", " |-- sis_41_z: double (nullable = true)\n", " |-- sis_42_x: double (nullable = true)\n", " |-- sis_42_y: double (nullable = true)\n", " |-- sis_42_z: double (nullable = true)\n", " |-- sis_43_x: double (nullable = true)\n", " |-- sis_43_y: double (nullable = true)\n", " |-- sis_43_z: double (nullable = true)\n", " |-- sis_44_x: double (nullable = true)\n", " |-- sis_44_y: double (nullable = true)\n", " |-- sis_44_z: double (nullable = true)\n", " |-- sis_45_x: double (nullable = true)\n", " |-- sis_45_y: double (nullable = true)\n", " |-- sis_45_z: double (nullable = true)\n", " |-- sis_46_x: double (nullable = true)\n", " |-- sis_46_y: double (nullable = true)\n", " |-- sis_46_z: double (nullable = true)\n", " |-- sis_47_x: double (nullable = true)\n", " |-- sis_47_y: double (nullable = true)\n", " |-- sis_47_z: double (nullable = true)\n", " |-- sis_48_x: double (nullable = true)\n", " |-- sis_48_y: double (nullable = true)\n", " |-- sis_48_z: double (nullable = true)\n", " |-- sis_49_x: double (nullable = true)\n", " |-- sis_49_y: double (nullable = true)\n", " |-- sis_49_z: double (nullable = true)\n", " |-- sis_4_x: double (nullable = true)\n", " |-- sis_4_y: double (nullable = true)\n", " |-- sis_4_z: double (nullable = true)\n", " |-- sis_5_x: double (nullable = true)\n", " |-- sis_5_y: double (nullable = true)\n", " |-- sis_5_z: double (nullable = true)\n", " |-- sis_6_x: double (nullable = true)\n", " |-- sis_6_y: double (nullable = true)\n", " |-- sis_6_z: double (nullable = true)\n", " |-- sis_7_x: double (nullable = true)\n", " |-- sis_7_y: double (nullable = true)\n", " |-- sis_7_z: double (nullable = true)\n", " |-- sis_8_x: double (nullable = true)\n", " |-- sis_8_y: double (nullable = true)\n", " |-- sis_8_z: double (nullable = true)\n", " |-- sis_9_x: double (nullable = true)\n", " |-- sis_9_y: double (nullable = true)\n", " |-- sis_9_z: double (nullable = true)\n", " |-- theta: double (nullable = true)\n", " |-- vx: double (nullable = true)\n", " |-- vy: double (nullable = true)\n", " |-- vz: double (nullable = true)\n", "\n" ] } ], "source": [ "def parse(iterator):\n", " rows = []\n", " \n", " # MODIFY TO YOUR NEEDS IF NECESSARY\n", " for row in iterator:\n", " path = row['path']\n", " # Load the file from HFDS.\n", " with hdfs.open(path) as f:\n", " data = np.load(f)\n", " # Add all rows in current path.\n", " for r in data:\n", " d = {}\n", " for f in fields:\n", " d[f] = r[f].item()\n", " rows.append(Row(**d))\n", " \n", " return iter(rows)\n", "\n", "# Apply the lambda function.\n", "dataset = df.rdd.mapPartitions(parse).toDF()\n", "dataset.printSchema()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 } ================================================ FILE: examples/example_0_data_preprocessing.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Preprocessing\n", "\n", "**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN) \n", "*Departement of Knowledge Engineering* \n", "*Maastricht University, The Netherlands*" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "07 December 2016\r\n" ] } ], "source": [ "!(date +%d\\ %B\\ %G)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we will be preprocessing a **4.6 GB** CSV file containing simulated ATLAS events. Afterwards we will save the processed data to the [Parquet](https://parquet.apache.org/) format for further analysis. After the completion of this notebook, we will have a processed dataset ready for model development, training and evaluation." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "import numpy as np\n", "\n", "import time\n", "\n", "from pyspark import SparkContext\n", "from pyspark import SparkConf\n", "\n", "from pyspark.ml.feature import StandardScaler\n", "from pyspark.ml.feature import VectorAssembler\n", "from pyspark.ml.feature import StringIndexer\n", "\n", "from distkeras.utils import shuffle\n", "from distkeras.transformers import OneHotTransformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Spark preparation and configuration\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Edit the variables in the cell below. If you are running Spark in local mode, please set the `local` flag to true and adjust the resources you wish to use on your local machine. The same goes for the case when you are running Spark 2.0 and higher." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Modify these variables according to your needs.\n", "application_name = \"Distributed Deep Learning: Data Prerocessing\"\n", "using_spark_2 = False\n", "local = False\n", "if local:\n", " # Tell master to use local resources.\n", " master = \"local[*]\"\n", " num_cores = 3\n", " num_executors = 1\n", "else:\n", " # Tell master to use YARN.\n", " master = \"yarn-client\"\n", " num_executors = 8\n", " num_cores = 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following cells you are not required to change something. Adjusting the configuration in the cell above should be sufficient for running this notebook." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "\n", "# Use the DataBricks CSV reader, this has some nice functionality regarding invalid values.\n", "os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell'" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "conf = SparkConf()\n", "conf.set(\"spark.app.name\", application_name)\n", "conf.set(\"spark.master\", master)\n", "conf.set(\"spark.executor.cores\", `num_cores`)\n", "conf.set(\"spark.executor.instances\", `num_executors`)\n", "conf.set(\"spark.executor.memory\",\"2g\")\n", "conf.set(\"spark.serializer\", \"org.apache.spark.serializer.KryoSerializer\");\n", "\n", "# Check if the user is running Spark 2.0 +\n", "if using_spark_2:\n", " sc = SparkSession.builder.config(conf=conf) \\\n", " .appName(application_name) \\\n", " .getOrCreate()\n", "else:\n", " # Create the Spark context.\n", " sc = SparkContext(conf=conf)\n", " # Add the missing imports\n", " from pyspark import SQLContext\n", " sqlContext = SQLContext(sc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset preprocessing\n", "\n", "### Reading" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Check if we are using Spark 2.0\n", "if using_spark_2:\n", " reader = sc\n", "else:\n", " reader = sqlContext\n", "\n", "# Read the dataset.\n", "raw_dataset = reader.read.format('com.databricks.spark.csv') \\\n", " .options(header='true', inferSchema='true').load(\"data/atlas_higgs.csv\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- EventId: integer (nullable = true)\n", " |-- DER_mass_MMC: double (nullable = true)\n", " |-- DER_mass_transverse_met_lep: double (nullable = true)\n", " |-- DER_mass_vis: double (nullable = true)\n", " |-- DER_pt_h: double (nullable = true)\n", " |-- DER_deltaeta_jet_jet: double (nullable = true)\n", " |-- DER_mass_jet_jet: double (nullable = true)\n", " |-- DER_prodeta_jet_jet: double (nullable = true)\n", " |-- DER_deltar_tau_lep: double (nullable = true)\n", " |-- DER_pt_tot: double (nullable = true)\n", " |-- DER_sum_pt: double (nullable = true)\n", " |-- DER_pt_ratio_lep_tau: double (nullable = true)\n", " |-- DER_met_phi_centrality: double (nullable = true)\n", " |-- DER_lep_eta_centrality: double (nullable = true)\n", " |-- PRI_tau_pt: double (nullable = true)\n", " |-- PRI_tau_eta: double (nullable = true)\n", " |-- PRI_tau_phi: double (nullable = true)\n", " |-- PRI_lep_pt: double (nullable = true)\n", " |-- PRI_lep_eta: double (nullable = true)\n", " |-- PRI_lep_phi: double (nullable = true)\n", " |-- PRI_met: double (nullable = true)\n", " |-- PRI_met_phi: double (nullable = true)\n", " |-- PRI_met_sumet: double (nullable = true)\n", " |-- PRI_jet_num: integer (nullable = true)\n", " |-- PRI_jet_leading_pt: double (nullable = true)\n", " |-- PRI_jet_leading_eta: double (nullable = true)\n", " |-- PRI_jet_leading_phi: double (nullable = true)\n", " |-- PRI_jet_subleading_pt: double (nullable = true)\n", " |-- PRI_jet_subleading_eta: double (nullable = true)\n", " |-- PRI_jet_subleading_phi: double (nullable = true)\n", " |-- PRI_jet_all_pt: double (nullable = true)\n", " |-- Weight: double (nullable = true)\n", " |-- Label: string (nullable = true)\n", "\n" ] } ], "source": [ "# Double-check the inferred schema, and get fetch a row to show how the dataset looks like.\n", "raw_dataset.printSchema()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Feature processing\n", "\n", "Next, we will take all the columns in the CSV except the *EventId*, *Weight*, and *Label* column since they are not relevant features." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Record the starting time of the data preprocessing.\n", "time_start = time.time()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[Row(features=DenseVector([138.47, 51.655, 97.827, 27.98, 0.91, 124.711, 2.666, 3.064, 41.928, 197.76, 1.582, 1.396, 0.2, 32.638, 1.017, 0.381, 51.626, 2.273, -2.414, 16.824, -0.277, 258.733, 2.0, 67.435, 2.15, 0.444, 46.062, 1.24, -2.475, 113.497]))]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# First, we would like to extract the desired features from the raw dataset.\n", "# We do this by constructing a list with all desired columns.\n", "features = raw_dataset.columns\n", "features.remove('EventId')\n", "features.remove('Weight')\n", "features.remove('Label')\n", "\n", "# Next, we use Spark's VectorAssembler to \"assemble\" (create) a vector of all desired features.\n", "# http://spark.apache.org/docs/latest/ml-features.html#vectorassembler\n", "vector_assembler = VectorAssembler(inputCols=features, outputCol=\"features\")\n", "\n", "# This transformer will take all columns specified in features, and create an additional column \"features\" which will contain all the desired features aggregated into a single vector.\n", "dataset = vector_assembler.transform(raw_dataset)\n", "\n", "# Show what happened after applying the vector assembler.\n", "# Note: \"features\" column got appended to the end.\n", "dataset.select(\"features\").take(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Feature normalization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply feature normalization with standard scaling using Spark's [StandardScaler](http://spark.apache.org/docs/latest/ml-features.html#standardscaler). This will transform a feature to have mean 0, and std 1." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "standard_scaler = StandardScaler(inputCol=\"features\", outputCol=\"features_normalized\", withStd=True, withMean=True)\n", "standard_scaler_model = standard_scaler.fit(dataset)\n", "dataset = standard_scaler_model.transform(dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Label transformation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset is devided into 2 classes, i.e., Signal(s), and Background(b). In order to make our lives easier in the future. We need to provide a mapping with a one-hot encoded vector (of course, this is a design decision). We achieve this by applying a [StringIndexer](http://spark.apache.org/docs/latest/ml-features.html#stringindexer). Again, a StringIndexer is an internal feature transformer provided by Apache Spark." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[Row(Label=u's', label_index=1.0),\n", " Row(Label=u'b', label_index=0.0),\n", " Row(Label=u'b', label_index=0.0),\n", " Row(Label=u'b', label_index=0.0),\n", " Row(Label=u'b', label_index=0.0)]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "label_indexer = StringIndexer(inputCol=\"Label\", outputCol=\"label_index\").fit(dataset)\n", "dataset = label_indexer.transform(dataset)\n", "\n", "# Show the result of the label transformation.\n", "dataset.select(\"Label\", \"label_index\").take(5)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "DataFrame[features_normalized: vector, label_index: double, label: vector]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We observe that Keras is not able to work with these indexes.\n", "# What it actually expects is a vector with an identical size to the output layer.\n", "# Our framework provides functionality to do this with ease. What it basically does,\n", "# given an expected vector dimension, it prepares zero vector with the specified dimensionality,\n", "# and will set the neuron with a specific label index to one.\n", "\n", "# For example:\n", "# 1. Assume we have a label index: 3\n", "# 2. Output dimensionality: 5\n", "# With these parameters, we obtain the following vector in the DataFrame column: [0,0,0,1,0]\n", "\n", "# First, we fetch the columns of interest.\n", "dataset = dataset.select(\"features_normalized\", \"label_index\")\n", "\n", "# Number of classes (signal and background).\n", "nb_classes = 2\n", "\n", "# Construct a one-hot encoded vector using the provided index.\n", "transformer = OneHotTransformer(output_dim=nb_classes, input_col=\"label_index\", output_col=\"label\")\n", "dataset = transformer.transform(dataset)\n", "\n", "# Only select the columns we need (less data shuffling) while training.\n", "dataset = dataset.select(\"features_normalized\", \"label_index\", \"label\")\n", "dataset.cache()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset randomization\n", "\n", "We shuffle the complete dataset in order to be able to draw stochastic samples from the dataframe." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Randomize the dataset.\n", "dataset = shuffle(dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset saving\n", "\n", "Finally, we save the shuffled and processed dataset to disk for later use." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Store the preprocessed dataset as a Parquet file.\n", "dataset.write.save(\"data/processed.parquet\", format=\"parquet\")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total time: 694.612912178 seconds.\n", "Total time: 11.5768818696 minutes.\n" ] } ], "source": [ "time_end = time.time()\n", "dt = time_end - time_start\n", "print(\"Total time: \" + str(dt) + \" seconds.\")\n", "print(\"Total time: \" + str(dt / 60) + \" minutes.\")" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [conda root]", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 0 } ================================================ FILE: examples/example_1_analysis.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Model Development and Evaluation\n", "\n", "**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN) \n", "*Departement of Knowledge Engineering* \n", "*Maastricht University, The Netherlands*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is dedicated to the development and evaluation of a Keras model based on a large [preprocessed dataset](https://github.com/JoeriHermans/dist-keras/blob/master/examples/data_preprocessing.ipynb)." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline \n", "\n", "import numpy as np\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "from keras.models import Sequential\n", "from keras.layers.core import Dense, Dropout, Activation\n", "\n", "from pyspark import SparkContext\n", "from pyspark import SparkConf\n", "\n", "from pyspark.ml.feature import StandardScaler\n", "from pyspark.ml.feature import VectorAssembler\n", "from pyspark.ml.feature import StringIndexer\n", "from pyspark.ml.evaluation import MulticlassClassificationEvaluator\n", "\n", "from distkeras.transformers import LabelIndexTransformer\n", "from distkeras.predictors import ModelPredictor\n", "from distkeras.trainers import SingleTrainer\n", "from distkeras.trainers import AEASGD\n", "from distkeras.trainers import DOWNPOUR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Spark Configuration and Preparation\n", "\n", "Edit the variables in the cell below. If you are running Spark in local mode, please set the `local` flag to true and adjust the resources you wish to use on your local machine. The same goes for the case when you are running Spark 2.0 and higher." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Modify these variables according to your needs.\n", "application_name = \"Distributed Deep Learning: Analysis\"\n", "using_spark_2 = False\n", "local = False\n", "if local:\n", " # Tell master to use local resources.\n", " master = \"local[*]\"\n", " num_cores = 3\n", " num_executors = 1\n", "else:\n", " # Tell master to use YARN.\n", " master = \"yarn-client\"\n", " num_executors = 8\n", " num_cores = 2" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of desired executors: 8\n", "Number of desired cores / executor: 2\n", "Total number of workers: 16\n" ] } ], "source": [ "# This variable is derived from the number of cores and executors, and will be used to assign the number of model trainers.\n", "num_workers = num_executors * num_cores\n", "\n", "print(\"Number of desired executors: \" + `num_executors`)\n", "print(\"Number of desired cores / executor: \" + `num_cores`)\n", "print(\"Total number of workers: \" + `num_workers`)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "conf = SparkConf()\n", "conf.set(\"spark.app.name\", application_name)\n", "conf.set(\"spark.master\", master)\n", "conf.set(\"spark.executor.cores\", `num_cores`)\n", "conf.set(\"spark.executor.instances\", `num_executors`)\n", "conf.set(\"spark.executor.memory\",\"2g\")\n", "conf.set(\"spark.serializer\", \"org.apache.spark.serializer.KryoSerializer\");\n", "\n", "# Check if the user is running Spark 2.0 +\n", "if using_spark_2:\n", " sc = SparkSession.builder.config(conf=conf) \\\n", " .appName(application_name) \\\n", " .getOrCreate()\n", "else:\n", " # Create the Spark context.\n", " sc = SparkContext(conf=conf)\n", " # Add the missing imports\n", " from pyspark import SQLContext\n", " sqlContext = SQLContext(sc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preparation\n", "\n", "After the Spark Context (or Spark Session if you are using Spark 2.0) has been set up, we can start reading the preprocessed dataset from storage." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Check if we are using Spark 2.0\n", "if using_spark_2:\n", " reader = sc\n", "else:\n", " reader = sqlContext\n", "# Read the dataset.\n", "raw_dataset = reader.read.parquet(\"data/processed.parquet\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- features_normalized: vector (nullable = true)\n", " |-- label_index: double (nullable = true)\n", " |-- label: vector (nullable = true)\n", "\n" ] } ], "source": [ "# Check the schema.\n", "raw_dataset.printSchema()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After reading the dataset from storage, we will extract several metrics such as `nb_features`, which basically is the number of input neurons, and `nb_classes`, which is the number of classes (signal and background)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of features: 30\n", "Number of classes: 2\n" ] } ], "source": [ "nb_features = len(raw_dataset.select(\"features_normalized\").take(1)[0][\"features_normalized\"])\n", "nb_classes = len(raw_dataset.select(\"label\").take(1)[0][\"label\"])\n", "\n", "print(\"Number of features: \" + str(nb_features))\n", "print(\"Number of classes: \" + str(nb_classes))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we split up the dataset for training and testing purposes, and fetch some additional statistics on the number of training and testing instances." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "DataFrame[features_normalized: vector, label_index: double, label: vector]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Finally, we create a trainingset and a testset.\n", "(training_set, test_set) = raw_dataset.randomSplit([0.7, 0.3])\n", "training_set.cache()\n", "test_set.cache()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of testset instances: 6377863\n", "Number of trainingset instances: 14872137\n", "Total number of instances: 21250000\n" ] } ], "source": [ "# Distribute the training and test set to the workers.\n", "test_set = test_set.repartition(num_workers)\n", "training_set = training_set.repartition(num_workers)\n", "\n", "num_test_set = test_set.count()\n", "num_training_set = training_set.count()\n", "\n", "print(\"Number of testset instances: \" + str(num_test_set))\n", "print(\"Number of trainingset instances: \" + str(num_training_set))\n", "print(\"Total number of instances: \" + str(num_test_set + num_training_set))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model construction" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model = Sequential()\n", "model.add(Dense(500, input_shape=(nb_features,)))\n", "model.add(Activation('relu'))\n", "model.add(Dropout(0.4))\n", "model.add(Dense(500))\n", "model.add(Activation('relu'))\n", "model.add(Dropout(0.6))\n", "model.add(Dense(500))\n", "model.add(Activation('relu'))\n", "model.add(Dense(nb_classes))\n", "model.add(Activation('softmax'))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "____________________________________________________________________________________________________\n", "Layer (type) Output Shape Param # Connected to \n", "====================================================================================================\n", "dense_1 (Dense) (None, 500) 15500 dense_input_1[0][0] \n", "____________________________________________________________________________________________________\n", "activation_1 (Activation) (None, 500) 0 dense_1[0][0] \n", "____________________________________________________________________________________________________\n", "dropout_1 (Dropout) (None, 500) 0 activation_1[0][0] \n", "____________________________________________________________________________________________________\n", "dense_2 (Dense) (None, 500) 250500 dropout_1[0][0] \n", "____________________________________________________________________________________________________\n", "activation_2 (Activation) (None, 500) 0 dense_2[0][0] \n", "____________________________________________________________________________________________________\n", "dropout_2 (Dropout) (None, 500) 0 activation_2[0][0] \n", "____________________________________________________________________________________________________\n", "dense_3 (Dense) (None, 500) 250500 dropout_2[0][0] \n", "____________________________________________________________________________________________________\n", "activation_3 (Activation) (None, 500) 0 dense_3[0][0] \n", "____________________________________________________________________________________________________\n", "dense_4 (Dense) (None, 2) 1002 activation_3[0][0] \n", "____________________________________________________________________________________________________\n", "activation_4 (Activation) (None, 2) 0 dense_4[0][0] \n", "====================================================================================================\n", "Total params: 517502\n", "____________________________________________________________________________________________________\n" ] } ], "source": [ "# Summarize the model.\n", "model.summary()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "optimizer = 'adagrad'\n", "loss = 'categorical_crossentropy'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model evaluation" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def evaluate(model):\n", " global test_set\n", "\n", " metric_name = \"f1\"\n", " evaluator = MulticlassClassificationEvaluator(metricName=metric_name, predictionCol=\"prediction_index\", labelCol=\"label_index\")\n", " # Clear the prediction column from the testset.\n", " test_set = test_set.select(\"features_normalized\", \"label\", \"label_index\")\n", " # Apply a prediction from a trained model.\n", " predictor = ModelPredictor(keras_model=trained_model, features_col=\"features_normalized\")\n", " test_set = predictor.predict(test_set)\n", " # Transform the prediction vector to an indexed label.\n", " index_transformer = LabelIndexTransformer(output_dim=nb_classes)\n", " test_set = index_transformer.transform(test_set)\n", " # Store the F1 score of the SingleTrainer.\n", " score = evaluator.evaluate(test_set)\n", " \n", " return score" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "results = {}\n", "time_spent = {}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model training and evaluation\n", "\n", "In the next sections we train and evaluate the models trained by different (distributed) optimizers.\n", "\n", "### Single Trainer" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer = SingleTrainer(keras_model=model, loss=loss, worker_optimizer=optimizer, \n", " features_col=\"features_normalized\", num_epoch=1, batch_size=64)\n", "trained_model = trainer.train(training_set)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time spent (SingleTrainer): 5927.329083919525 seconds.\n", "F1 (SingleTrainer): 0.839630118149035\n" ] } ], "source": [ "# Fetch the training time.\n", "dt = trainer.get_training_time()\n", "print(\"Time spent (SingleTrainer): \" + `dt` + \" seconds.\")\n", "\n", "# Evaluate the model.\n", "score = evaluate(trained_model)\n", "print(\"F1 (SingleTrainer): \" + `score`)\n", "\n", "# Store the training metrics.\n", "results['single'] = score\n", "time_spent['single'] = dt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Asynchronous EASGD" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer = AEASGD(keras_model=model, worker_optimizer=optimizer, loss=loss, num_workers=num_workers, batch_size=64,\n", " features_col=\"features_normalized\", num_epoch=1, communication_window=32, \n", " rho=5.0, learning_rate=0.1)\n", "trainer.set_parallelism_factor(1)\n", "trained_model = trainer.train(training_set)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time spent (AEASGD): 903.8733949661255 seconds.\n", "F1 (AEASGD): 0.8326362659335457\n" ] } ], "source": [ "# Fetch the training time.\n", "dt = trainer.get_training_time()\n", "print(\"Time spent (AEASGD): \" + `dt` + \" seconds.\")\n", "\n", "# Evaluate the model.\n", "score = evaluate(trained_model)\n", "print(\"F1 (AEASGD): \" + `score`)\n", "\n", "# Store the training metrics.\n", "results['aeasgd'] = score\n", "time_spent['aeasgd'] = dt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DOWNPOUR" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer = DOWNPOUR(keras_model=model, worker_optimizer=optimizer, loss=loss, num_workers=num_workers,\n", " batch_size=64, communication_window=5, learning_rate=0.1, num_epoch=1,\n", " features_col=\"features_normalized\")\n", "trainer.set_parallelism_factor(1)\n", "trained_model = trainer.train(training_set)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time spent (DOWNPOUR): 774.4893491268158 seconds.\n", "F1 (DOWNPOUR): 0.8345395134754954\n" ] } ], "source": [ "# Fetch the training time.\n", "dt = trainer.get_training_time()\n", "print(\"Time spent (DOWNPOUR): \" + `dt` + \" seconds.\")\n", "\n", "# Evaluate the model.\n", "score = evaluate(trained_model)\n", "print(\"F1 (DOWNPOUR): \" + `score`)\n", "\n", "# Store the training metrics.\n", "results['downpour'] = score\n", "time_spent['downpour'] = dt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see from the plots below, the distributed optimizers finish a single epoch ~7 times however. However, for this, the distributed optimizers use 16 times the amount of resources. However, a not very descriptive measure since some of jobs are scheduled on the same machines, some machines have a higher load etc. Nevertheless, the statistical performance of the optimizers is within 1% error. Which means that the classifiers would have near-identical performance. Furthermore, it is our guess that the statistical performance of the distributed optimizers can be improved by adding adaptive learning rates." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAGSCAYAAAAxVMH8AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAAPYQAAD2EBqD+naQAAIABJREFUeJzt3X2YH2V97/H3RzAgKAmIJKWCoijG+lASDOADtMaCD9Wi\neFqWUh+o5WgBMZWq7bFKwdMqVoIoniJaHyqsUtRiBYlAFQEjqQRFa6BFoQExwUgIEeQx3/PHzOov\nPzfJ7mY3s2Tfr+vai/3d852Ze5LJ8tl77plJVSFJktSVR3TdAUmSNLUZRiRJUqcMI5IkqVOGEUmS\n1CnDiCRJ6pRhRJIkdcowIkmSOmUYkSRJnTKMSJKkThlGJP1Skp90uO/fSPLJEda+Nsnfj2Lbz04y\nv+fzHyR50hi6KWkCGEYk9dpi74dIst7Pn6r6SVW9bhSbGE1ffxt4Uc/nw4CnjGJ9kmQ09ZJGzjAi\naaOSPCfJfyT5bpIz03ju0ChGkgVJvtN+/ztJzmq/f2mSxUmWJvlI2/aEJNcm+Rzwn337eUKSxe33\nv5vkunbdKzbQtb2TXJHk+iTH9Gznr5MsSfKdJH/ahp6Tgde123sb8Argw+3nHdtj/EaSbyf5XJLt\n222tSPKRJNcBjx+/P1VJvQwjkjbln4A/q6pnA48FBoBvA3Pa5c8F7kvyaOB5wJVJHgu8BTi4quYA\n65K8qq1/OvDuqpo9zL6GRjsWAMe36750A/3aD3gx8BzgrUlmJTkUeFxVzWvb/wx4HPAu4BNVNaeq\nTgUuAI5tt38/8H7g5VW1H/AfwJ+3+9gN+EJVPauqbhnxn5ikUdm26w5ImrySTAdSVd9pm84BXlxV\n5yZZk2Q3mv9hXwAcSBNGPtt+/yzgW+3lje2Bm4FrgGVVdf0mdv1N4P1JPgWcB6wdpuaiqrq77ecl\nwP7AC4DfT3IwEGAn4MnDHVrP9/u0ff1a29dHApe2y+6qqkv7V5Y0vgwjkjZlQ3MlFgN/AtwIXEkz\nJ+NJVfXDJE8H/rWq3rjehpInAPdsaodV9d4kF9FcTrk6ydyqWt1f1vd5Xfvfv6mqz/btd2PzQwIs\nqaoXD7Nsk32VtPm8TCOp13rBo6rWAA8meWbbNEATPACuorkUcxXNpY2j+NU8kG8B85P8JkCSXYa+\n79/HsJ1I9qqq66rqPcBNwB7DlL2kne/xGOCFbR8uA96QZLt2O09tv19LM0oypPfz9cBeSZ7RrrND\nkqHRFCetSluAYURSr12TLE9yS/vfFwNHA/+U5LvAnTSXYaC5lLI7cGVV/QK4vW2jqn4KHAtc0K63\niGbuBozsLpi/SPL9dmLs9VV13TA11wAX04SQ06pqRVV9ZaitnXT6EZqfc18D5iW5pr2E81ngb5Ms\nBaYBRwL/2O7vm/zq0s56fU1yYZJZI+i/pFFI1Ra7k0+SJOnXODIiSZI6ZRiRJEmdMoxIkqROGUYk\nSVKnDCOSJKlThhFJktQpw4gkSeqUYUSSJHXKMCJJkjplGJEkSZ0yjEiSpE4ZRiRJUqcMI5IkqVOG\nEUmS1CnDiCRJ6pRhRJIkdcowIkmSOtV5GElyU5J1w3x9qKfm5CS3JbknySVJ9u7bxnZJzkyyKsna\nJOcn2a2vZuck5yRZk2R1ko8l2XFLHackSRpe52EE2A+Y1fP1e0AB5wEkeTtwHHAMMA+4G1iUZFrP\nNk4HXgYcDhwE7A58vm8/5wKzgflt7UHAWRNyRJIkacRSVV33YT1JTgdeWlVPbT/fBry/qha2n3cC\nVgKvrarz2s8/BY6oqi+2NfsAy4ADqmpJktnAfwJzq+ratuZQ4ELg8VW1YssepSRJGjIZRkZ+Kckj\ngT8GPt5+3otmtOSyoZqqugu4GjiwbdoP2Lav5gZgeU/NAcDqoSDSupRmBGb/iTgWSZI0MpMqjACv\nBKYDn2o/z6IJDCv76la2ywBmAve3IWVDNbOA23sXVtVDwB09NZIkqQPbdt2BPkcDX5ksl02SPBY4\nFLgZuLfb3kiS9LCyPfBEYFFV/WxjhZMmjCTZE3gRcFhP8wogNKMfvaMjM4Fre2qmJdmpb3RkZrts\nqKb/7pptgF16aoZzKHDO6I5EkiT1+GOam0g2aNKEEZpRkZXARUMNVXVTkhU0d8BcB7+cwLo/cGZb\ndg3wYFvTO4F1T2BxW7MYmJFk3555I/Npgs7VG+nTzQCf+cxnmD179mYe3tSyYMECFi5c2HU3NAV4\nrmlL8VwbnWXLlnHUUUdB+//SjZkUYSRJgNcBn6yqdX2LTwfemeRGmgM6BbgVuACaCa1JPg6clmQ1\nsBY4A7iqqpa0NdcnWQScneRNwDTgQ8DgJi4J3Qswe/Zs5syZMy7HOlVMnz7dPzNtEZ5r2lI818Zs\nk9McJkUYobk8swfwif4FVXVqkh1ongkyA7gCeElV3d9TtgB4CDgf2A64GDi2b1NHAh+muYtmXVt7\nwvgehiRJGq1JEUaq6hJgm40sPwk4aSPL7wOOb782VHMncNSYOylJkibEZLu1V5IkTTGGEU2IgYGB\nrrugKcJzTVuK59rEMYxoQviPVluK55q2FM+1iWMYkSRJnTKMSJKkThlGJElSpwwjkiSpU4YRSZLU\nKcOIJEnqlGFEkiR1yjAiSZI6ZRiRJEmdMoxIkqROGUYkSVKnDCOSJKlThhFJktQpw4gkSeqUYUSS\nJHXKMCJJkjplGJEkSZ0yjEiSpE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ6pRhRJIkdcowIkmSOmUY\nkSRJnTKMSJKkThlGJElSpwwjkiSpU4YRSZLUKcOIJEnq1KQII0l2T/LPSVYluSfJd5PM6as5Oclt\n7fJLkuzdt3y7JGe221ib5Pwku/XV7JzknCRrkqxO8rEkO26JY5QkScPrPIwkmQFcBdwHHArMBt4K\nrO6peTtwHHAMMA+4G1iUZFrPpk4HXgYcDhwE7A58vm9357bbn9/WHgScNe4HJUmSRmzbrjsAvANY\nXlVv6Gn7n76aE4BTqurLAEleA6wEDgPOS7ITcDRwRFVd3ta8HliWZF5VLUkymybszK2qa9ua44EL\nk5xYVSsm8BglSdIGdD4yArwc+HaS85KsTLI0yS+DSZK9gFnAZUNtVXUXcDVwYNu0H02w6q25AVje\nU3MAsHooiLQuBQrYf9yPSpIkjchkCCNPAt4E3AAcAvw/4Iwkf9Iun0UTGFb2rbeyXQYwE7i/DSkb\nqpkF3N67sKoeAu7oqZEkSVvYZLhM8whgSVX9Tfv5u0meAbwR+OfuuiVJkraEyRBGfgIs62tbBryq\n/X4FEJrRj97RkZnAtT0105Ls1Dc6MrNdNlTTf3fNNsAuPTXDWrBgAdOnT1+vbWBggIGBgY2tJknS\nlDA4OMjg4OB6bWvWrBnx+pMhjFwF7NPXtg/tJNaquinJCpo7YK4DaCes7g+c2dZfAzzY1nyxrdkH\n2BNY3NYsBmYk2bdn3sh8mqBz9cY6uHDhQubMmbOxEkmSpqzhfkFfunQpc+fOHdH6kyGMLASuSvJX\nwHk0IeMNwJ/11JwOvDPJjcDNwCnArcAF0ExoTfJx4LQkq4G1wBnAVVW1pK25Pski4OwkbwKmAR8C\nBr2TRluD5cuXs2rVqq67oS1k1113Zc899+y6G9K46DyMVNW3k7wSeC/wN8BNwAlV9dmemlOT7EDz\nTJAZwBXAS6rq/p5NLQAeAs4HtgMuBo7t292RwIdp7qJZ19aeMBHHJW1Jy5cvZ599ZnPvvfd03RVt\nIdtvvwM33LDMQKKtQudhBKCqLgIu2kTNScBJG1l+H3B8+7WhmjuBo8bUSWkSW7VqVRtEPkPzXD9t\n3ZZx771HsWrVKsOItgqTIoxIGi+zAec3SXp4mQzPGZEkSVOYYUSSJHXKMCJJkjplGJEkSZ0yjEiS\npE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ6pRhRJIkdcowIkmSOmUYkSRJnTKMSJKkThlGJElSpwwj\nkiSpU4YRSZLUKcOIJEnqlGFEkiR1yjAiSZI6ZRiRJEmdMoxIkqROGUYkSVKnDCOSJKlThhFJktQp\nw4gkSeqUYUSSJHXKMCJJkjplGJEkSZ0yjEiSpE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ6lTnYSTJ\nu5Os6/v6QV/NyUluS3JPkkuS7N23fLskZyZZlWRtkvOT7NZXs3OSc5KsSbI6yceS7LgljlGSJG1Y\n52Gk9X1gJjCr/Xr+0IIkbweOA44B5gF3A4uSTOtZ/3TgZcDhwEHA7sDn+/ZxLjAbmN/WHgScNQHH\nIkmSRmHbrjvQerCqfrqBZScAp1TVlwGSvAZYCRwGnJdkJ+Bo4IiquryteT2wLMm8qlqSZDZwKDC3\nqq5ta44HLkxyYlWtmNCjkyRJGzRZRkaekuTHSX6Y5DNJ9gBIshfNSMllQ4VVdRdwNXBg27QfTajq\nrbkBWN5TcwCweiiItC4FCth/Yg5JkiSNxGQII98CXkczcvFGYC/gG+18jlk0gWFl3zor22XQXN65\nvw0pG6qZBdzeu7CqHgLu6KmRJEkd6PwyTVUt6vn4/SRLgP8B/hC4vpterW/BggVMnz59vbaBgQEG\nBgY66pEkSZPH4OAgg4OD67WtWbNmxOt3Hkb6VdWaJP8F7A18HQjN6Efv6MhMYOiSywpgWpKd+kZH\nZrbLhmr6767ZBtilp2aDFi5cyJw5c0Z/MJIkTQHD/YK+dOlS5s6dO6L1J8NlmvUkeTRNELmtqm6i\nCQvze5bvRDPP45tt0zXAg301+wB7AovbpsXAjCT79uxqPk3QuXpijkSSJI1E5yMjSd4P/BvNpZnf\nBP4WeAD4bFtyOvDOJDcCNwOnALcCF0AzoTXJx4HTkqwG1gJnAFdV1ZK25voki4Czk7wJmAZ8CBj0\nThpJkrrVeRgBHk/zDJDHAj8FrgQOqKqfAVTVqUl2oHkmyAzgCuAlVXV/zzYWAA8B5wPbARcDx/bt\n50jgwzR30axra0+YoGOSJEkj1HkYqapNzgKtqpOAkzay/D7g+PZrQzV3AkeNvoeSJGkiTbo5I5Ik\naWoxjEiSpE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ6pRhRJIkdcowIkmSOmUYkSRJnTKMSJKkThlG\nJElSpwwjkiSpU4YRSZLUKcOIJEnqlGFEkiR1yjAiSZI6ZRiRJEmdMoxIkqROGUYkSVKnDCOSJKlT\nhhFJktQpw4gkSeqUYUSSJHXKMCJJkjplGJEkSZ0yjEiSpE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ\n6pRhRJIkdcowIkmSOmUYkSRJnZp0YSTJO5KsS3JaX/vJSW5Lck+SS5Ls3bd8uyRnJlmVZG2S85Ps\n1lezc5JzkqxJsjrJx5LsuCWOS5IkDW9MYSTJo5Ls0PP5CUnekuSQzelMkucAxwDf7Wt/O3Bcu2we\ncDewKMm0nrLTgZcBhwMHAbsDn+/bxbnAbGB+W3sQcNbm9FmSJG2esY6MXAC8BiDJDOBq4K3ABUne\nNJYNJnk08BngDcCdfYtPAE6pqi9X1ffbfe8OHNauuxNwNLCgqi6vqmuB1wPPSzKvrZkNHAr8aVV9\nu6q+CRwPHJFk1lj6LEmSNt9Yw8gc4Ir2+1cDK4En0ISEN49xm2cC/1ZV/97bmGQvYBZw2VBbVd1F\nE4AObJv2A7btq7kBWN5TcwCwug0qQy4FCth/jH2WJEmbadsxrrcDsLb9/hDgC1W1Lsm3aELJqCQ5\nAvhtmlDRbxZNYFjZ176yXQYwE7i/DSkbqpkF3N67sKoeSnJHT40kSdrCxhpGbgQOS/JFmksfC9v2\n3YD+QLBRSR5PM9/jRVX1wBj7M6EWLFjA9OnT12sbGBhgYGCgox5JkjR5DA4OMjg4uF7bmjVrRrz+\nWMPIyTSTQRcCl1XV4rb9EODaDa41vLnA44ClSdK2bQMclOQ44GlAaEY/ekdHZvbsawUwLclOfaMj\nM9tlQzX9d9dsA+zSUzOshQsXMmfOnFEeliRJU8Nwv6AvXbqUuXPnjmj9Mc0ZqarzgT1pLqu8uGfR\nZcCCUW7uUuCZNJdpnt1+fZtmMuuzq+pHNGFh/tAK7YTV/YFvtk3XAA/21ezT9nEoKC0GZiTZt2ff\n82mCztWj7LMkSRonYx0ZoapW0DeiUFVLxrCdu4Ef9LYluRv4WVUta5tOB96Z5EbgZuAU4Faau3qo\nqruSfBw4LclqmvksZwBXDfWpqq5Psgg4u73jZxrwIWCwPRZJktSBEYeRJF8YaW1VvWps3fnVJvq2\nd2r7XJOzgBk0d/K8pKru7ylbADwEnA9sB1wMHNu33SOBD9OMxqxra0/YzL5KkqTNMJqRkd6ZKAFe\n2bZ9u22bSxMURhxaNqSqXjhM20nASRtZ5z6a54Ycv5GaO4GjNrd/kiRp/Iw4jFTV64e+T/I+4Dzg\njVX1UNu2DfARRnk3jSRJmtrG+tCzo4F/GAoi0DyzAzitXSZJkjQiYw0j29LcctvvaZuxTUmSNAWN\n9W6aTwAfT/JkYOgOmv2Bd7TLJEmSRmSsYeREmtt63wr8Rtv2E+D9wAfGoV+SJGmKGFMYqap1wKnA\nqe0DyBjmvTCSJEmbNOaHng0xhEiSpM0xpsmmSWYm+ecktyV5MMlDvV/j3UlJkrT1GuvIyCdp3vty\nCs1ckdpotSRJ0gaMNYw8H3hBVX1nPDsjSZKmnrE+E+QWmkfCS5IkbZaxhpG3AO9N8sTx64okSZqK\nxnqZ5nPADsAPk9wDPNC7sKp22dyOSZKkqWGsYeQt49oLSZI0ZY31oWefGu+OSJKkqWnMDz1Lsg1w\nGDC7bfpP4Eu9b/KVJEnalDGFkSR7AxcBvwnc0Db/FXBLkpdV1Q/HqX+SJGkrN9a7ac4AfgjsUVVz\nqmoOzUPQbmqXSZIkjchYL9McDBxQVXcMNVTVz5K8A7hqXHomSZKmhLGOjNwHPGaY9kcD94+9O5Ik\naaoZaxj5MvDRJPvnVw4A/hH40vh1T5Ikbe3GGkbeTDNnZDFwb/t1FXAjcML4dE2SJE0FY33OyJ3A\nH7R31Qzd2rusqm4ct55JkqQpYczPGQFow4cBRJIkjdmYLtMk+XySvxym/W1J/mXzuyVJkqaKsc4Z\nOYjmoWf9vtIukyRJGpGxhpFHAw8O0/4AsNPYuyNJkqaasYaR7wF/NEz7EcAPxt4dSZI01Yx1Ausp\nwBeSPBn497ZtPjAA/K/x6JgkSZoaxnpr778lOQz4a+DVwC+A64AXVdXl49g/SZK0lRvzrb1VdSFw\n4Tj2RZIkTUFjnTNCkhlJ3pDk75Ls0rbNSfKb49c9SZK0tRvTyEiSZwGXAmuAJwIfA+4AXgXsCbxm\nnPonSZK2cmMdGTkN+GRVPYXmvTRDLmKUzxlJ8sYk302ypv36ZpIX99WcnOS2JPckuaR9DH3v8u2S\nnJlkVZK1Sc5Psltfzc5Jzmn3sTrJx5LsOLrDliRJ422sYeQ5wFnDtP8YmDXKbd0CvB2YA8yluTvn\ngiSzAZK8HTgOOAaYB9wNLEoyrWcbpwMvAw6nCUO7A5/v28+5NO/Rmd/WHrSBY5AkSVvQWCew3sfw\nDzd7KvDT0WyonQjb651J3gQcACyjeQvwKVX1ZYAkrwFWAocB5yXZCTgaOGLoTp4krweWJZlXVUva\nYHMoMLeqrm1rjgcuTHJiVa0YTZ8lSdL4GevIyJeAdyV5ZPu5kuwJvI9fH5EYsSSPSHIEsAPwzSR7\n0Yy0XDZUU1V3AVcDB7ZN+9GEqt6aG4DlPTUHAKuHgkjrUqCA/cfaX0mStPnGGkbeSvNI+NuBRwGX\nAz8Efg78n9FuLMkzkqylGXH5CPDKNlDMogkMK/tWWcmvLgfNBO5vQ8qGama1ff2lqnqIZtLtaC8r\nSZKkcTTWh56tAX4vyfOBZ9EEk2uq6rKNr7lB1wPPBqbTPETt00l84Z4kSVPAqMJIkgOBxw7N36iq\nK9tHwr8N2CHJvwLHV9V9o9luVT0I/Kj9eG2SeTRzRU4FQjP60Ts6MhMYuuSyApiWZKe+0ZGZ7bKh\nmv67a7YBdump2aAFCxYwffr09doGBgYYGBjY9MFJkrSVGxwcZHBwcL22NWvWjHj90Y6MvAv4OjA0\nmfSZwNnAp2gmm/4lcBtw0ii32+8RwHZVdVOSFTR3wFzX7nMnmnkeZ7a119C8QXg+8MW2Zh+a550s\nbmsWAzOS7Nszb2Q+TdC5elOdWbhwIXPmzNnMQ5Ikaes03C/oS5cuZe7cuSNaf7Rh5LeBv+n5fASw\npKr+DCDJLcDfMoowkuTvgK/QTDh9DPDHwMHAIW3J6TR32NwI3Ezzkr5bgQugmdCa5OPAaUlWA2uB\nM4CrqmpJW3N9kkXA2e2dOtOADwGD3kkjSVK3RhtGdmb9yyUH0wSJIf8B7DHKbe5GM7LyGzRPdL0O\nOKSq/h2gqk5NsgPNM0FmAFcAL6mq+3u2sQB4CDgf2A64GDi2bz9HAh+muYtmXVt7wij7KkmSxtlo\nw8hKYC/glvahY3OAd/csfwzwwGg2WFVvGEHNSWxktKWdo3J8+7WhmjuBo0bTN0mSNPFGe2vvRcB7\nk7wA+HvgHpqRiiHPornFV5IkaURGOzLyN8AXaJ4r8nPgtX2XS44GvjpOfZMkSVPAqMJIVa0CDkoy\nHfh5++CwXv+LJqRIkiSNyOY89Gy49js2rzuSJGmqGevj4CVJksaFYUSSJHXKMCJJkjplGJEkSZ0y\njEiSpE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ6pRhRJIkdcowIkmSOmUYkSRJnTKMSJKkThlGJElS\npwwjkiSpU4YRSZLUKcOIJEnqlGFEkiR1yjAiSZI6ZRiRJEmdMoxIkqROGUYkSVKnDCOSJKlThhFJ\nktQpw4gkSeqUYUSSJHXKMCJJkjplGJEkSZ0yjEiSpE51HkaS/FWSJUnuSrIyyReTPHWYupOT3Jbk\nniSXJNm7b/l2Sc5MsirJ2iTnJ9mtr2bnJOckWZNkdZKPJdlxoo9RkiRtWOdhBHgB8CFgf+BFwCOB\nryZ51FBBkrcDxwHHAPOAu4FFSab1bOd04GXA4cBBwO7A5/v2dS4wG5jf1h4EnDX+hyRJkkZq2647\nUFUv7f2c5HXA7cBc4Mq2+QTglKr6clvzGmAlcBhwXpKdgKOBI6rq8rbm9cCyJPOqakmS2cChwNyq\nuratOR64MMmJVbVigg9VkiQNYzKMjPSbARRwB0CSvYBZwGVDBVV1F3A1cGDbtB9NsOqtuQFY3lNz\nALB6KIi0Lm33tf9EHIgkSdq0SRVGkoTmcsuVVfWDtnkWTWBY2Ve+sl0GMBO4vw0pG6qZRTPi8ktV\n9RBN6JmFJEnqROeXafp8BHg68LyuOyJJkraMSRNGknwYeCnwgqr6Sc+iFUBoRj96R0dmAtf21ExL\nslPf6MjMdtlQTf/dNdsAu/TUDGvBggVMnz59vbaBgQEGBgZGcGSSJG3dBgcHGRwcXK9tzZo1I15/\nUoSRNoj8AXBwVS3vXVZVNyVZQXMHzHVt/U408zzObMuuAR5sa77Y1uwD7AksbmsWAzOS7Nszb2Q+\nTdC5emP9W7hwIXPmzNmsY5QkaWs13C/oS5cuZe7cuSNav/MwkuQjwADwCuDuJDPbRWuq6t72+9OB\ndya5EbgZOAW4FbgAmgmtST4OnJZkNbAWOAO4qqqWtDXXJ1kEnJ3kTcA0mluKB72TRpKk7nQeRoA3\n0kxQ/Xpf++uBTwNU1alJdqB5JsgM4ArgJVV1f0/9AuAh4HxgO+Bi4Ni+bR4JfJjmLpp1be0J43gs\nkiRplDoPI1U1ojt6quok4KSNLL8POL792lDNncBRo+uhJEmaSJPq1l5JkjT1GEYkSVKnDCOSJKlT\nhhFJktQpw4gkSeqUYUSSJHXKMCJJkjplGJEkSZ0yjEiSpE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ\n6pRhRJIkdcowIkmSOmUYkSRJnTKMSJKkThlGJElSpwwjkiSpU4YRSZLUKcOIJEnqlGFEkiR1yjAi\nSZI6ZRiRJEmdMoxIkqROGUYkSVKnDCOSJKlThhFJktQpw4gkSeqUYUSSJHXKMCJJkjplGJEkSZ0y\njEiSpE4ZRiRJUqcmRRhJ8oIkX0ry4yTrkrximJqTk9yW5J4klyTZu2/5dknOTLIqydok5yfZra9m\n5yTnJFmTZHWSjyXZcaKPT5IkbdikCCPAjsB3gD8Hqn9hkrcDxwHHAPOAu4FFSab1lJ0OvAw4HDgI\n2B34fN+mzgVmA/Pb2oOAs8bzQCRJ0uhs23UHAKrqYuBigCQZpuQE4JSq+nJb8xpgJXAYcF6SnYCj\ngSOq6vK25vXAsiTzqmpJktnAocDcqrq2rTkeuDDJiVW1YmKPUpIkDWeyjIxsUJK9gFnAZUNtVXUX\ncDVwYNu0H02w6q25AVjeU3MAsHooiLQupRmJ2X+i+i9JkjZu0ocRmiBSNCMhvVa2ywBmAve3IWVD\nNbOA23sXVtVDwB09NZIkaQubFJdpJrsFCxYwffr09doGBgYYGBjoqEeSJE0eg4ODDA4Orte2Zs2a\nEa//cAgjK4DQjH70jo7MBK7tqZmWZKe+0ZGZ7bKhmv67a7YBdumpGdbChQuZM2fOmA9AkqSt2XC/\noC9dupS5c+eOaP1Jf5mmqm6iCQvzh9raCav7A99sm64BHuyr2QfYE1jcNi0GZiTZt2fz82mCztUT\n1X9JkrRxk2JkpH3Wx940wQDgSUmeDdxRVbfQ3Lb7ziQ3AjcDpwC3AhdAM6E1yceB05KsBtYCZwBX\nVdWStub6JIuAs5O8CZgGfAgYnMg7aZYvX86qVasmavOaZHbddVf23HPPrrshSQ8rkyKM0NwN8zWa\niaoFfKBt/xRwdFWdmmQHmmeCzACuAF5SVff3bGMB8BBwPrAdza3Cx/bt50jgwzR30axra0+YiAOC\nJojss89s7r33nonahSaZ7bffgRtuWGYgkaRRmBRhpH02yEYvGVXVScBJG1l+H3B8+7WhmjuBo8bU\nyTFYtWpVG0Q+Q/OsNW3dlnHvvUexatUqw4gkjcKkCCNbv9mAE2AlSRrOpJ/AKkmStm6GEUmS1Ckv\n00iSRsXBSn9iAAAO5UlEQVS7BKeWLXGXoGFEkjRi3iU49WyJuwQNI5KkEfMuwalmy9wlaBiRJI2B\ndwlq/DiBVZIkdcowIkmSOmUYkSRJnTKMSJKkThlGJElSpwwjkiSpU4YRSZLUKcOIJEnqlGFEkiR1\nyjAiSZI6ZRiRJEmdMoxIkqROGUYkSVKnDCOSJKlThhFJktQpw4gkSeqUYUSSJHXKMCJJkjplGJEk\nSZ0yjEiSpE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ6pRhRJIkdcowIkmSOjXlwkiSY5PclOQXSb6V\n5Dld92nrNNh1BzRleK5pS/FcmyhTKowk+SPgA8C7gX2B7wKLkuzaace2Sv6j1ZbiuaYtxXNtokyp\nMAIsAM6qqk9X1fXAG4F7gKO77ZYkSVPXlAkjSR4JzAUuG2qrqgIuBQ7sql+SJE11UyaMALsC2wAr\n+9pXArO2fHckSRLAtl13YJLbHmDZsmVjWvlX610EjG0bD1+3Aud03Ykt7CZg7OfL5vBc81zbUjzX\nPNdGqmed7TdVm+ZKxdavvUxzD3B4VX2pp/2TwPSqeuUw6xzJ1DvzJEkaT39cVedurGDKjIxU1QNJ\nrgHmA18CSJL28xkbWG0R8MfAzcC9W6CbkiRtLbYHnkjz/9KNmjIjIwBJ/hD4JM1dNEto7q55NfC0\nqvpph12TJGnKmjIjIwBVdV77TJGTgZnAd4BDDSKSJHVnSo2MSJKkyWcq3dorSZImIcOIHlaSrEvy\niq77oYmR5BNJvjDO23xCe948azy3q+4l+VqS07ruhzbflJozImnSezOQCdiu16OlScwwImnSqKq1\nE7TpiQg4UieSbFtVD3bdj/HkZZopKsmhSa5IsjrJqiT/luRJPcsfn+Rz7fKfJfnXJE/oWb5fkq8m\n+WmSO5N8Pcm+ffs4Kcn/JLk3ya1JTu9ZNivJhUnuSXJjkj9MclOSN/fU7J3kG0l+keT7SV400X8u\n2jKSvDrJde3f/6r2XHpU/2Wadhj+g0ne156HP0ny7r5t7ZPkyvY8+V6S39nU5bwkz0hyUZK1SVYk\n+XSSx07kMWvzJNmh/Xtam+THSf6ib/mMdvkdSe5u/3737ll+e5JX9Xz+TpIf93x+fvuzavv287ok\nf5rkC+32/ivJy3vqD25rXprku+35tzjJb/X16/D259e97c+4/n7/2rna/tx9Tfv90GXGP2x/zt4D\nHLlZf5iTkGFk6toR+AAwB3gh8BDwRWhSN81DatYAzwOeC6wFLm6XATyG5pktzwX2B/4LuCjJju02\nXg28BfgzYG/gMOB7Pfv/Z5p3Ah1E86yXNwGPG1rYPpDuizQPm3sOzbNh3ofD7Q97SWYB5wIfA54G\nHAx8gQ3/PHoN8HNgHvA24F1J5rfbegRwAc35+RzgfwPvZSPnSZLpNC/MvIbm/D8U2A343GYemibW\nPwAvAF4OHAL8Ds3f35BPtZ9/HziAZjTsoiTbtMu/0a5Dkhk0596jkjy1XX4QsKSqeh9w+S7gs8Az\naZ5/f067bq9TaZ5ZtR/wU+BLQ/tMMpfmvDoXeAbwbuCUoaAxSn8PLARmM4KHiD3sVJVffkHzIsF1\nwNNpnjr7g77l04C7gRdtYP1H0ISXl7afF9C8uGKbYWr3afe1b0/bk9u2N7efDwHuA2b21Bza1ryi\n6z8vvzbrXNuXJvzuMcyyTwBf6Pn8NeDyvpqrgb9rv39xe548rmf5/N7zBHhC+/lZ7ef/A3ylb5uP\nb2v27vrPx69hz5kdaX4xeVVP287tz6TTaH7hWQfs37N8l3b54e3n44Dr2u9fAXyTJgQf07Z9FTil\nZ/11wEk9n3do2w5pPx/cfn71MH16dfv5M8DFfcfyPuB7fft5RV/NauA17fdD5+9xXf89TOSXIyNT\nVHsJ5NwkP0yyhuZtSAXsCTwbeEo7HLo2yVrgZ8B2NKGBJLslObsduryTJojs2K4P8C80/3hvSvLR\nJIf1/IayD/BAVV071J+q+iHNP8AhTwNuqaretywvHt8/BXXkuzQjE99Pcl6SNwzz22av6/o+/4Rm\nJAPgqTTnSe+DC5dsYv/PBl7Yd34vozn/nzzio9CW9GTgkfT83VbVauCG9uNs4IG+5Xe0y2e3TZcD\nT28vxx0MfL39+p12xPe57edevxzNrap7gLv41bkHzTnzrWH6NLTP2cBVfdu8iubn62jnMV0zyvqH\nFSewTl1fpgkgbwBuA7YBvk8zAvJo4Ns01yX7/8EM/dD/NM1vAccDy2l+O/1Wuz5VdWs7/Pki4PeA\njwAnJjl44g5JDwdVtQ44JMmBNCNgxwPvSXLABlZ5oH8TbN4l5kfTvJ/qbfz6+f2TzdiuJrGq+l6S\nO2gu1RwM/DWwEngHzSW+bWlGS3qN97k3bNf49fPwkcPU3T3O+51UHBmZgpLsQvMb5Xuq6mtVdQPN\nkObQdfalwFOAn1bVj/q+hu52eC5wRlUtqqplNP9od+3dT1XdV1UXVtVbaH4APJfm2usNwLbpmfDa\nTjTbuWf1ZcAeSWb2tB2Ic0a2GlW1uKr+luayzQM084pG6waa8+RxPW3zNrHOUuC3gP8Z5vz+xRj6\noIn3Q+BBmvlpACTZmebnGDQ/Lx7Zt/yxNKOwP+jZzpXAH9Bcjr6SZtRtO5q5Rt8ew99/aOan9Pdp\naJ/LaObd9Xo+8F/VXoOh+QXvN3q28RSaUeVeW/3PPcPI1LSa5rLLMUmenOSFNJNZh5zTLr+gnWH+\nxPYOhQ8m2b2t+W/gT5I8Lcn+NNdG7xnaQJLXJjk6yW8l2Qv4k3b5/7Th5zLg7CTPaUPJWe3yoX90\nl7b7+HSSZyV5AfCeifnj0JaUZF6Sv0oyN8kewOE0QXbZGDZ3CfAjmvPkmUmeR3OeFBv+AX4mTfj+\nbJq7wp6U5u6yfxrD0Lm2gKq6G/g48P4kv5vkGTTzix5ql99IM5H57CTPS/Jsmp9Jt7TtQ74ODADf\nqap72kDwDZp5cpePsXvvSvLCtk+fpAkXQ/v8ADA/yTuTPCXJa4Fjgff3rP/vwHFJfjvJfsD/A+7v\n28dWf14aRqag9h/gHwFzaa6JfgA4sWf5L2hmrS8HPk+T8s+m+Q3irrbsaJqRjGtoZrF/ELi9Zzd3\n0txJcyXNHIEXAr/fXlOFJpysoPkB8Pl2+z+nmaQ21MfDaF5BfTXwUZphVT383UVz58KFNCMbJwN/\nUVXD3SGw0d8I20s+f0AzX2kJzXnyHpof3r13RVTPOj+h+W31ETR3JVxHMwlydc9vq5p8/hK4guYS\n21fb73vnUby+/fxvNPMy1gEvq6qHemoup/l7/1pP29fbtq/37W+4c6G/rWgu83wQ+A+aOwJfXu0z\nQNp5cX9I8/P2e8BJwDur6p97tvFWmtD0DZoA9X56frHbSF+2Kr4oT5NCksfThJ/5VfW1TdVLG9KO\njnyD5s6Ym7ruj7ZO7fy3fwd2rqq7NlWvjXMCqzqR5HdpJhJ+D9id5l79H9H8T0QasSSH0Yyq/TfN\nXKfTgSsNItoCtvrLJ1uKYURdeSTwd8BeNA+sugoY6BtSlUbiMTTPbtgDWEUzj+TEja4hjQ8vLYwT\nL9NIkqROOYFVkiR1yjAiSZI6ZRiRJEmdMoxIkqROGUYkSVKnDCOSJpUk706ydDO38YQk65I8a7z6\nJWniGEYkjVqSx7fvcvlxkvuS3Jzk9PYljKPZzrokr+hrfj8wfzO7uByYRfMmakmTnGFE0qi0Lz78\nNvBkmnduPJnmrafzgcVJZmzO9tsXmK3edOVGt1FVdXv77ppxl+QRvlRPGj+GEUmj9RHgPuD3qurK\nqrq1fcndi4DfBP4vQJKb2reVnpvk50luTfLnQxtJchPNEyz/tR0h+VHbflKSa3vqPpHki+2bflck\nWd1ud5skpyb5WZJbkryuZ531LtO021jXfj3U8/1B7fJpSf6h7ePPkyxu3z0ytL3Xtvt9eZL/pHkJ\n3x7t26yvbtdZneSK9k3EkkbBMCJpxJLsDBwCnFlV673mvKpWAufQjJYMORG4Fvht4L3AB5MMXYJ5\nDs27PV5Lc0nlOUOb4tcfs/1C4Ddo3ia9gOZNv18G7gDmAf8InJVk994u9Xz/5nYfs9rtfBBYCVzf\nLj8T2J/mDavPBP4F+EqSJ/dsYwfgbcCfAr8FrAa+SPMG2GcAB9C8NdjHWkuj5LtpJI3GU2gCxPUb\nWL4M2DnJru3nq6rq/e33H27fqLsAuKyqVrVXOtZU1e2b2O/PqurN7ff/neTtwKOq6r0ASf6e5lXu\nzwfOa+t+eRmlqtbSvAOJJK8CjqF5Q/Tt7UjG64A9qmpFu8ppSV5C81r6d7Zt2wJvqqrvt9vZGdgJ\nuLCqbm5rbtjEcUgahmFE0liMdL7E4mE+nzCG/f1n3+eVNG98BqCq1iX5GbDbxjaSZF/g08CxVfWt\ntvmZwDbAf/XNA5lG8+K9IfcPBZF2n6uTfAr4apJLgEuB83oCjaQR8jKNpNG4keYyxOwNLH86sLqq\nVm1g+Vg90Pe5NtC2wZ9pSWYBFwAfrapP9ix6NPAgMAd4ds/XbNYPTr/o32ZVHU1zeeYqmstTNySZ\nt+nDkdTLMCJpxKrqDuAS4M+TbNe7rP2f/ZHAZ3uaD+jbxAE0l3KGPEAzKjERfjl3o+3rvwI/AN7a\nV3dt24eZVfWjvq9NXT6iqr5bVe+rqufRjOAcOX6HIE0NhhFJo3UcsB2wKMkL2meOvBj4KnALv5pj\nAfC8JCcmeUqSY4FXA6f3LL8ZmJ9k5ubeEjyM3ksuHwUeTzPSsVu7v5lJHllV/w2cC3w6ySuTPDHJ\nvCTvaOeNDL/xpu7vkhyQZM8kh9DMqfnBOB+HtNUzjEgalaq6EdgP+BHwOZpLN/8IXAY8t6ru7Cn/\nQFt7LfDXwIKqurRn+VuB36MJMaN56upwd6z0t/V+PojmLpofALcBP2n/e2C7/HU0c0n+gWZy7hfa\nfi/fSB/uAZ4GnE8zcfUfgQ9V1UdHcRySgFR5F5qk8dc+R2RhVZ3RdV8kTW6OjEiSpE4ZRiRNFIdd\nJY2Il2kkSVKnHBmRJEmdMoxIkqROGUYkSVKnDCOSJKlThhFJktQpw4gkSeqUYUSSJHXKMCJJkjpl\nGJEkSZ36/9Lx/6FV7lc5AAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot the time.\n", "fig = plt.figure()\n", "st = fig.suptitle(\"Lower is better.\", fontsize=\"x-small\")\n", "\n", "plt.bar(range(len(time_spent)), time_spent.values(), align='center')\n", "plt.xticks(range(len(time_spent)), time_spent.keys())\n", "plt.xlabel(\"Optimizers\")\n", "plt.ylabel(\"Seconds\")\n", "plt.ylim([0, 7000])\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAicAAAGSCAYAAAA4v2GGAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAAPYQAAD2EBqD+naQAAIABJREFUeJzt3Xu4XVV97vHvy0VTQCkaBZHgBRGDIpoIolC1oqK2Wi9U\nG6VFLfVQUU/TejxUW1Fra1sUxFM4AsdyKRqxLVUQNYgXQAFRAiI14IVIvACyMSASImh+5485t65s\n9t7Jvo/A9/M868mac445x1h7z73yrjHGmjNVhSRJUiu2mOsGSJIkDTKcSJKkphhOJElSUwwnkiSp\nKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxI2qgkN4xYPiXJ8/rnn0qy1TTX97Akp25i\n2UOTvHcCx947yYEDy3+Q5NGTaKakGWI4kbQpxrzPRVX9flX9cioHT7LBe1FV3VBVr5nAISZyH44n\nAc8ZWH4JsPsE9idJJlJe0sQYTiRtijH/M06yKsn9+ufvSXJNkvP6xzP69S9MckmSFUlO6Nc9IskV\nSc4E/nvEMR+R5JL++e8muarf96IxmvGYJBf1db9+4DhvS3JZkiuT/Gkfgt4NvKY/3luBFwP/0i9v\nm2SfJBcm+XqSM5PM6491Y5ITklwF7DLJn6OkTTCtXbGS7rUenGRF/zzAAmBZv1wASfYBngXsCewI\nXNOvfzDwF8Azq+quJP+S5GXA5X3ZJVV1zSh1DveGLAXeVFUXJHnAGO17CrAX3Qeuryc5G9gbeEhV\n7Ztka+Ai4FPAO4A9quptffseB3ysqs7ryx0NvKiqbkvyFuANwDHAQ4GzquoNm/5jkzQZhhNJm2Ko\nqhYNLyQ5ZZQyTwf+q6rWAzckubBf/zTgicCl/XDIPOD7dOFk5RjBZNDFwNFJTgM+Dtw+SplPV9Ud\nfds+BzwV+B3g95M8ky5QPRDYbZR9B3uF9ujb+sW+rVsD5/fbflZV54/cWdL0M5xImi4jh34y8O8n\nqurwDTYmjwDWbuygVfWPST5NN/zy1SSLq2rNyGIjltf3//5tVX1sRL3jzS8JcFlVPX+UbRttq6Tp\n4ZwTSZtivAmgw9suBl6SZMskDwMO6NdfChyY5OEASR40/Hwjx6Uv/6iquqqq3gOsohtSGukF/XyR\nBwDPBr4GfB44LMn9++M8tn9+O10vyrDB5WuARyV5Qr/PNkmGe1ucBCvNEsOJpE0xsmeiRj6vqsuA\nC4CrgVOAK+mGQm4GjgA+meQbwHLgIWMcdzR/meTqJFcC11TVVaOUuRz4LF0oOaaqbqyqzwyv6yex\nnkD3nvdFYN8kl/dDPh8D3tXPqbkf8CrgQ319F/OboaAN2prk3CQ7bUL7JU1QqibyDTxJGluSbapq\nbT8J9mJg0fBcEEnaVM45kTSd/jXJHnTvLX9jMJE0GfacSJKkpjjnRJIkNcVwIkmSmmI4kSRJTTGc\nSJKkphhOJElSUwwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxIkqSmGE4kSVJT\nDCeSJKkphhNJktQUw4kkSWqK4USSJDXFcCJJkppiOJEkSU0xnEiSpKZsNdcN2JwkeTBwEPB9YN3c\ntkaSpM3KPOCRwPKqumW8goaTiTkI+MhcN0KSpM3Yq4GPjlfAcDIx3wc444wzWLhw4Rw3ZfOydOlS\njj322Lluhu4DPNc0WzzXJmblypUccsgh0P9fOh7DycSsA1i4cCGLFi2a67ZsVrbffnt/ZpoVnmua\nLZ5rk7bRaRFOiJUkSU0xnEiSpKYYTiRJUlMMJ5oVS5Ysmesm6D7Cc02zxXNt5hhONCv8I9Zs8VzT\nbPFcmzmGE0mS1BTDiSRJaorhRJIkNcVwIkmSmmI4kSRJTTGcSJKkphhOJElSUwwnkiSpKYYTSZLU\nFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USS\nJDXFcCJJkppiOJEkSU0xnEiSpKYYTiRJUlMMJ5IkqSmGE0mS1BTDiSRJaorhRJIkNcVwIkmSmmI4\nkSRJTTGcSJKkphhOJElSUwwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxIkqSm\nGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USSJDXFcCJJkppiOJEkSU0xnEiSpKYYTiRJUlMMJ5Ik\nqSmGE0mS1BTDiSRJaorhRJIkNcVwIkmSmmI4kSRJTTGcSJKkphhOJElSUwwnkiSpKYYTSZLUFMOJ\nJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxIkqSmGE4kSVJTmgknSY5IsirJnUkuTbLPRsq/OsmVSe5I\n8uMkH07yoDHK/lGS9UnOGrH+qH794ONb0/m6JEnSxDQRTpK8Eng/cBTwZOAbwPIk88covz9wGnAy\nsCdwMLAvcNIoZR8JHA1cOEb1VwM7Ajv1jwMm/0okSdJUNRFOgKXAiVV1elVdAxwOrAVeN0b5/YBV\nVXV8VV1fVRcDJ9IFlF9LsgVwBvAOYNUYx/plVd1cVT/pHz+djhckSZImZ87DSZKtgcXA54fXVVUB\n5wNPG2O3S4AFSV7QH2NH4A+Bc0eUOwq4qapOGacJuyf5UZLvJTkjyYJJvhRJkjQN5jycAPOBLYGb\nRqy/iW6Y5R76npJDgDOT3AXcAKwB3jhcJskBwGuBw8ap+1LgNcBBdL01jwIuTLLtZF6IJEmauhbC\nyYQl2RM4DngnsIguXDyKbmiHJNsBpwN/VlVrxjpOVS2vqv+sqqur6nPAC4EdgFfM7CuQJElj2Wqu\nGwAMAb+im5Q6aEfgxjH2ORL4SlUd0y9fneQNwEVJ3k7X4/II4Jwk6ctsAdD3tOxRVfeYg1JVtyX5\nNvCY8Rq8dOlStt9++w3WLVmyhCVLloy3myRJ9wnLli1j2bJlG6y77bbbNnn/OQ8nVXV3ksuBA4Gz\nAfpAcSDwwTF22wa4a8S69UABAa4B9hqx/e+B7YA3Az8Y7aB9j8tj6HpdxnTssceyaNGi8YpIknSf\nNdoH9hUrVrB48eJN2n/Ow0nvGODUPqRcRvftnW2AUwGSvBfYuaoO7cufA5yU5HBgObAzcCzw1aoa\n7m3Z4HolSW6lm2u7cmDd0f2xrgceDrwLuBvYMO5JkqRZ00Q4qaqP99c0eTfdcM6VwEFVdXNfZCdg\nwUD50/pejiOA9wG30n3b58gJVr0L8FHgwcDNwJeB/arqlim8HEmSNAVNhBOAqjoBOGGMba8dZd3x\nwPETOP5ox3CSiCRJjdksv60jSZLuvQwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElN\nMZxIkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USSJDXFcCJJkppiOJEkSU0xnEiSpKYYTiRJ\nUlMMJ5IkqSmGE0mS1BTDiSRJaorhRJIkNcVwIkmSmmI4kSRJTTGcSJKkphhOJElSUwwnkiSpKYYT\nSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK\n4USSJDXFcCJJkppiOJEkSU0xnEiSpKYYTiRJUlMMJ5IkqSmGE0mS1BTDiSRJaorhRJIkNcVwIkmS\nmmI4kSRJTTGcSJKkphhOJElSUwwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxI\nkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USSJDXFcCJJkppiOJEkSU0xnEiSpKYYTiRJUlMM\nJ5IkqSmGE0mS1JRmwkmSI5KsSnJnkkuT7LOR8q9OcmWSO5L8OMmHkzxojLJ/lGR9krOmWq8kSZpZ\n0xpOkixI8q+T2O+VwPuBo4AnA98AlieZP0b5/YHTgJOBPYGDgX2Bk0Yp+0jgaODCqdYrSZJm3nT3\nnDwIOHQS+y0FTqyq06vqGuBwYC3wujHK7wesqqrjq+r6qroYOJEuoPxaki2AM4B3AKumoV5JkjTD\ntppI4SQv3kiRR0+0AUm2BhYD/zC8rqoqyfnA08bY7RLg75O8oKo+k2RH4A+Bc0eUOwq4qapOSfKM\naahXkiTNsAmFE+ATQAEZp0xN8JjzgS2Bm0asvwnYY9QKqi5OcghwZpJ5dK/jbOCNw2WSHAC8Fth7\nuuqVJEkzb6Lh5AbgDVX1ydE2JnkScPmUW7URSfYEjgPeCZwHPAx4H93QzmFJtgNOB/6sqtZMd/1L\nly5l++2332DdkiVLWLJkyXRXJUnSZmfZsmUsW7Zsg3W33XbbJu+fqk3v6EhyNnBlVb1jjO17A1dU\n1SbPZemHV9YCL6+qswfWnwpsX1UvHWWf04F5VfWKgXX7AxfRBZWdgBXAr/hNL89wm35F1zPyw0nU\nuwi4/PLLL2fRokWb+hIlSbrPW7FiBYsXLwZYXFUrxis70QmxRwMXj7P9u8DvTuSAVXU3XW/LgcPr\nkqRfHquubYBfjli3nt8MOV0D7AU8iW5YZ2+6YZ8v9M9/MMl6JUnSDJvosM6PGP1bLwBU1R3ABZNo\nxzHAqUkuBy6j+xbNNsCpAEneC+xcVcPfBDoHOCnJ4cByYGfgWOCrVXVjX+ZbgxUkubVrYq3c1Hol\nSdLsm2g4+Q7dsMlPAJKcCby5qkZOKp2Qqvp4f22RdwM7AlcCB1XVzX2RnYAFA+VP6+eVHEE31+RW\n4PPAkdNcryRJmmUTnXOyHtipqobDye3A3lV13Qy1rynOOdHmZPXq1QwNDc11MzRL5s+fz6677jrX\nzZDGNJE5JxPtOZG0GVi9ejV77LGQdevWznVTNEvmzduGa69daUDRvcJEw0lxz+uYTPS6JpJm2NDQ\nUB9MzgAWznVzNONWsm7dIQwNDRlOdK8w0XASugmkv+iX5wEfSnLHYKGqetl0NE7SVC0EHIKUtHmZ\naDg5bcTyGdPVEEmSJJhgOKmq185UQyRJkmD670osSZI0JYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKa\nYjiRJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USSJDXFcCJJkppiOJEkSU0xnEiS\npKYYTiRJUlMMJ5IkqSmGE0mS1BTDiSRJaorhRJIkNcVwIkmSmmI4kSRJTTGcSJKkphhOJElSUwwn\nkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJktQU\nw4kkSWqK4USSJDXFcCJJkppiOJEkSU0xnEiSpKYYTiRJUlMMJ5IkqSmGE0mS1BTDiSRJaorhRJIk\nNcVwIkmSmmI4kSRJTTGcSJKkphhOJElSUwwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiR\nJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USSJDXFcCJJkprSTDhJckSSVUnuTHJp\nkn02Uv7VSa5MckeSHyf5cJIHDWx/aZKvJVmT5OdJrkhyyIhjHJVk/YjHt2bqNUqSpI1rIpwkeSXw\nfuAo4MnAN4DlSeaPUX5/4DTgZGBP4GBgX+CkgWK3AO8B9gP2Ak4BTkny3BGHuxrYEdipfxwwPa9K\nkiRNRhPhBFgKnFhVp1fVNcDhwFrgdWOU3w9YVVXHV9X1VXUxcCJdQAGgqi6sqk9W1bVVtaqqPghc\nxT3Dxy+r6uaq+kn/+Om0vzpJkrTJ5jycJNkaWAx8fnhdVRVwPvC0MXa7BFiQ5AX9MXYE/hA4d5x6\nDgQeC1wwYtPuSX6U5HtJzkiyYNIvRpIkTdmchxNgPrAlcNOI9TfRDbPcQ99TcghwZpK7gBuANcAb\nB8sleWCS2/sy5wBvqqovDBS5FHgNcBBdb82jgAuTbDvVFyVJkianhXAyYUn2BI4D3gksogsXj6Ib\n2hl0O7A38BTg7cCxSZ4xvLGqllfVf1bV1VX1OeCFwA7AK2b8RUiSpFFtNdcNAIaAX9FNSh20I3Dj\nGPscCXylqo7pl69O8gbgoiRvr6qb4NfDQ9f1Za7qQ81fAxeOdtCqui3Jt4HHjNfgpUuXsv3222+w\nbsmSJSxZsmS83SRJuk9YtmwZy5Yt22Ddbbfdtsn7z3k4qaq7k1wOHAicDZAk/fIHx9htG+CuEevW\nAwVknOq2AO4/1sYk29EFk9PHa/Oxxx7LokWLxisiSdJ91mgf2FesWMHixYs3af85Dye9Y4BT+5By\nGd23d7YBTgVI8l5g56o6tC9/DnBSksOB5cDOwLHAV6vqxn6fI4GvA9+jCyS/RzdP5fDhSpMc3R/r\neuDhwLuAu4EN454kSZo1TYSTqvp4f02Td9MN51wJHFRVN/dFdgIWDJQ/re/lOAJ4H3Ar3bd9jhw4\n7LbA8cAuwJ3ANcCrq+o/BsrsAnwUeDBwM/BlYL+qumXaX6QkSdokTYQTgKo6AThhjG2vHWXd8XTh\nY6zj/S3wtxup00kikiQ1ZrP8to4kSbr3aqbnRJK0eVq9ejVDQ0Nz3QzNkvnz57PrrrvOaB2GE0nS\npK1evZo99ljIunVr57opmiXz5m3DtdeunNGAYjiRJE3a0NBQH0zOABbOdXM041aybt0hDA0NGU4k\nSa1bSHfBbmnqnBArSZKaYjiRJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USSJDXF\ncCJJkppiOJEkSU0xnEiSpKYYTiRJUlMMJ5IkqSmGE0mS1BTDiSRJaorhRJIkNcVwIkmSmmI4kSRJ\nTTGcSJKkphhOJElSUwwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElN2WquG3Bfs3r1\naoaGhua6GZol8+fPZ9ddd53rZkjSZsVwMotWr17NHnssZN26tXPdFM2SefO24dprVxpQJGkCDCez\naGhoqA8mZwAL57o5mnErWbfuEIaGhgwnkjQBhpM5sRBYNNeNkCSpSU6IlSRJTTGcSJKkphhOJElS\nUwwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJ\nktQUw4kkSWqK4USSJDXFcCJJkppiOJEkSU0xnEiSpKYYTiRJUlMMJ5IkqSmGE0mS1BTDiSRJaorh\nRJIkNcVwIkmSmmI4kSRJTTGcSJKkphhOJElSUwwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKa\n0kw4SXJEklVJ7kxyaZJ9NlL+1UmuTHJHkh8n+XCSBw1sf2mSryVZk+TnSa5IcshU69VkLZvrBug+\nw3NNs8VzbaY0EU6SvBJ4P3AU8GTgG8DyJPPHKL8/cBpwMrAncDCwL3DSQLFbgPcA+wF7AacApyR5\n7mTr1VT4R6zZ4rmm2eK5NlOaCCfAUuDEqjq9qq4BDgfWAq8bo/x+wKqqOr6qrq+qi4ET6QIKAFV1\nYVV9sqqurapVVfVB4CrggCnUK0mSZtich5MkWwOLgc8Pr6uqAs4HnjbGbpcAC5K8oD/GjsAfAueO\nU8+BwGOBC6ZQryRJmmFzHk6A+cCWwE0j1t8E7DTaDn1PySHAmUnuAm4A1gBvHCyX5IFJbu/LnAO8\nqaq+MNl6JUnSzNtqrhswGUn2BI4D3gmcBzwMeB/d0M5hA0VvB/YGtgMOBI5Ncl1VXTjJqucBHHbY\nYTzgAQ/YYMNBBx3E85///HF3Xrly5fCzSVa/ObsNWDHXjZhl3e/5N7/3WazZc22uGzHLPNfmhufa\nWD772c+yfPnyDdbdfvvtw0/nbayWdCMZc6cfXlkLvLyqzh5YfyqwfVW9dJR9TgfmVdUrBtbtD1wE\nPKyqRvaGDJc5Gdilql4wyXpfBXxkUi9UkiQBvLqqPjpegTnvOamqu5NcTtezcTZAkvTLHxxjt22A\nu0asWw8UkHGq2wK4/xTqXQ68Gvg+sG681yVJkjYwD3gk3f+l45rzcNI7Bji1DwuX0X2LZhvgVIAk\n7wV2rqpD+/LnACclOZzuRe4MHAt8tapu7Pc5Evg68D26QPJ7dPNUDt/UekeqqluAcdOeJEka08Wb\nUqiJcFJVH++vLfJuYEfgSuCgqrq5L7ITsGCg/GlJtgOOoJtrcivdt26OHDjstsDxwC7AncA1dF1J\n/zGBeiVJ0iyb8zknkiRJg1r4KrEkSdKvGU60WUuyPsmL57odmhlJTkly1jQf8xH9efPE6Tyu5l6S\nLyY5Zq7boalrYs6JJI3hzYz/DbzJcjxbapjhRFKzqur2jZealJkIPNKcSLJVVf1yrtsxnRzWEQBJ\nDkpyUZI1SYaSnJPk0QPbd0lyZr/9liSfSPKIge1PSXJekpuT3JrkS0mePKKOdya5Psm6JD9M8oGB\nbTslOTfJ2iTfTfKKJKuSvHmgzGOSXJjkziRXJ3nOTP9cNDuSHJzkqv73P9SfS781clin77Y/Lsk/\n9efhDUmOGnGsPZJ8uT9PvpnkWRsb/kvyhCSf7m93cWOS05M8eCZfs6YmyTb97+n2JD9K8pcjtv92\nv/2nSe7of7+PGdj+kyQvG1i+MsmPBpYP6N+r5vXL65P8aZKz+uN9O8mLBso/sy/zwiTf6M+/S5I8\nfkS7Xt6/f63r3+NGtvse52r/vvsn/fPhYclX9O+za4FXTemH2SDDiYZtC7wfWAQ8G/gV8F/QpXK6\n68ncBuwPPJ3u1gCf7bcBPIDu+jBPB54KfBv4dJJt+2McDPwF8GfAY4CXAN8cqP/f6L4y/gzgYODP\ngYcMb0ySvj3rgH3orlfzT9g9v9lLshPd9YP+H/A44JnAWYz9/vQnwM/p7kL+VuAd6W7sSZItgE/S\nnZ/7AP8D+EfGOU+SbE93KYLL6c7/g4CHAmdO8aVpZr0P+B3gRcDzgGfR/f6GndYv/z7dnexD9560\nZb/9wn4fkvw23bn3W0ke229/BnBZVQ1ecPMdwMeAvYBPAx/p9x30z3TXzHoKcDNw9nCdSRbTnVcf\nBZ4AHAX83XDwmKD30l3fayGbcFGzzU5V+fBxjwfdjRHXA3vSXRX3WyO23w+4A3jOGPtvQRdmXtgv\nL6W7KcOWo5Tdo6/ryQPrduvXvblffh7wC2DHgTIH9WVePNc/Lx9TOteeTBeGF4yy7RTgrIHlLwIX\njCjzVeAf+ufP78+ThwxsP3DwPAEe0S8/sV9+O/CZEcfcpS/zmLn++fgY9ZzZlu6DyssG1u3Qvycd\nQ/cBaD3w1IHtD+q3v7xffiNwVf/8xXQXBzsLeH2/7jzg7wb2Xw+8c2B5m37d8/rlZ/bLB4/SpoP7\n5TOAz454Lf8EfHNEPS8eUWYN8Cf98+Hz941z/XuYyYc9JwJ+PWTy0STfS3IbsIru0+audDdP3L3v\nPr09ye3ALXRX3t2t3/+hSU7uuzpvpQsm2/b7A/w73R/zqiQnJXnJwCeYPYC7q+qK4fZU1ffo/iCH\nPQ74QW1436RLpvenoDnyDbqei6uTfDzJYaN8Gh101YjlG+h6OgAeS3eeDF5I8bKN1L838OwR5/dK\nuvN/t01+FZpNuwFbM/C7rao1wLX94kLg7hHbf9pvX9ivugDYsx++eybwpf7xrL5H+On98qBf9/ZW\n1VrgZ/zm3IPunLl0lDYN17kQ+MqIY36F7v11ovOgLp9g+c2KE2I17FN0geQw4MfAlsDVdD0k29Hd\nCuBV3HMi4fB/AqfTfUp4E7Ca7tPrpf3+VNUP++7S5wDPBU4A3pLkmTP3krQ5qKr1wPOSPI2uh+xN\nwHuS7DfGLnePPARTG6Leju7+Wm/lnuf3DVM4rhpWVd9M8lO6oZ1nAm8DbqK70vg+dP8/jrzU+nSf\ne6M2jXueh1uPUu6Oaa63KfaciCQPovvE+Z6q+mJVXUvXBTo8Tr8C2B24uaquG/EY/jbF04EPVtXy\nqlpJ90c8f7CeqvpFVZ1bVX9B94bwdLqx22uBrTIwgbafuLbDwO4rgQVJdhxY9zScc3KvUVWXVNW7\n6IZ57qablzRR19KdJw8ZWLfvRvZZATweuH6U8/vOSbRBM+97wC/p5rcBkGQHuvcx6N4vth6x/cF0\nvbTfGjjOl4E/oBu+/jJdr9z96eYqfX0Sv//QzW8Z2abhOlfSzdsbdADw7erHbOg+8D1s4Bi70/U6\nD7rXv+8ZTgTd8MktwOuT7Jbk2XSTY4d9pN/+yX4G+yP7b0Acl2Tnvsx3gD9O8rgkT6UbW107fIAk\nhyZ5XZLHJ3kU8Mf99uv7MPR54OQk+/Qh5cR++/Af4fl9HacneWKS3wHeMzM/Ds2mJPsm+eski5Ms\nAF5OF2xXTuJwnwOuoztP9kqyP915Uoz9hn48XRj/WLpvnT063bfX/nUSXe2aBVV1B/Bh4Ogkv5vk\nCXTzk37Vb/8u3cTok5Psn2RvuvekH/Trh30JWAJcWVVr+4BwId08uwsm2bx3JHl236ZT6cLGcJ3v\nBw5M8jdJdk9yKN094o4e2P8LwBuTPCnJU4D/C9w1oo57/XlpOBH9H+QrgcV0Y6rvB94ysP1Oulnx\nq4H/pPsUcDLdJ4yf9cVeR9fTcTndLPnjgJ8MVHMr3Td1vkw3x+DZwO/3Y7LQhZUb6d4Q/rM//s/p\nJr0Nt/EldLfc/ipwEl03rDZ/P6P7ZsS5dD0f7wb+sqpG+wbCuJ8Y+yGiP6Cb73QZ3XnyHro388Fv\nXdTAPjfQfZrdgu5bD1fRTapcM/BpVu35X8BFdENy5/XPB+dhvLZfPoduXsd64Peq6lcDZS6g+71/\ncWDdl/p1XxpR32jnwsh1RTcsdBzwNbpvHL6o+muQ9PPqXkH3fvtN4J3A31TVvw0c46/oQtSFdIHq\naAY+6I3TlnsVb/ynJiXZhS4MHVhVX9xYeWksfe/JhXTfvFk11+3RvVM/f+4LwA5V9bONldf4nBCr\nJiT5XbqJid8Edqa7VsB1dP+pSJssyUvoet2+QzdX6gPAlw0mmgX3+uGW2WI4USu2Bv4BeBTdBbS+\nAiwZ0QUrbYoH0F07YgEwRDcP5S3j7iFND4ciponDOpIkqSlOiJUkSU0xnEiSpKYYTiRJUlMMJ5Ik\nqSmGE0ktcQgFAAAEYElEQVSS1BTDiaSmJTkqyYopHuMRSdYneeJ0tUvSzDGcSJqyJLv096L5UZJf\nJPl+kg/0N5WcyHHWJ3nxiNVHAwdOsYmrgZ3o7rQtqXGGE0lT0t/I8evAbnT3DNmN7q6uBwKXJPnt\nqRy/vyHbmo2XHPcYVVU/6e+9M+2SbOFNAqXpYziRNFUnAL8AnltVX66qH/Y37XsO8HDg7wGSrOrv\nxvrRJD9P8sMkbxg+SJJVdFfY/ETfg3Jdv/6dSa4YKHdKkv/q72R8Y5I1/XG3TPLPSW5J8oMkrxnY\nZ4Nhnf4Y6/vHrwaeP6Pffr8k7+vb+PMkl/T3Thk+3qF9vS9K8t90NxVc0N+t+6v9PmuSXNTfaVnS\nBBhOJE1akh2A5wHHV9UGt3WvqpuAj9D1pgx7C3AF8CTgH4HjkgwP2exDd2+SQ+mGYPYZPhT3vCz4\ns4GH0d0teyndnYw/BfwU2Bf4EHBikp0HmzTw/M19HTv1xzkOuAm4pt9+PPBUujvI7gX8O/CZJLsN\nHGMb4K3AnwKPB9YA/0V3h9snAPvR3RXZy3BLE+S9dSRNxe50geKaMbavBHZIMr9f/kpVHd0//5f+\njsFLgc9X1VA/MnJbVf1kI/XeUlVv7p9/J8n/Bn6rqv4RIMl76W5dfwDw8b7cr4ddqup2uns4keRl\nwOvp7oD9k76n4zXAgqq6sd/lmCQvAF4L/E2/bivgz6vq6v44OwAPBM6tqu/3Za7dyOuQNArDiaTp\nsKnzLS4ZZfl/TqK+/x6xfBPdHa0BqKr1SW4BHjreQZI8GTgdOKKqLu1X7wVsCXx7xDyS+9HdSHDY\nXcPBpK9zTZLTgPOSfA44H/j4QMCRtIkc1pE0Fd+lG7ZYOMb2PYE1VTU0xvbJunvEco2xbsz3uCQ7\nAZ8ETqqqUwc2bQf8ElgE7D3wWMiGQerOkcesqtfRDed8hW4469ok+2785UgaZDiRNGlV9VPgc8Ab\nktx/cFv/n/+rgI8NrN5vxCH2oxv6GXY3Xa/FTPj13I++rZ8AvgX81YhyV/Rt2LGqrhvx2NhwE1X1\njar6p6ran66H51XT9xKk+wbDiaSpeiNwf2B5kt/pr3nyfOA84Af8Zo4GwP5J3pJk9yRHAAcDHxjY\n/n3gwCQ7TvUryKMYHKI5CdiFrifkoX19OybZuqq+A3wUOD3JS5M8Msm+SY7s552MfvCu3D8k2S/J\nrkmeRzcn51vT/Dqkez3DiaQpqarvAk8BrgPOpBvq+RDweeDpVXXrQPH392WvAN4GLK2q8we2/xXw\nXLpQM5Grwo72jZiR6waXn0H3LZ1vAT8Gbuj/fVq//TV0c1HeRzfZ96y+3avHacNa4HHAf9BNhP0Q\n8H+q6qQJvA5JQKr8lpukmddfx+TYqvrgXLdFUtvsOZEkSU0xnEiaLXbTStokDutIkqSm2HMiSZKa\nYjiRJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USSJDXFcCJJkpry/wGmW9ItYQME\nSAAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot the statistical performanc of the optimizers.\n", "fig = plt.figure()\n", "st = fig.suptitle(\"Higer is better.\", fontsize=\"x-small\")\n", "\n", "plt.bar(range(len(results)), results.values(), align='center')\n", "plt.xticks(range(len(results)), results.keys())\n", "plt.xlabel(\"Optimizers\")\n", "plt.ylabel(\"F1\")\n", "plt.ylim([0.83,0.85])\n", "plt.show()" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [conda root]", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 0 } ================================================ FILE: examples/kafka_producer.py ================================================ """ This example will be used as a Kafka producer to generate dummy data for our Spark Streaming example. """ ## BEGIN Imports. ############################################################## from kafka import * import sys import pandas import time import json ## END Imports. ################################################################ def usage(): print("Distributed Keras Example: Kafka Producer") print("") print("Usage:") print("python kafka_producer.py [bootstrap_server]") exit(0) def allocate_producer(bootstrap_server): producer = KafkaProducer(bootstrap_servers=[bootstrap_server]) return producer def read_data(): path = 'data/atlas_higgs.csv' data = [] # Use Pandas to infer the types. data = pandas.read_csv(path) # Remove the unneeded columns. del data['Label'] del data['Weight'] # Convert the data to a list of dictionaries. data = data.transpose().to_dict().values() return data def produce(producer, topic, data): for row in data: producer.send(topic, json.dumps(row)) def main(): # Check if the required number of arguments has been specified. if len(sys.argv) != 2: usage() # Fetch the bootstrap server from the arguments. bootstrap_server = sys.argv[1] # Allocate the producer. producer = allocate_producer(bootstrap_server) # Read the data from the CSV file. data = read_data() iteration = 1 # Transmit the data in a continous loop while waiting for 5 seconds after every iteration. while True: print("Iteration " + str(iteration) + ".") produce(producer, 'Machine_Learning', data) iteration += 1 time.sleep(5) if __name__ == "__main__": main() ================================================ FILE: examples/kafka_spark_high_throughput_ml_pipeline.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Kafka and Spark High Throughput Deep Learning Production Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN) \n", "*Departement of Knowledge Engineering* \n", "*Maastricht University, The Netherlands*" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15 November 2016\r\n" ] } ], "source": [ "!(date +%d\\ %B\\ %G)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we will inform the reader how to set up a production ready machine learning pipeline using [Apache Kafka](https://kafka.apache.org) and [Apache Spark](https://spark.apache.org), together with our Distributed Deep Learning framework [Distributed Keras](https://github.com/JoeriHermans/dist-keras) which is built using [Keras](https://keras.io).\n", "\n", "***Note before starting this notebook: *** Do not forget to run the Kafka producer (as explained in this notebook)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contents\n", "\n", "- [Introduction and problem statement](#Problem-statement)\n", "- [Preliminaries](#Preliminaries)\n", " - [Installation and requirements](#Installation-and-requirements)\n", " - [Pretrained model](#Pretrained-model)\n", " - [Kafka producer](#Kafka-producer)\n", "- [Usage](#Distributed-Keras:-a-practicle-example)\n", "- [Experiments](#Experiments)\n", "- [Conclusion](#Conclusion)\n", "- [Acknowledgments](#Acknowledgments)\n", "- [References](#References)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "import json\n", "\n", "from keras.models import Sequential\n", "from keras.layers.core import Dense, Dropout, Activation\n", "\n", "from pyspark import SparkContext\n", "from pyspark import SparkConf\n", "from pyspark.streaming import StreamingContext\n", "from pyspark.streaming.kafka import KafkaUtils\n", "\n", "from pyspark.ml.feature import VectorAssembler\n", "from pyspark.ml.feature import Normalizer\n", "\n", "from distkeras.trainers import *\n", "from distkeras.predictors import *\n", "from distkeras.transformers import *\n", "from distkeras.evaluators import *\n", "from distkeras.utils import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem statement\n", "\n", "The problem of building an efficient machine learning production pipeline is quite similar to building an efficient training procedure. However, in contrast to the training procedure, in production (model serving) most of this data will arrive in a streaming fashion. Usually, one just reads from a particular source using Spark Streaming. However, intergration with Apache Kafka is also possible. Kafka allows us to scale our streaming application if a bottleneck would occur. At CERN we employ Apache Kafka with different use-cases in [IT](https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-10-benchmarking-apache-kafka-openstack-vms) (IT Group), [BE](https://indico.cern.ch/event/533714/contributions/2173938/attachments/1292041/1924841/CALS2-Hadoop-IT.pdf) (Beams Group), and ATLAS.\n", "\n", "However, building a distributed streaming application has some practical considerations as mentioned in [1]. This includes specifying the *retention* (i.e., how much time is the data allowed to stay in the buffer, or what is the maximum size of the buffer before discarding older data) of the data in your buffer, usage of *compression*, number of *brokers*, *partitions*, and how to throttle incoming data. Of course, these settings are always application and infrastructure depended. But since this is a general-purpose framework, we will show in the following sections how to build a scalable deep learning production (model serving) pipeline using the technologies mentioned above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preliminaries\n", "\n", "### Installation and requirements\n", "\n", "#### Cluster requirements\n", "\n", "We will assume that you will already have a running Kafka and Spark cluster. Furthermore, in order to run this example, we require that the topic **\"Machine_Learning\"** is available on this Kafka cluster.\n", "\n", "#### Kafka Python\n", "\n", "In order to manage your Python dependencies, it is recommended to install a Python distribution like [Anaconda](https://www.continuum.io/downloads). In the following sections, we assume that Spark is already added to your PATH variable. In order to run our Kafka producer (located in the *examples* directory). We first need [Kafka Python](https://github.com/dpkp/kafka-python). This is done by simply running Pip in your shell:\n", "\n", "```pip install kafka-python```\n", "\n", "### Pretrained model\n", "\n", "In order to run a production classification pipeline you should have access to a trained model. Keras provides an API to load and store trained models. The same procedures can be used with Distributed Keras and Spark to load a pretrained model for production use-cases. However, in this example, we will construct a Neural Network with randomly initialized weights (which will simulate such a pretrained model). The structure of the model (input and output data) will be equivalent to the neural network in the *workflow notebook*. So if anyone wants to use the distributed training methods described in the workflow notebook to train a model, and afterwards save it to use the trained model in this notebook, you should not experience any problems. Just make sure the model variable is set to your trained Keras model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As defined in the *workflow* notebook, our neural network will use 30 features and will be trained to classify two classes (signal and background)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "nb_features = 30\n", "nb_classes = 2 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As described above, we construct a randomly initialized neural network to simulate a pretrained network." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "____________________________________________________________________________________________________\n", "Layer (type) Output Shape Param # Connected to \n", "====================================================================================================\n", "dense_1 (Dense) (None, 500) 15500 dense_input_1[0][0] \n", "____________________________________________________________________________________________________\n", "activation_1 (Activation) (None, 500) 0 dense_1[0][0] \n", "____________________________________________________________________________________________________\n", "dropout_1 (Dropout) (None, 500) 0 activation_1[0][0] \n", "____________________________________________________________________________________________________\n", "dense_2 (Dense) (None, 1000) 501000 dropout_1[0][0] \n", "____________________________________________________________________________________________________\n", "activation_2 (Activation) (None, 1000) 0 dense_2[0][0] \n", "____________________________________________________________________________________________________\n", "dropout_2 (Dropout) (None, 1000) 0 activation_2[0][0] \n", "____________________________________________________________________________________________________\n", "dense_3 (Dense) (None, 500) 500500 dropout_2[0][0] \n", "____________________________________________________________________________________________________\n", "activation_3 (Activation) (None, 500) 0 dense_3[0][0] \n", "____________________________________________________________________________________________________\n", "dense_4 (Dense) (None, 2) 1002 activation_3[0][0] \n", "____________________________________________________________________________________________________\n", "activation_4 (Activation) (None, 2) 0 dense_4[0][0] \n", "====================================================================================================\n", "Total params: 1018002\n", "____________________________________________________________________________________________________\n" ] } ], "source": [ "model = Sequential()\n", "model.add(Dense(500, input_shape=(nb_features,)))\n", "model.add(Activation('relu'))\n", "model.add(Dropout(0.4))\n", "model.add(Dense(1000))\n", "model.add(Activation('relu'))\n", "model.add(Dropout(0.4))\n", "model.add(Dense(500))\n", "model.add(Activation('relu'))\n", "model.add(Dense(nb_classes))\n", "model.add(Activation('softmax'))\n", "\n", "model.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Kafka producer\n", "\n", "In order to run the Kafka producer, change the directory to the examples directory. Next, fetch the address of a bootstrap server. Once you have this address, run the following command in a seperate shell to run the Kafka producer:\n", "\n", "```python kafka_producer.py [bootstrap_server]```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Usage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following cell, please modify the required parameters according to your requirements." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Modify these variables according to your needs.\n", "application_name = \"Distributed Keras Kafka Pipeline\"\n", "using_spark_2 = False\n", "local = False\n", "if local:\n", " # Tell master to use local resources.\n", " master = \"local[*]\"\n", " num_cores = 3\n", " num_executors = 1\n", "else:\n", " # Tell master to use YARN.\n", " master = \"yarn-client\"\n", " num_executors = 8\n", " num_cores = 2\n", "# Define Kafka specific metrics.\n", "zk = \"zookeeper_host:2181\"; # ZooKeeper address\n", "topic = \"Machine_Learning\" # Topic name\n", "consumer_name = \"dist-keras-consumer\" # Consumer identifier\n", "# Define Spark streaming specific parameters.\n", "batch_duriation = 10 # In seconds." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will allocate a Spark Context (sc) with a Spark Streaming Context (ssc) using the parameters you provided above." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "conf = SparkConf()\n", "conf.set(\"spark.app.name\", application_name)\n", "conf.set(\"spark.master\", master)\n", "conf.set(\"spark.executor.cores\", `num_cores`)\n", "conf.set(\"spark.executor.instances\", `num_executors`)\n", "conf.set(\"spark.serializer\", \"org.apache.spark.serializer.KryoSerializer\");\n", "\n", "# Check if the user is running Spark 2.0 +\n", "if using_spark_2:\n", " sc = SparkSession.builder.config(conf=conf) \\\n", " .appName(application_name) \\\n", " .getOrCreate()\n", "else:\n", " # Create the Spark context.\n", " sc = SparkContext(conf=conf)\n", " # Add the missing imports\n", " from pyspark import SQLContext\n", " sqlContext = SQLContext(sc)\n", "# Allocate the streaming context with a batch duration of 10 seconds.\n", "ssc = StreamingContext(sc, batch_duriation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we allocate a Kafka Stream using the previously defined parameters. However, the final parameter, which is passed as a dictionary, will tell the consumer group to read from (in this case) 3 different partitions at once.\n", "\n", "For additional and more detailed information on Spark's Kafka API, we will refer to their documentation [http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html](http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Allocate a Kafka stream.\n", "kafkaStream = KafkaUtils.createStream(ssc, zk, consumer_name, {topic: 3})" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def predict(df):\n", " \"\"\"This method will add a prediction column to the specified DataFrame using the pretrained model.\"\"\"\n", " predictor = ModelPredictor(keras_model=model, features_col=\"features_normalized\", output_col=\"prediction\")\n", " \n", " return predictor.predict(df)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def post_process(df):\n", " \"\"\"\n", " Will add a column to the specified DataFrame by converting the raw\n", " model prediction (which is an array) to a predicted class (identifier by an index).\n", " Since we only have two classes, the output dimension is 2. This will cause the\n", " LabelIndexTransformer to output a 0 or a 1 given the raw neural network classification.\n", " \"\"\"\n", " transformer = LabelIndexTransformer(output_dim=2, input_col=\"prediction\", output_col=\"predicted_index\")\n", " \n", " return transformer.transform(df)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def prepare_dataframe(df):\n", " \"\"\"\n", " Takes the specified dataframe and add two columns:\n", " \n", " 1. features\n", " Every row will hold a vector of the specified features.\n", " 2. features_normalized\n", " Every row will hold a normalized vector of features based\n", " on the features vector created before.\n", " \"\"\"\n", " features = df.columns\n", " features.remove('EventId')\n", " vector_assembler = VectorAssembler(inputCols=features, outputCol=\"features\")\n", " df = vector_assembler.transform(df)\n", " normalizer = Normalizer(inputCol=\"features\", outputCol=\"features_normalized\", p=2.0)\n", " df = normalizer.transform(df)\n", " \n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In *process_instances* we will process the incoming RDD's into predictions. Of course, since there is no real goal to this notebook besides demonstration purposes, we just print the number of instances which were classified as \"signal\" by the pretrained model." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def process_instances(rdd):\n", " # Check if there is new data available.\n", " if not rdd.isEmpty():\n", " df = rdd.toDF() # Convert the RDD to a Spark DataFrame.\n", " df = prepare_dataframe(df) # Create a feature column and normalize the batch.\n", " df = predict(df) # Add the raw Neural Network predictions.\n", " df = post_process(df) # Convert the raw Neural Network predictions to a class (index).\n", " # Extract the instances which are interesting (signal).\n", " df = df.filter(df['predicted_index'] == 0)\n", " # TODO: Do something with your DataFrame (e.g., storing to HDFS).\n", " print(df.count())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Fetch the raw instances from the Kafka stream.\n", "raw_instances = kafkaStream.map(lambda x: x[1])\n", "# Convert the raw instances (which are JSON strings) to Spark rows.\n", "instances = raw_instances.map(json_to_dataframe_row)\n", "# Process every RDD in the DStream.\n", "instances.foreachRDD(process_instances)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "33023\n", "46801\n", "45446\n", "48116\n", "22459\n", "45999\n" ] } ], "source": [ "ssc.start()\n", "ssc.awaitTermination()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Experiments\n", "\n", "TODO\n", "\n", "## Conclusion\n", "\n", "In this notebook we demonstrated how to construct a high throughput model serving pipeline using Apache Spark, Apache Kafka and Distributed Keras. Furthermore, we also showed that this infrastructure provides an easily scalable approach for production use-cases. However, since Distributed Keras is still being developed, some bugs might still show up. So please notify us when any of these occur on your system.\n", "\n", "**Contact**: [joeri.hermans@cern.ch](mailto:joeri.hermans@cern.ch)\n", " [luca.canali@cern.ch](mailto:luca.canali@cern.ch)\n", " [zbigniew.baranowski@cern.ch](mailto:zbigniew.baranowski@cern.ch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Acknowledgements\n", "\n", "Many thanks to Zbigniew Baranowski and Luca Canali of the IT-DB group for their collaboration on this work." ] } ], "metadata": { "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 0 } ================================================ FILE: examples/mnist.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MNIST using Distributed Keras\n", "\n", "**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN) \n", "*Departement of Knowledge Engineering* \n", "*Maastricht University, The Netherlands*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!(date +%d\\ %B\\ %G)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we will show you how to process the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset using Distributed Keras. As in the [workflow](https://github.com/JoeriHermans/dist-keras/blob/master/examples/workflow.ipynb) notebook, we will guide you through the complete machine learning pipeline.\n", "\n", "## Preparation\n", "\n", "To get started, we first load all the required imports. Please make sure you installed `dist-keras`, and `seaborn`. Furthermore, we assume that you have access to an installation which provides Apache Spark.\n", "\n", "Before you start this notebook, place the MNIST dataset (which is provided in this repository) on HDFS. Or in the case HDFS is not available, place it on the local filesystem. But make sure the path to the file is identical for all computing nodes." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import numpy as np\n", "\n", "import seaborn as sns\n", "\n", "from keras.optimizers import *\n", "from keras.models import Sequential\n", "from keras.layers.core import *\n", "from keras.layers.convolutional import *\n", "\n", "from pyspark import SparkContext\n", "from pyspark import SparkConf\n", "\n", "from matplotlib import pyplot as plt\n", "import matplotlib.patches as mpatches\n", "\n", "from pyspark.ml.feature import StandardScaler\n", "from pyspark.ml.feature import VectorAssembler\n", "from pyspark.ml.feature import OneHotEncoder\n", "from pyspark.ml.feature import MinMaxScaler\n", "from pyspark.ml.feature import StringIndexer\n", "from pyspark.ml.evaluation import MulticlassClassificationEvaluator\n", "\n", "from distkeras.trainers import *\n", "from distkeras.predictors import *\n", "from distkeras.transformers import *\n", "from distkeras.evaluators import *\n", "from distkeras.utils import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following cell, adapt the parameters to fit your personal requirements." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Modify these variables according to your needs.\n", "application_name = \"Distributed Keras MNIST Notebook\"\n", "using_spark_2 = False\n", "local = False\n", "path_train = \"data/mnist_train.csv\"\n", "path_test = \"data/mnist_test.csv\"\n", "if local:\n", " # Tell master to use local resources.\n", " master = \"local[*]\"\n", " num_processes = 3\n", " num_executors = 1\n", "else:\n", " # Tell master to use YARN.\n", " master = \"yarn-client\"\n", " num_executors = 20\n", " num_processes = 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# This variable is derived from the number of cores and executors, and will be used to assign the number of model trainers.\n", "num_workers = num_executors * num_processes\n", "\n", "print(\"Number of desired executors: \" + `num_executors`)\n", "print(\"Number of desired processes / executor: \" + `num_processes`)\n", "print(\"Total number of workers: \" + `num_workers`)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "\n", "# Use the DataBricks CSV reader, this has some nice functionality regarding invalid values.\n", "os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "conf = SparkConf()\n", "conf.set(\"spark.app.name\", application_name)\n", "conf.set(\"spark.master\", master)\n", "conf.set(\"spark.executor.cores\", `num_processes`)\n", "conf.set(\"spark.executor.instances\", `num_executors`)\n", "conf.set(\"spark.executor.memory\", \"4g\")\n", "conf.set(\"spark.locality.wait\", \"0\")\n", "conf.set(\"spark.serializer\", \"org.apache.spark.serializer.KryoSerializer\");\n", "\n", "# Check if the user is running Spark 2.0 +\n", "if using_spark_2:\n", " sc = SparkSession.builder.config(conf=conf) \\\n", " .appName(application_name) \\\n", " .getOrCreate()\n", "else:\n", " # Create the Spark context.\n", " sc = SparkContext(conf=conf)\n", " # Add the missing imports\n", " from pyspark import SQLContext\n", " sqlContext = SQLContext(sc)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Check if we are using Spark 2.0\n", "if using_spark_2:\n", " reader = sc\n", "else:\n", " reader = sqlContext\n", "# Read the training dataset.\n", "raw_dataset_train = reader.read.format('com.databricks.spark.csv') \\\n", " .options(header='true', inferSchema='true') \\\n", " .load(path_train)\n", "# Read the testing dataset.\n", "raw_dataset_test = reader.read.format('com.databricks.spark.csv') \\\n", " .options(header='true', inferSchema='true') \\\n", " .load(path_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As shown in the output of the cell above, we see that every pixel is associated with a seperate column. In order to ensure compatibility with Apache Spark, we vectorize the columns, and add the resulting vectors as a seperate column. However, in order to achieve this, we first need a list of the required columns. This is shown in the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# First, we would like to extract the desired features from the raw dataset.\n", "# We do this by constructing a list with all desired columns.\n", "# This is identical for the test set.\n", "features = raw_dataset_train.columns\n", "features.remove('label')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have a list of columns names, we can pass this to Spark's [VectorAssembler](http://spark.apache.org/docs/latest/ml-features.html#vectorassembler). This VectorAssembler will take a list of features, vectorize them, and place them in a column defined in `outputCol`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Next, we use Spark's VectorAssembler to \"assemble\" (create) a vector of all desired features.\n", "# http://spark.apache.org/docs/latest/ml-features.html#vectorassembler\n", "vector_assembler = VectorAssembler(inputCols=features, outputCol=\"features\")\n", "# This transformer will take all columns specified in features, and create an additional column \"features\" which will contain all the desired features aggregated into a single vector.\n", "dataset_train = vector_assembler.transform(raw_dataset_train)\n", "dataset_test = vector_assembler.transform(raw_dataset_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have the inputs for our Neural Network (features column) after applying the VectorAssembler, we should also define the outputs. Since we are dealing with a classification task, the output of our Neural Network should be a one-hot encoded vector with 10 elements. For this, we provide a `OneHotTransformer` which accomplish this exact task." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Define the number of output classes.\n", "nb_classes = 10\n", "encoder = OneHotTransformer(nb_classes, input_col=\"label\", output_col=\"label_encoded\")\n", "dataset_train = encoder.transform(dataset_train)\n", "dataset_test = encoder.transform(dataset_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MNIST\n", "\n", "[MNIST](http://yann.lecun.com/exdb/mnist/) is a dataset of handwritten digits. Every image is a 28 by 28 pixel grayscale image. This means that every pixel has a value between 0 and 255. Some examples of instances within this dataset are shown in the cells below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def show_instances(column):\n", " global dataset\n", "\n", " num_instances = 6 # Number of instances you would like to draw.\n", " x_dimension = 3 # Number of images to draw on the x-axis.\n", " y_dimension = 2 # Number of images to draw on the y-axis.\n", "\n", " # Fetch 3 different instance from the dataset.\n", " instances = dataset_train.select(column).take(num_instances)\n", " # Process the instances.\n", " for i in range(0, num_instances):\n", " instance = instances[i]\n", " instance = instance[column].toArray().reshape((28, 28))\n", " instances[i] = instance\n", "\n", " # Draw the sampled instances.\n", " fig, axn = plt.subplots(y_dimension, x_dimension, sharex=True, sharey=True)\n", " num_axn = len(axn.flat)\n", " for i in range(0, num_axn):\n", " ax = axn.flat[i]\n", " h = sns.heatmap(instances[i], ax=ax)\n", " h.set_yticks([])\n", " h.set_xticks([])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "show_instances(\"features\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normalization\n", "\n", "In this Section, we will normalize the feature vectors between the 0 and 1 range." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Clear the dataset in the case you ran this cell before.\n", "dataset_train = dataset_train.select(\"features\", \"label\", \"label_encoded\")\n", "dataset_test = dataset_test.select(\"features\", \"label\", \"label_encoded\")\n", "# Allocate a MinMaxTransformer using Distributed Keras.\n", "# o_min -> original_minimum\n", "# n_min -> new_minimum\n", "transformer = MinMaxTransformer(n_min=0.0, n_max=1.0, \\\n", " o_min=0.0, o_max=250.0, \\\n", " input_col=\"features\", \\\n", " output_col=\"features_normalized\")\n", "# Transform the dataset.\n", "dataset_train = transformer.transform(dataset_train)\n", "dataset_test = transformer.transform(dataset_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "show_instances(\"features_normalized\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convolutions\n", "\n", "In order to make the dense vectors compatible with convolution operations in Keras, we add another column which contains the matrix form of these images. We provide a utility class (MatrixTransformer), which helps you with this." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "reshape_transformer = ReshapeTransformer(\"features_normalized\", \"matrix\", (28, 28, 1))\n", "dataset_train = reshape_transformer.transform(dataset_train)\n", "dataset_test = reshape_transformer.transform(dataset_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Development" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multilayer Perceptron" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mlp = Sequential()\n", "mlp.add(Dense(1000, input_shape=(784,)))\n", "mlp.add(Activation('relu'))\n", "mlp.add(Dense(250))\n", "mlp.add(Activation('relu'))\n", "mlp.add(Dense(10))\n", "mlp.add(Activation('softmax'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mlp.summary()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "optimizer_mlp = 'adam'\n", "loss_mlp = 'categorical_crossentropy'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convolutional network" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Taken from Keras MNIST example.\n", "\n", "# Declare model parameters.\n", "img_rows, img_cols = 28, 28\n", "# number of convolutional filters to use\n", "nb_filters = 32\n", "# size of pooling area for max pooling\n", "pool_size = (2, 2)\n", "# convolution kernel size\n", "kernel_size = (3, 3)\n", "input_shape = (img_rows, img_cols, 1)\n", "\n", "# Construct the model.\n", "convnet = Sequential()\n", "convnet.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1],\n", " border_mode='valid',\n", " input_shape=input_shape))\n", "convnet.add(Activation('relu'))\n", "convnet.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1]))\n", "convnet.add(Activation('relu'))\n", "convnet.add(MaxPooling2D(pool_size=pool_size))\n", "\n", "convnet.add(Flatten())\n", "convnet.add(Dense(225))\n", "convnet.add(Activation('relu'))\n", "convnet.add(Dense(nb_classes))\n", "convnet.add(Activation('softmax'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "convnet.summary()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "optimizer_convnet = 'adam'\n", "loss_convnet = 'categorical_crossentropy'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation\n", "\n", "We define a utility function which will compute the accuracy for us." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def evaluate_accuracy(model, test_set, features=\"features_normalized_dense\"):\n", " evaluator = AccuracyEvaluator(prediction_col=\"prediction_index\", label_col=\"label\")\n", " predictor = ModelPredictor(keras_model=model, features_col=features)\n", " transformer = LabelIndexTransformer(output_dim=nb_classes)\n", " test_set = test_set.select(features, \"label\")\n", " test_set = predictor.predict(test_set)\n", " test_set = transformer.transform(test_set)\n", " score = evaluator.evaluate(test_set)\n", " \n", " return score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dataset_train.printSchema()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dataset_train = dataset_train.select(\"features_normalized\", \"matrix\",\"label\", \"label_encoded\")\n", "dataset_test = dataset_test.select(\"features_normalized\", \"matrix\",\"label\", \"label_encoded\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dense_transformer = DenseTransformer(input_col=\"features_normalized\", output_col=\"features_normalized_dense\")\n", "dataset_train = dense_transformer.transform(dataset_train)\n", "dataset_test = dense_transformer.transform(dataset_test)\n", "dataset_train.repartition(num_workers)\n", "dataset_test.repartition(num_workers)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Assing the training and test set.\n", "training_set = dataset_train.repartition(num_workers)\n", "test_set = dataset_test.repartition(num_workers)\n", "# Cache them.\n", "training_set.cache()\n", "test_set.cache()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(training_set.count())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DOWNPOUR (Multilayer Perceptron)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer = DOWNPOUR(keras_model=mlp, worker_optimizer=optimizer_mlp, loss=loss_mlp, num_workers=num_workers,\n", " batch_size=4, communication_window=5, num_epoch=1,\n", " features_col=\"features_normalized_dense\", label_col=\"label_encoded\")\n", "trained_model = trainer.train(training_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\"Training time: \" + str(trainer.get_training_time()))\n", "print(\"Accuracy: \" + str(evaluate_accuracy(trained_model, test_set)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer.parameter_server.num_updates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ADAG (MultiLayer Perceptron)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer = ADAG(keras_model=mlp, worker_optimizer=optimizer_mlp, loss=loss_mlp, num_workers=num_workers,\n", " batch_size=4, communication_window=15, num_epoch=1,\n", " features_col=\"features_normalized_dense\", label_col=\"label_encoded\")\n", "trained_model = trainer.train(training_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\"Training time: \" + str(trainer.get_training_time()))\n", "print(\"Accuracy: \" + str(evaluate_accuracy(trained_model, test_set)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "trainer.parameter_server.num_updates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EASGD (MultiLayer Perceptron)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "trainer = AEASGD(keras_model=mlp, worker_optimizer=optimizer_mlp, loss=loss_mlp, num_workers=num_workers,\n", " batch_size=4, communication_window=35, num_epoch=1, features_col=\"features_normalized_dense\",\n", " label_col=\"label_encoded\")\n", "trained_model = trainer.train(training_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\"Training time: \" + str(trainer.get_training_time()))\n", "print(\"Accuracy: \" + str(evaluate_accuracy(trained_model, test_set)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer.parameter_server.num_updates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DOWNPOUR (Convolutional network)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer = DOWNPOUR(keras_model=convnet, worker_optimizer=optimizer_convnet, loss=loss_convnet,\n", " num_workers=num_workers, batch_size=4, communication_window=5,\n", " num_epoch=1, features_col=\"matrix\", label_col=\"label_encoded\")\n", "trainer.set_parallelism_factor(1)\n", "trained_model = trainer.train(training_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\"Training time: \" + str(trainer.get_training_time()))\n", "print(\"Accuracy: \" + str(evaluate_accuracy(trained_model, test_set, \"matrix\")))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer.parameter_server.num_updates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ADAG (Convolutional network)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer = ADAG(keras_model=convnet, worker_optimizer=optimizer_convnet, loss=loss_convnet,\n", " num_workers=num_workers, batch_size=15, communication_window=5, num_epoch=1,\n", " features_col=\"matrix\", label_col=\"label_encoded\")\n", "trainer.set_parallelism_factor(1)\n", "trained_model = trainer.train(training_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\"Training time: \" + str(trainer.get_training_time()))\n", "print(\"Accuracy: \" + str(evaluate_accuracy(trained_model, test_set, \"matrix\")))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "trainer.parameter_server.num_updates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EASGD (Convolutional network)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer = AEASGD(keras_model=convnet, worker_optimizer=optimizer_convnet, loss=loss_convnet, \n", " num_workers=num_workers, batch_size=35, communication_window=32, num_epoch=1,\n", " features_col=\"matrix\", label_col=\"label_encoded\")\n", "trained_model = trainer.train(training_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\"Training time: \" + str(trainer.get_training_time()))\n", "print(\"Accuracy: \" + str(evaluate_accuracy(trained_model, test_set, \"matrix\")))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer.parameter_server.num_updates" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 0 } ================================================ FILE: examples/mnist.py ================================================ """MNIST classification using Distributed Keras. ATTENTION: Before running this example, make sure you put the MNIST dataset on HDFS. 1. unzip mnist.zip 2. hdfs dfs -mkdir data 3. hdfs dfs -copyFromLocal mnist_train.csv data/mnist_train.csv 4. hdfs dfs -copyFromLocal mnist_test.csv data/mnist_test.csv """ from distkeras.evaluators import * from distkeras.predictors import * from distkeras.trainers import * from distkeras.transformers import * from distkeras.utils import * from keras.layers.convolutional import * from keras.layers.core import * from keras.models import Sequential from keras.optimizers import * from pyspark import SparkConf from pyspark import SparkContext from pyspark.ml.evaluation import MulticlassClassificationEvaluator from pyspark.ml.feature import OneHotEncoder from pyspark.ml.feature import StandardScaler from pyspark.ml.feature import StringIndexer from pyspark.ml.feature import VectorAssembler import pwd import os # First, setup the Spark variables. You can modify them to your needs. application_name = "Distributed Keras MNIST Notebook" using_spark_2 = False local = False path_train = "data/mnist_train.csv" path_test = "data/mnist_test.csv" if local: # Tell master to use local resources. master = "local[*]" num_processes = 3 num_executors = 1 else: # Tell master to use YARN. master = "yarn-client" num_executors = 20 num_processes = 1 # This variable is derived from the number of cores and executors, and will be used to assign the number of model trainers. num_workers = num_executors * num_processes print("Number of desired executors: " + `num_executors`) print("Number of desired processes / executor: " + `num_processes`) print("Total number of workers: " + `num_workers`) # Use the DataBricks CSV reader, this has some nice functionality regarding invalid values. os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell' conf = SparkConf() conf.set("spark.app.name", application_name) conf.set("spark.master", master) conf.set("spark.executor.cores", `num_processes`) conf.set("spark.executor.instances", `num_executors`) conf.set("spark.executor.memory", "4g") conf.set("spark.locality.wait", "0") conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); conf.set("spark.local.dir", "/tmp/" + get_os_username() + "/dist-keras"); # Check if the user is running Spark 2.0 + if using_spark_2: sc = SparkSession.builder.config(conf=conf) \ .appName(application_name) \ .getOrCreate() else: # Create the Spark context. sc = SparkContext(conf=conf) # Add the missing imports from pyspark import SQLContext sqlContext = SQLContext(sc) # Check if we are using Spark 2.0 if using_spark_2: reader = sc else: reader = sqlContext # Read the training dataset. raw_dataset_train = reader.read.format('com.databricks.spark.csv') \ .options(header='true', inferSchema='true') \ .load(path_train) # Read the testing dataset. raw_dataset_test = reader.read.format('com.databricks.spark.csv') \ .options(header='true', inferSchema='true') \ .load(path_test) # First, we would like to extract the desired features from the raw dataset. # We do this by constructing a list with all desired columns. # This is identical for the test set. features = raw_dataset_train.columns features.remove('label') # Next, we use Spark's VectorAssembler to "assemble" (create) a vector of all desired features. # http://spark.apache.org/docs/latest/ml-features.html#vectorassembler vector_assembler = VectorAssembler(inputCols=features, outputCol="features") # This transformer will take all columns specified in features, and create an additional column # "features" which will contain all the desired features aggregated into a single vector. dataset_train = vector_assembler.transform(raw_dataset_train) dataset_test = vector_assembler.transform(raw_dataset_test) # Define the number of output classes. nb_classes = 10 encoder = OneHotTransformer(nb_classes, input_col="label", output_col="label_encoded") dataset_train = encoder.transform(dataset_train) dataset_test = encoder.transform(dataset_test) # Allocate a MinMaxTransformer from Distributed Keras to normalize the features.. # o_min -> original_minimum # n_min -> new_minimum transformer = MinMaxTransformer(n_min=0.0, n_max=1.0, \ o_min=0.0, o_max=250.0, \ input_col="features", \ output_col="features_normalized") # Transform the dataset. dataset_train = transformer.transform(dataset_train) dataset_test = transformer.transform(dataset_test) # Keras expects the vectors to be in a particular shape, we can reshape the # vectors using Spark. reshape_transformer = ReshapeTransformer("features_normalized", "matrix", (28, 28, 1)) dataset_train = reshape_transformer.transform(dataset_train) dataset_test = reshape_transformer.transform(dataset_test) # Now, create a Keras model. # Taken from Keras MNIST example. # Declare model parameters. img_rows, img_cols = 28, 28 # number of convolutional filters to use nb_filters = 32 # size of pooling area for max pooling pool_size = (2, 2) # convolution kernel size kernel_size = (3, 3) input_shape = (img_rows, img_cols, 1) # Construct the model. convnet = Sequential() convnet.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1], border_mode='valid', input_shape=input_shape)) convnet.add(Activation('relu')) convnet.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1])) convnet.add(Activation('relu')) convnet.add(MaxPooling2D(pool_size=pool_size)) convnet.add(Flatten()) convnet.add(Dense(225)) convnet.add(Activation('relu')) convnet.add(Dense(nb_classes)) convnet.add(Activation('softmax')) # Define the optimizer and the loss. optimizer_convnet = 'adam' loss_convnet = 'categorical_crossentropy' # Print the summary. convnet.summary() # We can also evaluate the dataset in a distributed manner. # However, for this we need to specify a procedure how to do this. def evaluate_accuracy(model, test_set, features="matrix"): evaluator = AccuracyEvaluator(prediction_col="prediction_index", label_col="label") predictor = ModelPredictor(keras_model=model, features_col=features) transformer = LabelIndexTransformer(output_dim=nb_classes) test_set = test_set.select(features, "label") test_set = predictor.predict(test_set) test_set = transformer.transform(test_set) score = evaluator.evaluate(test_set) return score # Select the desired columns, this will reduce network usage. dataset_train = dataset_train.select("features_normalized", "matrix","label", "label_encoded") dataset_test = dataset_test.select("features_normalized", "matrix","label", "label_encoded") # Keras expects DenseVectors. dense_transformer = DenseTransformer(input_col="features_normalized", output_col="features_normalized_dense") dataset_train = dense_transformer.transform(dataset_train) dataset_test = dense_transformer.transform(dataset_test) dataset_train.repartition(num_workers) dataset_test.repartition(num_workers) # Assing the training and test set. training_set = dataset_train.repartition(num_workers) test_set = dataset_test.repartition(num_workers) # Cache them. training_set.cache() test_set.cache() # Precache the trainingset on the nodes using a simple count. print(training_set.count()) # Use the ADAG optimizer. You can also use a SingleWorker for testing purposes -> traditional # non-distributed gradient descent. trainer = ADAG(keras_model=convnet, worker_optimizer=optimizer_convnet, loss=loss_convnet, num_workers=num_workers, batch_size=16, communication_window=5, num_epoch=5, features_col="matrix", label_col="label_encoded") trained_model = trainer.train(training_set) print("Training time: " + str(trainer.get_training_time())) print("Accuracy: " + str(evaluate_accuracy(trained_model, test_set))) print("Number of parameter server updates: " + str(trainer.parameter_server.num_updates)) ================================================ FILE: examples/mnist_analysis.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MNIST Analysis with Distributed Keras\n", "\n", "**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN) \n", "*Departement of Knowledge Engineering* \n", "*Maastricht University, The Netherlands*" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18 January 2017\r\n" ] } ], "source": [ "!(date +%d\\ %B\\ %G)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we will show you how to process the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset using Distributed Keras. As in the [workflow](https://github.com/JoeriHermans/dist-keras/blob/master/examples/workflow.ipynb) notebook, we will guide you through the complete machine learning pipeline.\n", "\n", "## Preparation\n", "\n", "To get started, we first load all the required imports. Please make sure you installed `dist-keras`, and `seaborn`. Furthermore, we assume that you have access to an installation which provides Apache Spark.\n", "\n", "Before you start this notebook, place make sure you ran the \"MNIST preprocessing\" notebook first, since we will be evaluating a manually \"enlarged dataset\"." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "%matplotlib inline\n", "\n", "import numpy as np\n", "\n", "from keras.optimizers import *\n", "from keras.models import Sequential\n", "from keras.layers.core import *\n", "from keras.layers.convolutional import *\n", "\n", "from pyspark import SparkContext\n", "from pyspark import SparkConf\n", "\n", "from matplotlib import pyplot as plt\n", "\n", "from pyspark import StorageLevel\n", "\n", "from pyspark.ml.feature import StandardScaler\n", "from pyspark.ml.feature import VectorAssembler\n", "from pyspark.ml.feature import OneHotEncoder\n", "from pyspark.ml.feature import MinMaxScaler\n", "from pyspark.ml.feature import StringIndexer\n", "from pyspark.ml.evaluation import MulticlassClassificationEvaluator\n", "\n", "from distkeras.trainers import *\n", "from distkeras.predictors import *\n", "from distkeras.transformers import *\n", "from distkeras.evaluators import *\n", "from distkeras.utils import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following cell, adapt the parameters to fit your personal requirements." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Modify these variables according to your needs.\n", "application_name = \"Distributed Keras MNIST Analysis\"\n", "using_spark_2 = False\n", "local = False\n", "path = \"mnist.parquet\"\n", "if local:\n", " # Tell master to use local resources.\n", " master = \"local[*]\"\n", " num_processes = 3\n", " num_executors = 1\n", "else:\n", " # Tell master to use YARN.\n", " master = \"yarn-client\"\n", " num_executors = 30\n", " num_processes = 1" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of desired executors: 30\n", "Number of desired processes / executor: 1\n", "Total number of workers: 30\n" ] } ], "source": [ "# This variable is derived from the number of cores and executors, and will be used to assign the number of model trainers.\n", "num_workers = num_executors * num_processes\n", "\n", "print(\"Number of desired executors: \" + `num_executors`)\n", "print(\"Number of desired processes / executor: \" + `num_processes`)\n", "print(\"Total number of workers: \" + `num_workers`)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "conf = SparkConf()\n", "conf.set(\"spark.app.name\", application_name)\n", "conf.set(\"spark.master\", master)\n", "conf.set(\"spark.executor.cores\", `num_processes`)\n", "conf.set(\"spark.executor.instances\", `num_executors`)\n", "conf.set(\"spark.locality.wait\", \"0\")\n", "conf.set(\"spark.executor.memory\", \"5g\")\n", "conf.set(\"spark.serializer\", \"org.apache.spark.serializer.KryoSerializer\");\n", "\n", "# Check if the user is running Spark 2.0 +\n", "if using_spark_2:\n", " sc = SparkSession.builder.config(conf=conf) \\\n", " .appName(application_name) \\\n", " .getOrCreate()\n", "else:\n", " # Create the Spark context.\n", " sc = SparkContext(conf=conf)\n", " # Add the missing imports\n", " from pyspark import SQLContext\n", " sqlContext = SQLContext(sc)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Check if we are using Spark 2.0\n", "if using_spark_2:\n", " reader = sc\n", "else:\n", " reader = sqlContext\n", "# Read the training and test set.\n", "training_set = reader.read.parquet('data/mnist_train_big.parquet') \\\n", " .select(\"features_normalized_dense\", \"label_encoded\", \"label\")\n", "test_set = reader.read.parquet('data/mnist_test_preprocessed.parquet') \\\n", " .select(\"features_normalized_dense\", \"label_encoded\", \"label\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- features_normalized_dense: vector (nullable = true)\n", " |-- label_encoded: vector (nullable = true)\n", " |-- label: long (nullable = true)\n", "\n" ] } ], "source": [ "# Print the schema of the dataset.\n", "training_set.printSchema()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Development" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multilayer Perceptron" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mlp = Sequential()\n", "mlp.add(Dense(1000, input_shape=(784,)))\n", "mlp.add(Activation('relu'))\n", "mlp.add(Dropout(0.2))\n", "mlp.add(Dense(200))\n", "mlp.add(Activation('relu'))\n", "mlp.add(Dropout(0.2))\n", "mlp.add(Dense(10))\n", "mlp.add(Activation('softmax'))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "____________________________________________________________________________________________________\n", "Layer (type) Output Shape Param # Connected to \n", "====================================================================================================\n", "dense_1 (Dense) (None, 1000) 785000 dense_input_1[0][0] \n", "____________________________________________________________________________________________________\n", "activation_1 (Activation) (None, 1000) 0 dense_1[0][0] \n", "____________________________________________________________________________________________________\n", "dropout_1 (Dropout) (None, 1000) 0 activation_1[0][0] \n", "____________________________________________________________________________________________________\n", "dense_2 (Dense) (None, 200) 200200 dropout_1[0][0] \n", "____________________________________________________________________________________________________\n", "activation_2 (Activation) (None, 200) 0 dense_2[0][0] \n", "____________________________________________________________________________________________________\n", "dropout_2 (Dropout) (None, 200) 0 activation_2[0][0] \n", "____________________________________________________________________________________________________\n", "dense_3 (Dense) (None, 10) 2010 dropout_2[0][0] \n", "____________________________________________________________________________________________________\n", "activation_3 (Activation) (None, 10) 0 dense_3[0][0] \n", "====================================================================================================\n", "Total params: 987210\n", "____________________________________________________________________________________________________\n" ] } ], "source": [ "mlp.summary()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "optimizer_mlp = 'adam'\n", "loss_mlp = 'categorical_crossentropy'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training\n", "\n", "Prepare the training and test set for evaluation and training." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of training instances: 6060000\n", "Number of testing instances: 10000\n" ] } ], "source": [ "training_set = training_set.repartition(num_workers)\n", "test_set = test_set.repartition(num_workers)\n", "training_set.cache()\n", "test_set.cache()\n", "print(\"Number of training instances: \" + str(training_set.count()))\n", "print(\"Number of testing instances: \" + str(test_set.count()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation\n", "\n", "We define a utility function which will compute the accuracy for us." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def evaluate_accuracy(model, test_set, features=\"features_normalized_dense\"):\n", " evaluator = AccuracyEvaluator(prediction_col=\"prediction_index\", label_col=\"label\")\n", " predictor = ModelPredictor(keras_model=model, features_col=features)\n", " transformer = LabelIndexTransformer(output_dim=10)\n", " test_set = test_set.select(features, \"label\")\n", " test_set = predictor.predict(test_set)\n", " test_set = transformer.transform(test_set)\n", " score = evaluator.evaluate(test_set)\n", " \n", " return score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ADAG" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "trainer = ADAG(keras_model=mlp, worker_optimizer=optimizer_mlp, loss=loss_mlp, num_workers=num_workers,\n", " batch_size=4, communication_window=5, num_epoch=1,\n", " features_col=\"features_normalized_dense\", label_col=\"label_encoded\")\n", "# Modify the default parallelism factor.\n", "trained_model = trainer.train(training_set)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[array([[-0.02490237, -0.01861665, 0.03102627, ..., 0.01722135,\n", " 0.02223415, -0.04933412],\n", " [-0.02634868, 0.03564246, -0.05392314, ..., -0.02999102,\n", " -0.01270337, -0.03888189],\n", " [ 0.00727941, 0.04553502, -0.01856072, ..., 0.0319587 ,\n", " -0.00354035, -0.03581727],\n", " ..., \n", " [-0.03245988, -0.01220334, 0.019447 , ..., 0.05723321,\n", " -0.05618715, -0.0248918 ],\n", " [-0.02532675, -0.01772211, 0.05514754, ..., 0.03839124,\n", " -0.05036234, -0.03766601],\n", " [ 0.04610632, 0.01409597, 0.03790993, ..., -0.02038677,\n", " -0.03649681, 0.04742099]], dtype=float32),\n", " array([ -1.29682487e-02, 1.38744503e-01, -3.10007334e-01,\n", " -3.04996595e-02, -1.39434069e-01, -4.05185074e-02,\n", " -2.09797233e-01, -4.62490469e-01, -6.72216356e-01,\n", " -1.83647368e-02, -2.93090612e-01, 5.11649624e-02,\n", " -2.74094105e-01, -9.03906003e-02, -7.21242726e-01,\n", " -2.51375604e-02, -1.40052319e-01, -1.31754786e-01,\n", " -1.88921779e-01, -3.18406552e-01, -3.45931239e-02,\n", " -1.89292878e-01, 3.80539931e-02, 3.54425013e-02,\n", " -6.34538352e-01, -2.27093436e-02, -5.49978614e-01,\n", " -2.85222325e-02, -4.87636119e-01, -2.94719964e-01,\n", " -4.62469608e-01, -4.31859016e-01, -4.95594800e-01,\n", " -7.55963206e-01, -7.07836151e-01, 5.50588481e-02,\n", " 1.01570776e-02, -3.62383217e-01, -2.37895608e-01,\n", " -3.48139226e-01, -5.14193960e-02, -4.49353665e-01,\n", " -2.04702299e-02, -1.28980473e-01, -6.01515993e-02,\n", " -4.11046803e-01, -2.73511171e-01, -4.22501177e-01,\n", " 6.57678917e-02, -3.77899945e-01, -3.68858546e-01,\n", " -3.45079124e-01, -1.21501423e-01, -2.59954304e-01,\n", " -2.77339309e-01, 7.24700987e-02, -1.75704360e-01,\n", " -1.79602101e-01, -3.49472016e-01, -4.22441006e-01,\n", " -3.98772031e-01, 4.78056073e-02, 1.63912345e-02,\n", " -1.73481293e-02, 2.03711018e-01, -1.66458517e-01,\n", " -2.50248574e-02, -4.33256328e-01, -1.77355483e-02,\n", " -6.68845698e-02, -6.33655787e-02, -2.07219645e-01,\n", " -2.81381667e-01, -2.10354477e-01, 9.65033993e-02,\n", " 1.45252123e-01, -1.62108362e-01, -4.10078391e-02,\n", " -5.01093924e-01, 6.61657602e-02, -3.54006797e-01,\n", " -2.72664815e-01, -4.63590562e-01, -2.76888013e-01,\n", " 5.67168836e-03, -1.63264722e-02, -5.64372167e-02,\n", " -3.27719487e-02, -1.25738844e-01, -3.16582769e-02,\n", " -3.16652000e-01, 2.20678657e-01, -4.90398854e-01,\n", " -3.87180448e-01, 4.62217331e-02, -3.87124509e-01,\n", " 3.44271868e-01, -6.47646427e-01, -4.47504744e-02,\n", " -3.12687427e-01, -3.64519686e-01, -1.19691178e-01,\n", " -1.22579239e-01, -1.74031451e-01, -3.50467891e-01,\n", " -3.85930926e-01, -1.01258140e-02, 1.65355578e-01,\n", " 2.38174275e-02, -3.86843532e-01, -2.11541757e-01,\n", " -1.60455573e-02, -3.41660500e-01, -2.41097137e-01,\n", " -3.58184397e-01, -3.74646991e-01, -5.68306029e-01,\n", " 6.03663735e-02, -2.25287676e-01, -3.33954960e-01,\n", " -3.21863830e-01, -5.74063025e-02, -9.54797715e-02,\n", " -1.69863552e-01, 5.25663458e-02, -1.78944767e-01,\n", " -4.96068239e-01, -9.37457308e-02, -4.91037033e-02,\n", " -5.45800686e-01, -4.19147074e-01, -3.63402218e-01,\n", " -9.55256671e-02, -6.56951070e-02, -4.74279895e-02,\n", " 3.94136347e-02, -6.89108312e-01, -6.40569270e-01,\n", " -2.92730868e-01, -4.21674043e-01, -9.05798003e-02,\n", " -9.85799953e-02, -3.34262311e-01, -2.91352630e-01,\n", " -1.20481804e-01, -1.30824670e-01, -3.15101117e-01,\n", " -3.82897407e-01, -3.67818296e-01, -2.51174152e-01,\n", " -4.45220284e-02, -3.63316804e-01, -5.95236719e-01,\n", " -3.27549487e-01, -5.18906057e-01, -1.80942759e-01,\n", " -1.93147764e-01, -1.63675278e-01, 5.25709763e-02,\n", " -1.69222236e-01, -1.66612849e-01, -1.89764783e-01,\n", " 9.59388837e-02, -1.79865390e-01, -2.87416220e-01,\n", " -1.37040511e-01, -3.68917108e-01, -1.97503880e-01,\n", " -4.80307907e-01, -9.74704884e-03, -1.62035048e-01,\n", " -4.33685966e-02, -3.75206321e-01, -2.71574229e-01,\n", " -2.51338482e-01, -1.91602707e-01, -4.66123730e-01,\n", " -3.09535444e-01, -3.18885483e-02, -3.23637798e-02,\n", " -3.71796012e-01, -2.26407617e-01, -4.69909385e-02,\n", " -3.70391518e-01, -5.37406743e-01, -5.00004053e-01,\n", " -4.49130647e-02, 1.55784473e-01, -3.39550585e-01,\n", " -5.15295863e-01, -5.79936266e-01, 4.80024889e-03,\n", " -1.23718642e-01, -6.55675307e-02, -2.74233013e-01,\n", " -2.67147571e-01, -4.20176655e-01, -2.30046362e-02,\n", " -2.80579627e-01, -6.52074635e-01, -2.07271874e-01,\n", " -3.34823787e-01, -5.11079669e-01, -4.89039391e-01,\n", " -1.69896662e-01, -6.09769404e-01, 1.67333558e-01,\n", " -1.52619872e-02, -1.82103708e-01, -1.59035064e-02,\n", " -2.82586038e-01, -4.48576622e-02, -2.77401984e-01,\n", " -1.18868940e-01, -3.09958905e-01, -4.54939663e-01,\n", " -6.84868218e-03, -1.78479820e-01, -4.12694991e-01,\n", " -4.86943096e-01, -4.83419180e-01, -2.92061418e-01,\n", " -3.56696308e-01, -2.38492072e-01, -1.99521467e-01,\n", " -6.62643433e-01, -6.58789635e-01, -3.13386142e-01,\n", " -2.39210613e-02, 3.81695509e-01, 3.89514342e-02,\n", " -4.21914130e-01, -1.78643346e-01, -3.58139843e-01,\n", " -2.31155585e-02, -5.25866091e-01, -2.01350115e-02,\n", " 1.34515122e-01, -4.72941786e-01, 1.28511051e-02,\n", " -1.92628369e-01, -2.94919074e-01, -1.21810228e-01,\n", " -2.63900816e-01, -1.77175865e-01, -3.85966711e-02,\n", " -3.91167760e-01, -3.54940116e-01, -4.08377945e-02,\n", " -2.46946454e-01, -1.70614153e-01, 9.64559093e-02,\n", " -1.58487067e-01, -1.40857771e-01, -2.60191988e-02,\n", " -2.16996279e-02, -2.01046526e-01, 1.07773796e-01,\n", " -7.25519285e-02, -4.59324010e-02, -3.97602469e-01,\n", " -2.86683738e-01, -2.06594560e-02, -2.32254282e-01,\n", " -1.47455707e-01, -2.11738929e-01, -3.97648931e-01,\n", " -1.92232862e-01, -4.22664315e-01, -2.10082695e-01,\n", " -3.69767874e-01, -3.35989922e-01, -2.50372291e-02,\n", " -2.56772131e-01, -7.55918026e-01, -1.45749766e-02,\n", " -5.94904542e-01, -1.83992922e-01, -1.98239967e-01,\n", " 2.28624657e-01, -3.67346585e-01, -2.17467710e-01,\n", " -8.19451883e-02, -5.01424968e-02, -3.00576668e-02,\n", " 2.42029456e-03, -6.11475348e-01, -2.48637870e-01,\n", " -1.25368005e-02, -1.07831452e-02, 3.56794626e-01,\n", " -2.73973256e-01, -5.00894673e-02, -3.93987626e-01,\n", " -6.70151055e-01, 5.03201634e-02, -3.47819924e-01,\n", " 2.21592330e-04, -9.35477093e-02, -4.01370734e-01,\n", " -5.17268419e-01, -2.08003540e-02, -1.58300679e-02,\n", " 1.09454863e-01, 4.86627640e-03, -4.40006703e-01,\n", " 1.10145152e-01, -3.08435559e-01, -2.27646939e-02,\n", " -6.15591705e-02, -6.83150813e-02, 1.51192188e-01,\n", " -2.93954074e-01, 1.76271528e-01, -5.47897398e-01,\n", " -2.94454783e-01, -4.87583935e-01, -2.25682836e-02,\n", " -2.61891991e-01, -2.05876276e-01, -2.91871820e-02,\n", " -4.65158612e-01, -1.10427953e-01, 2.59957045e-01,\n", " -6.44603491e-01, -5.89241982e-01, -2.40099952e-01,\n", " -2.48620026e-02, 2.60877088e-02, -3.69062722e-01,\n", " -5.85998118e-01, 6.35902397e-04, 1.52950898e-01,\n", " -1.31705374e-01, -6.95600629e-01, -6.93177283e-02,\n", " -3.34524751e-01, -2.05166377e-02, -4.04433101e-01,\n", " -3.34488690e-01, 4.12484966e-02, -1.07743412e-01,\n", " -2.31767640e-01, -5.87181449e-01, -1.24916852e-01,\n", " -2.45317779e-02, -4.82061923e-01, 4.29915352e-04,\n", " -2.29062542e-01, -1.53157920e-01, -8.75511765e-02,\n", " -1.93034634e-01, -2.39149824e-01, -2.81021118e-01,\n", " -1.92091212e-01, 4.84096706e-02, -3.15482467e-01,\n", " -9.38970945e-04, -7.32823536e-02, 1.46180347e-01,\n", " -7.48398662e-01, -2.95927972e-01, -1.01935327e-01,\n", " -2.25223079e-02, -3.76603395e-01, -3.72446418e-01,\n", " -5.44973463e-02, -3.04856654e-02, -8.12882781e-01,\n", " -6.35300994e-01, 1.01717256e-01, 1.15769980e-02,\n", " 1.94745436e-01, -4.62203443e-01, -1.94413647e-01,\n", " -1.19787067e-01, 5.01835823e-01, -1.22532628e-01,\n", " -4.83275265e-01, -5.72950900e-01, -1.68230399e-01,\n", " -2.53478941e-02, -8.93718377e-02, -2.09907755e-01,\n", " 1.15736432e-01, 7.35889524e-02, -2.25963101e-01,\n", " -1.25411734e-01, -1.58686683e-01, 3.05348307e-01,\n", " -4.07805927e-02, -6.87129676e-01, -1.78614125e-01,\n", " -6.12517297e-02, -1.26590893e-01, -5.44444025e-01,\n", " -2.87909880e-02, -1.61622658e-01, -6.28022432e-01,\n", " -3.93144011e-01, -4.14166540e-01, -3.36472809e-01,\n", " -2.14290902e-01, -1.57012552e-01, -6.99233487e-02,\n", " -1.79140717e-01, -3.44865173e-01, -4.32067961e-01,\n", " -4.17658724e-02, -1.92612112e-01, -4.07513529e-01,\n", " -2.00688168e-01, -3.12940218e-02, -5.83245270e-02,\n", " -3.02525491e-01, -6.36755228e-01, -2.01398991e-02,\n", " -1.94140598e-01, -5.85560381e-01, -2.78204322e-01,\n", " -4.92228866e-01, 2.85394281e-01, -5.29185772e-01,\n", " -5.80944479e-01, -4.82267290e-01, -3.02456468e-01,\n", " -2.17350312e-02, -2.27617443e-01, -8.41379631e-03,\n", " -5.19459188e-01, -1.92483932e-01, -6.69973344e-02,\n", " -3.18294495e-01, -4.43626344e-01, 1.03083804e-01,\n", " -1.43494621e-01, -3.98965865e-01, -2.91880131e-01,\n", " -1.15407094e-01, -2.33865350e-01, -3.48333865e-01,\n", " -3.13846886e-01, -2.00329088e-02, -2.08419889e-01,\n", " -6.56257868e-02, -3.15933287e-01, -2.66032100e-01,\n", " -2.17209011e-01, -2.57886738e-01, -3.74219060e-01,\n", " -3.42252910e-01, -3.02372843e-01, -2.70351022e-01,\n", " -4.19028729e-01, -2.16944158e-01, 1.65465083e-02,\n", " -1.38239786e-01, 8.82068649e-03, -5.47306299e-01,\n", " -6.58184737e-02, -1.07372276e-01, -1.99595578e-02,\n", " -3.04633468e-01, -2.42436364e-01, -9.85036939e-02,\n", " 8.13045427e-02, -6.01692021e-01, -7.83374131e-01,\n", " -3.54873002e-01, -1.54401422e-01, -1.99920405e-02,\n", " -6.02073036e-03, -7.46182263e-01, -5.17743170e-01,\n", " -1.43411651e-01, 1.35698587e-01, -4.32992607e-01,\n", " -3.22256982e-01, 2.01625749e-01, -1.68692529e-01,\n", " 9.03868079e-02, -7.36883581e-02, -2.26779003e-02,\n", " 7.53887817e-02, -3.51618379e-01, -6.96502507e-01,\n", " -1.97232455e-01, -2.19720408e-01, -1.76197141e-01,\n", " -3.31067145e-01, 2.52920628e-01, -5.32557011e-01,\n", " -9.84433852e-03, -2.28284430e-02, -2.18466327e-01,\n", " -2.50813589e-02, -1.22822799e-01, -6.21357895e-02,\n", " -1.85140949e-02, 1.55188337e-01, -2.91802138e-01,\n", " -1.76329892e-02, -3.60844210e-02, -5.81378281e-01,\n", " -6.11039221e-01, -3.28095675e-01, -2.83731908e-01,\n", " -1.66193381e-01, 5.52292354e-02, 6.29878119e-02,\n", " -3.41305107e-01, -1.39835373e-01, 1.71938047e-01,\n", " -1.84613727e-02, 7.50863180e-02, -3.44148017e-02,\n", " -3.53854299e-01, -5.12476027e-01, 1.22042328e-01,\n", " -5.39535470e-02, 3.05281021e-03, -1.19409911e-01,\n", " -2.89323032e-01, -6.71940520e-02, -2.19452642e-02,\n", " -2.90004104e-01, -1.76387712e-01, -4.56134796e-01,\n", " -8.09880495e-01, -1.83778346e-01, -2.31890544e-01,\n", " -4.52327728e-01, -2.06816241e-01, -1.38748497e-01,\n", " -4.18441355e-01, -5.38856745e-01, -5.05130768e-01,\n", " -1.75971299e-01, -1.19080685e-01, -9.46213081e-02,\n", " -3.64823714e-02, -3.22997957e-01, -1.34447142e-01,\n", " -1.27073288e-01, 1.64654911e-01, -9.78678912e-02,\n", " -4.47389364e-01, -2.54144296e-02, 1.73969138e-02,\n", " -2.04480872e-01, -4.30503398e-01, -1.67036086e-01,\n", " -2.49711365e-01, -3.37412119e-01, -6.02359474e-01,\n", " -6.62094355e-01, -1.16948448e-01, 9.77696292e-03,\n", " -5.21902740e-01, -2.33485606e-02, -6.64649755e-02,\n", " -6.00027978e-01, -5.42070754e-02, -2.38561943e-01,\n", " -4.47000265e-01, 1.17274612e-01, -1.11540303e-01,\n", " -1.02203742e-01, -6.74192980e-02, -1.72974497e-01,\n", " -2.43933983e-02, -2.18470603e-01, -1.02555685e-01,\n", " -5.01730680e-01, -1.63745075e-01, -2.48166338e-01,\n", " 4.25796956e-02, -8.81046131e-02, -4.94634926e-01,\n", " -2.48743445e-01, 8.22583865e-03, -2.14855313e-01,\n", " -5.94667614e-01, 1.23224966e-01, -2.28983104e-01,\n", " -4.89580818e-02, -3.53976309e-01, -1.02518976e-01,\n", " -2.80924350e-01, 2.18932718e-01, -9.42684943e-04,\n", " -2.78814733e-01, -2.43697301e-01, -4.07780051e-01,\n", " -1.57622676e-02, -4.32732075e-01, 2.76384447e-02,\n", " -2.56971091e-01, -1.39276221e-01, -2.89412320e-01,\n", " -7.84103293e-03, -5.75612962e-01, -2.65779234e-02,\n", " -2.83633530e-01, -2.42152084e-02, -3.54716778e-01,\n", " -5.25303543e-01, -6.30853772e-02, -2.22892091e-01,\n", " -3.32897723e-01, -8.58137235e-02, -1.35768950e-01,\n", " -4.00102228e-01, -6.81776628e-02, -1.11637965e-01,\n", " 8.71941745e-02, 7.97185600e-02, -4.74733919e-01,\n", " -5.36120776e-03, -2.00053956e-02, 2.74125468e-02,\n", " -5.23373425e-01, -3.52810740e-01, -5.75067937e-01,\n", " -1.27765425e-02, -2.41196215e-01, 1.35370884e-02,\n", " -3.42776716e-01, -2.61937886e-01, -1.73471346e-01,\n", " -7.74265826e-01, -3.25414896e-01, -6.52070194e-02,\n", " -1.75177939e-02, -2.78512776e-01, -1.26804650e-01,\n", " -1.54330492e-01, -2.43354395e-01, -5.10048628e-01,\n", " -5.22104055e-02, -4.48061913e-01, -2.54915148e-01,\n", " -3.71145964e-01, -2.34785691e-01, -5.76828778e-01,\n", " -5.20584345e-01, -2.01370478e-01, -3.43574703e-01,\n", " -3.95394504e-01, -7.02085435e-01, 3.80159239e-03,\n", " -5.05006194e-01, -6.66690245e-02, -2.13820174e-01,\n", " -1.86356172e-01, -1.98591515e-01, -2.26664558e-01,\n", " -9.84562710e-02, 9.10461769e-02, -1.63858235e-01,\n", " -6.71461642e-01, -2.07045935e-02, -1.84064224e-01,\n", " -1.52253630e-02, -6.44623414e-02, -1.90693051e-01,\n", " -3.26317549e-01, -3.90465967e-02, -4.31612767e-02,\n", " -2.69320831e-02, -2.61054486e-01, -5.56032240e-01,\n", " -1.39396250e-01, -3.04626554e-01, -4.00418974e-02,\n", " -5.22964954e-01, -2.74515212e-01, -2.05182180e-01,\n", " -4.55017984e-01, -4.10655349e-01, -3.91681463e-01,\n", " -2.95707285e-01, -1.75162852e-02, -1.80232033e-01,\n", " -9.38054398e-02, -4.48614866e-01, -1.20916396e-01,\n", " -1.26026660e-01, -6.13098264e-01, -9.16779786e-02,\n", " -1.24931745e-01, -1.14639051e-01, -5.89349389e-01,\n", " -2.86892831e-01, -4.32475626e-01, -4.53839451e-01,\n", " -5.40873766e-01, -3.22011739e-01, -1.04171380e-01,\n", " -2.03116417e-01, -7.34383706e-03, -2.95767933e-01,\n", " 3.77100818e-02, -3.95163864e-01, -9.11748350e-01,\n", " -2.14269429e-01, -4.47106093e-01, -1.02919694e-02,\n", " -1.46425188e-01, 1.30215868e-01, 3.46448004e-01,\n", " -7.53604919e-02, -3.68188143e-01, -1.75004661e-01,\n", " -3.42096955e-01, -1.19322361e-02, 9.38493479e-03,\n", " -5.18787801e-01, -1.09108455e-01, 6.15557991e-02,\n", " -8.33496079e-03, -6.41730651e-02, -1.36719868e-02,\n", " -3.73748362e-01, -3.73859495e-01, 2.80248914e-02,\n", " -3.09117913e-01, -2.88713902e-01, -4.28494245e-01,\n", " -5.13740003e-01, -1.57594740e-01, -4.70732421e-01,\n", " -1.38654308e-02, -6.85215056e-01, -3.66586596e-01,\n", " -1.41351402e-01, -1.13854766e-01, -5.36643863e-01,\n", " -4.75565642e-01, -5.00832915e-01, -4.08477843e-01,\n", " -3.66504490e-01, -1.15367234e-01, -2.48915218e-02,\n", " -4.96757418e-01, 1.17366053e-01, -2.26039514e-01,\n", " -5.49678802e-01, -2.75789142e-01, -5.08426309e-01,\n", " 1.07284091e-01, -2.54364550e-01, -3.72139484e-01,\n", " -3.34391892e-01, 2.10764147e-02, -1.33560911e-01,\n", " -9.50245783e-02, -3.13357562e-01, -2.62188077e-01,\n", " -5.32095313e-01, -5.31459413e-03, -3.21489833e-02,\n", " -7.84164011e-01, -1.10715240e-01, -2.87352562e-01,\n", " -5.71807444e-01, -2.04134420e-01, 7.85130933e-02,\n", " -3.69185776e-01, -1.98006928e-02, 6.63151639e-03,\n", " -2.87224799e-01, 5.36596589e-02, -7.96930939e-02,\n", " -2.82612413e-01, -1.87133670e-01, -6.54792845e-01,\n", " -8.59472081e-02, -1.13062121e-01, -1.83315545e-01,\n", " -2.58277714e-01, -5.51701725e-01, -5.59242129e-01,\n", " -1.50169775e-01, 4.73141856e-02, -1.68764800e-01,\n", " -2.75284111e-01, -4.43699747e-01, -2.76820183e-01,\n", " -3.51191200e-02, -1.07176892e-01, -4.73967902e-02,\n", " -4.53751475e-01, -2.84370124e-01, -4.89342690e-01,\n", " -3.81000303e-02, -5.29655755e-01, -1.50656566e-01,\n", " -4.64593619e-01, -1.58045471e-01, -7.06188157e-02,\n", " -4.04648870e-01, -3.15317452e-01, -2.87708908e-01,\n", " -1.71832666e-01, -2.27938369e-01, -2.11054739e-02,\n", " -3.29687774e-01, -1.82581544e-01, -2.17228252e-02,\n", " 2.08218992e-02, -1.46109968e-01, -7.96382129e-02,\n", " -3.17795098e-01, -5.75634658e-01, -3.44916396e-02,\n", " -4.36014533e-01, -2.85244137e-02, -5.68732560e-01,\n", " -5.59068859e-01, -1.22407533e-01, -2.56792486e-01,\n", " -2.97368616e-01, -3.03129584e-01, -1.62084669e-01,\n", " -2.64727145e-01, -4.05563980e-01, 3.00995618e-01,\n", " -1.86940640e-01, -9.05097499e-02, -1.19438395e-01,\n", " -1.88409179e-01, -3.68620992e-01, 3.19603570e-02,\n", " -5.20787895e-01, -2.95364499e-01, -1.96136490e-01,\n", " 1.30156171e+00, -3.09764799e-02, -1.63758829e-01,\n", " -1.63395420e-01, -1.06308326e-01, -3.37606370e-01,\n", " -4.02779371e-01, -1.04163669e-01, -3.29879135e-01,\n", " -6.24738149e-02, 7.57394284e-02, -6.51596487e-01,\n", " -2.37611696e-01, -5.25772333e-01, 1.44061729e-01,\n", " -2.59940475e-01, -2.72920489e-01, -3.10522407e-01,\n", " -8.48866284e-01, -5.29746771e-01, -1.75354518e-02,\n", " -8.73476788e-02, -4.62230533e-01, -3.12623024e-01,\n", " -4.66565102e-01, -2.35941991e-01, -4.72842991e-01,\n", " -8.59152302e-02, -3.31128508e-01, -1.34016275e-01,\n", " -6.82140663e-02, -1.31053597e-01, 3.27668451e-02,\n", " -4.59252357e-01, -7.40645081e-02, -2.32884094e-01,\n", " -2.48913141e-03, -5.38118541e-01, -6.48121983e-02,\n", " -2.82097995e-01, -4.83397216e-01, -3.75957131e-01,\n", " -1.20243065e-01, -2.91992631e-02, -2.34807402e-01,\n", " -8.57004896e-02, -1.76332936e-01, -4.79596853e-01,\n", " -3.59954983e-01, -3.86393666e-01, -1.49604112e-01,\n", " 9.89474952e-02, -1.43513409e-02, -5.00253379e-01,\n", " -2.31766224e-01, -2.78296471e-01, -1.47517323e-01,\n", " -2.70760179e-01, 5.62180728e-02, 1.26814142e-01,\n", " -2.58570649e-02, -3.02321255e-01, -5.06240189e-01,\n", " -3.60810488e-01, -1.61365643e-01, -1.28059566e-01,\n", " -2.62734950e-01, -1.67697724e-02, 9.22571719e-02,\n", " -7.30941415e-01, -3.17986846e-01, -3.49215209e-01,\n", " -4.75899428e-01, -5.54573357e-01, -2.22814456e-01,\n", " -9.33618564e-03, -4.88777943e-02, -2.79946309e-02,\n", " -2.43498668e-01, 1.63741887e-01, -8.86490270e-02,\n", " -1.80582032e-02, 5.81286959e-02, -5.06547272e-01,\n", " -2.36781448e-01, -2.82066971e-01, 3.62231545e-02,\n", " 5.59952706e-02, -5.27004182e-01, -5.63789010e-02,\n", " -6.33812070e-01, -7.20118701e-01, -3.27905029e-01,\n", " -1.09615184e-01, -1.97968498e-01, -3.48774903e-02,\n", " -4.36178327e-01, -1.90760285e-01, -2.00712010e-01,\n", " -4.05785292e-02, -7.98018798e-02, -6.48312092e-01,\n", " -5.16030610e-01, -1.82418972e-02, -3.22774321e-01,\n", " -1.91510841e-01, -1.31354675e-01, -5.67911983e-01,\n", " -4.27046567e-01, -2.61492878e-01, -7.63690919e-02,\n", " -3.53502780e-01, -2.86672637e-02, 6.57036155e-02,\n", " -2.32697666e-01, -2.25740999e-01, -2.21521795e-01,\n", " 3.64017077e-02, -4.65820670e-01, -1.67809874e-01,\n", " -2.34040041e-02, -3.40095460e-01, 5.10562137e-02,\n", " -2.80955017e-01, 2.17410009e-02, -2.25610495e-01,\n", " -2.61850543e-02, -1.18860357e-01, 9.67218876e-02,\n", " -6.98161423e-01, -4.03901875e-01, -2.49750782e-02,\n", " -1.49894670e-01, -1.55417640e-02, -2.35045440e-02,\n", " -1.22158304e-02, -3.60701740e-01, -5.72664201e-01,\n", " -4.56410229e-01, -9.86423045e-02, -5.59065938e-01,\n", " -2.43323550e-01, 1.14932351e-01, -1.32146357e-02,\n", " -1.13701306e-01, -2.43878905e-02, 3.04878563e-01,\n", " -2.93137670e-01, -4.26690668e-01, -1.90759376e-01,\n", " -5.80423713e-01, 1.61198322e-02, -3.25486124e-01,\n", " -3.21475148e-01, -2.53617167e-01, -1.20874017e-01,\n", " -4.76823658e-01, -3.47528964e-01, -2.89901286e-01,\n", " 2.24457998e-02, -4.97344643e-01, 1.08718812e+00,\n", " -2.79220223e-01], dtype=float32),\n", " array([[ 0.03900816, 0.00785677, -0.06511776, ..., 0.00776991,\n", " -0.05963232, -0.05985177],\n", " [-0.20750827, 0.08817152, 0.40323174, ..., 0.20854132,\n", " -0.11089708, 0.14705186],\n", " [-0.24851227, 0.36102909, 0.07329425, ..., 0.12305254,\n", " 0.02824712, 0.2746895 ],\n", " ..., \n", " [-0.27076459, 0.04397521, 0.10150083, ..., -0.02952144,\n", " 0.35495111, 0.01788467],\n", " [-0.22880824, -0.14765862, -0.01148497, ..., -0.04802479,\n", " -0.11898327, 0.16021334],\n", " [-0.01458607, 0.51388001, 0.25630933, ..., 0.10885861,\n", " -0.15997633, 0.01113635]], dtype=float32),\n", " array([-0.36252829, -0.41307127, -0.37561458, -0.790694 , -0.7867986 ,\n", " -0.39656818, -0.49989551, -0.56961799, -0.67535901, -0.78190619,\n", " -0.64679927, -0.62336636, -0.73334086, -0.51707494, -0.80007225,\n", " -0.57039291, -0.43117863, -0.57423478, -1.01204598, -0.99576569,\n", " -0.45388478, -0.9715423 , -0.57562113, -0.85434681, -0.4783178 ,\n", " -0.65333492, -0.56394655, -0.51519966, -0.87941819, -0.9431147 ,\n", " -0.52889907, -0.51141596, -1.04037309, -0.87605566, -0.5586676 ,\n", " -0.67145008, -0.62178028, -0.74712718, -0.47700772, -0.81794 ,\n", " -0.94796181, -1.03332078, -0.99911004, -0.35762793, -0.41830212,\n", " -0.44990394, -0.54796964, -0.64622766, -0.36980084, -0.62949306,\n", " -0.73081511, -0.92071664, -0.96040893, -0.17141432, -0.50711352,\n", " -0.68742466, -0.58205402, -0.60873783, -0.51237881, -0.42307621,\n", " -0.59278268, -0.77905166, -0.70859444, -0.99470675, -0.68357819,\n", " -0.45728955, -0.98573047, -0.7740072 , -0.76561183, -0.38337517,\n", " -0.78785807, -0.9682638 , -0.41092423, -0.81709141, -0.4595961 ,\n", " -0.45476505, -0.89052409, -0.95178139, -0.920165 , -0.83498871,\n", " -0.54309958, -0.62142682, -0.10648966, -0.55824465, -0.51698029,\n", " -0.65391433, -0.73073816, -0.63968295, -0.73563075, -0.37823838,\n", " -0.83874625, -0.35336301, -0.72945499, -0.61786187, -1.04557991,\n", " -0.58565521, -0.35223064, -0.30662736, -0.66361117, -0.74605358,\n", " -0.79575521, -1.12011874, -0.65195775, -0.66316205, -0.30292839,\n", " -0.97478765, -0.30300212, -0.98781288, -0.88087404, -0.56088251,\n", " -0.82704026, -0.57432526, -0.44808209, -0.65736598, -0.7800023 ,\n", " -0.43863136, -0.71997589, -0.79668957, -0.58597511, -0.79392022,\n", " -0.91689253, -0.17079359, -0.70273119, -0.31935337, -0.99297088,\n", " -1.21429086, -0.54536754, -0.66847122, -1.0803057 , -0.02116329,\n", " -0.36946481, -0.78094089, -0.67028719, -0.63478422, -0.56762469,\n", " -0.59048861, -0.40834036, -0.76510531, -0.86944491, -0.26183733,\n", " -0.64363545, -0.21043499, -0.80520427, -0.98543239, -1.02239132,\n", " -0.87130302, -1.06532812, -0.47601402, -0.55352145, -0.75008106,\n", " -0.57477021, -0.73686802, -0.44472244, -0.64302158, -0.61648601,\n", " -1.09791934, -0.83204991, -0.40939972, -0.82405424, -0.57132626,\n", " -0.85813493, -0.84275389, -0.53043413, -1.03980398, -0.41696942,\n", " -0.99465734, -0.70751721, -0.94126099, -0.70646006, -0.85644752,\n", " -0.75323451, -0.62099051, -0.99225199, -0.81427616, -0.72105873,\n", " -0.3865678 , -0.71929121, -0.85359961, -0.47467613, -0.49992275,\n", " -0.78395241, -0.66783226, -0.85084015, -0.37230313, -0.74241304,\n", " -0.52368313, -0.57518154, -0.88761586, -0.78079957, -0.84552658,\n", " -0.60064358, -0.58771318, -0.68866116, -0.7030834 , -0.8059988 ,\n", " -0.71570534, -0.56441271, -0.89694452, -0.83912975, -0.46641162], dtype=float32),\n", " array([[-0.78751951, 0.02826324, -0.07172652, ..., -0.27620244,\n", " -0.47863257, -0.49731782],\n", " [-0.49682441, 0.04474993, -0.77598727, ..., -0.54524791,\n", " -0.21792939, -0.47720003],\n", " [-0.2323969 , -0.88028777, -0.2349651 , ..., -0.14491257,\n", " -0.17279406, -0.64144588],\n", " ..., \n", " [-0.7111882 , -0.30641097, -0.66904122, ..., -0.0798426 ,\n", " -0.57756215, -0.08725328],\n", " [ 0.11830693, 0.07352046, 0.08562858, ..., 0.09446803,\n", " -0.41451645, -0.35526502],\n", " [-0.92134595, 0.0993112 , -0.0636774 , ..., -0.0216356 ,\n", " -0.54615569, -0.05519475]], dtype=float32),\n", " array([-0.28950188, -0.33981469, -0.49054769, -0.24692491, -0.54108179,\n", " -0.53850734, -0.51629019, -0.45034203, 0.94987106, 0.34385717], dtype=float32)]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View the weights of the trained model.\n", "trained_model.get_weights()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training time: 22619.2383449\n", "Accuracy: 0.9859\n" ] } ], "source": [ "print(\"Training time: \" + str(trainer.get_training_time()))\n", "print(\"Accuracy: \" + str(evaluate_accuracy(trained_model, test_set)))" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [conda root]", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 1 } ================================================ FILE: examples/mnist_preprocessing.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MNIST Preprocessing\n", "\n", "**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN) \n", "*Departement of Knowledge Engineering* \n", "*Maastricht University, The Netherlands*" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "07 February 2017\r\n" ] } ], "source": [ "!(date +%d\\ %B\\ %G)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparation\n", "\n", "To get started, we first load all the required imports. Please make sure you installed dist-keras, and seaborn. Furthermore, we assume that you have access to an installation which provides Apache Spark.\n", "\n", "Before you start this notebook, place the MNIST dataset (which is provided in a zip in examples/data within this repository) on HDFS. Or in the case HDFS is not available, place it on the local filesystem. But make sure the path to the file is identical for all computing nodes." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "%matplotlib inline\n", "\n", "import numpy as np\n", "\n", "import seaborn as sns\n", "\n", "import time\n", "\n", "from pyspark import SparkContext\n", "from pyspark import SparkConf\n", "\n", "from matplotlib import pyplot as plt\n", "\n", "from pyspark.ml.feature import StandardScaler\n", "from pyspark.ml.feature import VectorAssembler\n", "from pyspark.ml.feature import OneHotEncoder\n", "from pyspark.ml.feature import MinMaxScaler\n", "from pyspark.ml.feature import StringIndexer\n", "\n", "from distkeras.transformers import *\n", "from distkeras.utils import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following cell, adapt the parameters to fit your personal requirements." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Modify these variables according to your needs.\n", "application_name = \"MNIST Preprocessing\"\n", "using_spark_2 = False\n", "local = False\n", "path_train = \"data/mnist_train.csv\"\n", "path_test = \"data/mnist_test.csv\"\n", "if local:\n", " # Tell master to use local resources.\n", " master = \"local[*]\"\n", " num_processes = 3\n", " num_executors = 1\n", "else:\n", " # Tell master to use YARN.\n", " master = \"yarn-client\"\n", " num_executors = 20\n", " num_processes = 1" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of desired executors: 20\n", "Number of desired processes / executor: 1\n", "Total number of workers: 20\n" ] } ], "source": [ "# This variable is derived from the number of cores and executors, and will be used to assign the number of model trainers.\n", "num_workers = num_executors * num_processes\n", "\n", "print(\"Number of desired executors: \" + `num_executors`)\n", "print(\"Number of desired processes / executor: \" + `num_processes`)\n", "print(\"Total number of workers: \" + `num_workers`)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "\n", "# Use the DataBricks CSV reader, this has some nice functionality regarding invalid values.\n", "os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell'" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "conf = SparkConf()\n", "conf.set(\"spark.app.name\", application_name)\n", "conf.set(\"spark.master\", master)\n", "conf.set(\"spark.executor.cores\", `num_processes`)\n", "conf.set(\"spark.executor.instances\", `num_executors`)\n", "conf.set(\"spark.executor.memory\", \"20g\")\n", "conf.set(\"spark.yarn.executor.memoryOverhead\", \"2\")\n", "conf.set(\"spark.locality.wait\", \"0\")\n", "conf.set(\"spark.serializer\", \"org.apache.spark.serializer.KryoSerializer\");\n", "\n", "# Check if the user is running Spark 2.0 +\n", "if using_spark_2:\n", " sc = SparkSession.builder.config(conf=conf) \\\n", " .appName(application_name) \\\n", " .getOrCreate()\n", "else:\n", " # Create the Spark context.\n", " sc = SparkContext(conf=conf)\n", " # Add the missing imports\n", " from pyspark import SQLContext\n", " sqlContext = SQLContext(sc)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Record time of starting point.\n", "time_start = time.time()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Check if we are using Spark 2.0\n", "if using_spark_2:\n", " reader = sc\n", "else:\n", " reader = sqlContext\n", "# Read the training set.\n", "raw_dataset_train = reader.read.format('com.databricks.spark.csv') \\\n", " .options(header='true', inferSchema='true') \\\n", " .load(path_train)\n", "# Read the test set.\n", "raw_dataset_test = reader.read.format('com.databricks.spark.csv') \\\n", " .options(header='true', inferSchema='true') \\\n", " .load(path_test)\n", "# Repartition the datasets.\n", "raw_dataset_train = raw_dataset_train.repartition(num_workers)\n", "raw_dataset_test = raw_dataset_test.repartition(num_workers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As shown in the output of the cell above, we see that every pixel is associated with a seperate column. In order to ensure compatibility with Apache Spark, we vectorize the columns, and add the resulting vectors as a seperate column. However, in order to achieve this, we first need a list of the required columns. This is shown in the cell below." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# First, we would like to extract the desired features from the raw dataset.\n", "# We do this by constructing a list with all desired columns.\n", "features = raw_dataset_train.columns\n", "features.remove('label')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have a list of columns names, we can pass this to Spark's [VectorAssembler](http://spark.apache.org/docs/latest/ml-features.html#vectorassembler). This VectorAssembler will take a list of features, vectorize them, and place them in a column defined in `outputCol`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Next, we use Spark's VectorAssembler to \"assemble\" (create) a vector of all desired features.\n", "# http://spark.apache.org/docs/latest/ml-features.html#vectorassembler\n", "vector_assembler = VectorAssembler(inputCols=features, outputCol=\"features\")\n", "# This transformer will take all columns specified in features, and create an additional column \"features\" which will contain all the desired features aggregated into a single vector.\n", "training_set = vector_assembler.transform(raw_dataset_train)\n", "test_set = vector_assembler.transform(raw_dataset_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have the inputs for our Neural Network (features column) after applying the VectorAssembler, we should also define the outputs. Since we are dealing with a classification task, the output of our Neural Network should be a one-hot encoded vector with 10 elements. For this, we provide a `OneHotTransformer` which accomplish this exact task." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Define the number of output classes.\n", "nb_classes = 10\n", "encoder = OneHotTransformer(nb_classes, input_col=\"label\", output_col=\"label_encoded\")\n", "training_set = encoder.transform(training_set)\n", "test_set = encoder.transform(test_set)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MNIST\n", "\n", "[MNIST](http://yann.lecun.com/exdb/mnist/) is a dataset of handwritten digits. Every image is a 28 by 28 pixel grayscale image. This means that every pixel has a value between 0 and 255. Some examples of instances within this dataset are shown in the cells below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normalization\n", "\n", "In this Section, we will normalize the feature vectors between the 0 and 1 range." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Clear the datasets in the case you ran this cell before.\n", "training_set = training_set.select(\"features\", \"label\", \"label_encoded\")\n", "test_set = test_set.select(\"features\", \"label\", \"label_encoded\")\n", "# Allocate a MinMaxTransformer using Distributed Keras.\n", "# o_min -> original_minimum\n", "# n_min -> new_minimum\n", "transformer = MinMaxTransformer(n_min=0.0, n_max=1.0, \\\n", " o_min=0.0, o_max=250.0, \\\n", " input_col=\"features\", \\\n", " output_col=\"features_normalized\")\n", "# Transform the datasets.\n", "training_set = transformer.transform(training_set)\n", "test_set = transformer.transform(test_set)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convolutions\n", "\n", "In order to make the dense vectors compatible with convolution operations in Keras, we add another column which contains the matrix form of these images. We provide a utility class (MatrixTransformer), which helps you with this." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "reshape_transformer = ReshapeTransformer(\"features_normalized\", \"matrix\", (28, 28, 1))\n", "training_set = reshape_transformer.transform(training_set)\n", "test_set = reshape_transformer.transform(test_set)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dense Transformation\n", "\n", "At the moment, dist-keras does not support SparseVectors due to the numpy dependency. As a result, we have to convert the SparseVector to a DenseVector. We added a simple utility transformer which does this for you." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dense_transformer = DenseTransformer(input_col=\"features_normalized\", output_col=\"features_normalized_dense\")\n", "training_set = dense_transformer.transform(training_set)\n", "test_set = dense_transformer.transform(test_set)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Artificial Enlargement\n", "\n", "We want to make the dataset 100 times larger to simulate larger datasets, and to evaluate optimizer performance." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "DataFrame[features: vector, label: bigint, label_encoded: vector, features_normalized: vector, matrix: array>>, features_normalized_dense: vector]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = training_set\n", "expansion = 10\n", "for i in range(0, expansion):\n", " df = df.unionAll(training_set)\n", "training_set = df\n", "training_set.cache()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing to HDFS\n", "\n", "In order to prevent constant preprocessing, and ensure optimizer performance, we write the data to HDFS in a Parquet format." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "training_set.write.parquet(\"data/mnist_train.parquet\")\n", "test_set.write.parquet(\"data/mnist_test.parquet\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Record end of transformation.\n", "time_end = time.time()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dt = time_end - time_start\n", "print(\"Took \" + str(dt) + \" seconds.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!hdfs dfs -rm -r data/mnist_test.parquet\n", "!hdfs dfs -rm -r data/mnist_train.parquet" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 1 } ================================================ FILE: examples/workflow.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Distributed Deep Learning with Apache Spark and Keras\n", "\n", "**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN) \n", "*Departement of Knowledge Engineering* \n", "*Maastricht University, The Netherlands*" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "06 April 2017\r\n" ] } ], "source": [ "!(date +%d\\ %B\\ %G)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This presentation will give the reader an introduction to the topic of distributed deep learning (DDL) and to the issues which need to be taken into consideration when applying this technique. We will also introduce a DDL framework based on a **fast and general engine for large-scale data processing** called [Apache Spark](https://spark.apache.org/) and the **neural network library** [Keras](https://keras.io). \n", "\n", "The project was initially initiated by the CMS experiment. CMS is exploring the possibility to use a deep learning model for the high level trigger in order to be able to handle the data rates for LHC run 3 and up. Furthermore, they would like to be able to train their models faster using distributed algorithms which allows them to tune their models with an increased frequency. An other requirement was those models should be trained on their complete dataset, which is in the order of a TB. At this point, production use-cases for ATLAS are also being evaluated. These focus more on the serving of models to classify instances." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contents\n", "\n", "- [Introduction and problem statement](#Distributed-Deep-Learning,-an-introduction.)\n", " - [Model parallelism](#Model-parallelism2)\n", " - [Data parallelism](#Data-parallelism)\n", "- [Usage](#Distributed-Keras:-a-practicle-example)\n", "- [Acknowledgments](#Acknowledgments)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distributed Deep Learning, an introduction.\n", "\n", "Unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. However, consider the problem of training a deep network with billions of parameters. How do we achieve this without waiting for days, or even weeks, and thus leaving more time to tune the model? Dean et al. [[1]](https://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf) proposed an training paradigm which allows us to train a model on multiple physical machines. The authors describe two methods to achieve this, i.e., **data parallelism** and **model parallelism**1.\n", "\n", "### Model parallelism2\n", "\n", "In model parallelism a *single* model is distributed over multiple machines. The performance benefits of distributing a deep network across multiple machines depends mainly on the structure and of the model. Models with a large number of parameters typically benefit from access to more CPUs and memory, up to the point where communication costs, i.e., propagation of the weight updates and synchronization mechanisms, dominate [[1]](https://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf).\n", "\n", "\"Model\n", "\n", "### Data parallelism\n", "\n", "As stated in the introduction, in order to train a large network in a reasonable amount of time, we need to parallize the optimization process (which is the learning of the model). In this settings, we take *several* model replicas, and distribute them over multiple machines. Of course, it would also be possible to combine this with the model parallelism approach. However, for the sake of simplicity, let us assume that a model (or several models) can be contained on a single machine. In order to parallelize the training, and to improve the usage of the resources of the cluster, we distribute the models over several machines.\n", "\n", "In order to build a distributed learning scheme using data parallelism, you would in the most simple case need at least 1 **parameter server**. A parameter server is basically a thread (or a collection of threads) which aggregate the incoming gradient updates of the workers into a so-called center variable, which acts as a *global consensus* variable. Finally, the weights which are stored in the center variable will eventually be used by the produced model.\n", "\n", "\"Data\n", "\n", "There are two general approaches towards solving data parallelism. The most straightforward is a **synchronous** method. In short, a synchronous data parallel method will wait for all workers to finish the current mini-batch or stochastic sample before continuing to the next iteration. Synchronous methods have the advantage that all workers will use the most recent center variable, i.e., a worker knows that all other workers will use the same center variable. However, the main disadvantage of this method is the synchronization itself. A synchronous method will never be truly synchronous. This is due to the many, and possibly different, machines. Furthermore, every machine could have a different workload, which would influence the training speed of a worker. As a result, synchronous methods need additional waiting mechanisms to synchronize all workers. These locking mechanisms will make sure that that all workers compute the next gradient based on the same center variable. However, locking mechanisms induce a significant wait which will significantly influence the training speed. For example, imagine a cluster node with an unusual high load. This high load will, due to CPU sharing, cause the training procedure to slow down. Which in turn, will cause the other workers to wait for this single node. Of course, this is just a simple and possibly extreme example, but this example shows how a single worker could significantly influence the training time of all workers.\n", "\n", "\"Data\n", "\n", "A very simple, but radical \"solution\" for this synchronization problem is to not synchronize the workers :) Workers simply fetch the center variable and update the parameter server with the computed gradient whenever a worker is ready. This approach is called an **asynchronous** data parallel method. Asynchronous methods, compared to synchronous methods, will have a different set of problems. One of these is a so-called **stale gradient**. This a gradient based on an older version of the center variable while the current center variable is a center variable which has already been updated by other workers. One approach to solve this is to induce and exponential decay factor to the gradient updates. However, this would waste computational resources, but of course, one could just get the most recent weights from the parameter server and then start again. However, as we will show later, it is actually stale gradients (result of asynchrony) that induce *implicit momentum* to the learning process [[2]](https://arxiv.org/pdf/1606.04487v4.pdf).\n", "\n", "\"Data\n", "\n", "At this point you probably ask the question: **why does this actually work?** A lot of people suggest this is due to the sparsity of the gradients. Intuitively, image having multiple workers processing different data (since every worker has its own dat partition), chances are the weights updates will be totally dissimilar since we are training a large network with a lot of tunable parameters. Furthermore, techniques such as dropout (if they are applied differently among the replicas) only increase the sparsity updates3.\n", "\n", "### Formalization\n", "\n", "We would also like to inform the reader that the general problem to be solved is the so-called **global consensus optimization** problem. A popular approach towards solving this is using the Alternating Direction Method of Multipliers (ADMM) [[3]](http://www.jmlr.org/proceedings/papers/v32/zhange14.pdf) [[4]](http://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf). However, since this is outside the scope of this notebook we will not review this in-depth. But, we would like to note that the *Elastic Averaging* methods [[5]](https://arxiv.org/pdf/1412.6651.pdf) by Zhang et al., which we included in *Distributed Keras* are based on ADMM.\n", "\n", "**1:** Hybrids are possible as well. \n", "**2:** This is mainly used for the computation of the network outputs [[1]](https://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf). \n", "**3:** A way to check the sparsity between 2 gradients is to put all the weights in to a 1 dimensional vector, and then compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distributed Keras\n", "\n", "Distributed Keras is a framework which uses Apache Spark and Keras. We chose for Spark because of the distributed environment. This allows us to preprocess the data in a distributed manner, and train our deep learning models on the same architecture, while still having the modeling simplicity of Keras.\n", "\n", "### Architecture\n", "\n", "Our architecture is very similar to the architecture discussed in [[1]](https://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf). However, we employ Apache Spark for data parallel reading and handling larger than memory datasets. The parameter server will always be created in the **Spark Driver**. This **is the program which creates the Spark Context**. For example, if the Jupyter installation of this notebook is running on the Spark cluster, then a cluster node will host the parameter server. However, if you run a Python script, which connects to a remote Spark cluster, then your computer will run the Spark Driver, and as a result will run the parameter server. In that case, be sure your network connection is able to handle the load, else your computer will be the bottleneck in the learning process.\n", "\n", "\"Model\n", "\n", "### Implementation of costum distributed optimizer\n", "\n", "In order to implement your own optimizer you need 2 classes. First, define your optimizer using the *Trainer* interface. We already supplied an *AsynchronousDistributedTrainer*, and an *SynchronousDistributedTrainer*. However, if you require an other procedure, please feel free to do so. Finally, you need a worker class. This class must have a *train* method with the required arguments, as specified by Apache Spark.\n", "\n", "### Usage\n", "\n", "In the following sections, we will give you an example how a complete workflow will look like. This includes setting up a Spark context, reading, preprocessing, and normalizing the data. Finally, we create a relatively simple model (feel free to adjust the parameters) with Keras and optimize it using the different distributed optimizers which are included by default.\n", "\n", "#### Dataset\n", "\n", "We are using the ATLAS Higgs dataset constructed for the Kaggle machine learning challenge. This dataset is quite limited, it contains only **250000** instances. 40% of which we will be using as a test set. For future experiments, it would be usefull to integrate well understood datasets such as CIFAR or MNIST to evaluate against other optimizers. However, it would be nice to have a \"well understood\" HEP (High Energy Physics) dataset for this task :) " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "import numpy as np\n", "\n", "import time\n", "\n", "import requests\n", "\n", "from keras.optimizers import *\n", "from keras.models import Sequential\n", "from keras.layers.core import Dense, Dropout, Activation\n", "\n", "from pyspark import SparkContext\n", "from pyspark import SparkConf\n", "\n", "from pyspark.ml.feature import StandardScaler\n", "from pyspark.ml.feature import VectorAssembler\n", "from pyspark.ml.feature import StringIndexer\n", "from pyspark.ml.evaluation import MulticlassClassificationEvaluator\n", "from pyspark.mllib.evaluation import BinaryClassificationMetrics\n", "\n", "from distkeras.trainers import *\n", "from distkeras.predictors import *\n", "from distkeras.transformers import *\n", "from distkeras.evaluators import *\n", "from distkeras.utils import *" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Modify these variables according to your needs.\n", "application_name = \"Distributed Keras Notebook\"\n", "using_spark_2 = False\n", "local = False\n", "if local:\n", " # Tell master to use local resources.\n", " master = \"local[*]\"\n", " num_cores = 3\n", " num_executors = 1\n", "else:\n", " # Tell master to use YARN.\n", " master = \"yarn-client\"\n", " num_executors = 6\n", " num_cores = 2" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of desired executors: 6\n", "Number of desired cores / executor: 2\n", "Total number of workers: 12\n" ] } ], "source": [ "# This variable is derived from the number of cores and executors, and will be used to assign the number of model trainers.\n", "num_workers = num_executors * num_cores\n", "\n", "print(\"Number of desired executors: \" + `num_executors`)\n", "print(\"Number of desired cores / executor: \" + `num_cores`)\n", "print(\"Total number of workers: \" + `num_workers`)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "\n", "# Use the DataBricks CSV reader, this has some nice functionality regarding invalid values.\n", "os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Preparing a Spark Context\n", "\n", "In order to read our (big) dataset into our Spark Cluster, we first need a Spark Context. However, since Spark 2.0 there are some changes regarding the initialization of a Spark Context. For example, SQLContext and HiveContext do not have to be initialized separately anymore." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "conf = SparkConf()\n", "conf.set(\"spark.app.name\", application_name)\n", "conf.set(\"spark.master\", master)\n", "conf.set(\"spark.executor.cores\", `num_cores`)\n", "conf.set(\"spark.executor.instances\", `num_executors`)\n", "conf.set(\"spark.locality.wait\", \"0\")\n", "conf.set(\"spark.serializer\", \"org.apache.spark.serializer.KryoSerializer\");\n", "\n", "# Check if the user is running Spark 2.0 +\n", "if using_spark_2:\n", " sc = SparkSession.builder.config(conf=conf) \\\n", " .appName(application_name) \\\n", " .getOrCreate()\n", "else:\n", " # Create the Spark context.\n", " sc = SparkContext(conf=conf)\n", " # Add the missing imports\n", " from pyspark import SQLContext\n", " sqlContext = SQLContext(sc)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Check if we are using Spark 2.0\n", "if using_spark_2:\n", " reader = sc\n", "else:\n", " reader = sqlContext\n", "# Read the dataset.\n", "raw_dataset = reader.read.format('com.databricks.spark.csv') \\\n", " .options(header='true', inferSchema='true').load(\"data/atlas_higgs.csv\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- EventId: integer (nullable = true)\n", " |-- DER_mass_MMC: double (nullable = true)\n", " |-- DER_mass_transverse_met_lep: double (nullable = true)\n", " |-- DER_mass_vis: double (nullable = true)\n", " |-- DER_pt_h: double (nullable = true)\n", " |-- DER_deltaeta_jet_jet: double (nullable = true)\n", " |-- DER_mass_jet_jet: double (nullable = true)\n", " |-- DER_prodeta_jet_jet: double (nullable = true)\n", " |-- DER_deltar_tau_lep: double (nullable = true)\n", " |-- DER_pt_tot: double (nullable = true)\n", " |-- DER_sum_pt: double (nullable = true)\n", " |-- DER_pt_ratio_lep_tau: double (nullable = true)\n", " |-- DER_met_phi_centrality: double (nullable = true)\n", " |-- DER_lep_eta_centrality: double (nullable = true)\n", " |-- PRI_tau_pt: double (nullable = true)\n", " |-- PRI_tau_eta: double (nullable = true)\n", " |-- PRI_tau_phi: double (nullable = true)\n", " |-- PRI_lep_pt: double (nullable = true)\n", " |-- PRI_lep_eta: double (nullable = true)\n", " |-- PRI_lep_phi: double (nullable = true)\n", " |-- PRI_met: double (nullable = true)\n", " |-- PRI_met_phi: double (nullable = true)\n", " |-- PRI_met_sumet: double (nullable = true)\n", " |-- PRI_jet_num: integer (nullable = true)\n", " |-- PRI_jet_leading_pt: double (nullable = true)\n", " |-- PRI_jet_leading_eta: double (nullable = true)\n", " |-- PRI_jet_leading_phi: double (nullable = true)\n", " |-- PRI_jet_subleading_pt: double (nullable = true)\n", " |-- PRI_jet_subleading_eta: double (nullable = true)\n", " |-- PRI_jet_subleading_phi: double (nullable = true)\n", " |-- PRI_jet_all_pt: double (nullable = true)\n", " |-- Weight: double (nullable = true)\n", " |-- Label: string (nullable = true)\n", "\n" ] } ], "source": [ "# Double-check the inferred schema, and get fetch a row to show how the dataset looks like.\n", "raw_dataset.printSchema()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dataset preprocessing and normalization\n", "\n", "Since Spark's MLlib has some nice features for distributed preprocessing, we made sure we comply to the DataFrame API in order to ensure compatibility. What it basically boils down to, is that all the features (which can have different type) will be aggregated into a single column. More information on Spark MLlib (and other APIs) can be found here: [http://spark.apache.org/docs/latest/ml-guide.html](http://spark.apache.org/docs/latest/ml-guide.html)\n", "\n", "In the following steps we will show you how to extract the desired columns from the dataset and prepare the for further processing." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[Row(features=DenseVector([138.47, 51.655, 97.827, 27.98, 0.91, 124.711, 2.666, 3.064, 41.928, 197.76, 1.582, 1.396, 0.2, 32.638, 1.017, 0.381, 51.626, 2.273, -2.414, 16.824, -0.277, 258.733, 2.0, 67.435, 2.15, 0.444, 46.062, 1.24, -2.475, 113.497]))]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# First, we would like to extract the desired features from the raw dataset.\n", "# We do this by constructing a list with all desired columns.\n", "features = raw_dataset.columns\n", "features.remove('EventId')\n", "features.remove('Weight')\n", "features.remove('Label')\n", "# Next, we use Spark's VectorAssembler to \"assemble\" (create) a vector of all desired features.\n", "# http://spark.apache.org/docs/latest/ml-features.html#vectorassembler\n", "vector_assembler = VectorAssembler(inputCols=features, outputCol=\"features\")\n", "# This transformer will take all columns specified in features, and create an additional column \"features\" which will contain all the desired features aggregated into a single vector.\n", "dataset = vector_assembler.transform(raw_dataset)\n", "\n", "# Show what happened after applying the vector assembler.\n", "# Note: \"features\" column got appended to the end.\n", "dataset.select(\"features\").take(1)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Apply feature normalization with standard scaling. This will transform a feature to have mean 0, and std 1.\n", "# http://spark.apache.org/docs/latest/ml-features.html#standardscaler\n", "standard_scaler = StandardScaler(inputCol=\"features\", outputCol=\"features_normalized\", withStd=True, withMean=True)\n", "standard_scaler_model = standard_scaler.fit(dataset)\n", "dataset = standard_scaler_model.transform(dataset)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[Row(Label=u's', label_index=1.0),\n", " Row(Label=u'b', label_index=0.0),\n", " Row(Label=u'b', label_index=0.0),\n", " Row(Label=u'b', label_index=0.0),\n", " Row(Label=u'b', label_index=0.0)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# If we look at the dataset, the Label column consists of 2 entries, i.e., b (background), and s (signal).\n", "# Our neural network will not be able to handle these characters, so instead, we convert it to an index so we can indicate that output neuron with index 0 is background, and 1 is signal.\n", "# http://spark.apache.org/docs/latest/ml-features.html#stringindexer\n", "label_indexer = StringIndexer(inputCol=\"Label\", outputCol=\"label_index\").fit(dataset)\n", "dataset = label_indexer.transform(dataset)\n", "\n", "# Show the result of the label transformation.\n", "dataset.select(\"Label\", \"label_index\").take(5)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Define some properties of the neural network for later use.\n", "nb_classes = 2 # Number of output classes (signal and background)\n", "nb_features = len(features)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "# We observe that Keras is not able to work with these indexes.\n", "# What it actually expects is a vector with an identical size to the output layer.\n", "# Our framework provides functionality to do this with ease.\n", "# What it basically does, given an expected vector dimension, \n", "# it prepares zero vector with the specified dimensionality, and will set the neuron\n", "# with a specific label index to one. (One-Hot encoding)\n", "\n", "# For example:\n", "# 1. Assume we have a label index: 3\n", "# 2. Output dimensionality: 5\n", "# With these parameters, we obtain the following vector in the DataFrame column: [0,0,0,1,0]\n", "\n", "transformer = OneHotTransformer(output_dim=nb_classes, input_col=\"label_index\", output_col=\"label\")\n", "dataset = transformer.transform(dataset)\n", "# Only select the columns we need (less data shuffling) while training.\n", "dataset = dataset.select(\"features_normalized\", \"label_index\", \"label\")\n", "\n", "# Show the expected output vectors of the neural network.\n", "dataset.select(\"label_index\", \"label\").take(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Warning**: shuffling on a large dataset will take some time.\n", "\n", "We recommend users to first preprocess and shuffle their data, as is described in the data preprocessing notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Shuffle the dataset.\n", "dataset = shuffle(dataset)\n", "\n", "# Note: we also support shuffling in the trainers by default.\n", "# However, since this would require a shuffle for every training we will only do it once here.\n", "# If you want, you can enable the training shuffling by specifying shuffle=True in the train() function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Finally, we create a trainingset and a testset.\n", "(training_set, test_set) = dataset.randomSplit([0.6, 0.4])\n", "training_set.cache()\n", "test_set.cache()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Model construction\n", "\n", "We will now construct a relatively simple Keras model (without any modifications) which, hopefully, will be able to classify the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model = Sequential()\n", "model.add(Dense(500, input_shape=(nb_features,)))\n", "model.add(Activation('relu'))\n", "model.add(Dropout(0.4))\n", "model.add(Dense(500))\n", "model.add(Activation('relu'))\n", "model.add(Dense(nb_classes))\n", "model.add(Activation('softmax'))\n", "\n", "model.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Worker Optimizer and Loss\n", "\n", "In order to evaluate the gradient on the model replicas, we have to specify an optimizer and a loss method. For this, we just follow the Keras API as defined in the documentation: [https://keras.io/optimizers/](https://keras.io/optimizers/) and [https://keras.io/objectives/](https://keras.io/objectives/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "optimizer = 'adagrad'\n", "loss = 'categorical_crossentropy'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Training\n", "\n", "In the following cells we will train and evaluate the model using different distributed trainers, however, we will as well provide a baseline metric using a **SingleTrainer**, which is basically an instance of the Adagrad optimizer running on Spark.\n", "\n", "Furthermore, we will also evaluate every training using Spark's MulticlassClassificationEvaluator [https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.MulticlassClassificationEvaluator](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.MulticlassClassificationEvaluator)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Evaluation\n", "\n", "We will evaluate all algorithms using the F1 [https://en.wikipedia.org/wiki/F1_score](https://en.wikipedia.org/wiki/F1_score) and accuracy metric." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def evaluate_accuracy(model):\n", " global test_set\n", " \n", " # Allocate a Distributed Keras Accuracy evaluator.\n", " evaluator = AccuracyEvaluator(prediction_col=\"prediction_index\", label_col=\"label_index\")\n", " # Clear the prediction column from the testset.\n", " test_set = test_set.select(\"features_normalized\", \"label_index\", \"label\")\n", " # Apply a prediction from a trained model.\n", " predictor = ModelPredictor(keras_model=trained_model, features_col=\"features_normalized\")\n", " test_set = predictor.predict(test_set)\n", " # Allocate an index transformer.\n", " index_transformer = LabelIndexTransformer(output_dim=nb_classes)\n", " # Transform the prediction vector to an indexed label.\n", " test_set = index_transformer.transform(test_set)\n", " # Fetch the score.\n", " score = evaluator.evaluate(test_set)\n", " \n", " return score" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def add_result(trainer, accuracy, dt):\n", " global results;\n", " \n", " # Store the metrics.\n", " results[trainer] = {}\n", " results[trainer]['accuracy'] = accuracy;\n", " results[trainer]['time_spent'] = dt\n", " # Display the metrics.\n", " print(\"Trainer: \" + str(trainer))\n", " print(\" - Accuracy: \" + str(accuracy))\n", " print(\" - Training time: \" + str(dt))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But first, we will allocate a simple datastructure which will hold the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "results = {}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### SingleTrainer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A ***SingleTrainer*** is used as a benchmarking trainer to compare to distributed trainer. However, one could also use this trainer if the dataset is too big to fit in memory." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer = SingleTrainer(keras_model=model, worker_optimizer=optimizer,\n", " loss=loss, features_col=\"features_normalized\",\n", " label_col=\"label\", num_epoch=1, batch_size=32)\n", "trained_model = trainer.train(training_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Fetch the evaluation metrics.\n", "accuracy = evaluate_accuracy(trained_model)\n", "dt = trainer.get_training_time()\n", "# Add the metrics to the results.\n", "add_result('single', accuracy, dt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Asynchronous EASGD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "EASGD based methods, proposed by Zhang et al., transmit the complete parametrization instead of the gradient. These methods will then \"average\" the difference of the center variable and the backpropagated worker variable. This is used to compute a new master variable, on which the worker nodes will base their backpropagation in the next iteration on.\n", "\n", "Asynchronous EASGD will do this in an asynchronous fashion, meaning, whenever a worker node is done processing its mini-batch after a certain amount of iterations (communication window), then the computed parameter will be communicated with the parameter server, which will update the center (master) variable immediately without waiting for other workers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer = AEASGD(keras_model=model, worker_optimizer=optimizer, loss=loss, num_workers=num_workers, \n", " batch_size=32, features_col=\"features_normalized\", label_col=\"label\", num_epoch=1,\n", " communication_window=32, rho=5.0, learning_rate=0.1)\n", "trainer.set_parallelism_factor(1)\n", "trained_model = trainer.train(training_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Fetch the evaluation metrics.\n", "accuracy = evaluate_accuracy(trained_model)\n", "dt = trainer.get_training_time()\n", "# Add the metrics to the results.\n", "add_result('aeasgd', accuracy, dt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Asynchronous EAMSGD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The only difference between asynchronous EAMSGD and asynchronous EASGD is the possibility of specifying an explicit momentum term." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "trainer = EAMSGD(keras_model=model, worker_optimizer=optimizer, loss=loss, num_workers=num_workers,\n", " batch_size=32, features_col=\"features_normalized\", label_col=\"label\", num_epoch=1,\n", " communication_window=32, rho=5.0, learning_rate=0.1, momentum=0.6)\n", "trainer.set_parallelism_factor(1)\n", "trained_model = trainer.train(training_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Fetch the evaluation metrics.\n", "accuracy = evaluate_accuracy(trained_model)\n", "dt = trainer.get_training_time()\n", "# Add the metrics to the results.\n", "add_result('eamsgd', accuracy, dt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### DOWNPOUR SGD" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trainer = DOWNPOUR(keras_model=model, worker_optimizer=optimizer, loss=loss, num_workers=num_workers,\n", " batch_size=32, communication_window=5, learning_rate=0.05, num_epoch=1,\n", " features_col=\"features_normalized\", label_col=\"label\")\n", "trainer.set_parallelism_factor(1)\n", "trained_model = trainer.train(training_set)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Fetch the evaluation metrics.\n", "accuracy = evaluate_accuracy(trained_model)\n", "dt = trainer.get_training_time()\n", "# Add the metrics to the results.\n", "add_result('downpour', accuracy, dt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Experimental observations\n", "\n", "- DOWNPOUR converges well when a small communication window is used $< 5$.\n", "- EASGD based methods on the other hand, thrive using large communication windows $> 25$.\n", "- Asynchronous methods induce implicit momentum.\n", "\n", "## Summary\n", "\n", "Distributed Deep Learning can **significantly speedup** the learning process. We provide such a framework built on top of Keras and Apache Spark. The latter provides a nice framework for distributed data processing and model evaluation. We can easily integrate our workflows using Apache Spark, and thus speeding up the **data preprocessing**, and our **model optimization procedure** while still having the same **modelling simplicity**.\n", "\n", "Our group is always open on further collaboration on this work, and would like to assist the physics community in their machine learning efforts.\n", "\n", "**Contact**: [joeri.hermans@cern.ch](mailto:joeri.hermans@cern.ch)\n", " [luca.canali@cern.ch](mailto:luca.canali@cern.ch)\n", " [zbigniew.baranowski@cern.ch](mailto:zbigniew.baranowski@cern.ch)\n", "\n", "## Future work\n", "\n", "- Understanding \"theoretical\" meaning of a communication window.\n", "- Apply compression for big weight updates when sending updates to the parameter server.\n", "- Keep track of a gradient residual (this will reduce the bandwidth due to sparsity).\n", "- Evaluation of algorithms with GPU's.\n", "- Optimization of parameter sharing (e.g., sockets instead of REST API).\n", "- Use \"famous\" ConvNet architectures with well known datasets in order to have a more sound evaluation.\n", "- Add threaded queue to process asynchronous updates.\n", "- Training accuracy while training.\n", "- Stop on target loss.\n", "\n", "## Acknowledgments\n", "\n", "Many thanks to Zbigniew Baranowski and Luca Canali of the IT-DB group, and to Jean-Roch Vlimant, Maurizio Pierini, and Federico Presutti of the EP-UCM group for their collaboration on this work.\n", "\n", "## GitHub repository\n", "\n", "[https://github.com/JoeriHermans/dist-keras/](#https://github.com/JoeriHermans/dist-keras)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 0 } ================================================ FILE: mkdocs.yml ================================================ # Project information site_name: Distributed Keras site_description: Distributed Deep Learning with Apache Spark and Keras. site_author: Joeri Hermans site_url: 'http://dist-keras.joerihermans.com' # Page definitions. pages: - Home: index.md - Optimizers: optimizers.md - License: license.md # Documentation and theme configuration theme: readthedocs docs_dir: 'docs' markdown_extensions: - admonition extra: version: '0.1.0' palette: primary: 'grey' accent: 'light blue' author: github: 'JoeriHermans' twitter: 'joeri_hermans' # Copyright copyright: 'Copyright (c) 2016 Joeri Hermans' # Repository repo_name: 'GitHub' repo_url: 'https://github.com/JoeriHermans/dist-keras' ================================================ FILE: resources/blog-posts/css/main.css ================================================ /** * joerihermans.com main stylesheet. * * @author Joeri Hermans * @version 0,1 * @since 28 June 2016 */ /** BEGIN Imports. ************************************************************/ @import url('https://fonts.googleapis.com/css?family=Roboto:300'); /** END Imports. **************************************************************/ /** BEGIN General. ************************************************************/ html, body { background: white; color: #1b1b1b; font-family: 'Roboto', sans-serif; height: 100%; margin: 0; font-size: 0.95em; } body { padding: 50px; } a { color: #4078c0; cursor: pointer; text-decoration: none; } a:hover { text-decoration: none; } a img { border: 0; } .left { float: left !important; } .right { float: right !important; } .inlined-code { border-radius: 2px; background: #f0f0f0; padding: 2px 4px; } .center-text { text-align: center; } .image-container { text-align: center; } /** END General. **************************************************************/ /** BEGIN Math. ***************************************************************/ .equation-math { text-align: center; margin: 40px 0; position: relative; } .equation-math span { text-align: left; } .equation-math-number { position: absolute; top: 0; bottom: 0; margin: 0; height: 100%; line-height: 40px; right: 20px; font-size: 0.9em; color: #1b1b1b; font-weight: bold; } /** END Math. *****************************************************************/ /** BEGIN Blog. ***************************************************************/ .blog-figure-container { text-align: center; } .blog-figure-container img { max-width: 100%; } .blog-figure-container p { text-align: left; } /** END Blog. ****************************************************************/ /** BEGIN Highlight. *********************************************************/ .hljs { display: block; overflow-x: auto; padding: 10px 30px; background: #f0f0f0; } .hljs, .hljs-subst { color: #444; } .hljs-comment { color: #888888; } .hljs-keyword, .hljs-attribute, .hljs-selector-tag, .hljs-meta-keyword, .hljs-doctag, .hljs-name { font-weight: bold; } .hljs-type, .hljs-string, .hljs-number, .hljs-selector-id, .hljs-selector-class, .hljs-quote, .hljs-template-tag, .hljs-deletion { color: #d4645c; } .hljs-title, .hljs-section { color: #d4645c; font-weight: bold; } .hljs-literal { color: #78A960; } .hljs-built_in, .hljs-bullet, .hljs-code, .hljs-addition { color: #397300; } .hljs-meta { color: #1f7199; } .hljs-meta-string { color: #4d99bf; } .hljs-emphasis { font-style: italic; } .hljs-strong { font-weight: bold; } /** END Highlight. ***********************************************************/ ================================================ FILE: resources/blog-posts/js/highlight.pack.js ================================================ /*! highlight.js v9.5.0 | BSD3 License | git.io/hljslicense */ !function(e){var n="object"==typeof window&&window||"object"==typeof self&&self;"undefined"!=typeof exports?e(exports):n&&(n.hljs=e({}),"function"==typeof define&&define.amd&&define([],function(){return n.hljs}))}(function(e){function n(e){return e.replace(/[&<>]/gm,function(e){return I[e]})}function t(e){return e.nodeName.toLowerCase()}function r(e,n){var t=e&&e.exec(n);return t&&0===t.index}function a(e){return k.test(e)}function i(e){var n,t,r,i,o=e.className+" ";if(o+=e.parentNode?e.parentNode.className:"",t=B.exec(o))return R(t[1])?t[1]:"no-highlight";for(o=o.split(/\s+/),n=0,r=o.length;r>n;n++)if(i=o[n],a(i)||R(i))return i}function o(e,n){var t,r={};for(t in e)r[t]=e[t];if(n)for(t in n)r[t]=n[t];return r}function u(e){var n=[];return function r(e,a){for(var i=e.firstChild;i;i=i.nextSibling)3===i.nodeType?a+=i.nodeValue.length:1===i.nodeType&&(n.push({event:"start",offset:a,node:i}),a=r(i,a),t(i).match(/br|hr|img|input/)||n.push({event:"stop",offset:a,node:i}));return a}(e,0),n}function c(e,r,a){function i(){return e.length&&r.length?e[0].offset!==r[0].offset?e[0].offset"}function u(e){l+=""}function c(e){("start"===e.event?o:u)(e.node)}for(var s=0,l="",f=[];e.length||r.length;){var g=i();if(l+=n(a.substr(s,g[0].offset-s)),s=g[0].offset,g===e){f.reverse().forEach(u);do c(g.splice(0,1)[0]),g=i();while(g===e&&g.length&&g[0].offset===s);f.reverse().forEach(o)}else"start"===g[0].event?f.push(g[0].node):f.pop(),c(g.splice(0,1)[0])}return l+n(a.substr(s))}function s(e){function n(e){return e&&e.source||e}function t(t,r){return new RegExp(n(t),"m"+(e.cI?"i":"")+(r?"g":""))}function r(a,i){if(!a.compiled){if(a.compiled=!0,a.k=a.k||a.bK,a.k){var u={},c=function(n,t){e.cI&&(t=t.toLowerCase()),t.split(" ").forEach(function(e){var t=e.split("|");u[t[0]]=[n,t[1]?Number(t[1]):1]})};"string"==typeof a.k?c("keyword",a.k):E(a.k).forEach(function(e){c(e,a.k[e])}),a.k=u}a.lR=t(a.l||/\w+/,!0),i&&(a.bK&&(a.b="\\b("+a.bK.split(" ").join("|")+")\\b"),a.b||(a.b=/\B|\b/),a.bR=t(a.b),a.e||a.eW||(a.e=/\B|\b/),a.e&&(a.eR=t(a.e)),a.tE=n(a.e)||"",a.eW&&i.tE&&(a.tE+=(a.e?"|":"")+i.tE)),a.i&&(a.iR=t(a.i)),null==a.r&&(a.r=1),a.c||(a.c=[]);var s=[];a.c.forEach(function(e){e.v?e.v.forEach(function(n){s.push(o(e,n))}):s.push("self"===e?a:e)}),a.c=s,a.c.forEach(function(e){r(e,a)}),a.starts&&r(a.starts,i);var l=a.c.map(function(e){return e.bK?"\\.?("+e.b+")\\.?":e.b}).concat([a.tE,a.i]).map(n).filter(Boolean);a.t=l.length?t(l.join("|"),!0):{exec:function(){return null}}}}r(e)}function l(e,t,a,i){function o(e,n){for(var t=0;t',i+n+o}function p(){var e,t,r,a;if(!E.k)return n(B);for(a="",t=0,E.lR.lastIndex=0,r=E.lR.exec(B);r;)a+=n(B.substr(t,r.index-t)),e=g(E,r),e?(M+=e[1],a+=h(e[0],n(r[0]))):a+=n(r[0]),t=E.lR.lastIndex,r=E.lR.exec(B);return a+n(B.substr(t))}function d(){var e="string"==typeof E.sL;if(e&&!x[E.sL])return n(B);var t=e?l(E.sL,B,!0,L[E.sL]):f(B,E.sL.length?E.sL:void 0);return E.r>0&&(M+=t.r),e&&(L[E.sL]=t.top),h(t.language,t.value,!1,!0)}function b(){k+=null!=E.sL?d():p(),B=""}function v(e){k+=e.cN?h(e.cN,"",!0):"",E=Object.create(e,{parent:{value:E}})}function m(e,n){if(B+=e,null==n)return b(),0;var t=o(n,E);if(t)return t.skip?B+=n:(t.eB&&(B+=n),b(),t.rB||t.eB||(B=n)),v(t,n),t.rB?0:n.length;var r=u(E,n);if(r){var a=E;a.skip?B+=n:(a.rE||a.eE||(B+=n),b(),a.eE&&(B=n));do E.cN&&(k+=C),E.skip||(M+=E.r),E=E.parent;while(E!==r.parent);return r.starts&&v(r.starts,""),a.rE?0:n.length}if(c(n,E))throw new Error('Illegal lexeme "'+n+'" for mode "'+(E.cN||"")+'"');return B+=n,n.length||1}var N=R(e);if(!N)throw new Error('Unknown language: "'+e+'"');s(N);var w,E=i||N,L={},k="";for(w=E;w!==N;w=w.parent)w.cN&&(k=h(w.cN,"",!0)+k);var B="",M=0;try{for(var I,j,O=0;;){if(E.t.lastIndex=O,I=E.t.exec(t),!I)break;j=m(t.substr(O,I.index-O),I[0]),O=I.index+j}for(m(t.substr(O)),w=E;w.parent;w=w.parent)w.cN&&(k+=C);return{r:M,value:k,language:e,top:E}}catch(T){if(T.message&&-1!==T.message.indexOf("Illegal"))return{r:0,value:n(t)};throw T}}function f(e,t){t=t||y.languages||E(x);var r={r:0,value:n(e)},a=r;return t.filter(R).forEach(function(n){var t=l(n,e,!1);t.language=n,t.r>a.r&&(a=t),t.r>r.r&&(a=r,r=t)}),a.language&&(r.second_best=a),r}function g(e){return y.tabReplace||y.useBR?e.replace(M,function(e,n){return y.useBR&&"\n"===e?"
":y.tabReplace?n.replace(/\t/g,y.tabReplace):void 0}):e}function h(e,n,t){var r=n?L[n]:t,a=[e.trim()];return e.match(/\bhljs\b/)||a.push("hljs"),-1===e.indexOf(r)&&a.push(r),a.join(" ").trim()}function p(e){var n,t,r,o,s,p=i(e);a(p)||(y.useBR?(n=document.createElementNS("http://www.w3.org/1999/xhtml","div"),n.innerHTML=e.innerHTML.replace(/\n/g,"").replace(//g,"\n")):n=e,s=n.textContent,r=p?l(p,s,!0):f(s),t=u(n),t.length&&(o=document.createElementNS("http://www.w3.org/1999/xhtml","div"),o.innerHTML=r.value,r.value=c(t,u(o),s)),r.value=g(r.value),e.innerHTML=r.value,e.className=h(e.className,p,r.language),e.result={language:r.language,re:r.r},r.second_best&&(e.second_best={language:r.second_best.language,re:r.second_best.r}))}function d(e){y=o(y,e)}function b(){if(!b.called){b.called=!0;var e=document.querySelectorAll("pre code");w.forEach.call(e,p)}}function v(){addEventListener("DOMContentLoaded",b,!1),addEventListener("load",b,!1)}function m(n,t){var r=x[n]=t(e);r.aliases&&r.aliases.forEach(function(e){L[e]=n})}function N(){return E(x)}function R(e){return e=(e||"").toLowerCase(),x[e]||x[L[e]]}var w=[],E=Object.keys,x={},L={},k=/^(no-?highlight|plain|text)$/i,B=/\blang(?:uage)?-([\w-]+)\b/i,M=/((^(<[^>]+>|\t|)+|(?:\n)))/gm,C="",y={classPrefix:"hljs-",tabReplace:null,useBR:!1,languages:void 0},I={"&":"&","<":"<",">":">"};return e.highlight=l,e.highlightAuto=f,e.fixMarkup=g,e.highlightBlock=p,e.configure=d,e.initHighlighting=b,e.initHighlightingOnLoad=v,e.registerLanguage=m,e.listLanguages=N,e.getLanguage=R,e.inherit=o,e.IR="[a-zA-Z]\\w*",e.UIR="[a-zA-Z_]\\w*",e.NR="\\b\\d+(\\.\\d+)?",e.CNR="(-?)(\\b0[xX][a-fA-F0-9]+|(\\b\\d+(\\.\\d*)?|\\.\\d+)([eE][-+]?\\d+)?)",e.BNR="\\b(0b[01]+)",e.RSR="!|!=|!==|%|%=|&|&&|&=|\\*|\\*=|\\+|\\+=|,|-|-=|/=|/|:|;|<<|<<=|<=|<|===|==|=|>>>=|>>=|>=|>>>|>>|>|\\?|\\[|\\{|\\(|\\^|\\^=|\\||\\|=|\\|\\||~",e.BE={b:"\\\\[\\s\\S]",r:0},e.ASM={cN:"string",b:"'",e:"'",i:"\\n",c:[e.BE]},e.QSM={cN:"string",b:'"',e:'"',i:"\\n",c:[e.BE]},e.PWM={b:/\b(a|an|the|are|I'm|isn't|don't|doesn't|won't|but|just|should|pretty|simply|enough|gonna|going|wtf|so|such|will|you|your|like)\b/},e.C=function(n,t,r){var a=e.inherit({cN:"comment",b:n,e:t,c:[]},r||{});return a.c.push(e.PWM),a.c.push({cN:"doctag",b:"(?:TODO|FIXME|NOTE|BUG|XXX):",r:0}),a},e.CLCM=e.C("//","$"),e.CBCM=e.C("/\\*","\\*/"),e.HCM=e.C("#","$"),e.NM={cN:"number",b:e.NR,r:0},e.CNM={cN:"number",b:e.CNR,r:0},e.BNM={cN:"number",b:e.BNR,r:0},e.CSSNM={cN:"number",b:e.NR+"(%|em|ex|ch|rem|vw|vh|vmin|vmax|cm|mm|in|pt|pc|px|deg|grad|rad|turn|s|ms|Hz|kHz|dpi|dpcm|dppx)?",r:0},e.RM={cN:"regexp",b:/\//,e:/\/[gimuy]*/,i:/\n/,c:[e.BE,{b:/\[/,e:/\]/,r:0,c:[e.BE]}]},e.TM={cN:"title",b:e.IR,r:0},e.UTM={cN:"title",b:e.UIR,r:0},e.METHOD_GUARD={b:"\\.\\s*"+e.UIR,r:0},e});hljs.registerLanguage("python",function(e){var r={cN:"meta",b:/^(>>>|\.\.\.) /},b={cN:"string",c:[e.BE],v:[{b:/(u|b)?r?'''/,e:/'''/,c:[r],r:10},{b:/(u|b)?r?"""/,e:/"""/,c:[r],r:10},{b:/(u|r|ur)'/,e:/'/,r:10},{b:/(u|r|ur)"/,e:/"/,r:10},{b:/(b|br)'/,e:/'/},{b:/(b|br)"/,e:/"/},e.ASM,e.QSM]},a={cN:"number",r:0,v:[{b:e.BNR+"[lLjJ]?"},{b:"\\b(0o[0-7]+)[lLjJ]?"},{b:e.CNR+"[lLjJ]?"}]},l={cN:"params",b:/\(/,e:/\)/,c:["self",r,a,b]};return{aliases:["py","gyp"],k:{keyword:"and elif is global as in if from raise for except finally print import pass return exec else break not with class assert yield try while continue del or def lambda async await nonlocal|10 None True False",built_in:"Ellipsis NotImplemented"},i:/(<\/|->|\?)/,c:[r,a,b,e.HCM,{v:[{cN:"function",bK:"def",r:10},{cN:"class",bK:"class"}],e:/:/,i:/[${=;\n,]/,c:[e.UTM,l,{b:/->/,eW:!0,k:"None"}]},{cN:"meta",b:/^[\t ]*@/,e:/$/},{b:/\b(print|exec)\(/}]}});hljs.registerLanguage("bash",function(e){var t={cN:"variable",v:[{b:/\$[\w\d#@][\w\d_]*/},{b:/\$\{(.*?)}/}]},s={cN:"string",b:/"/,e:/"/,c:[e.BE,t,{cN:"variable",b:/\$\(/,e:/\)/,c:[e.BE]}]},a={cN:"string",b:/'/,e:/'/};return{aliases:["sh","zsh"],l:/-?[a-z\.]+/,k:{keyword:"if then else elif fi for while in do done case esac function",literal:"true false",built_in:"break cd continue eval exec exit export getopts hash pwd readonly return shift test times trap umask unset alias bind builtin caller command declare echo enable help let local logout mapfile printf read readarray source type typeset ulimit unalias set shopt autoload bg bindkey bye cap chdir clone comparguments compcall compctl compdescribe compfiles compgroups compquote comptags comptry compvalues dirs disable disown echotc echoti emulate fc fg float functions getcap getln history integer jobs kill limit log noglob popd print pushd pushln rehash sched setcap setopt stat suspend ttyctl unfunction unhash unlimit unsetopt vared wait whence where which zcompile zformat zftp zle zmodload zparseopts zprof zpty zregexparse zsocket zstyle ztcp",_:"-ne -eq -lt -gt -f -d -e -s -l -a"},c:[{cN:"meta",b:/^#![^\n]+sh\s*$/,r:10},{cN:"function",b:/\w[\w\d_]*\s*\(\s*\)\s*\{/,rB:!0,c:[e.inherit(e.TM,{b:/\w[\w\d_]*/})],r:0},e.HCM,s,a,t]}});hljs.registerLanguage("json",function(e){var i={literal:"true false null"},n=[e.QSM,e.CNM],r={e:",",eW:!0,eE:!0,c:n,k:i},t={b:"{",e:"}",c:[{cN:"attr",b:/"/,e:/"/,c:[e.BE],i:"\\n"},e.inherit(r,{b:/:/})],i:"\\S"},c={b:"\\[",e:"\\]",c:[e.inherit(r)],i:"\\S"};return n.splice(n.length,0,t,c),{c:n,k:i,i:"\\S"}});hljs.registerLanguage("scala",function(e){var t={cN:"meta",b:"@[A-Za-z]+"},a={cN:"subst",v:[{b:"\\$[A-Za-z0-9_]+"},{b:"\\${",e:"}"}]},r={cN:"string",v:[{b:'"',e:'"',i:"\\n",c:[e.BE]},{b:'"""',e:'"""',r:10},{b:'[a-z]+"',e:'"',i:"\\n",c:[e.BE,a]},{cN:"string",b:'[a-z]+"""',e:'"""',c:[a],r:10}]},c={cN:"symbol",b:"'\\w[\\w\\d_]*(?!')"},i={cN:"type",b:"\\b[A-Z][A-Za-z0-9_]*",r:0},s={cN:"title",b:/[^0-9\n\t "'(),.`{}\[\]:;][^\n\t "'(),.`{}\[\]:;]+|[^0-9\n\t "'(),.`{}\[\]:;=]/,r:0},n={cN:"class",bK:"class object trait type",e:/[:={\[\n;]/,eE:!0,c:[{bK:"extends with",r:10},{b:/\[/,e:/\]/,eB:!0,eE:!0,r:0,c:[i]},{cN:"params",b:/\(/,e:/\)/,eB:!0,eE:!0,r:0,c:[i]},s]},l={cN:"function",bK:"def",e:/[:={\[(\n;]/,eE:!0,c:[s]};return{k:{literal:"true false null",keyword:"type yield lazy override def with val var sealed abstract private trait object if forSome for while throw finally protected extends import final return else break new catch super class case package default try this match continue throws implicit"},c:[e.CLCM,e.CBCM,r,c,i,l,n,e.CNM,t]}});hljs.registerLanguage("javascript",function(e){return{aliases:["js","jsx"],k:{keyword:"in of if for while finally var new function do return void else break catch instanceof with throw case default try this switch continue typeof delete let yield const export super debugger as async await static import from as",literal:"true false null undefined NaN Infinity",built_in:"eval isFinite isNaN parseFloat parseInt decodeURI decodeURIComponent encodeURI encodeURIComponent escape unescape Object Function Boolean Error EvalError InternalError RangeError ReferenceError StopIteration SyntaxError TypeError URIError Number Math Date String RegExp Array Float32Array Float64Array Int16Array Int32Array Int8Array Uint16Array Uint32Array Uint8Array Uint8ClampedArray ArrayBuffer DataView JSON Intl arguments require module console window document Symbol Set Map WeakSet WeakMap Proxy Reflect Promise"},c:[{cN:"meta",r:10,b:/^\s*['"]use (strict|asm)['"]/},{cN:"meta",b:/^#!/,e:/$/},e.ASM,e.QSM,{cN:"string",b:"`",e:"`",c:[e.BE,{cN:"subst",b:"\\$\\{",e:"\\}"}]},e.CLCM,e.CBCM,{cN:"number",v:[{b:"\\b(0[bB][01]+)"},{b:"\\b(0[oO][0-7]+)"},{b:e.CNR}],r:0},{b:"("+e.RSR+"|\\b(case|return|throw)\\b)\\s*",k:"return throw case",c:[e.CLCM,e.CBCM,e.RM,{b://,sL:"xml",c:[{b:/<\w+\s*\/>/,skip:!0},{b:/<\w+/,e:/(\/\w+|\w+\/)>/,skip:!0,c:["self"]}]}],r:0},{cN:"function",bK:"function",e:/\{/,eE:!0,c:[e.inherit(e.TM,{b:/[A-Za-z$_][0-9A-Za-z$_]*/}),{cN:"params",b:/\(/,e:/\)/,eB:!0,eE:!0,c:[e.CLCM,e.CBCM]}],i:/\[|%/},{b:/\$[(.]/},e.METHOD_GUARD,{cN:"class",bK:"class",e:/[{;=]/,eE:!0,i:/[:"\[\]]/,c:[{bK:"extends"},e.UTM]},{bK:"constructor",e:/\{/,eE:!0}],i:/#(?!!)/}});hljs.registerLanguage("cpp",function(t){var e={cN:"keyword",b:"\\b[a-z\\d_]*_t\\b"},r={cN:"string",v:[{b:'(u8?|U)?L?"',e:'"',i:"\\n",c:[t.BE]},{b:'(u8?|U)?R"',e:'"',c:[t.BE]},{b:"'\\\\?.",e:"'",i:"."}]},s={cN:"number",v:[{b:"\\b(0b[01'_]+)"},{b:"\\b([\\d'_]+(\\.[\\d'_]*)?|\\.[\\d'_]+)(u|U|l|L|ul|UL|f|F|b|B)"},{b:"(-?)(\\b0[xX][a-fA-F0-9'_]+|(\\b[\\d'_]+(\\.[\\d'_]*)?|\\.[\\d'_]+)([eE][-+]?[\\d'_]+)?)"}],r:0},i={cN:"meta",b:/#\s*[a-z]+\b/,e:/$/,k:{"meta-keyword":"if else elif endif define undef warning error line pragma ifdef ifndef include"},c:[{b:/\\\n/,r:0},t.inherit(r,{cN:"meta-string"}),{cN:"meta-string",b:"<",e:">",i:"\\n"},t.CLCM,t.CBCM]},a=t.IR+"\\s*\\(",c={keyword:"int float while private char catch export virtual operator sizeof dynamic_cast|10 typedef const_cast|10 const struct for static_cast|10 union namespace unsigned long volatile static protected bool template mutable if public friend do goto auto void enum else break extern using class asm case typeid short reinterpret_cast|10 default double register explicit signed typename try this switch continue inline delete alignof constexpr decltype noexcept static_assert thread_local restrict _Bool complex _Complex _Imaginary atomic_bool atomic_char atomic_schar atomic_uchar atomic_short atomic_ushort atomic_int atomic_uint atomic_long atomic_ulong atomic_llong atomic_ullong new throw return",built_in:"std string cin cout cerr clog stdin stdout stderr stringstream istringstream ostringstream auto_ptr deque list queue stack vector map set bitset multiset multimap unordered_set unordered_map unordered_multiset unordered_multimap array shared_ptr abort abs acos asin atan2 atan calloc ceil cosh cos exit exp fabs floor fmod fprintf fputs free frexp fscanf isalnum isalpha iscntrl isdigit isgraph islower isprint ispunct isspace isupper isxdigit tolower toupper labs ldexp log10 log malloc realloc memchr memcmp memcpy memset modf pow printf putchar puts scanf sinh sin snprintf sprintf sqrt sscanf strcat strchr strcmp strcpy strcspn strlen strncat strncmp strncpy strpbrk strrchr strspn strstr tanh tan vfprintf vprintf vsprintf endl initializer_list unique_ptr",literal:"true false nullptr NULL"},n=[e,t.CLCM,t.CBCM,s,r];return{aliases:["c","cc","h","c++","h++","hpp"],k:c,i:"",k:c,c:["self",e]},{b:t.IR+"::",k:c},{v:[{b:/=/,e:/;/},{b:/\(/,e:/\)/},{bK:"new throw return else",e:/;/}],k:c,c:n.concat([{b:/\(/,e:/\)/,k:c,c:n.concat(["self"]),r:0}]),r:0},{cN:"function",b:"("+t.IR+"[\\*&\\s]+)+"+a,rB:!0,e:/[{;=]/,eE:!0,k:c,i:/[^\w\s\*&]/,c:[{b:a,rB:!0,c:[t.TM],r:0},{cN:"params",b:/\(/,e:/\)/,k:c,r:0,c:[t.CLCM,t.CBCM,r,s,e]},t.CLCM,t.CBCM,i]}]),exports:{preprocessor:i,strings:r,k:c}}});hljs.registerLanguage("sql",function(e){var t=e.C("--","$");return{cI:!0,i:/[<>{}*#]/,c:[{bK:"begin end start commit rollback savepoint lock alter create drop rename call delete do handler insert load replace select truncate update set show pragma grant merge describe use explain help declare prepare execute deallocate release unlock purge reset change stop analyze cache flush optimize repair kill install uninstall checksum restore check backup revoke",e:/;/,eW:!0,l:/[\w\.]+/,k:{keyword:"abort abs absolute acc acce accep accept access accessed accessible account acos action activate add addtime admin administer advanced advise aes_decrypt aes_encrypt after agent aggregate ali alia alias allocate allow alter always analyze ancillary and any anydata anydataset anyschema anytype apply archive archived archivelog are as asc ascii asin assembly assertion associate asynchronous at atan atn2 attr attri attrib attribu attribut attribute attributes audit authenticated authentication authid authors auto autoallocate autodblink autoextend automatic availability avg backup badfile basicfile before begin beginning benchmark between bfile bfile_base big bigfile bin binary_double binary_float binlog bit_and bit_count bit_length bit_or bit_xor bitmap blob_base block blocksize body both bound buffer_cache buffer_pool build bulk by byte byteordermark bytes cache caching call calling cancel capacity cascade cascaded case cast catalog category ceil ceiling chain change changed char_base char_length character_length characters characterset charindex charset charsetform charsetid check checksum checksum_agg child choose chr chunk class cleanup clear client clob clob_base clone close cluster_id cluster_probability cluster_set clustering coalesce coercibility col collate collation collect colu colum column column_value columns columns_updated comment commit compact compatibility compiled complete composite_limit compound compress compute concat concat_ws concurrent confirm conn connec connect connect_by_iscycle connect_by_isleaf connect_by_root connect_time connection consider consistent constant constraint constraints constructor container content contents context contributors controlfile conv convert convert_tz corr corr_k corr_s corresponding corruption cos cost count count_big counted covar_pop covar_samp cpu_per_call cpu_per_session crc32 create creation critical cross cube cume_dist curdate current current_date current_time current_timestamp current_user cursor curtime customdatum cycle data database databases datafile datafiles datalength date_add date_cache date_format date_sub dateadd datediff datefromparts datename datepart datetime2fromparts day day_to_second dayname dayofmonth dayofweek dayofyear days db_role_change dbtimezone ddl deallocate declare decode decompose decrement decrypt deduplicate def defa defau defaul default defaults deferred defi defin define degrees delayed delegate delete delete_all delimited demand dense_rank depth dequeue des_decrypt des_encrypt des_key_file desc descr descri describ describe descriptor deterministic diagnostics difference dimension direct_load directory disable disable_all disallow disassociate discardfile disconnect diskgroup distinct distinctrow distribute distributed div do document domain dotnet double downgrade drop dumpfile duplicate duration each edition editionable editions element ellipsis else elsif elt empty enable enable_all enclosed encode encoding encrypt end end-exec endian enforced engine engines enqueue enterprise entityescaping eomonth error errors escaped evalname evaluate event eventdata events except exception exceptions exchange exclude excluding execu execut execute exempt exists exit exp expire explain export export_set extended extent external external_1 external_2 externally extract failed failed_login_attempts failover failure far fast feature_set feature_value fetch field fields file file_name_convert filesystem_like_logging final finish first first_value fixed flash_cache flashback floor flush following follows for forall force form forma format found found_rows freelist freelists freepools fresh from from_base64 from_days ftp full function general generated get get_format get_lock getdate getutcdate global global_name globally go goto grant grants greatest group group_concat group_id grouping grouping_id groups gtid_subtract guarantee guard handler hash hashkeys having hea head headi headin heading heap help hex hierarchy high high_priority hosts hour http id ident_current ident_incr ident_seed identified identity idle_time if ifnull ignore iif ilike ilm immediate import in include including increment index indexes indexing indextype indicator indices inet6_aton inet6_ntoa inet_aton inet_ntoa infile initial initialized initially initrans inmemory inner innodb input insert install instance instantiable instr interface interleaved intersect into invalidate invisible is is_free_lock is_ipv4 is_ipv4_compat is_not is_not_null is_used_lock isdate isnull isolation iterate java join json json_exists keep keep_duplicates key keys kill language large last last_day last_insert_id last_value lax lcase lead leading least leaves left len lenght length less level levels library like like2 like4 likec limit lines link list listagg little ln load load_file lob lobs local localtime localtimestamp locate locator lock locked log log10 log2 logfile logfiles logging logical logical_reads_per_call logoff logon logs long loop low low_priority lower lpad lrtrim ltrim main make_set makedate maketime managed management manual map mapping mask master master_pos_wait match matched materialized max maxextents maximize maxinstances maxlen maxlogfiles maxloghistory maxlogmembers maxsize maxtrans md5 measures median medium member memcompress memory merge microsecond mid migration min minextents minimum mining minus minute minvalue missing mod mode model modification modify module monitoring month months mount move movement multiset mutex name name_const names nan national native natural nav nchar nclob nested never new newline next nextval no no_write_to_binlog noarchivelog noaudit nobadfile nocheck nocompress nocopy nocycle nodelay nodiscardfile noentityescaping noguarantee nokeep nologfile nomapping nomaxvalue nominimize nominvalue nomonitoring none noneditionable nonschema noorder nopr nopro noprom nopromp noprompt norely noresetlogs noreverse normal norowdependencies noschemacheck noswitch not nothing notice notrim novalidate now nowait nth_value nullif nulls num numb numbe nvarchar nvarchar2 object ocicoll ocidate ocidatetime ociduration ociinterval ociloblocator ocinumber ociref ocirefcursor ocirowid ocistring ocitype oct octet_length of off offline offset oid oidindex old on online only opaque open operations operator optimal optimize option optionally or oracle oracle_date oradata ord ordaudio orddicom orddoc order ordimage ordinality ordvideo organization orlany orlvary out outer outfile outline output over overflow overriding package pad parallel parallel_enable parameters parent parse partial partition partitions pascal passing password password_grace_time password_lock_time password_reuse_max password_reuse_time password_verify_function patch path patindex pctincrease pctthreshold pctused pctversion percent percent_rank percentile_cont percentile_disc performance period period_add period_diff permanent physical pi pipe pipelined pivot pluggable plugin policy position post_transaction pow power pragma prebuilt precedes preceding precision prediction prediction_cost prediction_details prediction_probability prediction_set prepare present preserve prior priority private private_sga privileges procedural procedure procedure_analyze processlist profiles project prompt protection public publishingservername purge quarter query quick quiesce quota quotename radians raise rand range rank raw read reads readsize rebuild record records recover recovery recursive recycle redo reduced ref reference referenced references referencing refresh regexp_like register regr_avgx regr_avgy regr_count regr_intercept regr_r2 regr_slope regr_sxx regr_sxy reject rekey relational relative relaylog release release_lock relies_on relocate rely rem remainder rename repair repeat replace replicate replication required reset resetlogs resize resource respect restore restricted result result_cache resumable resume retention return returning returns reuse reverse revoke right rlike role roles rollback rolling rollup round row row_count rowdependencies rowid rownum rows rtrim rules safe salt sample save savepoint sb1 sb2 sb4 scan schema schemacheck scn scope scroll sdo_georaster sdo_topo_geometry search sec_to_time second section securefile security seed segment select self sequence sequential serializable server servererror session session_user sessions_per_user set sets settings sha sha1 sha2 share shared shared_pool short show shrink shutdown si_averagecolor si_colorhistogram si_featurelist si_positionalcolor si_stillimage si_texture siblings sid sign sin size size_t sizes skip slave sleep smalldatetimefromparts smallfile snapshot some soname sort soundex source space sparse spfile split sql sql_big_result sql_buffer_result sql_cache sql_calc_found_rows sql_small_result sql_variant_property sqlcode sqldata sqlerror sqlname sqlstate sqrt square standalone standby start starting startup statement static statistics stats_binomial_test stats_crosstab stats_ks_test stats_mode stats_mw_test stats_one_way_anova stats_t_test_ stats_t_test_indep stats_t_test_one stats_t_test_paired stats_wsr_test status std stddev stddev_pop stddev_samp stdev stop storage store stored str str_to_date straight_join strcmp strict string struct stuff style subdate subpartition subpartitions substitutable substr substring subtime subtring_index subtype success sum suspend switch switchoffset switchover sync synchronous synonym sys sys_xmlagg sysasm sysaux sysdate sysdatetimeoffset sysdba sysoper system system_user sysutcdatetime table tables tablespace tan tdo template temporary terminated tertiary_weights test than then thread through tier ties time time_format time_zone timediff timefromparts timeout timestamp timestampadd timestampdiff timezone_abbr timezone_minute timezone_region to to_base64 to_date to_days to_seconds todatetimeoffset trace tracking transaction transactional translate translation treat trigger trigger_nestlevel triggers trim truncate try_cast try_convert try_parse type ub1 ub2 ub4 ucase unarchived unbounded uncompress under undo unhex unicode uniform uninstall union unique unix_timestamp unknown unlimited unlock unpivot unrecoverable unsafe unsigned until untrusted unusable unused update updated upgrade upped upper upsert url urowid usable usage use use_stored_outlines user user_data user_resources users using utc_date utc_timestamp uuid uuid_short validate validate_password_strength validation valist value values var var_samp varcharc vari varia variab variabl variable variables variance varp varraw varrawc varray verify version versions view virtual visible void wait wallet warning warnings week weekday weekofyear wellformed when whene whenev wheneve whenever where while whitespace with within without work wrapped xdb xml xmlagg xmlattributes xmlcast xmlcolattval xmlelement xmlexists xmlforest xmlindex xmlnamespaces xmlpi xmlquery xmlroot xmlschema xmlserialize xmltable xmltype xor year year_to_month years yearweek",literal:"true false null",built_in:"array bigint binary bit blob boolean char character date dec decimal float int int8 integer interval number numeric real record serial serial8 smallint text varchar varying void"},c:[{cN:"string",b:"'",e:"'",c:[e.BE,{b:"''"}]},{cN:"string",b:'"',e:'"',c:[e.BE,{b:'""'}]},{cN:"string",b:"`",e:"`",c:[e.BE]},e.CNM,e.CBCM,t]},e.CBCM,t]}});hljs.registerLanguage("php",function(e){var c={b:"\\$+[a-zA-Z_-ÿ][a-zA-Z0-9_-ÿ]*"},i={cN:"meta",b:/<\?(php)?|\?>/},t={cN:"string",c:[e.BE,i],v:[{b:'b"',e:'"'},{b:"b'",e:"'"},e.inherit(e.ASM,{i:null}),e.inherit(e.QSM,{i:null})]},a={v:[e.BNM,e.CNM]};return{aliases:["php3","php4","php5","php6"],cI:!0,k:"and include_once list abstract global private echo interface as static endswitch array null if endwhile or const for endforeach self var while isset public protected exit foreach throw elseif include __FILE__ empty require_once do xor return parent clone use __CLASS__ __LINE__ else break print eval new catch __METHOD__ case exception default die require __FUNCTION__ enddeclare final try switch continue endfor endif declare unset true false trait goto instanceof insteadof __DIR__ __NAMESPACE__ yield finally",c:[e.HCM,e.C("//","$",{c:[i]}),e.C("/\\*","\\*/",{c:[{cN:"doctag",b:"@[A-Za-z]+"}]}),e.C("__halt_compiler.+?;",!1,{eW:!0,k:"__halt_compiler",l:e.UIR}),{cN:"string",b:/<<<['"]?\w+['"]?$/,e:/^\w+;?$/,c:[e.BE,{cN:"subst",v:[{b:/\$\w+/},{b:/\{\$/,e:/\}/}]}]},i,{cN:"keyword",b:/\$this\b/},c,{b:/(::|->)+[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*/},{cN:"function",bK:"function",e:/[;{]/,eE:!0,i:"\\$|\\[|%",c:[e.UTM,{cN:"params",b:"\\(",e:"\\)",c:["self",c,e.CBCM,t,a]}]},{cN:"class",bK:"class interface",e:"{",eE:!0,i:/[:\(\$"]/,c:[{bK:"extends implements"},e.UTM]},{bK:"namespace",e:";",i:/[\.']/,c:[e.UTM]},{bK:"use",e:";",c:[e.UTM]},{b:"=>"},t,a]}});hljs.registerLanguage("java",function(e){var t=e.UIR+"(<"+e.UIR+"(\\s*,\\s*"+e.UIR+")*>)?",a="false synchronized int abstract float private char boolean static null if const for true while long strictfp finally protected import native final void enum else break transient catch instanceof byte super volatile case assert short package default double public try this switch continue throws protected public private module requires exports",r="\\b(0[bB]([01]+[01_]+[01]+|[01]+)|0[xX]([a-fA-F0-9]+[a-fA-F0-9_]+[a-fA-F0-9]+|[a-fA-F0-9]+)|(([\\d]+[\\d_]+[\\d]+|[\\d]+)(\\.([\\d]+[\\d_]+[\\d]+|[\\d]+))?|\\.([\\d]+[\\d_]+[\\d]+|[\\d]+))([eE][-+]?\\d+)?)[lLfF]?",s={cN:"number",b:r,r:0};return{aliases:["jsp"],k:a,i:/<\/|#/,c:[e.C("/\\*\\*","\\*/",{r:0,c:[{b:/\w+@/,r:0},{cN:"doctag",b:"@[A-Za-z]+"}]}),e.CLCM,e.CBCM,e.ASM,e.QSM,{cN:"class",bK:"class interface",e:/[{;=]/,eE:!0,k:"class interface",i:/[:"\[\]]/,c:[{bK:"extends implements"},e.UTM]},{bK:"new throw return else",r:0},{cN:"function",b:"("+t+"\\s+)+"+e.UIR+"\\s*\\(",rB:!0,e:/[{;=]/,eE:!0,k:a,c:[{b:e.UIR+"\\s*\\(",rB:!0,r:0,c:[e.UTM]},{cN:"params",b:/\(/,e:/\)/,k:a,r:0,c:[e.ASM,e.QSM,e.CNM,e.CBCM]},e.CLCM,e.CBCM]},s,{cN:"meta",b:"@[A-Za-z]+"}]}});hljs.registerLanguage("css",function(e){var c="[a-zA-Z-][a-zA-Z0-9_-]*",t={b:/[A-Z\_\.\-]+\s*:/,rB:!0,e:";",eW:!0,c:[{cN:"attribute",b:/\S/,e:":",eE:!0,starts:{eW:!0,eE:!0,c:[{b:/[\w-]+\(/,rB:!0,c:[{cN:"built_in",b:/[\w-]+/},{b:/\(/,e:/\)/,c:[e.ASM,e.QSM]}]},e.CSSNM,e.QSM,e.ASM,e.CBCM,{cN:"number",b:"#[0-9A-Fa-f]+"},{cN:"meta",b:"!important"}]}}]};return{cI:!0,i:/[=\/|'\$]/,c:[e.CBCM,{cN:"selector-id",b:/#[A-Za-z0-9_-]+/},{cN:"selector-class",b:/\.[A-Za-z0-9_-]+/},{cN:"selector-attr",b:/\[/,e:/\]/,i:"$"},{cN:"selector-pseudo",b:/:(:)?[a-zA-Z0-9\_\-\+\(\)"'.]+/},{b:"@(font-face|page)",l:"[a-z-]+",k:"font-face page"},{b:"@",e:"[{;]",i:/:/,c:[{cN:"keyword",b:/\w+/},{b:/\s/,eW:!0,eE:!0,r:0,c:[e.ASM,e.QSM,e.CSSNM]}]},{cN:"selector-tag",b:c,r:0},{b:"{",e:"}",i:/\S/,c:[e.CBCM,t]}]}});hljs.registerLanguage("xml",function(s){var e="[A-Za-z0-9\\._:-]+",t={eW:!0,i:/`]+/}]}]}]};return{aliases:["html","xhtml","rss","atom","xjb","xsd","xsl","plist"],cI:!0,c:[{cN:"meta",b:"",r:10,c:[{b:"\\[",e:"\\]"}]},s.C("",{r:10}),{b:"<\\!\\[CDATA\\[",e:"\\]\\]>",r:10},{b:/<\?(php)?/,e:/\?>/,sL:"php",c:[{b:"/\\*",e:"\\*/",skip:!0}]},{cN:"tag",b:"|$)",e:">",k:{name:"style"},c:[t],starts:{e:"",rE:!0,sL:["css","xml"]}},{cN:"tag",b:"|$)",e:">",k:{name:"script"},c:[t],starts:{e:"",rE:!0,sL:["actionscript","javascript","handlebars","xml"]}},{cN:"meta",v:[{b:/<\?xml/,e:/\?>/,r:10},{b:/<\?\w+/,e:/\?>/}]},{cN:"tag",b:"",c:[{cN:"name",b:/[^\/><\s]+/,r:0},t]}]}});hljs.initHighlightingOnLoad(); ================================================ FILE: resources/blog-posts/js/main.js ================================================ /** * Main JavaScript file for additional main site functionality. * * @author Joeri Hermans * @version 0.1 * @since 8 July 2016 */ var rippleEffect = function(e) { var target = e.target; var rectangle = target.getBoundingClientRect(); var ripple = target.querySelector('.ripple'); if( !ripple ) { ripple = document.createElement('span'); ripple.className = 'ripple'; ripple.style.height = ripple.style.width = Math.max(rectangle.width, rectangle.height) + 'px'; // Check if the target has a first child. if( target.firstChild ) target.insertBefore(ripple, target.firstChild); else target.appendChild(ripple); } // Check if we need to add a red ripple. if( target.classList.contains('red') ) ripple.classList.add('ripple-red'); ripple.classList.remove('show'); var top = e.pageY - rectangle.top - ripple.offsetHeight / 2 - document.body.scrollTop; var left = e.pageX - rectangle.left - ripple.offsetWidth / 2 - document.body.scrollLeft; ripple.style.top = top + 'px'; ripple.style.left = left + 'px'; ripple.classList.add('show'); return false; }; function addRippleEffects() { // Add the ripple effect to all buttons in the page. var elements = document.getElementsByClassName("ripple-button"); for(var i = 0; i < elements.length; ++i) { elements[i].addEventListener('click', rippleEffect, false); } }; function renderMath() { var currentEquation = 1; // Render all the math-elements. var elements = document.getElementsByClassName("math"); for(var i = 0; i < elements.length; ++i) { var e = elements[i]; var tex = e.innerHTML; katex.render(tex, e); // Check if the element is an equation. if( e.classList.contains("equation-math") ) { // Set the unique id of the equation. e.id = "equation-" + currentEquation; // Add the equation number. e.innerHTML += '(' + currentEquation + ')'; ++currentEquation; } } }; addRippleEffects(); renderMath(); ================================================ FILE: resources/blog-posts/part-1-an-introduction.html ================================================ Distributed Deep Learning with Apache Spark and Keras - Part 1 - An introduction

In the following blog posts we study the topic of Distributed Deep Learning, or rather, how to parallelize gradient descent using data parallel methods. We start by laying out the theory, while supplying you with some intuition into the techniques we applied. At the end of this blog post, we conduct some experiments to evaluate how different optimization schemes perform in identical situations. We also introduce dist-keras, which is our distributed deep learning framework built on top of Apache Spark and Keras. For this, we provide several notebooks and examples. This framework is mainly used to test our distributed optimization schemes, however, it also has several practical applications at CERN, not only because of the distributed learning, but also for model serving purposes. For example, we provide several examples which show you how to integrate this framework with Spark Streaming and Apache Kafka. Finally, these series will contain parts of my master-thesis research. As a result, they will mainly show my research progress. However, some might find some of the approaches I present here useful to apply in their own work.

Introduction

Unsupervised feature learning and deep learning has shown that being able to train large models on vasts amount of data can drastically improve model performance. However, consider the problem of training a deep network with millions, or even billions of parameters. How do we achieve this without waiting for days, or even multiple weeks? Dean et al. [2] propose a different training paradigm which allows us to train and serve a model on multiple physical machines. The authors propose two novel methodologies to accomplish this, namely, model parallelism and data parallelism. In the following blog post, we briefly mention model parallelism since we will mainly focus on data parallel approaches.

Sidenote: In order to simplify the figures, and make them more intuitive, we negate the gradient \nabla f without adding a - sign in front. Thus, all gradient symbols in the following figures will be negated by default, unless stated otherwise. I actually forgot to negate the gradients in the figures, so mentioning this is rather an easy fix. However, this will be corrected in the final version of the master thesis.

Model parallelism

In model parallelism, a single model is distributed over multiple machines. The performance benefits of distributing a deep network across multiple machines mainly depends on the structure of the model. Models with a large number of parameters typically benefit from access to more CPU cores and memory, thus, parallelizing a large model produces a significant performance increase, and thereby reducing the training time.

Let us start with a simple example in order to illustrate this concept more clearly. Imagine having a perceptron, as depicted in Figure 1. In order to parallelize this efficiently, we can view a neural network as a dependency graph, where the goal is to minimize the number of synchronization mechanisms, assuming we have unlimited resources. Furthermore, a synchronization mechanism is only required when a node has more than 1 variable dependency. A variable dependency is a dependency which can change in time. For example, a bias would be a static dependency, because the value of a bias remains constant over time. In the case for the perceptron shown in Figure 1, the parallelization is quite straightforward. The only synchronization mechanism which should be implemented resides in output neuron since y \triangleq \sigma(\sum_i w_ix_i) where \sigma is the activation function of the output neuron.

Model Parallelism

Figure 1: A perceptron partitioned using the model parallelism paradigm. In this approach every input node is responsible for accepting the input x_i from some source, and multiplying the input with the associated weight w_i. After the multiplication, the result is sent to the node which is responsible for computing y. Of course, this node requires a synchronization mechanism to ensure that the result is consistent. The synchronization mechanism does this by waiting for the results y depends on.

Data parallelism

Data parallelism is an inherently different methodology of optimizing parameters. The general idea is to reduce the training time by having n workers optimizing a central model by processing n different shards (partitions) of the dataset in parallel. In this setting we distribute n model replicas over n processing nodes, i.e., every node (or process) holds one model replica. Then, the workers train their local replica using the assigned data shard. However, it is possible to coordinate the workers in such a way that, together, they will optimize a single objective. There are several approaches to achieve this, and these will be discussed in greater detail in the coming sections and blog posts.

Nevertheless, a popular approach to optimize this objective is to employ a centralized parameter server. A parameter server is responsible for the aggregation of model updates, and parameter requests coming from different workers. The distributed learning process starts by partitioning a dataset into n shards. Every individual shard will be assigned to a particular worker. Next, a worker will sample mini-batches from its shard in order to train the local model replica. After every mini-batch (or multiple mini-batches), the workers will communicate a variable with the parameter server. This variable is in most implementations the gradient \nabla f_i(x). Finally, the parameter server will integrate this variable by applying a specific update procedure which knows how to handle this variable. This process repeats itself until all workers have sampled all mini-batches from their shard. This high-level description is summarized in Figure 2.

Data Parallelism

Figure 2: Schematic representation of a data parallel approach. In this methodology we spawn n workers (not necessarily on different machines), and assign a data shard (partition) of the dataset to every worker. Using this data shard, a worker i will iterate through all mini-batches to produce a gradient, \nabla f_i(x) for every mini-batch x. Next, \nabla f_i(x) is send to the parameter server, which will incorperate the gradient using an update mechanism.

Approaches

In this section we discuss several approaches towards parallelizing gradient descent (GD). This is not an intuitive task since gradient descent is an inherently sequential algorithm where every data point (instance) provides a direction to a minimum. However, training a model with a lot of parameters while using a very large dataset, will result in a high training time. If one would like the reduce the training time, the obvious choice would be to buy better, or rather, more suitable hardware (e.g., a GPU). However, this is not always possible. For this reason, several attempts have been made to parallelize gradient descent. In the following subsections, we will examine some of the popular approaches to parallelize gradient descent, and provide some intuition into these techniques work, and how they should be used.

Synchronous Data Parallel Methods

There are two distinct approaches towards solving data parallelism. Personally, the most conceptually straightforward one is synchronous data parallelism. In synchronous data parallelism, as depicted in Figure 3, all workers compute their gradients based on the same center variable. This means whenever a worker is done computing a gradient for the current batch, it will commit a parameter (i.e., the gradient or the parameterization of the model) to the parameter server. However, before incorporating this information into the center variable, the parameter server stores all the information until all workers have committed their work. After this, the parameter server will apply a specific update mechanism (depending on the algorithm) to incorporate the commits into the parameter server. In essence, one can see synchronous data parallelism as a way to parallelize the computation of a mini-batch.

Synchronous Data Parallelism

Figure 3: In a synchronous data parallel setting, one has n workers (not necessarily on different machines). At the start of the training procedure, every worker fetches the most recent center variable. Next, every worker will start their training procedure. After the computation of the gradient, a worker commits the computed information (gradient or parametrization, depending on the algorithm) to the parameter server. However, due to unmodeled system behaviour, some workers might induce a significant delay, which results in other workers to be taskless while still consuming the same memory resources.

However, due to unmodeled system behaviour of the workers, workers might commit their results with a certain delay. Depending on the system load, this delay can be quite significant. As a result, this data parallel method is a case of the age-old saying "a synchronous data parallel method is only as strong, as the weakest worker in the cluster" :-).

Model Averaging

In essence, this is a data parallel approach as mentioned in the introduction. However, in contrary to more conventional data parallel approaches, there is no parameter server. In model averaging, every worker will get a copy of the model at the start of the training period. However, one can have different weight initialization techniques for the workers to cover more of the parameter space after several iterations, as shown in Figure 4. Though, it is not recommended to do this since this will result in very different solutions for every worker. Thus wasting initial iterations converging to a "good solution" on which all workers "agree". However, this problem is related to most distributed optimization algorithms discussed here, and will be discussed in more detail in the following blog posts.

After every worker is initialized with a copy of the model, all workers start the training procedure independently from each other. This means that during the training procedure, no communication between the workers occurs. Thus, eliminating the communication overhead that is present in approaches with parameter servers. After the end of an epoch, i.e., a full iteration of the dataset, the models are aggregated and averaged on a single worker. The resulting averaged model will then be distributed to all workers, where the training process repeats until the averaged model converges.

Model Averaging

Figure 4: In this setting we have 4 independent workers, each having a randomly initialized model. In order to simplify the situation, let us assume we can obtain the gradient directly from E(\theta), which is our loss function. In model averaging, every worker only applies gradient descent to its own model without communicating with other workers. After the end of an epoch, as shown in the center plot, the models are averaged in order to produce a central model. In the following epoch, the central model will be used as a starting point for all workers.

EASGD

EASGD, or Elastic Averaging SGD, introduced by Zhang et al. [3], is a distributed optimization scheme designed to reduce communication overhead with the parameter server. This is in contrast to approaches such as DOWNPOUR, which most of the time require a small communication window in order to converge properly. The issue with a small communication window is that the learning process needs to be stopped in order to synchronize the model with the parameter server, and as a result, limiting the throughput of the training process. Of course, the number of parameters in a model is also an important factor. For example, one can imagine that having a model with 100 MB worth of parameters could severely influence the training performance if every 5 mini-batches a synchronization with the parameter server would occur. Furthermore, the authors state that due to the distributed nature, exploration of the nearby parameter space by the workers actually improves the statistical performance of a model with respect to sequential gradient descent. However, at the moment, we do not have any evidence to support this claim, nor to deny it. What we do observe, is that the statistical performance of a model after a single epoch, is usually (significantly) less than a single epoch of Adam (sequential training) and ADAG (distributed training). However, if we would let EASGD reach the same amount of wallclock training time, then we still have an identical or slightly worse model performance. So there is evidence to suggest that this claim is not completely true, at least, in the case of EASGD. This however, requires more investigation.

The authors solve the communication constraint by applying an "elastic force" between the parameters of the workers and the center variable. Furthermore, due to the elasticity and reduction in communication with the parameter server, the workers are allowed to explore the surrounding parameter space. As stated above, the authors claim that allowing for more exploration can be beneficial for the statistical performance of the model. However, we argue that, as in model averaging, this will only work well if the workers are in the neighbourhood of the center variable, we will show this empirically in the Experiments section. However, in contrast to model averaging, the workers are not synchronized with the center variable. This begs the question, how does EASGD ensure that the workers remain in the "neighbourhood" of the center variable? Because as in model averaging, too much exploration of the parameter space actually deteriorates the performance of the center variable, and may prevent convergence, because the workers cover inherently different spaces of the parameter space, as shown in Figure 4. However, if the elasticity parameter is too high, exploration will not take place at all.

\theta^i_{t+1} = \theta^i_t - \eta\nabla f(\theta^i_t) - \eta\rho(\theta^i_t - \theta^c_t)
\theta^c_{t+1} = (1 - n\eta\rho) \theta^c_t + \eta\rho(\frac{1}{n} \sum_{i = 0}^{n} \theta^i_t)

To fully understand the implications of the EASGD equations, as shown in Equation 1 and Equation 2, we refer the reader to Figure 5, which shows the intuition behind the elastic force. Having two vectors, the gradient \nabla f, and the elastic difference \eta\rho(\theta^i_t - \theta^c_t) where \eta is the learning rate and \rho is the elasticity parameter, the authors say when \rho is small, you allow for more exploration of the parameter space. This can be observed from Figure 5. When \rho is small, the vector \eta\rho(\theta^i_t - \theta^c_t) will be small as well (unless the distance between the worker and the center variable is large). As a result, the attraction between the center variable and the worker is small, thus, allowing for more exploration of the parameter space.

Analogously, imagine that you are walking with your dog, and the dog is responsible for getting you home (guiding you to a minimum). If you would let your dog drift too far away from you (because you have a leash which is very flexible (small \rho)). In the most extreme case, the dog will get home without you because your leash was simply too flexible. As a result, the dog could not pull you home. At this point you think, maybe I should buy more dogs? Thinking that together they will help you. However, due to the nature of these creatures you soon realize that instead of going home, they all go to different places (multiple workers in the parameter space having different inputs, e.g., one dog sees a particular tree, while an other dog sees a bush, etc.). From this experience, you notice that the problem is the leash, it is way too flexible because the dogs are all over the place. As a result, you buy a less flexible leash, with the effect that the dogs stay closer to you, and eventually "pull" together to bring you home faster.

EASGD

Figure 5: The worker variable w is exploring the parameter space in order to optimize C. However, the amount of exploration is proportional by the elasticity factor \rho, and the difference (\theta^w_t - \theta^C_t). In general, when \rho is small, you allow for more exploration to occur. It is to be noted, that as in model averaging, too much exploration will actually deteriorate the statistical performance of a model (as shown in the first subfigure of Figure 4), because the workers do not agree on a good solution. Especially when you take into account that the center variable is updated using an average of the worker variables, shown in Equation 2.

Now, from Equation 1 and the intuition above, we can expect that for some worker update i within a communication window, the accumulated gradient is larger or equal to the elastic force. As a result, this prevents the workers from further exploration (as expected). However, a significant side-effect is that the following gradient computations are wasted since they are countered by the elastic difference, as shown in Figure 5. Using the analogy from above, this is equivalent to a situation where no matter how hard a dog is trying to pull, you just don't let it go any further. Thus, the efforts of the dog are wasted. This condition is described by Equation 3.

- \sum_i - \eta\nabla f(x_{t + i};\theta_{t + i}) \geq -\eta\rho(\theta_{t + i + 1} - \theta_t^C)

A straightforward technique to prevent the squandering of computations after the condition described by Equation 3, is to simply check for this condition after the computation of every mini-batch. When this condition is met, then the term \sum_i - \eta\nabla f(x_{t + i};\theta_{t + i}) is communicated with the parameter server. As a result, we do not waste any computations, and furthermore, we loose a hyperparameter since the communication window is now controlled (indirectly) by the hyperparameter \rho, which controls the elastic force. In essence, the core idea of ADAG (which will be mentioned later in this blog post), can also be applied to this scheme to even further improve the quality of the gradient updates, and making the optimization scheme less sensitive to other hyperparameters, e.g., the number of parallel workers.

Asynchronous Data Parallel Methods

In order to overcome the significant delays induced by loaded workers in synchronous data parallelism, and thereby decrease the training time even further, let us simply remove the synchronization constraint. However, this imposes several other effects, and some of them are not very obvious. The conceptually simplest, is parameter staleness. Parameter staleness is simply the number of commits other workers performed between the last pull (center variable synchronization), and the last commit (parameter update) of the current worker. Intuitively, this implies that a worker is updating a "newer" model using gradients based on a previous parametrization of that model. This is shown in Figure 6.

Asynchronous Data Parallelism

Figure 6: In asynchronous data parallelism, training time is even further reduced (on average) due to the removal of the synchronization mechanism in synchronous data parallelism. However, this induces several effects such as parameter staleness, and asynchrony induced momentum.

Note: It is not required to read the paragraphs below, unless you really want to. However, the take-away point is: increasing the number of parallel workers behaves like adding more momentum.

The other, less intuitive side-effect is asynchrony induced momentum [1]. Roughly stated, this means that adding more workers to the problem also adds more implicit momentum to the optimization process. This implicit momentum is the result of the queuing model required by asynchrony. Note that some approaches, such as Hogwild! do not require locking mechanisms, since they assume sparse gradient updates. However, distributed SGD works with dense gradient updates as well. We also confirm the statements of the authors that adding more asynchronous workers to the problem actually deteriorates the statistical performance of the model when using algorithms which do not take staleness and asynchrony into account. Furthermore, they state that the behaviour of an asynchronous algorithm is roughly described by Equation 4. Which implies that the implicit momentum produced by asynchrony is (1 - \frac{1}{n}).

E[\theta_{t+1} - \theta_t] = (1 - \frac{1}{n})E[\theta_t - \theta_{t-1}] - \frac{1}{n}\eta E \nabla f_i(\theta_t)

But personally, I think this is not the complete story. I agree with the nicely formalized queueing model, and that in general, an increase in the number of asynchronous workers decreases the statistical performance of a model (we also observe this in our experiments). However, I would say that the effect rather behaves like momentum, but cannot be necessarily be defined as such (with ADAG, we do not observe this effect, at least for 30 parallel processes). We will go more in-depth into this topic in the following blog posts, since this is still a topic that requires some more research on my part.

Asynchronous EASGD

The update scheme of asynchronous EASGD is quite similar, however, there are some important details. In the following paragraphs we will call the vector - \eta\rho(\theta^i_t - \theta^c_t) the elastic difference, and thereby following the notation of the paper. Remember that in the synchronous version this vector is actually used to enforce the exploration policy. Meaning, in Equation 1 this vector has the task to not let a worker drift too "far" from the center variable. Repeating the analogy with the dogs, imagine having a dog with an elastic leash. The further the dog walks away from you (the center variable), the stronger the force will be to pull it back. As a result, at some point the force the dogs exerts will be equal to the force the elastic leash exerts in the opposite direction. At this point, the dog cannot move any further. This is exactly what happens when the elastic difference is applied to a worker, as shown in Figure 5.

In the asynchronous version, the elastic difference has the same function. However, it will also be used to update the center variable. As stated in the paragraph above, the elastic difference is actually used to limit exploration. However, if we negate the elastic difference, which is + \eta\rho(\theta^i_t - \theta^c_t), then the elastic difference can be used to optimize the center variable (reverse the arrow in Figure 5), while still holding true to the communication constraints EASGD is trying to solve.

DOWNPOUR

In DOWNPOUR, whenever a worker computes a gradient (or a sequence of gradients), the gradient is communicated with the parameter server. When the parameter server receives the gradient update from a worker, it will incorporate the update in the center variable, as shown in Figure 7. Contrary to EASGD, DOWNPOUR does not assume any communication constraints. Even more, if frequent communication with the parameter server does not take place (in order to reduce worker variance), DOWNPOUR will not converge (this is also related to the asynchrony induces momentum issue, see Figure 8). This is because of the same issues discussed in the sections above. If we allow the workers to explore "too much" of the parameter space, then the workers will not work together on finding a good solution for the center variable. Furthermore, DOWNPOUR does not have any intrinsic mechanisms in place to remain in the neighbourhood of the center variable. As a result, if you would increase the communication window, you would proportionally increase the length of the gradient which is sent to the parameter server, thus, the center variable is updated more aggressively in order to keep the variance of the workers in the parameter space "small".

DOWNPOUR animated

Figure 7: Animation of DOWNPOUR with 20 parallel workers (blue) with identical learning rates which are trying to optimize a single objective (center variable, red) compared to regular sequential gradient descent (green). From this animation we can observe the momentum induced by the asynchrony of the parallel workers, as discussed above.

DOWNPOUR with too much implicit momentum

Figure 8: Animation of DOWNPOUR with 40 parallel workers. In this case, the implicit momentum produced by the number of workers causes the algorithm to diverge.

ADAG

We noticed that a large communication window is correlated with a decline in model performance. Using some simulations (like DOWNPOUR, as shown above), we noticed that this effect can be mitigated when you normalize the accumulated gradient with the communication window. This has several positive effects, for one, you are not normalizing with respect to the number of parallel workers, thus, as a result, you are not losing the (convergence speed) benefit of parallelizing gradient descent. This has as a side-effect, that the variance of the workers with respect to the center variable will also remain small, thus contributing positively to the central objective! Furthermore, because of the normalization, you are less sensitive to hyperparametrization, especially regarding the communication window. However, it is to say that a large communication window typically also degrades the performance of the model because you allow the workers to explore more of the parameter space using the samples from their data shard. In our first prototype, we adapted DOWNPOUR to fit this idea. We observed the following results. First, we observe a significant increase in model performance, even compared to a sequential optimization scheme such as Adam. Second, compared to DOWNPOUR, we can increase the communication window with a factor 3. Thus, allow to utilize the CPU resources more efficiently, and decreasing the total training time even further. Finally, normalizing the accumulated gradient allows us the increase the communication window. As a result, we are able to match the training time of EASGD, and achieve the roughly the same (sometimes better, sometimes worse).

\Large \frac{\sum_{i=0}^{\tau}-\eta\nabla f(x_{t + i};\theta_{t + i})}{\tau}

To conclude, the core idea of ADAG, or asynchronous distributed adaptive gradients, can be applied to any distributed optimization scheme. Using our observations, and intuition (especially with respect to implicit momentum due to asynchrony), we can make a calculated guess that the idea of normalized accumulated gradients can be applied to any distributed optimization scheme. However, we need to conduct several experiments in order to verify this claim. ADAG will be discussed in detail in the following blog posts.

Distributed Keras

Distributed Keras is a distributed deep learning framework built on top of Apache Spark and Keras with the goal to significantly reduce the training using distributed machine learning algorithms, and allow bigger than memory datasets. This project initially started as a prototype with the CMS collaboration. However, the project has seen several iterations already since its start in August 2016.

Architecture

In essence, a training procedure is passed on to the Spark workers as a lambda function. However, in order to pass on multiple parameters, e.g., port number of the parameter server, we wrap everything in an object, and define a train function which accepts the parameters required by Spark. To give a complete overview, let us assume that a user just called the train method shown in the code block below. The trainer object will first allocate and start a parameter server on the Spark driver. Next, it allocates the worker procedure, which holds all parameters and procedures to train the Keras model. Furthermore, in order to comply with the required number of parallel workers, we partition the dataset according to this specific amount. However, when processing big datasets it is recommended to increase the parallelism factor, this will prevent that some workers remain idle while other (slower) workers are still processing their old batch (in literature, this is know as the straggler problem). In such cases, we recommend a parallelism factor of 3, as suggested by the Spark documentation.

Although, one needs to consider the implications of a large parallelism factor. Basically, the parallelism is proportional to the number of partitions you will create. So lets say you assign 20 workers to a specific task, with a parallelism factor of 3. Spark will then repartition the dataset into 60 shards. Now, before a worker starts processing a partition, it first has to load all the Python libraries which are required to process the task, next, it will also have to deserialize and compile the Keras model. This induces a significant overhead. So this technique is only effective on non-heterogenous systems (meaning, different hardware, or variable load per worker), and large datasets due to the large "warmup" overhead.

Distributed Keras architecture

Figure 9: Imagine we have a Spark Context with k executors, and l cores per executor. Using these parameters, there will be n = k \cdot l workers allocated by dist-keras. However, if you would like to use a smaller amount of parallel workers, you can simply parameterize the training algorithm without having to reinitialize the Spark Context.

        
        # Allocate the parameter server.
        self.parameter_server = self.allocate_parameter_server()
        # Start the communication service.
        self.start_service()
        # Allocate a worker.
        worker = self.allocate_worker()
        # Repartition in order to fit the number of workers.
        num_partitions = dataframe.rdd.getNumPartitions()
        # Assign the dataset.
        dataset = dataframe
        if shuffle:
            dataset = shuffle(dataset)
        if num_partitions > self.num_workers:
            dataset = dataset.coalesce(self.num_workers)
        else:
            dataset = dataset.repartition(self.num_workers)
        dataset.cache()
        # Iterate through the epochs (some trainers require a result).
        dataset.rdd.mapPartitionsWithIndex(worker.train).collect()
        # Stop the communication service.
        self.stop_service()

        return self.parameter_server.get_model()
        
      

Experiments

In the following experiments we set up the different optimization schemes against each other, i.e. (sequential) Adam, (distributed) DOWNPOUR, Asynchronous EASGD, and ADAG, and evaluate them using the MNIST dataset (samples are shown in Figure 10). We will use the following parameters during our experiments:

  • Multilayer perceptron with 1 000 000 trainable parameters (~4 MB model) (complete model summarized below)
  • 4 sample mini-batches
  • 1 epoch
  • Parallelism factor: 1
  • Adam as worker optimizer
  • Communication windows:
    • DOWNPOUR: 5
    • ADAG: 5
    • Asynchronous EASGD: 32
  • 20 parallel workers:
    • 10 compute nodes with 10 Gbps network cards
    • 2 processes per compute node (32 cores)
        
mlp = Sequential()
mlp.add(Dense(1000, input_shape=(784,)))
mlp.add(Activation('relu'))
mlp.add(Dropout(0.2))
mlp.add(Dense(200))
mlp.add(Activation('relu'))
mlp.add(Dropout(0.2))
mlp.add(Dense(10))
mlp.add(Activation('softmax'))
        
      
MNIST normalized

Figure 10: The MNIST dataset is a collection of handwritten digits. This dataset is usually used as a "unit test" for optimization algorithms. Every sample consists of 784 pixels, with values ranging between 0 and 255. We normalize these using our framework dist-keras, which is built on top of Apache Spark, thus, profiting from the parallelization.

In the following experiments we evaluate the accuracy of the central variable, and the training time (wallclock) compared to the number of parallel workers. Although this is a relatively small dataset, it gives us some indications into the scaling abilities of the optimization schems. In the following blog posts we will also focus on large scale deep learning, meaning, we will handle very large datasets and train models in a data parallel setting.

DOWNPOUR

DOWNPOUR experiment MNIST

Figure 11: A key observation in this experiment is that DOWNPOUR actually diverges when it reaches a critical amount of implicit momentum, as shown in Figure 8. We made this observation in several other datasets as well. However, the constantly declining performance the authors in [1] is not observed. Rather, we have a sudden decline in model performance. This is rather contradictory to the claims made in [1]. According to their theory, we should not see a sudden decline in model performance, but rather a steady decline. As a result, we think that their statement "there exists a limit to asynchrony" is false as well. Though, their intuition is correct! Furthermore, on the left, we see the scaling of the algorithm. We actually expected that the scaling should work better, however, this could be because of the unbalanced partitions (we are doing experiments with other partitioners to correct for this) and relatively small dataset.

Asynchronous EASGD

Asynchronous EASGD experiment MNIST

Figure 12: As stated above, EASGD is an algorithm designed with communication constraints in mind, which is a realistic constraint. As a result, the authors incorporate an elastic force which allows the worker to explore a certain area of the neighbouring parameter space w.r.t. the center variable. As a result, it will not have an immediate decline in model performance, as observed in DOWNPOUR, but rather a steady decline. This decline (with respect to the number of model performance), is due to the increased amount of staleness (since the center variable will have covered more distance because of the queuing model) compared to the worker. As a result, the positive information a worker can contribute is proportional to the elastic difference, and this elastic difference will be smaller when the number of parallel workers is higher (due to parameter staleness). However, since EASGD scales very well with the number of workers, we simply match the training time of ADAG or DOWNPOUR. However, even if we would match the training time, EASGD usually results in a lower accuracy compared to ADAG. This is phenomena is subject to further study, as it is not really completely understood why this is actually happening. Furthermore, it also consumes more CPU compared to ADAG, if we would match the model performance of ADAG (ADAG spends a lot of time waiting for network operations).

ADAG

ADAG experiment MNIST

Figure 13: If we would assume no communication constraints, then how would we solve the problem DOWNPOUR has? Averaging the gradients will work, but it is not very desireable since the gradient will act as if they were a sequential optimization algorithm. So what if we would normalize with respect to the communication window? Since this really is the parameter which induces parameter staleness, as can be observed from Figure 12 (declining model performance). An interesting observation we can make here is the absence of any decline in model performance (compared to DOWNPOUR and EASGD). We think this is because one of the following reasons; for one, we keep the variance of the workers small (limited exploration), and normalize the accumulated gradient on the workers with the communication window (which is a prime factor in implicit momentum).

Influence of the communication window on accuracy and training time

In the following set of experiments we will investigate the influence of the communication window on accuracy and training time. The communication window is a hyperparameter which defines the frequency of communication with the parameter server. A communication window of 35 means that a worker will accumulate 35 mini-batch updates, and finally synchronizes with the parameter server. In the experiments, all optimization schemes use identical hyperparameters, where the only variable between tests is the communication window. As before, we will use MNIST as a dataset, a mini-batch of size 4, and Adam as the worker optimizer.

Influence of communication window on accuracy

Figure 14: As expected, DOWNPOUR is not able to handle large communication windows. EASGD on the other hand is not able to handle small communication windows! As stated above, this is because the elastic force (due to the number of workers) is stronger then the exploration of the parameter space. Thus, causing EASGD to not converge. ADAGA on the other hand is able to handle the varying communication window, however, a slight decline in model performance is observed. This is expected due to the increase in exploration of the parameter space by the workers.

Influence of communication window on training time

Figure 15: Again, the training time of all optimization schemes decrease significantly when the communication window is increased. However, we think we can further decrease the training time by allocated a thread in every worker which sole responsibility is to send the parameters to the parameter server. However, this is an idea that has yet to be explored. To conclude, we suggest to make a trade-off between training time and accuracy, in the case of ADAG, we recommend a communication window of 10-15, since this hyperparametrization achieves similar model performance. However, when applying this to a differen dataset. We recommend that you test these settings for yourself, since they can differ.

Summary

In this work we gave the reader an introduction to the problem of distributed deep learning, and some of the aspects which one needs to consider when applying it, such as, for example, implicit momentum. We also suggested some techniques which are able to significantly improve existing distributed optimization schemes. Furthermore, we introduced our framework, dist-keras, and applied different distributed optimization schemes to the MNIST dataset. Finally, we also provided several production-ready examples and notebooks.

Acknowledgements

This work was done as part of my Technical Student contract at CERN IT. I would like to thank Zbigniew Baranowski and Luca Canali of the IT-DB group, Volodimir Begy of the University of Vienna, and to Jean-Roch Vlimant, Maurizio Pierini, and Federico Presutti (CalTech) of the EP-UCM group for their collaboration on this work.

References

  1. Mitliagkas, I., Zhang, C., Hadjis, S., & Ré, C. (2016). Asynchrony begets Momentum, with an Application to Deep Learning. arXiv preprint arXiv:1605.09774.
  2. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., ... & Ng, A. Y. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223-1231).
  3. Zhang, S., Choromanska, A. E., & LeCun, Y. (2015). Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems (pp. 685-693).
  4. The MNIST database, of handwritten digits.
================================================ FILE: scripts/generate_secret.py ================================================ """Generates a JSON structure that needs to be added to the secrets file. Author: Joeri Hermans """ ## BEGIN Imports. ############################################################## import json import optparse import random import string ## END Imports. ################################################################ def generate_secret(identity): secret = ''.join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(64)) d = {} d['secret'] = secret d['identity'] = identity print(json.dumps(d)) def parse_arguments(): parser = optparse.OptionParser() parser.set_defaults(identity=None) parser.add_option('--identity', action='store', dest='identity', type='string') (options, args) = parser.parse_args() return options def main(): # Parse the options. options = parse_arguments() # Check if an identity has been provided. if options.identity is not None: generate_secret(options.identity) else: print("Please specify an identity (--identity).") if __name__ == '__main__': main() ================================================ FILE: scripts/punchcard.py ================================================ """Script which starts the Punchcard daemon. Punchcard will accept remote job requests and execute them on the local cluster. Author: Joeri Hermans """ ## BEGIN Imports. ############################################################## from distkeras.job_deployment import Job from distkeras.job_deployment import Punchcard import os import sys import optparse ## END Imports. ################################################################ def parse_arguments(): parser = optparse.OptionParser() parser.set_defaults(port=8000, secrets_path='secrets.json') parser.add_option('--port', action='store', dest='port', type='int') parser.add_option('--secrets', action='store', dest='secrets_path', type='string') (options, args) = parser.parse_args() return options def start_punchcard(port, secrets): punchcard = Punchcard(secrets, port) punchcard.run() def main(): # Parse the program arguments. options = parse_arguments() port = options.port secrets_path = options.secrets_path # Start the Punchcard instance. start_punchcard(port, secrets_path) if __name__ == '__main__': main() ================================================ FILE: setup.py ================================================ """Setup-module for DistKeras. This software enables distrubuted Machine Learning on Apache Spark using Keras. See: https://github.com/JoeriHermans/dist-keras/ http://joerihermans.com/ """ from setuptools import setup from setuptools import find_packages setup(name='dist-keras', description='Distributed Deep learning with Apache Spark with Keras.', url='https://github.com/JoeriHermans/dist-keras', author='Joeri Hermans', version='0.2.1', author_email='joeri@joerihermans.com', license='GPLv3', install_requires=['theano', 'tensorflow', 'keras', 'flask'], packages=['distkeras'], package_data={'distkeras': ['distkeras/*.py']}, # Keywords related to the project. keywords=['Keras', 'Deep Learning', 'Machine Learning', 'Theano', 'Tensorflow', 'Distributed', 'Apache Spark'], )