[
  {
    "path": "LICENSE",
    "content": "                    GNU GENERAL PUBLIC LICENSE\n                       Version 3, 29 June 2007\n\n Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>\n Everyone is permitted to copy and distribute verbatim copies\n of this license document, but changing it is not allowed.\n\n                            Preamble\n\n  The GNU General Public License is a free, copyleft license for\nsoftware and other kinds of works.\n\n  The licenses for most software and other practical works are designed\nto take away your freedom to share and change the works.  By contrast,\nthe GNU General Public License is intended to guarantee your freedom to\nshare and change all versions of a program--to make sure it remains free\nsoftware for all its users.  We, the Free Software Foundation, use the\nGNU General Public License for most of our software; it applies also to\nany other work released this way by its authors.  You can apply it to\nyour programs, too.\n\n  When we speak of free software, we are referring to freedom, not\nprice.  Our General Public Licenses are designed to make sure that you\nhave the freedom to distribute copies of free software (and charge for\nthem if you wish), that you receive source code or can get it if you\nwant it, that you can change the software or use pieces of it in new\nfree programs, and that you know you can do these things.\n\n  To protect your rights, we need to prevent others from denying you\nthese rights or asking you to surrender the rights.  Therefore, you have\ncertain responsibilities if you distribute copies of the software, or if\nyou modify it: responsibilities to respect the freedom of others.\n\n  For example, if you distribute copies of such a program, whether\ngratis or for a fee, you must pass on to the recipients the same\nfreedoms that you received.  You must make sure that they, too, receive\nor can get the source code.  And you must show them these terms so they\nknow their rights.\n\n  Developers that use the GNU GPL protect your rights with two steps:\n(1) assert copyright on the software, and (2) offer you this License\ngiving you legal permission to copy, distribute and/or modify it.\n\n  For the developers' and authors' protection, the GPL clearly explains\nthat there is no warranty for this free software.  For both users' and\nauthors' sake, the GPL requires that modified versions be marked as\nchanged, so that their problems will not be attributed erroneously to\nauthors of previous versions.\n\n  Some devices are designed to deny users access to install or run\nmodified versions of the software inside them, although the manufacturer\ncan do so.  This is fundamentally incompatible with the aim of\nprotecting users' freedom to change the software.  The systematic\npattern of such abuse occurs in the area of products for individuals to\nuse, which is precisely where it is most unacceptable.  Therefore, we\nhave designed this version of the GPL to prohibit the practice for those\nproducts.  If such problems arise substantially in other domains, we\nstand ready to extend this provision to those domains in future versions\nof the GPL, as needed to protect the freedom of users.\n\n  Finally, every program is threatened constantly by software patents.\nStates should not allow patents to restrict development and use of\nsoftware on general-purpose computers, but in those that do, we wish to\navoid the special danger that patents applied to a free program could\nmake it effectively proprietary.  To prevent this, the GPL assures that\npatents cannot be used to render the program non-free.\n\n  The precise terms and conditions for copying, distribution and\nmodification follow.\n\n                       TERMS AND CONDITIONS\n\n  0. Definitions.\n\n  \"This License\" refers to version 3 of the GNU General Public License.\n\n  \"Copyright\" also means copyright-like laws that apply to other kinds of\nworks, such as semiconductor masks.\n\n  \"The Program\" refers to any copyrightable work licensed under this\nLicense.  Each licensee is addressed as \"you\".  \"Licensees\" and\n\"recipients\" may be individuals or organizations.\n\n  To \"modify\" a work means to copy from or adapt all or part of the work\nin a fashion requiring copyright permission, other than the making of an\nexact copy.  The resulting work is called a \"modified version\" of the\nearlier work or a work \"based on\" the earlier work.\n\n  A \"covered work\" means either the unmodified Program or a work based\non the Program.\n\n  To \"propagate\" a work means to do anything with it that, without\npermission, would make you directly or secondarily liable for\ninfringement under applicable copyright law, except executing it on a\ncomputer or modifying a private copy.  Propagation includes copying,\ndistribution (with or without modification), making available to the\npublic, and in some countries other activities as well.\n\n  To \"convey\" a work means any kind of propagation that enables other\nparties to make or receive copies.  Mere interaction with a user through\na computer network, with no transfer of a copy, is not conveying.\n\n  An interactive user interface displays \"Appropriate Legal Notices\"\nto the extent that it includes a convenient and prominently visible\nfeature that (1) displays an appropriate copyright notice, and (2)\ntells the user that there is no warranty for the work (except to the\nextent that warranties are provided), that licensees may convey the\nwork under this License, and how to view a copy of this License.  If\nthe interface presents a list of user commands or options, such as a\nmenu, a prominent item in the list meets this criterion.\n\n  1. Source Code.\n\n  The \"source code\" for a work means the preferred form of the work\nfor making modifications to it.  \"Object code\" means any non-source\nform of a work.\n\n  A \"Standard Interface\" means an interface that either is an official\nstandard defined by a recognized standards body, or, in the case of\ninterfaces specified for a particular programming language, one that\nis widely used among developers working in that language.\n\n  The \"System Libraries\" of an executable work include anything, other\nthan the work as a whole, that (a) is included in the normal form of\npackaging a Major Component, but which is not part of that Major\nComponent, and (b) serves only to enable use of the work with that\nMajor Component, or to implement a Standard Interface for which an\nimplementation is available to the public in source code form.  A\n\"Major Component\", in this context, means a major essential component\n(kernel, window system, and so on) of the specific operating system\n(if any) on which the executable work runs, or a compiler used to\nproduce the work, or an object code interpreter used to run it.\n\n  The \"Corresponding Source\" for a work in object code form means all\nthe source code needed to generate, install, and (for an executable\nwork) run the object code and to modify the work, including scripts to\ncontrol those activities.  However, it does not include the work's\nSystem Libraries, or general-purpose tools or generally available free\nprograms which are used unmodified in performing those activities but\nwhich are not part of the work.  For example, Corresponding Source\nincludes interface definition files associated with source files for\nthe work, and the source code for shared libraries and dynamically\nlinked subprograms that the work is specifically designed to require,\nsuch as by intimate data communication or control flow between those\nsubprograms and other parts of the work.\n\n  The Corresponding Source need not include anything that users\ncan regenerate automatically from other parts of the Corresponding\nSource.\n\n  The Corresponding Source for a work in source code form is that\nsame work.\n\n  2. Basic Permissions.\n\n  All rights granted under this License are granted for the term of\ncopyright on the Program, and are irrevocable provided the stated\nconditions are met.  This License explicitly affirms your unlimited\npermission to run the unmodified Program.  The output from running a\ncovered work is covered by this License only if the output, given its\ncontent, constitutes a covered work.  This License acknowledges your\nrights of fair use or other equivalent, as provided by copyright law.\n\n  You may make, run and propagate covered works that you do not\nconvey, without conditions so long as your license otherwise remains\nin force.  You may convey covered works to others for the sole purpose\nof having them make modifications exclusively for you, or provide you\nwith facilities for running those works, provided that you comply with\nthe terms of this License in conveying all material for which you do\nnot control copyright.  Those thus making or running the covered works\nfor you must do so exclusively on your behalf, under your direction\nand control, on terms that prohibit them from making any copies of\nyour copyrighted material outside their relationship with you.\n\n  Conveying under any other circumstances is permitted solely under\nthe conditions stated below.  Sublicensing is not allowed; section 10\nmakes it unnecessary.\n\n  3. Protecting Users' Legal Rights From Anti-Circumvention Law.\n\n  No covered work shall be deemed part of an effective technological\nmeasure under any applicable law fulfilling obligations under article\n11 of the WIPO copyright treaty adopted on 20 December 1996, or\nsimilar laws prohibiting or restricting circumvention of such\nmeasures.\n\n  When you convey a covered work, you waive any legal power to forbid\ncircumvention of technological measures to the extent such circumvention\nis effected by exercising rights under this License with respect to\nthe covered work, and you disclaim any intention to limit operation or\nmodification of the work as a means of enforcing, against the work's\nusers, your or third parties' legal rights to forbid circumvention of\ntechnological measures.\n\n  4. Conveying Verbatim Copies.\n\n  You may convey verbatim copies of the Program's source code as you\nreceive it, in any medium, provided that you conspicuously and\nappropriately publish on each copy an appropriate copyright notice;\nkeep intact all notices stating that this License and any\nnon-permissive terms added in accord with section 7 apply to the code;\nkeep intact all notices of the absence of any warranty; and give all\nrecipients a copy of this License along with the Program.\n\n  You may charge any price or no price for each copy that you convey,\nand you may offer support or warranty protection for a fee.\n\n  5. Conveying Modified Source Versions.\n\n  You may convey a work based on the Program, or the modifications to\nproduce it from the Program, in the form of source code under the\nterms of section 4, provided that you also meet all of these conditions:\n\n    a) The work must carry prominent notices stating that you modified\n    it, and giving a relevant date.\n\n    b) The work must carry prominent notices stating that it is\n    released under this License and any conditions added under section\n    7.  This requirement modifies the requirement in section 4 to\n    \"keep intact all notices\".\n\n    c) You must license the entire work, as a whole, under this\n    License to anyone who comes into possession of a copy.  This\n    License will therefore apply, along with any applicable section 7\n    additional terms, to the whole of the work, and all its parts,\n    regardless of how they are packaged.  This License gives no\n    permission to license the work in any other way, but it does not\n    invalidate such permission if you have separately received it.\n\n    d) If the work has interactive user interfaces, each must display\n    Appropriate Legal Notices; however, if the Program has interactive\n    interfaces that do not display Appropriate Legal Notices, your\n    work need not make them do so.\n\n  A compilation of a covered work with other separate and independent\nworks, which are not by their nature extensions of the covered work,\nand which are not combined with it such as to form a larger program,\nin or on a volume of a storage or distribution medium, is called an\n\"aggregate\" if the compilation and its resulting copyright are not\nused to limit the access or legal rights of the compilation's users\nbeyond what the individual works permit.  Inclusion of a covered work\nin an aggregate does not cause this License to apply to the other\nparts of the aggregate.\n\n  6. Conveying Non-Source Forms.\n\n  You may convey a covered work in object code form under the terms\nof sections 4 and 5, provided that you also convey the\nmachine-readable Corresponding Source under the terms of this License,\nin one of these ways:\n\n    a) Convey the object code in, or embodied in, a physical product\n    (including a physical distribution medium), accompanied by the\n    Corresponding Source fixed on a durable physical medium\n    customarily used for software interchange.\n\n    b) Convey the object code in, or embodied in, a physical product\n    (including a physical distribution medium), accompanied by a\n    written offer, valid for at least three years and valid for as\n    long as you offer spare parts or customer support for that product\n    model, to give anyone who possesses the object code either (1) a\n    copy of the Corresponding Source for all the software in the\n    product that is covered by this License, on a durable physical\n    medium customarily used for software interchange, for a price no\n    more than your reasonable cost of physically performing this\n    conveying of source, or (2) access to copy the\n    Corresponding Source from a network server at no charge.\n\n    c) Convey individual copies of the object code with a copy of the\n    written offer to provide the Corresponding Source.  This\n    alternative is allowed only occasionally and noncommercially, and\n    only if you received the object code with such an offer, in accord\n    with subsection 6b.\n\n    d) Convey the object code by offering access from a designated\n    place (gratis or for a charge), and offer equivalent access to the\n    Corresponding Source in the same way through the same place at no\n    further charge.  You need not require recipients to copy the\n    Corresponding Source along with the object code.  If the place to\n    copy the object code is a network server, the Corresponding Source\n    may be on a different server (operated by you or a third party)\n    that supports equivalent copying facilities, provided you maintain\n    clear directions next to the object code saying where to find the\n    Corresponding Source.  Regardless of what server hosts the\n    Corresponding Source, you remain obligated to ensure that it is\n    available for as long as needed to satisfy these requirements.\n\n    e) Convey the object code using peer-to-peer transmission, provided\n    you inform other peers where the object code and Corresponding\n    Source of the work are being offered to the general public at no\n    charge under subsection 6d.\n\n  A separable portion of the object code, whose source code is excluded\nfrom the Corresponding Source as a System Library, need not be\nincluded in conveying the object code work.\n\n  A \"User Product\" is either (1) a \"consumer product\", which means any\ntangible personal property which is normally used for personal, family,\nor household purposes, or (2) anything designed or sold for incorporation\ninto a dwelling.  In determining whether a product is a consumer product,\ndoubtful cases shall be resolved in favor of coverage.  For a particular\nproduct received by a particular user, \"normally used\" refers to a\ntypical or common use of that class of product, regardless of the status\nof the particular user or of the way in which the particular user\nactually uses, or expects or is expected to use, the product.  A product\nis a consumer product regardless of whether the product has substantial\ncommercial, industrial or non-consumer uses, unless such uses represent\nthe only significant mode of use of the product.\n\n  \"Installation Information\" for a User Product means any methods,\nprocedures, authorization keys, or other information required to install\nand execute modified versions of a covered work in that User Product from\na modified version of its Corresponding Source.  The information must\nsuffice to ensure that the continued functioning of the modified object\ncode is in no case prevented or interfered with solely because\nmodification has been made.\n\n  If you convey an object code work under this section in, or with, or\nspecifically for use in, a User Product, and the conveying occurs as\npart of a transaction in which the right of possession and use of the\nUser Product is transferred to the recipient in perpetuity or for a\nfixed term (regardless of how the transaction is characterized), the\nCorresponding Source conveyed under this section must be accompanied\nby the Installation Information.  But this requirement does not apply\nif neither you nor any third party retains the ability to install\nmodified object code on the User Product (for example, the work has\nbeen installed in ROM).\n\n  The requirement to provide Installation Information does not include a\nrequirement to continue to provide support service, warranty, or updates\nfor a work that has been modified or installed by the recipient, or for\nthe User Product in which it has been modified or installed.  Access to a\nnetwork may be denied when the modification itself materially and\nadversely affects the operation of the network or violates the rules and\nprotocols for communication across the network.\n\n  Corresponding Source conveyed, and Installation Information provided,\nin accord with this section must be in a format that is publicly\ndocumented (and with an implementation available to the public in\nsource code form), and must require no special password or key for\nunpacking, reading or copying.\n\n  7. Additional Terms.\n\n  \"Additional permissions\" are terms that supplement the terms of this\nLicense by making exceptions from one or more of its conditions.\nAdditional permissions that are applicable to the entire Program shall\nbe treated as though they were included in this License, to the extent\nthat they are valid under applicable law.  If additional permissions\napply only to part of the Program, that part may be used separately\nunder those permissions, but the entire Program remains governed by\nthis License without regard to the additional permissions.\n\n  When you convey a copy of a covered work, you may at your option\nremove any additional permissions from that copy, or from any part of\nit.  (Additional permissions may be written to require their own\nremoval in certain cases when you modify the work.)  You may place\nadditional permissions on material, added by you to a covered work,\nfor which you have or can give appropriate copyright permission.\n\n  Notwithstanding any other provision of this License, for material you\nadd to a covered work, you may (if authorized by the copyright holders of\nthat material) supplement the terms of this License with terms:\n\n    a) Disclaiming warranty or limiting liability differently from the\n    terms of sections 15 and 16 of this License; or\n\n    b) Requiring preservation of specified reasonable legal notices or\n    author attributions in that material or in the Appropriate Legal\n    Notices displayed by works containing it; or\n\n    c) Prohibiting misrepresentation of the origin of that material, or\n    requiring that modified versions of such material be marked in\n    reasonable ways as different from the original version; or\n\n    d) Limiting the use for publicity purposes of names of licensors or\n    authors of the material; or\n\n    e) Declining to grant rights under trademark law for use of some\n    trade names, trademarks, or service marks; or\n\n    f) Requiring indemnification of licensors and authors of that\n    material by anyone who conveys the material (or modified versions of\n    it) with contractual assumptions of liability to the recipient, for\n    any liability that these contractual assumptions directly impose on\n    those licensors and authors.\n\n  All other non-permissive additional terms are considered \"further\nrestrictions\" within the meaning of section 10.  If the Program as you\nreceived it, or any part of it, contains a notice stating that it is\ngoverned by this License along with a term that is a further\nrestriction, you may remove that term.  If a license document contains\na further restriction but permits relicensing or conveying under this\nLicense, you may add to a covered work material governed by the terms\nof that license document, provided that the further restriction does\nnot survive such relicensing or conveying.\n\n  If you add terms to a covered work in accord with this section, you\nmust place, in the relevant source files, a statement of the\nadditional terms that apply to those files, or a notice indicating\nwhere to find the applicable terms.\n\n  Additional terms, permissive or non-permissive, may be stated in the\nform of a separately written license, or stated as exceptions;\nthe above requirements apply either way.\n\n  8. Termination.\n\n  You may not propagate or modify a covered work except as expressly\nprovided under this License.  Any attempt otherwise to propagate or\nmodify it is void, and will automatically terminate your rights under\nthis License (including any patent licenses granted under the third\nparagraph of section 11).\n\n  However, if you cease all violation of this License, then your\nlicense from a particular copyright holder is reinstated (a)\nprovisionally, unless and until the copyright holder explicitly and\nfinally terminates your license, and (b) permanently, if the copyright\nholder fails to notify you of the violation by some reasonable means\nprior to 60 days after the cessation.\n\n  Moreover, your license from a particular copyright holder is\nreinstated permanently if the copyright holder notifies you of the\nviolation by some reasonable means, this is the first time you have\nreceived notice of violation of this License (for any work) from that\ncopyright holder, and you cure the violation prior to 30 days after\nyour receipt of the notice.\n\n  Termination of your rights under this section does not terminate the\nlicenses of parties who have received copies or rights from you under\nthis License.  If your rights have been terminated and not permanently\nreinstated, you do not qualify to receive new licenses for the same\nmaterial under section 10.\n\n  9. Acceptance Not Required for Having Copies.\n\n  You are not required to accept this License in order to receive or\nrun a copy of the Program.  Ancillary propagation of a covered work\noccurring solely as a consequence of using peer-to-peer transmission\nto receive a copy likewise does not require acceptance.  However,\nnothing other than this License grants you permission to propagate or\nmodify any covered work.  These actions infringe copyright if you do\nnot accept this License.  Therefore, by modifying or propagating a\ncovered work, you indicate your acceptance of this License to do so.\n\n  10. Automatic Licensing of Downstream Recipients.\n\n  Each time you convey a covered work, the recipient automatically\nreceives a license from the original licensors, to run, modify and\npropagate that work, subject to this License.  You are not responsible\nfor enforcing compliance by third parties with this License.\n\n  An \"entity transaction\" is a transaction transferring control of an\norganization, or substantially all assets of one, or subdividing an\norganization, or merging organizations.  If propagation of a covered\nwork results from an entity transaction, each party to that\ntransaction who receives a copy of the work also receives whatever\nlicenses to the work the party's predecessor in interest had or could\ngive under the previous paragraph, plus a right to possession of the\nCorresponding Source of the work from the predecessor in interest, if\nthe predecessor has it or can get it with reasonable efforts.\n\n  You may not impose any further restrictions on the exercise of the\nrights granted or affirmed under this License.  For example, you may\nnot impose a license fee, royalty, or other charge for exercise of\nrights granted under this License, and you may not initiate litigation\n(including a cross-claim or counterclaim in a lawsuit) alleging that\nany patent claim is infringed by making, using, selling, offering for\nsale, or importing the Program or any portion of it.\n\n  11. Patents.\n\n  A \"contributor\" is a copyright holder who authorizes use under this\nLicense of the Program or a work on which the Program is based.  The\nwork thus licensed is called the contributor's \"contributor version\".\n\n  A contributor's \"essential patent claims\" are all patent claims\nowned or controlled by the contributor, whether already acquired or\nhereafter acquired, that would be infringed by some manner, permitted\nby this License, of making, using, or selling its contributor version,\nbut do not include claims that would be infringed only as a\nconsequence of further modification of the contributor version.  For\npurposes of this definition, \"control\" includes the right to grant\npatent sublicenses in a manner consistent with the requirements of\nthis License.\n\n  Each contributor grants you a non-exclusive, worldwide, royalty-free\npatent license under the contributor's essential patent claims, to\nmake, use, sell, offer for sale, import and otherwise run, modify and\npropagate the contents of its contributor version.\n\n  In the following three paragraphs, a \"patent license\" is any express\nagreement or commitment, however denominated, not to enforce a patent\n(such as an express permission to practice a patent or covenant not to\nsue for patent infringement).  To \"grant\" such a patent license to a\nparty means to make such an agreement or commitment not to enforce a\npatent against the party.\n\n  If you convey a covered work, knowingly relying on a patent license,\nand the Corresponding Source of the work is not available for anyone\nto copy, free of charge and under the terms of this License, through a\npublicly available network server or other readily accessible means,\nthen you must either (1) cause the Corresponding Source to be so\navailable, or (2) arrange to deprive yourself of the benefit of the\npatent license for this particular work, or (3) arrange, in a manner\nconsistent with the requirements of this License, to extend the patent\nlicense to downstream recipients.  \"Knowingly relying\" means you have\nactual knowledge that, but for the patent license, your conveying the\ncovered work in a country, or your recipient's use of the covered work\nin a country, would infringe one or more identifiable patents in that\ncountry that you have reason to believe are valid.\n\n  If, pursuant to or in connection with a single transaction or\narrangement, you convey, or propagate by procuring conveyance of, a\ncovered work, and grant a patent license to some of the parties\nreceiving the covered work authorizing them to use, propagate, modify\nor convey a specific copy of the covered work, then the patent license\nyou grant is automatically extended to all recipients of the covered\nwork and works based on it.\n\n  A patent license is \"discriminatory\" if it does not include within\nthe scope of its coverage, prohibits the exercise of, or is\nconditioned on the non-exercise of one or more of the rights that are\nspecifically granted under this License.  You may not convey a covered\nwork if you are a party to an arrangement with a third party that is\nin the business of distributing software, under which you make payment\nto the third party based on the extent of your activity of conveying\nthe work, and under which the third party grants, to any of the\nparties who would receive the covered work from you, a discriminatory\npatent license (a) in connection with copies of the covered work\nconveyed by you (or copies made from those copies), or (b) primarily\nfor and in connection with specific products or compilations that\ncontain the covered work, unless you entered into that arrangement,\nor that patent license was granted, prior to 28 March 2007.\n\n  Nothing in this License shall be construed as excluding or limiting\nany implied license or other defenses to infringement that may\notherwise be available to you under applicable patent law.\n\n  12. No Surrender of Others' Freedom.\n\n  If conditions are imposed on you (whether by court order, agreement or\notherwise) that contradict the conditions of this License, they do not\nexcuse you from the conditions of this License.  If you cannot convey a\ncovered work so as to satisfy simultaneously your obligations under this\nLicense and any other pertinent obligations, then as a consequence you may\nnot convey it at all.  For example, if you agree to terms that obligate you\nto collect a royalty for further conveying from those to whom you convey\nthe Program, the only way you could satisfy both those terms and this\nLicense would be to refrain entirely from conveying the Program.\n\n  13. Use with the GNU Affero General Public License.\n\n  Notwithstanding any other provision of this License, you have\npermission to link or combine any covered work with a work licensed\nunder version 3 of the GNU Affero General Public License into a single\ncombined work, and to convey the resulting work.  The terms of this\nLicense will continue to apply to the part which is the covered work,\nbut the special requirements of the GNU Affero General Public License,\nsection 13, concerning interaction through a network will apply to the\ncombination as such.\n\n  14. Revised Versions of this License.\n\n  The Free Software Foundation may publish revised and/or new versions of\nthe GNU General Public License from time to time.  Such new versions will\nbe similar in spirit to the present version, but may differ in detail to\naddress new problems or concerns.\n\n  Each version is given a distinguishing version number.  If the\nProgram specifies that a certain numbered version of the GNU General\nPublic License \"or any later version\" applies to it, you have the\noption of following the terms and conditions either of that numbered\nversion or of any later version published by the Free Software\nFoundation.  If the Program does not specify a version number of the\nGNU General Public License, you may choose any version ever published\nby the Free Software Foundation.\n\n  If the Program specifies that a proxy can decide which future\nversions of the GNU General Public License can be used, that proxy's\npublic statement of acceptance of a version permanently authorizes you\nto choose that version for the Program.\n\n  Later license versions may give you additional or different\npermissions.  However, no additional obligations are imposed on any\nauthor or copyright holder as a result of your choosing to follow a\nlater version.\n\n  15. Disclaimer of Warranty.\n\n  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY\nAPPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT\nHOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM \"AS IS\" WITHOUT WARRANTY\nOF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,\nTHE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR\nPURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM\nIS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF\nALL NECESSARY SERVICING, REPAIR OR CORRECTION.\n\n  16. Limitation of Liability.\n\n  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING\nWILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS\nTHE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY\nGENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE\nUSE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF\nDATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD\nPARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),\nEVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF\nSUCH DAMAGES.\n\n  17. Interpretation of Sections 15 and 16.\n\n  If the disclaimer of warranty and limitation of liability provided\nabove cannot be given local legal effect according to their terms,\nreviewing courts shall apply local law that most closely approximates\nan absolute waiver of all civil liability in connection with the\nProgram, unless a warranty or assumption of liability accompanies a\ncopy of the Program in return for a fee.\n\n                     END OF TERMS AND CONDITIONS\n\n            How to Apply These Terms to Your New Programs\n\n  If you develop a new program, and you want it to be of the greatest\npossible use to the public, the best way to achieve this is to make it\nfree software which everyone can redistribute and change under these terms.\n\n  To do so, attach the following notices to the program.  It is safest\nto attach them to the start of each source file to most effectively\nstate the exclusion of warranty; and each file should have at least\nthe \"copyright\" line and a pointer to where the full notice is found.\n\n    Distributed Deep Learning with Keras and Apache Spark.\n    Copyright (C) 2016  Joeri Hermans\n\n    This program is free software: you can redistribute it and/or modify\n    it under the terms of the GNU General Public License as published by\n    the Free Software Foundation, either version 3 of the License, or\n    (at your option) any later version.\n\n    This program is distributed in the hope that it will be useful,\n    but WITHOUT ANY WARRANTY; without even the implied warranty of\n    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n    GNU General Public License for more details.\n\n    You should have received a copy of the GNU General Public License\n    along with this program.  If not, see <http://www.gnu.org/licenses/>.\n\nAlso add information on how to contact you by electronic and paper mail.\n\n  If the program does terminal interaction, make it output a short\nnotice like this when it starts in an interactive mode:\n\n    Distributed Keras  Copyright (C) 2016  Joeri Hermans\n    This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.\n    This is free software, and you are welcome to redistribute it\n    under certain conditions; type `show c' for details.\n\nThe hypothetical commands `show w' and `show c' should show the appropriate\nparts of the General Public License.  Of course, your program's commands\nmight be different; for a GUI interface, you would use an \"about box\".\n\n  You should also get your employer (if you work as a programmer) or school,\nif any, to sign a \"copyright disclaimer\" for the program, if necessary.\nFor more information on this, and how to apply and follow the GNU GPL, see\n<http://www.gnu.org/licenses/>.\n\n  The GNU General Public License does not permit incorporating your program\ninto proprietary programs.  If your program is a subroutine library, you\nmay consider it more useful to permit linking proprietary applications with\nthe library.  If this is what you want to do, use the GNU Lesser General\nPublic License instead of this License.  But first, please read\n<http://www.gnu.org/philosophy/why-not-lgpl.html>.\n"
  },
  {
    "path": "README.md",
    "content": "# Distributed Keras\n\nDistributed Deep Learning with Apache Spark and Keras.\n\n\n## Introduction\n\nDistributed Keras is a distributed deep learning framework built op top of Apache Spark and Keras, with a focus on \"state-of-the-art\" distributed optimization algorithms. We designed the framework in such a way that a new distributed optimizer could be implemented with ease, thus enabling a person to focus on research. Several distributed methods are supported, such as, but not restricted to, the training of **ensembles** and models using **data parallel** methods.\n\nMost of the distributed optimizers we provide, are based on data parallel methods. A data parallel method, as described in [[1]](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf), is a learning paradigm where multiple replicas of a single model are used to optimize a single objective. Using this approach, we are able to dignificantly reduce the training time of a model. Depending on the parametrization, we also observed that it is possible to achieve better statistical model performance compared to a more traditional approach (e.g., like the [SingleTrainer](#single-trainer) implementation), and yet, spending less wallclock time on the training of the model. However, this is subject to further research.\n\n**Attention**: A rather complete introduction to the problem of Distributed Deep Learning is presented in my Master Thesis [http://github.com/JoeriHermans/master-thesis](http://github.com/JoeriHermans/master-thesis). Furthermore, the thesis describes includes several *novel* insights, such as a redefinition of parameter staleness, and several new distributed optimizers such as AGN and ADAG.\n\n\n## Installation\n\nWe will guide you how to install Distributed Keras. However, we will assume that an Apache Spark installation is available. In the following subsections, we describe two approaches to achieve this.\n\n### pip\n\nWhen you only require the framework for development purposes, just use `pip` to install dist-keras.\n\n```bash\npip install --upgrade dist-keras\n\n# OR\n\npip install --upgrade git+https://github.com/JoeriHermans/dist-keras.git\n```\n\n### git & pip\n\nHowever, if you would like to contribute, or run some of the examples. It is probably best to clone the repository directly from GitHub and install it afterwards using `pip`. This will also resolve possible missing dependencies.\n\n```bash\ngit clone https://github.com/JoeriHermans/dist-keras\ncd dist-keras\npip install -e .\n```\n\n### General notes\n\n#### .bashrc\n\nMake sure the following variables are set in your `.bashrc`. It is possible, depending on your system configuration, that the following configuration **doesn't have to be applied**.\n\n```bash\n# Example of a .bashrc configuration.\nexport SPARK_HOME=/usr/lib/spark\nexport PYTHONPATH=\"$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH\"\n```\n\n\n## Running an example\n\nWe would like to refer the reader to the `workflow.ipynb` notebook in the examples folder. This will give you a complete introduction to the problem of distributed deep learning, and will guide you through the steps that have to be executed.\n\nFurthermore, we would also like to show how you exactly should process \"big\" datasets. This is shown in the examples starting with the prefix ```example_```. Please execute them in the provided sequence.\n\n### Spark 2.0\n\nIf you want to run the examples using Apache Spark 2.0.0 and higher. You will need to remove the line containing `sqlContext = SQLContext(sc)`. We need to do this because in Spark 2.0+, the SQLContext, and Hive context are now merged in the Spark session.\n\n\n## Optimization Algorithms\n\n### Sequential Trainer\n\nThis optimizer follows the traditional scheme of training a model, i.e., it uses sequential gradient updates to optimize the parameters. It does this by executing the training procedure on a single Spark executor.\n\n```python\nSingleTrainer(model, features_col, label_col, batch_size, optimizer, loss, metrics=[\"accuracy\"])\n```\n\n### ADAG (Currently Recommended)\n\nDOWNPOUR variant which is able to achieve significantly better statistical performance while being less sensitive to hyperparameters. This optimizer was developed using insights gained while developing this framework. More research regarding parameter staleness is still being conducted to further improve this optimizer.\n\n```python\nADAG(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, batch_size=32,\n     features_col=\"features\", label_col=\"label\", num_epoch=1, communication_window=12)\n```\n\n### Dynamic SGD\n\nDynamic SGD, dynamically maintains a learning rate for every worker by incorperating parameter staleness. This optimization scheme is introduced in \"Heterogeneity-aware Distributed Parameter Servers\" at the SIGMOD 2017 conference [[5]](http://net.pku.edu.cn/~cuibin/Papers/2017SIGMOD.pdf).\n\n```python\nDynSGD(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, batch_size=32,\n       features_col=\"features\", label_col=\"label\", num_epoch=1, communication_window=10)\n```\n\n### Asynchronous Elastic Averaging SGD (AEASGD)\n\nThe distinctive idea of EASGD is to allow the local workers to perform more exploration (small rho) and the master to perform exploitation. This approach differs from other settings explored in the literature, and focus on how fast the center variable converges [[2]](https://arxiv.org/pdf/1412.6651.pdf) .\n\nIn this section we show the asynchronous version of EASGD. Instead of waiting on the synchronization of other trainers, this method communicates the elastic difference (as described in the paper), with the parameter server. The only synchronization mechanism that has been implemented, is to ensure no race-conditions occur when updating the center variable.\n\n\n```python\nAEASGD(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers, batch_size, features_col,\n       label_col, num_epoch, communication_window, rho, learning_rate)\n```\n\n### Asynchronous Elastic Averaging Momentum SGD (AEAMSGD)\n\nAsynchronous EAMSGD is a variant of asynchronous EASGD. It is based on the Nesterov's momentum scheme, where the update of the local worker is modified to incorepare a momentum term [[2]](https://arxiv.org/pdf/1412.6651.pdf) .\n\n```python\nEAMSGD(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers, batch_size,\n       features_col, label_col, num_epoch, communication_window, rho,\n       learning_rate, momentum)\n```\n\n### DOWNPOUR\n\nAn asynchronous stochastic gradient descent procedure introduced by Dean et al., supporting a large number of model replicas and leverages adaptive learning rates. This implementation is based on the pseudocode provided by [[1]](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf) .\n\n```python\nDOWNPOUR(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers, batch_size,\n         features_col, label_col, num_epoch, learning_rate, communication_window)\n```\n\n### Ensemble Training\n\nIn ensemble training, we train `n` models in parallel on the same dataset. All models are trained in parallel, but the training of a single model is done in a sequential manner using Keras optimizers. After the training process, one can combine and, for example, average the output of the models.\n\n```python\nEnsembleTrainer(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], features_col,\n                label_col, batch_size, num_ensembles)\n```\n\n### Model Averaging\n\nModel averaging is a data parallel technique which will average the trainable parameters of model replicas after every epoch.\n\n```python\nAveragingTrainer(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], features_col,\n                 label_col, num_epoch, batch_size, num_workers)\n```\n\n## Job deployment\n\nWe also support remote job deployment. For example, imagine you are developing your model on a local notebook using a small development set. However, in order to submit your job on a remote cluster, you first need to develop a cluster job, and run the job there. In order to simplify this process, we have developed a simplified interface for a large scale machine learning job.\n\nIn order to submit a job to a remote cluster, you simply run the following code:\n\n```python\n# Define the distributed optimization procedure, and its parameters.\ntrainer = ADAG(keras_model=mlp, worker_optimizer=optimizer_mlp, loss=loss_mlp, metrics=[\"accuracy\"], num_workers=20,\n               batch_size=32, communication_window=15, num_epoch=1,\n               features_col=\"features_normalized_dense\", label_col=\"label_encoded\")\n\n# Define the job parameters.\njob = Job(secret, job_name, data_path, num_executors, num_processes, trainer)\njob.send('http://yourcluster:[port]')\njob.wait_completion()\n# Fetch the trained model, and history for training evaluation.\ntrained_model = job.get_trained_model()\nhistory = job.get_history()\n```\n\n### Punchcard Server\n\nJob scheduling, and execution is handled by our `Punchcard` server. This server will accept requests from a remote location given a specific `secret`, which is basically a long identification string of a specific user. However, a user can have multiple secrets. At the moment, a job is only executed if there are no other jobs running for the specified secret.\n\nIn order to submit jobs to `Punchcard` we need to specify a secrets file. This file is basically a JSON structure, it will have the following structure:\n\n```json\n[\n    {\n        \"secret\": \"secret_of_user_1\",\n        \"identity\": \"user1\"\n    },\n    {\n        \"secret\": \"secret_of_user_2\",\n        \"identity\": \"user2\"\n    }\n]\n```\n\nAfter the secrets file has been constructed, the Punchcard server can be started by issueing the following command.\n\n```sh\npython scripts/punchcard.py --secrets /path/to/secrets.json\n```\n\n#### Secret Generation\n\nIn order to simplify secret generation, we have added a costum script which will generate a unique key for the specified identity. The structure can be generated by running the following command.\n\n```sh\npython scripts/generate_secret.py --identity userX\n```\n\n## Optimization Schemes\n\nTODO\n\n## General note\n\nIt is known that adding more asynchronous workers deteriorates the statistical performance of the model. There have been some studies which examinate this particular effect. However, some of them conclude that actually adding more asynchronous workers contributes to something what they call **implicit momentum** [[3]](https://arxiv.org/pdf/1605.09774.pdf). However, this is subject to further investigation.\n\n\n## Known issues\n\n- Python 3 compatibility.\n\n\n## TODO's\n\nList of possible future additions.\n\n- Save Keras model to HDFS.\n- Load Keras model from HDFS.\n- Compression / decompression of network transmissions.\n- Stop on target loss.\n- Multiple parameter servers for large Deep Networks.\n- Python 3 compatibility.\n- For every worker, spawn an additional thread which is responsible for sending updates to the parameter server. The actual worker thread will just submit tasks to this queue.\n\n\n## Citing\n\nIf you use this framework in any academic work, please use the following BibTex code.\n\n```latex\n@misc{dist_keras_joerihermans,\n  author = {Joeri R. Hermans, CERN IT-DB},\n  title = {Distributed Keras: Distributed Deep Learning with Apache Spark and Keras},\n  year = {2016},\n  publisher = {GitHub},\n  journal = {GitHub Repository},\n  howpublished = {\\url{https://github.com/JoeriHermans/dist-keras/}},\n}\n```\n\n## References\n\n* Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., ... & Ng, A. Y. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223-1231). [[1]](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf)\n\n* Zhang, S., Choromanska, A. E., & LeCun, Y. (2015). Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems (pp. 685-693). [[2]](https://arxiv.org/pdf/1412.6651.pdf)\n\n* Mitliagkas, Ioannis, et al. \"Asynchrony begets Momentum, with an Application to Deep Learning.\" arXiv preprint arXiv:1605.09774 (2016). [[3]](https://arxiv.org/pdf/1605.09774.pdf)\n\n<!-- @misc{pumperla2015, -->\n<!-- author = {Max Pumperla}, -->\n<!-- title = {elephas}, -->\n<!-- year = {2015}, -->\n<!-- publisher = {GitHub}, -->\n<!-- journal = {GitHub repository}, -->\n<!-- howpublished = {\\url{https://github.com/maxpumperla/elephas}} -->\n<!-- } -->\n* Pumperla, M. (2015). Elephas. Github Repository https://github.com/maxpumperla/elephas/. [4]\n* Jiawei Jiang, Bin Cui, Ce Zhang and Lele Yu (2017). Heterogeneity-aware Distributed Parameter Servers [[5]](http://net.pku.edu.cn/~cuibin/Papers/2017SIGMOD.pdf)\n\n\n## Licensing\n\n![GPLv3](resources/gpl_v3.png) ![CERN](resources/cern_logo.jpg)\n"
  },
  {
    "path": "distkeras/__init__.py",
    "content": ""
  },
  {
    "path": "distkeras/evaluators.py",
    "content": "\"\"\"Evaluation module.\n\nAn evaluator will evaluate a dataframe according to specific requirements.\n\"\"\"\n\nclass Evaluator(object):\n    \"\"\"An evaluator is an abstract class which will, given a label and a prediction,\n       will compute an evaluation metric.\n\n    # Arguments\n        label_col: string. Column name of the label.\n        prediction_col: string. Column name of the prediction.\n    \"\"\"\n\n    def __init__(self, label_col=\"label\", prediction_col=\"prediction\"):\n        self.label_column = label_col\n        self.prediction_column = prediction_col\n\n    def evaluate(self, dataframe):\n        \"\"\"Evalutes the specified dataframe.\n\n        # Arguments\n            dataframe: dataframe. Spark Dataframe.\n        \"\"\"\n        raise NotImplementedError\n\n\nclass AccuracyEvaluator(Evaluator):\n    \"\"\"Computes the accuracy of the prediction based on the label.\n\n    # Arguments\n        label_col: string. Label column.\n        prediction_col: string. Prediction column.\n    \"\"\"\n\n    def __init__(self, label_col=\"label\", prediction_col=\"prediction\"):\n        # Initialize the parent structure.\n        super(AccuracyEvaluator, self).__init__(label_col, prediction_col)\n\n    def evaluate(self, dataframe):\n        # Count the total number of instances.\n        num_instances = dataframe.count()\n        # Extract the matching indexes.\n        cleaned = dataframe.where(dataframe[self.prediction_column] == dataframe[self.label_column])\n        # Fetch the number of correctly guessed instances.\n        validated_instances = cleaned.count()\n\n        return float(validated_instances) / float(num_instances)\n"
  },
  {
    "path": "distkeras/job_deployment.py",
    "content": "\"\"\"Module which facilitates job deployment on remote Spark clusters.\nThis allows you to build models and architectures on, for example, remote\nnotebook servers, and submit the large scale training job on remote\nHadoop / Spark clusters.\"\"\"\n\n## BEGIN Imports. ##############################################################\n\nfrom distkeras.utils import deserialize_keras_model\nfrom distkeras.utils import get_os_username\nfrom distkeras.utils import pickle_object\nfrom distkeras.utils import serialize_keras_model\nfrom distkeras.utils import unpickle_object\n\nfrom flask import Flask\nfrom flask import request\n\nfrom os.path import expanduser\n\nfrom threading import Lock\n\nimport base64\n\nimport json\n\nimport os\n\nimport subprocess\n\nimport threading\n\nimport time\n\nimport urllib2\n\n## END Imports. ################################################################\n\nclass Punchcard(object):\n\n    def __init__(self, secrets_path=\"secrets.json\", port=80):\n        self.application = Flask(__name__)\n        self.secrets_path = secrets_path\n        self.port = port\n        self.mutex = threading.Lock()\n        self.jobs = {}\n\n    def read_secrets(self):\n        with open(self.secrets_path) as f:\n            secrets_raw = f.read()\n        secrets = json.loads(secrets_raw)\n\n        return secrets\n\n    def valid_secret(self, secret, secrets):\n        num_secrets = len(secrets)\n        for i in range(0, num_secrets):\n            description = secrets[i]\n            if description['secret'] == secret:\n                return True\n        return False\n\n    def secret_in_use(self, secret):\n        return secret in self.jobs\n\n    def set_trained_model(self, job, model):\n        with self.mutex:\n            self.models[job.get_secret()] = model\n\n    def get_submitted_job(self, secret):\n        with self.mutex:\n            if self.secret_in_use(secret):\n                job = self.jobs[secret]\n            else:\n                job = None\n\n        return job\n\n    def define_routes(self):\n\n        ## BEGIN Route definitions. ############################################\n\n        @self.application.route('/api/submit', methods=['POST'])\n        def submit_job():\n            # Parse the incoming JSON data.\n            data = json.loads(request.data)\n            # Fetch the required job arguments.\n            secret = data['secret']\n            job_name = data['job_name']\n            num_executors = data['num_executors']\n            num_processes = data['num_processes']\n            data_path = data['data_path']\n            trainer = unpickle_object(data['trainer'].decode('hex_codec'))\n            # Fetch the parameters for the job.\n            secrets = self.read_secrets()\n            with self.mutex:\n                if self.valid_secret(secret, secrets) and not self.secret_in_use(secret):\n                    job = PunchcardJob(secret, job_name, data_path, num_executors, num_processes, trainer)\n                    self.jobs[secret] = job\n                    job.start()\n                    return '', 200\n\n            return '', 403\n\n        @self.application.route('/api/state')\n        def job_state():\n            secret = request.args.get('secret')\n            job = self.get_submitted_job(secret)\n            # Check if the job exists.\n            if job is not None:\n                d = {}\n                d['job_name'] = job.get_job_name()\n                d['running'] = job.running()\n                return json.dumps(d), 200\n\n            return '', 404\n\n        @self.application.route('/api/cancel')\n        def cancel():\n            secret = request.args.get('secret')\n            job = self.get_submitted_job(secret)\n            if job is not None and job.running():\n                with self.mutex:\n                    job.cancel()\n                    del self.jobs[secret]\n\n            return '', 200\n\n        @self.application.route('/api/destroy')\n        def destroy_job():\n            secret = request.args.get('secret')\n            job = self.get_submitted_job(secret)\n            if job is not None and not job.running():\n                with self.mutex:\n                    model = self.jobs[secret].get_trained_model()\n                    history = self.jobs[secret].get_history()\n                    model = pickle_object(serialize_keras_model(model)).encode('hex_codec')\n                    history = pickle_object(history).encode('hex_codec')\n                    d = {}\n                    d['model'] = model\n                    d['history'] = history\n                    del self.jobs[secret]\n                return json.dumps(d), 200\n\n            return '', 400\n\n        ## END Route definitions. ##############################################\n\n    def run(self):\n        self.define_routes()\n        self.application.run('0.0.0.0', self.port)\n\n\nclass PunchcardJob(object):\n\n    def __init__(self, secret, job_name, data_path, num_executors, num_processes, trainer):\n        self.secret = secret\n        self.job_name = job_name\n        self.data_path = data_path\n        self.num_executors = num_executors\n        self.num_processes = num_processes\n        self.trainer = trainer\n        self.is_running = True\n        self.thread = None\n        self.trained_model = None\n        self.history = None\n\n    def get_job_name(self):\n        return self.job_name\n\n    def get_secret(self):\n        return self.secret\n\n    def get_history(self):\n        return self.history\n\n    def get_trained_model(self):\n        return self.trained_model\n\n    def start(self):\n        self.trainer.determine_new_master()\n        self.thread = threading.Thread(target=self.run)\n        self.thread.setDaemon(True)\n        self.thread.start()\n\n    def cancel(self):\n        self.thread.exit()\n\n    def running(self):\n        return self.is_running\n\n    def join(self):\n        self.thread.join()\n\n    def run_job(self):\n        os.system(\"python ~/jobs/\" + self.secret + \".py\")\n\n    def clean_up(self):\n        home = expanduser(\"~\")\n        os.remove(home + \"/models/\" + self.secret)\n        os.remove(home + \"/histories/\" + self.secret)\n        os.remove(home + \"/trainers/\" + self.secret)\n\n    def read_trained_model(self):\n        home = expanduser(\"~\")\n        with open(home + \"/models/\" + self.secret, \"r\") as f:\n            self.trained_model = deserialize_keras_model(unpickle_object(f.read()))\n\n    def read_history(self):\n        home = expanduser(\"~\")\n        with open(home + \"/histories/\" + self.secret, \"r\") as f:\n            self.history = unpickle_object(f.read())\n\n    def serialize_trainer(self):\n        trainer = pickle_object(self.trainer)\n        home = expanduser(\"~\")\n        with open(home + \"/trainers/\" + self.secret, \"w\") as f:\n            f.write(trainer)\n\n    def generate_code(self):\n        source = \"\"\"\nfrom distkeras.evaluators import *\nfrom distkeras.predictors import *\nfrom distkeras.trainers import *\nfrom distkeras.trainers import *\nfrom distkeras.transformers import *\nfrom distkeras.utils import *\nfrom keras import *\nfrom pyspark import SparkConf\nfrom pyspark import SparkContext\nfrom pyspark import SQLContext\nfrom os.path import expanduser\nsecret = '{secret}'\napplication_name = '{job_name}'\nnum_executors = {num_executors}\nnum_processes = {num_processes}\npath_data = '{data_path}'\nnum_workers = num_processes * num_executors\n# Allocate a Spark Context, and a Spark SQL context.\nconf = SparkConf()\nconf.set(\"spark.app.name\", application_name)\nconf.set(\"spark.master\", \"yarn-client\")\nconf.set(\"spark.executor.cores\", num_processes)\nconf.set(\"spark.executor.instances\", num_executors)\nconf.set(\"spark.executor.memory\", \"5g\")\nconf.set(\"spark.locality.wait\", \"0\")\nconf.set(\"spark.serializer\", \"org.apache.spark.serializer.KryoSerializer\");\nsc = SparkContext(conf=conf)\nsqlContext = SQLContext(sc)\n# Read the dataset from HDFS. For now we assume Parquet files.\ndataset = sqlContext.read.parquet(path_data).repartition(num_workers)\n# Deserialize the trainer object.\nhome = expanduser(\"~\")\nwith open(home + \"/trainers/\" + secret, \"r\") as f:\n    trainer = unpickle_object(f.read())\n# Train the model, and save it afterwards.\ntrained_model = trainer.train(dataset)\nwith open(home + \"/models/\" + secret, \"w\") as f:\n    f.write(pickle_object(serialize_keras_model(trained_model)))\n# Save the history of the training process.\nhistories = trainer.get_history()\nwith open(home + \"/histories/\" + secret, \"w\") as f:\n    f.write(pickle_object(histories))\nsc.stop()\n        \"\"\".format(\n            secret=self.secret,\n            job_name=self.job_name,\n            num_executors=self.num_executors,\n            num_processes=self.num_processes,\n            data_path=self.data_path\n        )\n        home = expanduser(\"~\")\n        with open(home + \"/jobs/\" + self.secret + \".py\", \"w\") as f:\n            f.write(source)\n\n    def run(self):\n        self.serialize_trainer()\n        self.generate_code()\n        self.run_job()\n        self.read_trained_model()\n        self.read_history()\n        self.clean_up()\n        self.is_running = False\n\n\nclass Job(object):\n\n    def __init__(self, secret, job_name, data_path, num_executors, num_processes, trainer):\n        self.secret = secret\n        self.job_name = job_name\n        self.num_executors = 20\n        self.num_processes = 1\n        self.data_path = data_path\n        self.trainer = trainer\n        self.trained_model = None\n        self.history = None\n        self.address = None\n\n    def set_num_executors(self, num_executors):\n        self.num_executors = num_executors\n\n    def set_num_processes(self, num_processes):\n        self.num_processes = num_processes\n\n    def get_trained_model(self):\n        return self.trained_model\n\n    def get_history(self):\n        return self.history\n\n    def is_finished(self):\n        address = self.address + '/api/state?secret=' + self.secret\n        request = urllib2.Request(address)\n        response = urllib2.urlopen(request)\n        data = json.load(response)\n\n        return not data['running']\n\n    def destroy_remote_job(self):\n        address = self.address + '/api/destroy?secret=' + self.secret\n        request = urllib2.Request(address)\n        response = urllib2.urlopen(request)\n        data = json.load(response)\n        model = unpickle_object(data['model'].decode('hex_codec'))\n        self.trained_model = deserialize_keras_model(model)\n        self.history = unpickle_object(data['history'].decode('hex_codec'))\n\n    def start(self):\n        self.thread = threading.Thread(target=self.run)\n        self.thread.start()\n\n    def wait_completion(self):\n        self.thread.join()\n\n    def cancel(self):\n        address = self.address + '/api/cancel?secret=' + self.secret\n        request = urllib2.Request(address)\n        urllib2.urlopen(request)\n\n    def send(self, address):\n        data = {}\n        data['secret'] = self.secret\n        data['job_name'] = self.job_name\n        data['num_executors'] = self.num_executors\n        data['num_processes'] = self.num_processes\n        data['data_path'] = self.data_path\n        data['trainer'] = pickle_object(self.trainer).encode('hex_codec')\n        request = urllib2.Request(address + \"/api/submit\")\n        request.add_header('Content-Type', 'application/json')\n        urllib2.urlopen(request, json.dumps(data))\n        self.address = address\n        self.start()\n\n    def run(self):\n        time.sleep(1)\n        while not self.is_finished():\n            time.sleep(10)\n        self.destroy_remote_job()\n"
  },
  {
    "path": "distkeras/networking.py",
    "content": "\"\"\"Networking utility functions.\"\"\"\n\n## BEGIN Imports. ##############################################################\n\nimport pickle\n\nimport socket\n\n## END Imports. ################################################################\n\ndef determine_host_address():\n    \"\"\"Determines the human-readable host address of the local machine.\"\"\"\n    host_address = socket.gethostbyname(socket.gethostname())\n\n    return host_address\n\n\ndef recvall(connection, num_bytes):\n    \"\"\"Reads `num_bytes` bytes from the specified connection.\n\n    # Arguments\n        connection: socket. Opened socket.\n        num_bytes: int. Number of bytes to read.\n    \"\"\"\n    byte_buffer = b''\n    buffer_size = 0\n    bytes_left = num_bytes\n    # Iterate until we received all data.\n    while buffer_size < num_bytes:\n        # Fetch the next frame from the network.\n        data = connection.recv(bytes_left)\n        # Compute the size of the frame.\n        delta = len(data)\n        buffer_size += delta\n        bytes_left -= delta\n        # Append the data to the buffer.\n        byte_buffer += data\n\n    return byte_buffer\n\n\ndef recv_data(connection):\n    \"\"\"Will fetch the next data frame from the connection.\n\n    The protocol for reading is structured as follows:\n    1. The first 20 bytes represents a string which holds the next number of bytes to read.\n    2. We convert the 20 byte string to an integer (e.g. '00000000000000000011' -> 11).\n    3. We read `num_bytes` from the socket (which is in our example 11).\n    4. Deserialize the retrieved string.\n\n    # Arguments\n        connection: socket. Opened socket.\n    \"\"\"\n    data = b''\n    # Fetch the serialized data length.\n    length = int(recvall(connection, 20).decode())\n    # Fetch the serialized data.\n    serialized_data = recvall(connection, length)\n    # Deserialize the data.\n    data = pickle.loads(serialized_data)\n\n    return data\n\n\ndef send_data(connection, data):\n    \"\"\"Sends the data to the other endpoint of the socket using our protocol.\n\n    The protocol for sending is structured as follows:\n    1. Serialize the data.\n    2. Obtain the buffer-size of the serialized data.\n    3. Serialize the buffer-size in 20 bytes (e.g. 11 -> '00000000000000000011').\n    4. Send the serialized buffer size.\n    5. Send the serialized data.\n\n    # Arguments\n        connection: socket. Opened socket.\n        data: any. Data to send.\n    \"\"\"\n    # Serialize the data.\n    serialized_data = pickle.dumps(data, -1)\n    length = len(serialized_data)\n    # Serialize the number of bytes in the data.\n    serialized_length = str(length).zfill(20)\n    # Send the data over the provided socket.\n    connection.sendall(serialized_length.encode())\n    connection.sendall(serialized_data)\n\n\ndef connect(host, port, disable_nagle=True):\n    fd = socket.socket(socket.AF_INET, socket.SOCK_STREAM)\n    # Check if Nagle's algorithm needs to be disabled.\n    if disable_nagle:\n        fd.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)\n    else:\n        fd.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 0)\n    # Connect to the specified URI.\n    fd.connect((host, port))\n\n    return fd\n"
  },
  {
    "path": "distkeras/parameter_servers.py",
    "content": "\"\"\"Parameter servers.\n\nA parameter server is a process which will aggregate all the incoming gradient\nor parameter updates of the workers and incorperate it into a single center variable.\nThis center variable will eventually be the produced model of the trainer.\n\"\"\"\n\n## BEGIN Imports. ##############################################################\n\nimport copy\n\nimport math\n\nimport numpy as np\n\nimport socket\n\nimport threading\n\nfrom distkeras.networking import recv_data\nfrom distkeras.networking import send_data\nfrom distkeras.utils import deserialize_keras_model\n\n## END Imports. ################################################################\n\nclass ParameterServer(object):\n    \"\"\"Abstract class which provides basic attributed and methods for all\n       parameter servers.\n\n    # Arguments\n        model: string. Serialized Keras model.\n               See: distkeras.utils.serialize_keras_model\n    \"\"\"\n\n    def __init__(self, model):\n        self.model = deserialize_keras_model(model)\n        self.num_updates = 1\n\n    def initialize(self):\n        \"\"\"Initializes the parameter server.\n\n        This method is called after self.start().\n        \"\"\"\n        raise NotImplementedError\n\n    def start(self):\n        \"\"\"Starts the parameter server in a new thread.\"\"\"\n        raise NotImplementedError\n\n    def run(self):\n        \"\"\"Main event loop of the parameter server.\"\"\"\n        raise NotImplementedError\n\n    def stop(self):\n        \"\"\"Notifies the parameter server thread to stop.\"\"\"\n        raise NotImplementedError\n\n    def get_model(self):\n        \"\"\"Returns the Keras model which will be trained by the workers.\"\"\"\n        return self.model\n\n    def next_update(self):\n        \"\"\"Increments the number of model updates by 1.\"\"\"\n        self.num_updates += 1\n\n    def reset_update_counter(self):\n        \"\"\"Resets the model update counter.\"\"\"\n        self.num_updates = 0\n\n    def get_num_updates(self):\n        \"\"\"Returns the number of model updates the parameter server has performed.\"\"\"\n        return self.num_updates\n\n\nclass SocketParameterServer(ParameterServer):\n    \"\"\"Abstract class of a parameter server which is based on a socket implementation.\n\n    This means that this parameter server accepts multiple TCP connections from multiple\n    workers, and uses a costum protocol to transmit and receive the model parameters. This\n    is done by implementing a custom protocol. Which is fully described in the\n    distkeras.networking module.\n\n    # Arguments\n        model: string. Serialized Keras model.\n               See: distkeras.utils.serialize_keras_model\n        port: int. Listing port number.\n    \"\"\"\n\n    def __init__(self, model, port=5000):\n        super(SocketParameterServer, self).__init__(model)\n        self.master_port = port\n        self.socket = None\n        self.running = False\n        self.connections = []\n        self.mutex = threading.Lock()\n\n    def initialize(self):\n        \"\"\"Sets up the listing port.\"\"\"\n        # Reset the running flag.\n        self.running = True\n        # Prepare a socket.\n        file_descriptor = socket.socket(socket.AF_INET, socket.SOCK_STREAM)\n        # Disable Nagle's algorithm.\n        file_descriptor.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)\n        # Check if the master port needs to be assigned by the OS.\n        if self.master_port is None:\n            file_descriptor.bind(('0.0.0.0', 0))\n            # Retrieve the port assigned by the OS.\n            self.master_port = int(file_descriptor.getsockname()[1])\n        else:\n            file_descriptor.bind(('0.0.0.0', self.master_port))\n        # Listen to the socket.\n        file_descriptor.listen(5)\n        # Assign the socket.\n        self.socket = file_descriptor\n\n    def handle_commit(self, conn, addr):\n        \"\"\"Handles parameter updates coming from the workers.\n\n        # Arguments:\n            conn: socket. The opened connection.\n            addr: addr. Address of the remote host.\n        \"\"\"\n        raise NotImplementedError\n\n    def handle_pull(self, conn, addr):\n        \"\"\"Handles parameter requests coming from the workers. This will\n        actually send the model parameters to the requesting host.\n\n        # Arguments:\n            conn: socket. The opened connection.\n            addr: addr. Address of the remote host.\n        \"\"\"\n        # Fetch the raw center variables.\n        with self.mutex:\n            center_variable = self.model.get_weights()\n            cv = copy.deepcopy(center_variable)\n        # Send the data over the socket.\n        send_data(conn, cv)\n\n    def cancel_accept(self):\n        \"\"\"This method will cancel the accept procedure. The method\n        is meant to be executed by the stop() procedure.\n        \"\"\"\n        file_descriptor = socket.socket(socket.AF_INET, socket.SOCK_STREAM)\n        try:\n            # Connect to the listening socket to cancel the accept.\n            file_descriptor.connect((\"localhost\", self.master_port))\n            file_descriptor.close()\n        except Exception as e:\n            print(e)\n\n    def handle_connection(self, conn, addr):\n        \"\"\"\n        A parameter server has two main functionalities. Nodes are able to\n        pull (p) the current state, or 'commit' a state. This is implemented\n        in the following functionality. Classes which implement these interfaces\n        should not worry about connection handling.\n        \"\"\"\n        try:\n            while self.running:\n                # Fetch the current action.\n                action = conn.recv(1).decode()\n                # Check if the action is a commit (most of the cases).\n                if action == 'c':\n                    # Handle the commit.\n                    self.handle_commit(conn, addr)\n                elif action == 'p':\n                    # Handle the pull.\n                    self.handle_pull(conn, addr)\n        except Exception as e:\n            print(e)\n\n    def start(self):\n        \"\"\"Starts the parameter server.\"\"\"\n        # Set the running flag.\n        self.running = True\n\n    def run(self):\n        \"\"\"Main event loop of the parameter server.\"\"\"\n        # Listen for incoming connections.\n        while self.running:\n            try:\n                # Accept incoming connections.\n                conn, addr = self.socket.accept()\n                # Handle the connection.\n                thread = threading.Thread(target=self.handle_connection, args=(conn, addr))\n                thread.start()\n                # Store the connection in the dictionary.\n                self.connections.append(thread)\n            except Exception as e:\n                print(e)\n\n    def stop(self):\n        \"\"\"Stop the parameter server. This will also cleanup all existing connections.\"\"\"\n        self.running = False\n        # Check if a socket is allocated.\n        if self.socket:\n            self.cleanup_connections()\n            self.finalize()\n            self.socket.close()\n            self.cancel_accept()\n            self.socket = None\n        self.connections = []\n\n    def finalize(self):\n        \"\"\"Method that is called when the parameter server stops.\"\"\"\n        print(\"Not executed\")\n\n    def cleanup_connections(self):\n        \"\"\"Clean all existing connections up.\"\"\"\n        # Iterate over all connections.\n        for thread in self.connections:\n            # Fetch the thread object.\n            thread.join()\n            del thread\n\n\nclass DeltaParameterServer(SocketParameterServer):\n    \"\"\"A parameter server which integrates all incoming deltas into the model.\n\n    # Arguments\n        model: string. Serialized Keras model.\n               See: distkeras.utils.serialize_keras_model\n        master_port: int. Port number of the parameter server.\n    \"\"\"\n\n    def __init__(self, model, master_port):\n        super(DeltaParameterServer, self).__init__(model, master_port)\n        self.center_variable = np.asarray(self.model.get_weights())\n\n    def handle_commit(self, conn, addr):\n        # Receive the parameters from the remote node.\n        data = recv_data(conn)\n        # Extract the delta from the dictionary.\n        delta = data['delta']\n        # Update the center variable with the delta.\n        with self.mutex:\n            self.center_variable = self.center_variable + delta\n        # Next iteration.\n        self.next_update()\n\n    def handle_pull(self, conn, addr):\n        \"\"\"Handles parameter requests coming from the workers. This will\n        actually send the model parameters to the requesting host.\n\n        # Arguments:\n            conn: socket. The opened connection.\n            addr: addr. Address of the remote host.\n        \"\"\"\n        # Fetch the raw center variables.\n        with self.mutex:\n            cv = copy.deepcopy(self.center_variable)\n        # Send the data over the socket.\n        send_data(conn, cv)\n\n    def finalize(self):\n        # Set the final weights of the model.\n        self.model.set_weights(self.center_variable)\n\n\nclass ADAGParameterServer(SocketParameterServer):\n    \"\"\"A parameter server which integrates the incoming gradient residuals into\n       the model, and integrates them using the ADAG scheme.\n\n    # Arguments\n        model: string. Keras model.\n               See: distkeras.utils.serialize_keras_model\n        master_port: int. Port number of the parameter server.\n    \"\"\"\n\n    def __init__(self, model, master_port):\n        super(ADAGParameterServer, self).__init__(model, master_port)\n        self.center_variable = np.asarray(self.model.get_weights())\n\n    def handle_commit(self, conn, addr):\n        # Receive the parameters from the remote node.\n        data = recv_data(conn)\n        # Extract the data from the dictionary.\n        r = data['residual']\n        with self.mutex:\n            # Update the center variable.\n            self.center_variable = self.center_variable + r\n        # Increment the number of parameter server updates.\n        self.next_update()\n\n    def handle_pull(self, conn, addr):\n        \"\"\"Handles parameter requests coming from the workers. This will\n        actually send the model parameters to the requesting host.\n\n        # Arguments:\n            conn: socket. The opened connection.\n            addr: addr. Address of the remote host.\n        \"\"\"\n        # Fetch the raw center variables.\n        with self.mutex:\n            cv = copy.deepcopy(self.center_variable)\n        # Send the data over the socket.\n        send_data(conn, cv)\n\n    def finalize(self):\n        # Set the weights of the model.\n        self.model.set_weights(self.center_variable)\n\n\nclass DynSGDParameterServer(SocketParameterServer):\n    \"\"\"DynSGD parameter server, keeps track of the staleness between updates\n    to maintain dynamic worker learning rates based on staleness.\n\n    # Arguments\n        model: string. Keras model\n               See: distkeras.utils.serialize_keras_model\n        master_port: int. Port number of the parameter server.\n    \"\"\"\n\n    def __init__(self, model, master_port):\n        super(DynSGDParameterServer, self).__init__(model, master_port)\n\n    def handle_pull(self, conn, addr):\n        \"\"\"Handles parameter requests coming from the workers. This will\n        actually send the model parameters to the requesting host.\n\n        This is a specific implementation for DynSGD.\n\n        # Arguments:\n            conn: socket. The opened connection.\n            addr: addr. Address of the remote host.\n        \"\"\"\n        # Allocate a new dictionary.\n        data = {}\n        # Fetch the raw center variables.\n        with self.mutex:\n            center_variable = self.model.get_weights()\n            cv = copy.deepcopy(center_variable)\n            # Store the number of updates (u) the PS executed.\n            data['update'] = self.num_updates\n        # Store the model (m).\n        data['model'] = cv\n        # Send the data over the socket.\n        send_data(conn, data)\n\n    def handle_commit(self, conn, addr):\n        data = recv_data(conn)\n        r = data['residual']\n        # Fetch the last iteration number\n        last_update = data['last_update']\n        du = (self.num_updates - last_update) + 1\n        r /= du\n        with self.mutex:\n            center_variable = self.model.get_weights()\n            center_variable = center_variable + r\n            self.model.set_weights(center_variable)\n        # Increment the number of parameter server updates.\n        self.next_update()\n\n\nclass ExperimentalParameterServer(SocketParameterServer):\n    \"\"\"A parameter server which integrates the incoming gradient residuals into\n       the model, and integrates them using the ADAG scheme.\n\n    # Arguments\n        model: string. Keras model.\n               See: distkeras.utils.serialize_keras_model\n        master_port: int. Port number of the parameter server.\n    \"\"\"\n\n    def __init__(self, model, master_port, learning_rate):\n        super(ExperimentalParameterServer, self).__init__(model, master_port)\n        self.center_variable = np.asarray(self.model.get_weights())\n        self.inverse_learning_rate = 1.0 / learning_rate\n\n    def handle_commit(self, conn, addr):\n        # Receive the parameters from the remote node.\n        data = recv_data(conn)\n        # Extract the data from the dictionary.\n        r = data['residual']\n        worker_id = data['worker_id']\n        stale_cv = data['stale_center_variable']\n        with self.mutex:\n            diff_cv = np.subtract(self.center_variable, stale_cv)\n            d = 1 / (self.inverse_learning_rate * np.power(diff_cv, 2) + 1)\n            r = np.multiply(d, r)\n            # Update the center variable.\n            self.center_variable = self.center_variable + r\n        # Increment the number of parameter server updates.\n        self.next_update()\n\n    def handle_pull(self, conn, addr):\n        \"\"\"Handles parameter requests coming from the workers. This will\n        actually send the model parameters to the requesting host.\n\n        # Arguments:\n            conn: socket. The opened connection.\n            addr: addr. Address of the remote host.\n        \"\"\"\n        # Fetch the raw center variables.\n        with self.mutex:\n            cv = copy.deepcopy(self.center_variable)\n        # Send the data over the socket.\n        send_data(conn, cv)\n\n    def finalize(self):\n        # Set the weights of the model.\n        self.model.set_weights(self.center_variable)\n"
  },
  {
    "path": "distkeras/predictors.py",
    "content": "\"\"\"Predictors take a model and will transform the Dataframe by adding a prediction column.\"\"\"\n\n## BEGIN Imports. ##############################################################\n\nimport numpy as np\n\nfrom pyspark.mllib.linalg import DenseVector\n\nfrom distkeras.utils import serialize_keras_model\nfrom distkeras.utils import deserialize_keras_model\nfrom distkeras.utils import new_dataframe_row\n\n## END Imports. ################################################################\n\nclass Predictor(object):\n    \"\"\"Abstract predictor class.\n\n    # Arguments\n        keras_model: Keras Model.\n    \"\"\"\n\n    def __init__(self, keras_model):\n        self.model = serialize_keras_model(keras_model)\n\n    def predict(self, dataframe):\n        \"\"\"Transforms the dataframe to add a prediction.\n\n        # Arguments\n            dataframe: dataframe. Spark Dataframe.\n        \"\"\"\n        raise NotImplementedError\n\n\nclass ModelPredictor(Predictor):\n    \"\"\"Takes a Keras model and adds a prediction column to the dataframe\n       given a features column.\n\n    # Arguments\n        keras_model: Keras model.\n        features_col: string. Name of the features column.\n        output_col: string. Name of the prediction column.\n    \"\"\"\n\n    def __init__(self, keras_model, features_col=\"features\", output_col=\"prediction\"):\n        super(ModelPredictor, self).__init__(keras_model)\n        assert isinstance(features_col, (str, list)), \"'features_col' must be a string or a list of strings\"\n        self.features_column = [features_col] if isinstance(features_col, str) else features_col\n        self.output_column = output_col\n\n    def _predict(self, iterator):\n        \"\"\"Lambda method which will append a prediction column to the provided rows.\n\n        # Arguments:\n            iterator: iterator. Spark Row iterator.\n        \"\"\"\n        model = deserialize_keras_model(self.model)\n        for row in iterator:\n            features = [np.asarray([row[c]]) for c in self.features_column]\n            prediction = model.predict(features)\n            dense_prediction = DenseVector(prediction[0])\n            new_row = new_dataframe_row(row, self.output_column, dense_prediction)\n            yield new_row\n\n    def predict(self, dataframe):\n        \"\"\"Returns a dataframe which is the old dataframe with an additional\n        prediction column.\n        \"\"\"\n        return dataframe.rdd.mapPartitions(self._predict).toDF()\n"
  },
  {
    "path": "distkeras/schemes.py",
    "content": "\"\"\"Schemes module.\n\nModule with schemes to automatize a distributed learning process. These schemes will automatically\nadjust the hyperparameters to improve training performance.\n\"\"\"\n\n## BEGIN Imports. ##############################################################\n\nimport math\n\n## END Imports. ################################################################\n\nclass Scheme(object):\n    \"\"\"A 'Scheme' is way to describe how a distributed optimization sequence\n    should perform. For example, it is responsible for adjusting the learning\n    rate of the parameter server if it notices that the loss doesn't decay.\n    However, this is only one of the possible solutions. Others include the\n    optimization of other hyperparameters such as the number of workers.\n\n    # Arguments\n        optimizer: trainer. A distributed optimizer.\n        num_epoch: int. Total number of epoch.\n        evaluation_frequency: int. Frequency of hyperparameter evaluation.\n    \"\"\"\n\n    def __init__(self, optimizer, num_epoch=15, evaluation_frequency=5):\n        self.optimizer = optimizer\n        self.num_epoch = num_epoch\n        self.evaluation_frequency = evaluation_frequency\n        self.epoch_over_eval_frequency = int(self.num_epoch / self.evaluation_frequency)\n        self.initialize()\n\n    def initialize(self):\n        \"\"\"Initializes the hyperparameters to follow the scheme parameters.\"\"\"\n        self.optimizer.set_num_epoch(self.get_epoch_over_evaluation_frequency())\n\n    def get_epoch_over_evaluation_frequency(self):\n        \"\"\"Returns the number of epochs per evaluation frequency.\"\"\"\n        return self.epoch_over_eval_frequency\n\n    def optimize(self, training_set, validation_set):\n        raise NotImplementedError\n\n\nclass Emperor(Scheme):\n    \"\"\"The 'Emporor' optimization schema will make hyperparameter changes based\n    on the loss derrivatives of the validation set.\n\n    # Arguments\n        optimizer: trainer. A distributed optimizer.\n        evaluate_loss: function. Function which evaluates the loss. This\n                       function should accept a model, and a dataframe.\n        num_epoch: int. Total number of epoch.\n        evaluation_frequency: int. Frequency of hyperparameter evaluation.\n    \"\"\"\n\n    def __init__(self, optimizer, evaluate_loss, num_epoch=15, evaluation_frequency=5,\n                 loss_threshold=0.005):\n        super(Emperor, self).__init__(optimizer, num_epoch, evaluation_frequency)\n        self.previous_loss = float('inf')\n        self.loss_threshold = loss_threshold\n        self.evaluate_loss = evaluate_loss\n\n    def optimize(self, training_set, validation_set):\n        trained_model = None\n\n        # Fetch the number of evaluations, to match the number of epochs.\n        num_evaluations = self.get_epoch_over_evaluation_frequency() + 1\n        # Iterate over the number of evaluation epochs.\n        for i in range(0, num_evaluations):\n            # Train the model.\n            trained_model = self.optimizer.train(training_set)\n            self.optimizer.set_model(trained_model)\n            # Evaluate the training set, and fetch the loss.\n            loss = self.evaluate_loss(trained_model, validation_set)\n            print(\"Current loss: \" + str(loss))\n            dl = math.fabs(loss - self.previous_loss)\n            self.previous_loss = loss\n            if dl <= self.loss_threshold:\n                print(\"Lowering learning rate.\")\n                print(\"Old learning rate: \" + str(self.optimizer.get_learning_rate()))\n                # Modify the learning rate.\n                learning_rate = self.optimizer.get_learning_rate()\n                learning_rate /= 10\n                self.optimizer.set_learning_rate(learning_rate)\n                print(\"New learning rate: \"+ str(self.optimizer.get_learning_rate()))\n\n        return trained_model\n"
  },
  {
    "path": "distkeras/trainers.py",
    "content": "\"\"\"Model optimizers. Depending on the implementation, these classes will optimize the\nKeras model in a distributed manner (with exception of the SingleTrainer).\"\"\"\n\n## BEGIN Imports. ##############################################################\n\nimport numpy as np\n\nimport threading\n\nimport time\n\nfrom distkeras.parameter_servers import ADAGParameterServer\nfrom distkeras.parameter_servers import DeltaParameterServer\nfrom distkeras.parameter_servers import DynSGDParameterServer\nfrom distkeras.parameter_servers import ExperimentalParameterServer\n\nfrom distkeras.utils import deserialize_keras_model\nfrom distkeras.utils import history_executor\nfrom distkeras.utils import history_executors_average\nfrom distkeras.utils import pickle_object\nfrom distkeras.utils import serialize_keras_model\nfrom distkeras.utils import set_keras_base_directory\nfrom distkeras.utils import unpickle_object\n\nfrom distkeras.networking import determine_host_address\n\nfrom distkeras.workers import ADAGWorker\nfrom distkeras.workers import AEASGDWorker\nfrom distkeras.workers import DOWNPOURWorker\nfrom distkeras.workers import DynSGDWorker\nfrom distkeras.workers import ExperimentalWorker\nfrom distkeras.workers import EAMSGDWorker\nfrom distkeras.workers import SequentialWorker\n\nfrom keras import backend as K\n\n## END Imports. ################################################################\n\nclass Trainer(object):\n    \"\"\"Abstract trainer class. This class provides all base functionality which\n    all optimizers need to implement.\n\n    # Arguments\n        keras_model: Keras model.\n        loss: string. String representing the loss.\n              See: https://keras.io/objectives/\n        worker_optimizer: string. String representing worker optimizer.\n                          See https://keras.io/optimizers/\n        metrics: list of strings representing model evaluation metrics. Default is [\"accuracy\"].\n                 See: https://keras.io/metrics/\n        loss_weights: optional list or dict specifying weights for different losses.\n    \"\"\"\n\n    def __init__(self, keras_model, loss, worker_optimizer, metrics=[\"accuracy\"], loss_weights=None):\n        set_keras_base_directory()\n        self.master_model = serialize_keras_model(keras_model)\n        self.loss = loss\n        self.loss_weights = loss_weights\n        self.worker_optimizer = worker_optimizer\n        self.metrics = metrics\n        self.history = []\n        self.training_time_start = 0\n        self.training_time_end = 0\n        self.training_time = 0\n        self.max_mini_batches_prefetch = 100\n\n    def set_max_prefetch(self, max_mini_batches):\n        \"\"\"Sets the maximum amount of mini-batches that can be prefetched by a worker.\"\"\"\n        self.max_mini_batches_prefetch = max_mini_batches\n\n    def set_model(self, model):\n        \"\"\"Sets the master model to be used by the trainer.\"\"\"\n        self.master_model = serialize_keras_model(model)\n\n    def record_training_start(self):\n        \"\"\"Records the start of the training.\n\n        This private function is called when the training process starts.\n        \"\"\"\n        self.training_time = 0\n        self.training_time_start = time.time()\n\n    def record_training_end(self):\n        \"\"\"Records the end of the traing.\n\n        This private function is called when the training process is terminated.\n        \"\"\"\n        self.training_time_end = time.time()\n        self.training_time = self.training_time_end - self.training_time_start\n\n    def get_training_time(self):\n        \"\"\"Returns the told training time.\"\"\"\n        return self.training_time\n\n    def get_history(self):\n        \"\"\"Returns all history object aggregated during training.\"\"\"\n        return self.history\n\n    def get_averaged_history(self):\n        \"\"\"Returns the averaged history of the center variable.\"\"\"\n        return history_executors_average(self.history)\n\n    def get_executor_history(self, executor_id):\n        \"\"\"Returns the history of a specific executor.\"\"\"\n        return history_executor(self.history, executor_id)\n\n    def train(self, dataframe, shuffle=False):\n        \"\"\"Trains the specified model using the specified dataframe.\n\n        # Arguments\n            dataframe: dataframe. A Spark Dataframe containing the training data.\n            shuffle: boolean. Tells to shuffle the dataframe before training.\n                     Warning: this will tell Spark to shuffle all partitions over\n                     the network. It is recommended to shuffle the dataframe before\n                     training and store it.\n        \"\"\"\n        raise NotImplementedError\n\n    def serialize(self):\n        return pickle_object(self)\n\n\nclass SingleTrainer(Trainer):\n    \"\"\"An optimizer which will train a network on a single machine.\n\n    # Arguments\n        keras_model: model. Keras model to train.\n        worker_optimizer: string. String representing worker optimizer.\n                          See https://keras.io/optimizers/\n        loss: string. String representing the loss.\n              See: https://keras.io/objectives/\n        metrics: list of strings representing model evaluation metrics. Default is [\"accuracy\"].\n                 See: https://keras.io/metrics/\n        features_col: string or list of strings. Name(s) of the features column(s).\n        label_col: string or list of strings. Name(s) of the label column(s).\n        num_epoch: int. Number of epochs.\n        batch_size: int. Mini-batch size.\n        loss_weights: optional list or dict specifying weights for different losses.\n    \"\"\"\n\n    def __init__(self, keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], features_col=\"features\",\n                 label_col=\"label\", num_epoch=1, batch_size=32, loss_weights=None):\n        super(SingleTrainer, self).__init__(keras_model, loss, worker_optimizer, metrics, loss_weights)\n        self.features_column = features_col\n        self.label_column = label_col\n        self.num_epoch = num_epoch\n        self.batch_size = batch_size\n\n    def allocate_worker(self):\n        \"\"\"Allocates a worker for the Single Trainer instance.\n\n        Only for internal use.\n        \"\"\"\n        worker = SequentialWorker(model=self.master_model, features_col=self.features_column,\n                                  label_col=self.label_column, batch_size=self.batch_size, num_epoch = self.num_epoch,\n                                  optimizer=self.worker_optimizer, loss=self.loss, loss_weights=self.loss_weights, \n                                  metrics = self.metrics)\n\n        return worker\n\n    def train(self, dataframe, shuffle=False):\n        \"\"\"See distkeras.trainers.Trainer.train\n\n        # Arguments\n            dataframe: dataframe. A Spark Dataframe containing the training data.\n            shuffle: boolean. Tells to shuffle the dataframe before training.\n                     Warning: this will tell Spark to shuffle all partitions over\n                     the network. It is recommended to shuffle the dataframe before\n                     training and store it.\n        \"\"\"\n        # Check if the data needs to be shuffled.\n        if shuffle:\n            dataframe = shuffle(dataframe)\n        # Collect the dataframe on a single worker node.\n        dataframe = dataframe.coalesce(1)\n        # Cache the dataframe.\n        dataframe.cache()\n        # Allocate a worker.\n        worker = self.allocate_worker()\n        # Set the maximum number of mini-batches.\n        worker.set_max_prefetch(self.max_mini_batches_prefetch)\n        # Start recording training time.\n        self.record_training_start()\n        # Fetch the trained model.\n        self.master_model = dataframe.rdd.mapPartitionsWithIndex(worker.train).collect()[0]\n        # Stop recording of training time.\n        self.record_training_end()\n\n        return deserialize_keras_model(self.master_model)\n\n\nclass AveragingTrainer(Trainer):\n    \"\"\"A trainer which implements a data parallel technique using model averaging.\n\n    In this implementation, the model replicas are averages after every epoch.\n    # Arguments\n        keras_model: model. Keras model to train.\n        worker_optimizer: string. String representing worker optimizer.\n                          See https://keras.io/optimizers/\n        loss: string. String representing the loss.\n              See: https://keras.io/objectives/\n        metrics: list of strings representing model evaluation metrics. Default is [\"accuracy\"].\n                 See: https://keras.io/metrics/\n        features_col: string or list of strings. Name(s) of the features column(s).\n        label_col: string or list of strings. Name(s) of the label column(s).\n        num_epoch: int. Number of epochs.\n        batch_size: int. Mini-batch size.\n        num_workers: int. Number of model replicas to train in parallel.\n        loss_weights: optional list or dict specifying weights for different losses.\n    \"\"\"\n\n    def __init__(self, keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], features_col=\"features\",\n                 label_col=\"label\", num_epoch=1, batch_size=32, num_workers=2, loss_weights=None):\n        super(AveragingTrainer, self).__init__(keras_model, loss, worker_optimizer, metrics, loss_weights)\n        self.features_column = features_col\n        self.label_column = label_col\n        self.num_epoch = num_epoch\n        self.batch_size = batch_size\n        self.num_workers = num_workers\n        self.parameter_buffer = np.asarray(keras_model.get_weights())\n        self.parameter_buffer.fill(0.0)\n\n    def average_models(self, models):\n        \"\"\"Averages the specified list of Keras models, and assigns the\n        averaged model as the master model.\n\n        # Arguments:\n            models: list. A list of serialized Keras models.\n        \"\"\"\n        num_models = len(models)\n        # Get all weights of the models.\n        for i in range(0, num_models):\n            weights = np.asarray(deserialize_keras_model(models[i]).get_weights())\n            self.parameter_buffer += weights\n        # Average the parameters.\n        self.parameter_buffer /= num_models\n        temp_model = deserialize_keras_model(self.master_model)\n        temp_model.set_weights(self.parameter_buffer)\n        self.master_model = serialize_keras_model(temp_model)\n\n\n    def allocate_worker(self):\n        \"\"\"Allocates the AveragingWorker for internal use.\"\"\"\n        worker = SequentialWorker(model=self.master_model, features_col=self.features_column,\n                                  label_col=self.label_column, batch_size=self.batch_size, num_epoch = 1,\n                                  optimizer=self.worker_optimizer, loss=self.loss, loss_weights=self.loss_weights, metrics = self.metrics)\n\n        return worker\n\n    def train(self, dataframe, shuffle=False):\n        \"\"\"Applies model averaging to the model replicas distributed over the specified\n        number of Spark executors.\n\n        # Arguments\n            dataframe: dataframe: A Spark Dataframe containing the training data.\n            shuffle: boolean. Tells to shuffle the dataframe before training.\n                     Warning: this will tell Spark to shuffle all partitions over\n                     the network. It is recommended to shuffle the dataframe before\n                     training and store it.\n        \"\"\"\n        # Repartition the data in order to fit the number of workers.\n        num_partitions = dataframe.rdd.getNumPartitions()\n        # Check if the dataframe needs to be shuffled.\n        if shuffle:\n            dataframe = shuffle(dataframe)\n        # Check if we need to repartition the dataframe.\n        if num_partitions >= self.num_workers:\n            dataframe = dataframe.coalesce(self.num_workers)\n        else:\n            dataframe = dataframe.repartition(self.num_workers)\n        # Start the training procedure.\n        self.record_training_start()\n        for i in range(0, self.num_epoch):\n            worker = self.allocate_worker()\n            # Set the maximum number of mini-batches.\n            worker.set_max_prefetch(self.max_mini_batches_prefetch)\n            models = dataframe.rdd.mapPartitionsWithIndex(worker.train).collect()\n            self.average_models(models)\n        # End the training procedure.\n        self.record_training_end()\n\n        return deserialize_keras_model(self.master_model)\n\n\nclass EnsembleTrainer(Trainer):\n    \"\"\"Utility trainer which will train ensemble methods in parallel.\n\n    # Arguments\n        keras_model: model. Keras model to train.\n        worker_optimizer: string. String representing worker optimizer.\n                          See https://keras.io/optimizers/\n        loss: string. String representing the loss.\n              See: https://keras.io/objectives/\n        metrics: list of strings representing model evaluation metrics. Default is [\"accuracy\"].\n                 See: https://keras.io/metrics/\n        features_col: string or list of strings. Name(s) of the features column(s).\n        label_col: string or list of strings. Name(s) of the label column(s).\n        batch_size: int. Mini-batch size.\n        num_ensembles: int. Number of ensembles to train.\n        loss_weights: optional list or dict specifying weights for different losses.\n    # Note\n        This will note employ a data-parallell approach for the ensembles.\n    \"\"\"\n\n    def __init__(self, keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], features_col=\"features\",\n                 label_col=\"label\", batch_size=32, num_ensembles=2, loss_weights=None):\n        super(EnsembleTrainer, self).__init__(keras_model, loss, worker_optimizer, metrics, loss_weights)\n        self.features_column = features_col\n        self.label_column = label_col\n        self.batch_size = batch_size\n        self.num_ensembles = num_ensembles\n\n    def allocate_worker(self):\n        \"\"\"Allocates the EnsembleWorker for internal use.\"\"\"\n        worker = SequentialWorker(model=self.master_model, features_col=self.features_column,\n                                  label_col=self.label_column, batch_size=self.batch_size, num_epoch = self.num_epoch,\n                                  optimizer=self.worker_optimizer, loss=self.loss, loss_weights=self.loss_weights, metrics=self.metrics)\n\n        return worker\n\n    def train(self, dataframe, shuffle=False):\n        \"\"\"Trains the specified number of ensemble models using the specified dataframe.\n\n        # Arguments\n            dataframe: dataframe. A Spark Dataframe containing the training data.\n            shuffle: boolean. Tells to shuffle the dataframe before training.\n                     Warning: this will tell Spark to shuffle all partitions over\n                     the network. It is recommended to shuffle the dataframe before\n                     training and store it.\n        \"\"\"\n        # Allocate a worker.\n        worker = self.allocate_worker()\n        # Set the maximum number of mini-batches.\n        worker.set_max_prefetch(self.max_mini_batches_prefetch)\n        # Repartition in order to fit the number of workers.\n        num_partitions = dataframe.rdd.getNumPartitions()\n        # Check if the dataframe needs to be shuffled before training.\n        if shuffle:\n            dataframe = shuffle(dataframe)\n        # Check if we need to repartition the dataframe.\n        if num_partitions >= self.num_workers:\n            dataframe = dataframe.coalesce(self.num_workers)\n        else:\n            dataframe = dataframe.repartition(self.num_workers)\n        # Start the training procedure.\n        self.record_training_start()\n        # Train the models in parallel.\n        models = dataframe.rdd.mapPartitionsWithIndex(worker.train).collect()\n        # End the training procedure.\n        self.record_training_end()\n\n        return models\n\n\nclass DistributedTrainer(Trainer):\n    \"\"\"Abstract class which describes the properties of a distributed optimizer.\n\n    # Arguments\n        keras_model: model. Keras model to train.\n        worker_optimizer: string. String representing worker optimizer.\n                          See https://keras.io/optimizers/\n        loss: string. String representing the loss.\n              See: https://keras.io/objectives/\n        metrics: list of strings representing model evaluation metrics. Default is [\"accuracy\"].\n                 See: https://keras.io/metrics/\n        features_col: string or list of strings. Name(s) of the features column(s).\n        label_col: string or list of strings. Name(s) of the label column(s).\n        num_epoch: int. Number of epochs.\n        batch_size: int. Mini-batch size.\n        num_workers: int. Number of distributed workers.\n        master_port: int. port number for the parameter server.\n        loss_weights: optional list or dict specifying weights for different losses.\n    \"\"\"\n\n    def __init__(self, keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, batch_size=32,\n                 features_col=\"features\", label_col=\"label\", num_epoch=1, master_port=5000, loss_weights=None):\n        super(DistributedTrainer, self).__init__(keras_model, loss, worker_optimizer, metrics, loss_weights)\n        self.num_workers = num_workers\n        self.batch_size = batch_size\n        self.features_column = features_col\n        self.label_column = label_col\n        self.num_epoch = num_epoch\n        self.parameter_server = None\n        self.parameter_server_thread = None\n        self.master_host = determine_host_address()\n        self.master_port = master_port\n        self.learning_rate = 1.0\n\n    def set_minibatch_size(self, size):\n        \"\"\"Sets the size of the mini-batch.\"\"\"\n        self.batch_size = size\n\n    def get_minibatch_size(self):\n        \"\"\"Returns the size of the mini-batch.\"\"\"\n        return self.batch_size\n\n    def get_features_column(self):\n        \"\"\"Returns the name of the features column.\"\"\"\n        return self.features_column\n\n    def get_label_column(self):\n        \"\"\"Returns the name of the label column.\"\"\"\n        return self.label_column\n\n    def get_learning_rate(self):\n        \"\"\"Returns the learning rate of the worker which can be tuned by\n        the parameter server, or optimization scheme.\n\n        Note: this learning rate is independent of the learning rate of the optimizer.\n        \"\"\"\n        return self.learning_rate\n\n    def set_learning_rate(self, learning_rate):\n        \"\"\"Sets the learning rate which can be tuned by the parameter server,\n        or optimization scheme.\n\n        Note: this learning rate is independent of the learning rate of the optimizer.\n        \"\"\"\n        self.learning_rate = learning_rate\n\n    def set_num_epoch(self, num_epoch):\n        \"\"\"Sets the number of epochs.\"\"\"\n        self.num_epoch = num_epoch\n\n    def get_num_epoch(self):\n        \"\"\"Returns the number of epochs.\"\"\"\n        return self.num_epoch\n\n    def allocate_worker(self):\n        \"\"\"Allocates the worker implementation.\n\n        Implement this method in subclasses.\n        \"\"\"\n        raise NotImplementedError\n\n    def set_master(self, master):\n        \"\"\"Sets the master address of the parameter server.\"\"\"\n        self.master_host = master\n\n    def determine_new_master(self):\n        \"\"\"Sets the new master address to the current host.\"\"\"\n        self.master_host = determine_host_address()\n\n    def allocate_parameter_server(self):\n        \"\"\"Allocates the parameter server.\n\n        If an other type of parameter server is required, you can overwrite\n        this implementation.\n        \"\"\"\n        parameter_server = DeltaParameterServer(self.master_model, self.master_port)\n\n        return parameter_server\n\n    def set_num_workers(self, num_workers):\n        \"\"\"Sets the number of parallel workers to use.\"\"\"\n        self.num_workers = num_workers\n\n    def get_num_workers(self):\n        \"\"\"Returns the number of parallel workers.\"\"\"\n        return self.num_workers\n\n    def num_updates(self):\n        \"\"\"Returns the number of model updates the parameter server performed.\"\"\"\n        return self.parameter_server.num_updates()\n\n    def service(self):\n        \"\"\"Executes the parameter server service.\"\"\"\n        self.parameter_server.start()\n        self.parameter_server.initialize()\n        self.parameter_server.run()\n\n    def stop_service(self):\n        \"\"\"Stops the parameter server service.\"\"\"\n        self.parameter_server.stop()\n        self.parameter_server_thread.join()\n        self.parameter_server_thread = None\n\n    def start_service(self):\n        \"\"\"Starts the parameter server service.\"\"\"\n        # Check if a parameter server thread is already allocated.\n        if not self.parameter_server_thread is None:\n            # Stop the parameter server service.\n            self.stop_service()\n        # Allocate a new parameter service thread.\n        self.parameter_server_thread = threading.Thread(target=self.service)\n        self.parameter_server_thread.start()\n\n    def train(self, dataframe, shuffle=False):\n        \"\"\"Training procedure of a distributed optimization process.\n\n        # Arguments\n            dataframe: dataframe. A Spark Dataframe containing the training data.\n            shuffle: boolean. Tells to shuffle the dataframe before training.\n                     Warning: this will tell Spark to shuffle all partitions over\n                     the network. It is recommended to shuffle the dataframe before\n                     training and store it.\n        \"\"\"\n        # Check if a parameter server has been allocated.\n        if self.parameter_server is not None:\n            # Cleanup the old parameter server.\n            self.parameter_server.stop()\n            self.parameter_server = None\n        # Allocate the parameter server.\n        self.parameter_server = self.allocate_parameter_server()\n        # Start the communication service.\n        self.start_service()\n        # Allocate a worker.\n        worker = self.allocate_worker()\n        # Set the maximum number of mini-batches.\n        worker.set_max_prefetch(self.max_mini_batches_prefetch)\n        # Repartition in order to fit the number of workers.\n        num_partitions = dataframe.rdd.getNumPartitions()\n        # Check if the dataframe needs to be shuffled before training.\n        if shuffle:\n            dataframe = shuffle(dataframe)\n        # Check if we need to repartition the dataframe.\n        if num_partitions >= self.num_workers:\n            dataframe = dataframe.coalesce(self.num_workers)\n        else:\n            dataframe = dataframe.repartition(self.num_workers)\n        # Cache the dataframe.\n        dataframe.cache()\n        # Start the training procedure.\n        self.record_training_start()\n        # Iterate through the epochs.\n        self.history = dataframe.rdd.mapPartitionsWithIndex(worker.train).collect()\n        # End the training procedure.\n        self.record_training_end()\n        # Stop the communication service.\n        self.stop_service()\n\n        return self.parameter_server.get_model()\n\n\nclass AsynchronousDistributedTrainer(DistributedTrainer):\n    \"\"\"Abstract class for an asynchronous distributed trainer.\n\n    This trainer also allows us to set a parallelism factor. This parallelism factor allows\n    us to further parallelize the Spark job. For example, imagine having n machines optimizing\n    a model in an asynchronous distributed setting. If for some, but likely reason, some machines\n    are performing worse compared to others. It will cause the complete learning procedure to be\n    stuck on this one particular machine since every machine will be assigned a single partition.\n    In order to resolve this, we added a parallelization factor. This factor indicates the ratio\n    of the number of jobs per machine (executor). For small dataframes, we recommend that this factor\n    is set to 1. However, this effect really is prominent when the dataframe is large. In this case\n    we recommend that the ratio is 2 or 3.\n\n    # Arguments\n        keras_model: model. Keras model to train.\n        worker_optimizer: string. String representing worker optimizer.\n                          See https://keras.io/optimizers/\n        loss: string. String representing the loss.\n              See: https://keras.io/objectives/\n        metrics: list of strings representing model evaluation metrics. Default is [\"accuracy\"].\n                 See: https://keras.io/metrics/\n        features_col: string or list of strings. Name(s) of the features column(s).\n        label_col: string or list of strings. Name(s) of the label column(s).\n        num_epoch: int. Number of epochs.\n        batch_size: int. Mini-batch size.\n        num_workers: int. Number of distributed workers.\n        master_port: int. port number for the parameter server.\n        loss_weights: optional list or dict specifying weights for different losses.\n\n    # Note\n        By default, the parallelization factor is set to 1.\n    \"\"\"\n\n    def __init__(self, keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, batch_size=32,\n                 features_col=\"features\", label_col=\"label\", num_epoch=1, master_port=5000, loss_weights=None):\n        super(AsynchronousDistributedTrainer, self).__init__(keras_model, worker_optimizer, loss, metrics, \n                                                             num_workers, batch_size, features_col,\n                                                             label_col, num_epoch, master_port, loss_weights)\n        # Initialize asynchronous methods variables.\n        self.parallelism_factor = 1\n\n    def allocate_worker(self):\n        \"\"\"Allocates the worker implementation.\n\n        Implement this method in subclasses.\n        \"\"\"\n        raise NotImplementedError\n\n    def set_parallelism_factor(self, factor):\n        \"\"\"Sets the parallelization factor.\n\n        # Arguments\n            factor: int. The new parallelization factor.\n        \"\"\"\n        self.parallelism_factor = factor\n\n    def get_parallelism_factor(self):\n        \"\"\"Returns the parallelization factor.\"\"\"\n        return self.parallelism_factor\n\n    def train(self, dataframe, shuffle=False):\n        \"\"\"Training procedure of an asynchronous distributed optimization process.\n\n        # Arguments\n            dataframe: dataframe. A Spark Dataframe containing the training data.\n            shuffle: boolean. Tells to shuffle the dataframe before training.\n                     Warning: this will tell Spark to shuffle all partitions over\n                     the network. It is recommended to shuffle the dataframe before\n                     training and store it.\n        \"\"\"\n        # Check if a parameter server has been allocated.\n        if self.parameter_server is not None:\n            # Cleanup the old parameter server.\n            self.parameter_server.stop()\n            self.parameter_server = None\n        # Allocate the parameter server.\n        self.parameter_server = self.allocate_parameter_server()\n        # Start the communication service.\n        self.start_service()\n        # Allocate a worker.\n        worker = self.allocate_worker()\n        # Set the maximum number of mini-batches.\n        worker.set_max_prefetch(self.max_mini_batches_prefetch)\n        # Repartition in order to fit the number of workers.\n        num_partitions = dataframe.rdd.getNumPartitions()\n        # Check if the dataframe needs to be shuffled before training.\n        if shuffle:\n            dataframe = shuffle(dataframe)\n        # Indicate the parallelism (number of worker times parallelism factor).\n        parallelism = self.parallelism_factor * self.num_workers\n        # Check if we need to repartition the dataframe.\n        if num_partitions >= parallelism:\n            dataframe = dataframe.coalesce(parallelism)\n        else:\n            dataframe = dataframe.repartition(parallelism)\n        # Start the training procedure.\n        self.record_training_start()\n        # Iterate through the epochs.\n        self.history = dataframe.rdd.mapPartitionsWithIndex(worker.train).collect()\n        # End the training procedure.\n        self.record_training_end()\n        # Stop the communication service.\n        self.stop_service()\n\n        return self.parameter_server.get_model()\n\n\nclass AEASGD(AsynchronousDistributedTrainer):\n    \"\"\"Asynchronous Elastic Averaging SGD optimizer.\n    Introduced by Zhang et al.\n    https://arxiv.org/pdf/1412.6651.pdf\n    # Arguments\n        keras_model: model. Keras model to train.\n        worker_optimizer: string. String representing worker optimizer.\n                          See https://keras.io/optimizers/\n        loss: string. String representing the loss.\n              See: https://keras.io/objectives/\n        metrics: list of strings representing model evaluation metrics. Default is [\"accuracy\"].\n                 See: https://keras.io/metrics/\n        features_col: string or list of strings. Name(s) of the features column(s).\n        label_col: string or list of strings. Name(s) of the label column(s).\n        num_epoch: int. Number of epochs.\n        batch_size: int. Mini-batch size.\n        num_workers: int. Number of distributed workers.\n        communication_window: int. Staleness parameter.\n                              This parameter describes the number of mini-batches that will be\n                              computed before updating the center variable. For EASGD based\n                              algorithms we recommend large communication windows.\n        learning_rate: float. Learning rate.\n        rho: float. Elastic \"exploration\" variable.\n                    Higher values mean that the model is allowed to \"explore\" its surroundings.\n                    Smaller values are correlated with less exploration. We use the value\n                    recommend by the authors.\n        master_port: int. port number for the parameter server.\n        loss_weights: optional list or dict specifying weights for different losses.\n    \"\"\"\n\n    def __init__(self, keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, batch_size=32,\n                 features_col=\"features\", label_col=\"label\", num_epoch=1, communication_window=32,\n                 rho=5.0, learning_rate=0.1, master_port=5000, loss_weights=None):\n        super(AEASGD, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers,\n                                     batch_size, features_col, label_col, num_epoch, master_port, loss_weights)\n        self.communication_window = communication_window\n        self.rho = rho\n        self.learning_rate = learning_rate\n\n    def allocate_worker(self):\n        \"\"\"Allocates the asynchronous EASGD worker.\"\"\"\n        # Allocate a AEASGD worker.\n        worker = AEASGDWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics,\n                              self.features_column, self.label_column, self.batch_size, self.num_epoch,\n                              self.master_host, self.master_port, self.rho, self.learning_rate,\n                              self.communication_window)\n\n        return worker\n\n\nclass DOWNPOUR(AsynchronousDistributedTrainer):\n    \"\"\"DOWNPOUR Optimizer.\n\n    Asynchronous data-parallel optimizer introduced by Dean et al.\n    http://static.googleusercontent.com/media/research.google.com/en/archive/large_deep_networks_nips2012.pdf\n\n    # Arguments\n        keras_model: model. Keras model to train.\n        worker_optimizer: string. String representing worker optimizer.\n                          See https://keras.io/optimizers/\n        loss: string. String representing the loss.\n              See: https://keras.io/objectives/\n        metrics: list of strings representing model evaluation metrics. Default is [\"accuracy\"].\n                 See: https://keras.io/metrics/\n        features_col: string or list of strings. Name(s) of the features column(s).\n        label_col: string or list of strings. Name(s) of the label column(s).\n        num_epoch: int. Number of epochs.\n        batch_size: int. Mini-batch size.\n        num_workers: int. Number of distributed workers.\n        communication_window: int. Staleness parameter.\n                              This parameter describes the number of mini-batches that will be\n                              computed before updating the center variable. For DOWNPOUR we\n                              recommend small communication windows.\n        learning_rate: float. Learning rate.\n        master_port: int. port number for the parameter server.\n        loss_weights: optional list or dict specifying weights for different losses.\n    \"\"\"\n\n    def __init__(self, keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, batch_size=32,\n                 features_col=\"features\", label_col=\"label\", num_epoch=1, communication_window=5, master_port=5000, loss_weights=None):\n        super(DOWNPOUR, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers,\n                                       batch_size, features_col, label_col, num_epoch, master_port, loss_weights)\n        self.communication_window = communication_window\n\n    def allocate_worker(self):\n        \"\"\"Allocates the DOWNPOUR worker.\"\"\"\n        # Allocate DOWNPOUR worker.\n        worker = DOWNPOURWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics,\n                                self.features_column, self.label_column, self.batch_size, self.num_epoch,\n                                self.master_host, self.master_port, self.communication_window)\n\n        return worker\n\n\nclass EAMSGD(AsynchronousDistributedTrainer):\n    \"\"\"Asynchronous Elastic Averaging w/ Momentum SGD optimizer.\n\n    Introduced by Zhang et al.\n    https://arxiv.org/pdf/1412.6651.pdf\n\n    # Arguments\n        keras_model: model. Keras model to train.\n        worker_optimizer: string. String representing worker optimizer.\n                          See https://keras.io/optimizers/\n        loss: string. String representing the loss.\n              See: https://keras.io/objectives/\n        metrics: list of strings representing model evaluation metrics. Default is [\"accuracy\"].\n                 See: https://keras.io/metrics/\n        features_col: string or list of strings. Name(s) of the features column(s).\n        label_col: string or list of strings. Name(s) of the label column(s).\n        num_epoch: int. Number of epochs.\n        batch_size: int. Mini-batch size.\n        num_workers: int. Number of distributed workers.\n        communication_window: int. Staleness parameter.\n                              This parameter describes the number of mini-batches that will be\n                              computed before updating the center variable. For EASGD based\n                              algorithms we recommend large communication windows.\n        learning_rate: float. Learning rate.\n        rho: float. Elastic \"exploration\" variable.\n                    Higher values mean that the model is allowed to \"explore\" its surroundings.\n                    Smaller values are correlated with less exploration. We use the value\n                    recommend by the authors.\n        momentum: float. Momentum term.\n        master_port: int. port number for the parameter server.\n        loss_weights: optional list or dict specifying weights for different losses.\n    \"\"\"\n\n    def __init__(self, keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, batch_size=32,\n                 features_col=\"features\", label_col=\"label\", num_epoch=1, communication_window=32,\n                 rho=5.0, learning_rate=0.1, momentum=0.9, master_port=5000, loss_weights=None):\n        super(EAMSGD, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers,\n                                     batch_size, features_col, label_col, num_epoch, master_port, loss_weights)\n        self.communication_window = communication_window\n        self.rho = rho\n        self.learning_rate = learning_rate\n        self.momentum = momentum\n\n    def allocate_worker(self):\n        \"\"\"Allocates the asynchronous EAMSGD worker.\"\"\"\n        # Allocate a EAMSGD REST worker.\n        worker = EAMSGDWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics,\n                              self.features_column, self.label_column, self.batch_size, self.num_epoch,\n                              self.master_host, self.master_port, self.rho, self.learning_rate,\n                              self.momentum, self.communication_window)\n\n        return worker\n\n\nclass ADAG(AsynchronousDistributedTrainer):\n    \"\"\"Asynchronous Distributed Adaptive Gradient (Stochastic Gradient Descent).\n\n    Introduced by Hermans et al.\n\n    # Arguments:\n        keras_model: model. Keras model to train.\n        worker_optimizer: string. String representing worker optimizer.\n                          See: https://keras.io/optimizers/\n        loss: string. String representing the loss function.\n              See: https://keras.io/objectives/\n        metrics: list of strings representing model evaluation metrics. Default is [\"accuracy\"].\n                 See: https://keras.io/metrics/\n        features_col: string or list of strings. Name(s) of the features column(s).\n        num_epoch: int. Number of epochs.\n        batch_size: int. Mini-batch size.\n        num_workers: int. Number of distributed workers.\n        communication_window: int. Staleness parameter.\n                              This parameter describes the number of mini-batches that will be\n                              computed before updating the center variable. For DOWNPOUR based\n                              algorithms we recommend large communication windows.\n        master_port: int. port number for the parameter server.\n        loss_weights: optional list or dict specifying weights for different losses.\n    \"\"\"\n\n    def __init__(self, keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, batch_size=32,\n                 features_col=\"features\", label_col=\"label\", num_epoch=1, communication_window=12, master_port=5000, loss_weights=None):\n        # Initialize the parent object.\n        super(ADAG, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers,\n                                   batch_size, features_col, label_col, num_epoch, master_port, loss_weights)\n        # Set algorithm parameters.\n        self.communication_window = communication_window\n\n    def allocate_worker(self):\n        \"\"\"Allocate an Adag worker.\"\"\"\n        worker = ADAGWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics,\n                            self.features_column, self.label_column, self.batch_size, self.num_epoch,\n                            self.master_host, self.master_port, self.communication_window)\n\n        return worker\n\n    def allocate_parameter_server(self):\n        \"\"\"Allocate the Adag parameter server.\"\"\"\n        parameter_server = ADAGParameterServer(self.master_model, self.master_port)\n\n        return parameter_server\n\n\nclass DynSGD(AsynchronousDistributedTrainer):\n    \"\"\"Dynamic SGD, dynamically maintains learning rate for every worker\n    and incorperates staleness.\n\n    Introduced in SIGMOD 2017 \"Heterogenity-aware Parameter Servers\"\n    http://net.pku.edu.cn/~cuibin/Papers/2017SIGMOD.pdf\n\n    # Arguments:\n        keras_model: model. Keras model to train.\n        worker_optimizer: string. String representing worker optimizer.\n                          See: https://keras.io/optimizers/\n        loss: string. String representing the loss function.\n              See: https://keras.io/objectives/\n        metrics: list of strings representing model evaluation metrics. Default is [\"accuracy\"].\n                 See: https://keras.io/metrics/\n        features_col: string or list of strings. Name(s) of the features column(s).\n        num_epoch: int. Number of epochs.\n        batch_size: int. Mini-batch size.\n        num_workers: int. Number of distributed workers.\n        communication_window: int. Staleness parameter.\n                              This parameter describes the number of mini-batches that will be\n                              computed before updating the center variable. For DOWNPOUR based\n                              algorithms we recommend large communication windows.\n        master_port: int. port number for the parameter server.\n        loss_weights: optional list or dict specifying weights for different losses.\n    \"\"\"\n\n    def __init__(self, keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, batch_size=32,\n                 features_col=\"features\", label_col=\"label\", num_epoch=1, communication_window=5, master_port=5000, loss_weights=None):\n        # Initialize the parent object.\n        super(DynSGD, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers,\n                                     batch_size, features_col, label_col, num_epoch, master_port, loss_weights)\n        # Set algorithm parameters.\n        self.communication_window = communication_window\n\n    def allocate_worker(self):\n        \"\"\"Allocate DYNSGD worker.\"\"\"\n        worker = DynSGDWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics,\n                              self.features_column, self.label_column, self.batch_size, self.num_epoch,\n                              self.master_host, self.master_port, self.communication_window)\n\n        return worker\n\n    def allocate_parameter_server(self):\n        \"\"\"Allocate DYNSGD parameter server.\"\"\"\n        parameter_server = DynSGDParameterServer(self.master_model, self.master_port)\n\n        return parameter_server\n\n\nclass Experimental(AsynchronousDistributedTrainer):\n    \"\"\"Experimental optimization scheme for development purposes.\"\"\"\n\n    def __init__(self, keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, batch_size=32,\n                 features_col=\"features\", label_col=\"label\", num_epoch=1, communication_window=5,\n                 learning_rate=1.0, master_port=5000, loss_weights=None):\n        # Initialize the parent object.\n        super(Experimental, self).__init__(keras_model, worker_optimizer, loss, metrics, num_workers,\n                                           batch_size, features_col, label_col, num_epoch, master_port, loss_weights)\n        # Set the algorithm parameters.\n        self.communication_window = communication_window\n        self.learning_rate = learning_rate\n\n    def allocate_worker(self):\n        \"\"\"Allocate experimental worker.\"\"\"\n        worker = ExperimentalWorker(self.master_model, self.worker_optimizer, self.loss, self.loss_weights, self.metrics,\n                                    self.features_column, self.label_column, self.batch_size, self.num_epoch,\n                                    self.master_host, self.master_port, self.communication_window,\n                                    self.num_workers, self.learning_rate)\n\n        return worker\n\n    def allocate_parameter_server(self):\n        \"\"\"Allocate experimental parameter server.\"\"\"\n        parameter_server = ExperimentalParameterServer(self.master_model, self.master_port, self.learning_rate)\n\n        return parameter_server\n"
  },
  {
    "path": "distkeras/transformers.py",
    "content": "\"\"\"Commonly used Dataframe transformers.\n\nA transformer will \"transform\" a Spark dataframe from one form into\nthe other. For example, mapping the column to an other value, or adding\na column to a dataframe based on a collection of specified values.\n\"\"\"\n\n## BEGIN Imports. ##############################################################\n\nimport numpy as np\n\nfrom distkeras.utils import new_dataframe_row\nfrom distkeras.utils import to_one_hot_encoded_dense\n\nfrom pyspark.mllib.linalg import DenseMatrix\nfrom pyspark.mllib.linalg import DenseVector\n\nfrom pyspark.sql.functions import mean\nfrom pyspark.sql.functions import stddev_pop\n\n## END Imports. ################################################################\n\nclass Transformer(object):\n    \"\"\"Interface which defines a transformer object.\"\"\"\n\n    def transform(self, dataframe):\n        \"\"\"Transforms the dataframe into an other dataframe.\n\n        # Returns\n            The transformed dataframe.\n        \"\"\"\n        raise NotImplementedError\n\n\nclass MinMaxTransformer(Transformer):\n    \"\"\"Will transform every feature of an instance between a specified range.\n\n    # Arguments\n        o_min: float. Original minimum of dataset.\n        o_max: float. Original maximum of dataset.\n        n_min: float. New minimum of dataset.\n        n_max: float. New maximum of dataset.\n        input_col: string. Name of input column.\n        output_col: string. Name of output column.\n        is_vector. boolean. Indicates if the data element is a vector or\n                            a singular value.\n\n    # Summary\n        New range: [o_min; o_max]\n        Old range: [n_min; n_max]\n    \"\"\"\n\n    def __init__(self, o_min, o_max, n_min, n_max, input_col, output_col, is_vector=True):\n        self.o_min = float(o_min)\n        self.o_max = float(o_max)\n        self.n_min = float(n_min)\n        self.n_max = float(n_max)\n        self.scale = (self.n_max - self.n_min) / (self.o_max - self.o_min)\n        self.input_column = input_col\n        self.output_column = output_col\n        self.is_vector = is_vector\n\n    def _transform(self, row):\n        \"\"\"Rescale every instance like this:\n\n        x' = \\frac{x - min}{max - min}\n        \"\"\"\n        if self.is_vector:\n            vector = row[self.input_column].toArray()\n            vector = self.scale * (vector - self.o_max) + self.n_max\n            new_value = DenseVector(vector)\n        else:\n            value = row[self.input_column]\n            new_value = self.scale * (value - self.o_max) + self.n_max\n        # Construct a new row with the normalized vector.\n        new_row = new_dataframe_row(row, self.output_column, new_value)\n\n        return new_row\n\n    def transform(self, dataframe):\n        \"\"\"Applies the min-max transformation to every row in the dataframe.\n\n        # Arguments\n            dataframe: dataframe. Spark Dataframe.\n        \"\"\"\n        return dataframe.rdd.map(self._transform).toDF()\n\n\nclass BinaryLabelTransformer(Transformer):\n    \"\"\"Transformers the specified a column to a binary label, i.e., [0, 1] give\n    a specific label name. Given the specified label, this transformer will generate\n    [1,0], in the other case [0,1].\n\n    # Arguments:\n        input_column: string. Column name of the label identifier.\n        output_column: string. Name of the new label which contains the binary label.\n        label: string. Name of the label which needs to serve as 1.\n    \"\"\"\n\n    def __init__(self, input_column, output_column, label):\n        self.input_column = input_column\n        self.output_column = output_column\n        self.label = label\n\n    def _transform(self, row):\n        \"\"\"Appends the desired binary label column.\"\"\"\n        value = row[self.input_column]\n        vector = np.zeros(2)\n        # Check if the name matches.\n        if value == self.label:\n            vector[0] = 1.0\n        else:\n            vector[1] = 1.0\n        # Convert to a Spark DenseVector\n        vector = DenseVector(vector)\n\n        return new_dataframe_row(row, self.output_column, vector)\n\n    def transform(self, dataframe):\n        \"\"\"Applies the binary label transformation to the applied dataframe.\n\n        # Arguments\n            dataframe: dataframe. Spark Dataframe.\n        \"\"\"\n        return dataframe.rdd.map(self._transform).toDF()\n\n\nclass StandardTransformer(Transformer):\n    \"\"\"Will transform the specified columns to unit standard deviation (if specified),\n    and centers the data to mean 0 (if specified).\n\n    # Arguments\n        columns: list. List of columns.\n        suffix: string. Suffix name of the column after processing.\n    # Note\n        We assume equal probability of the rows.\n    \"\"\"\n\n    def __init__(self, columns, suffix=\"_normalized\"):\n        self.columns = columns\n        self.column_suffix = suffix\n        self.current_column = None\n        self.means = {}\n        self.stddevs = {}\n\n    def clean_mean_keys(self, means):\n        \"\"\"Cleans the keys of the specified dictionary (mean).\"\"\"\n        new_means = {}\n\n        for k in means:\n            new_means[k[4:-1]] = means[k]\n\n        return new_means\n\n    def clean_stddev_keys(self, stddevs):\n        \"\"\"Cleans the keys of the specified dictionary (stddev).\"\"\"\n        new_stddevs = {}\n\n        for k in stddevs:\n            new_stddevs[k[11:-5]] = stddevs[k]\n\n        return new_stddevs\n\n    def _transform(self, row):\n        \"\"\"Take the column, and normalize it with the computed means and std devs.\"\"\"\n        mean = self.means[self.current_column]\n        stddev = self.stddevs[self.current_column]\n        x = row[self.current_column]\n        x_normalized = (x - mean) / stddev\n        output_column = self.current_column + self.column_suffix\n        new_row = new_dataframe_row(row, output_column, x_normalized)\n\n        return new_row\n\n    def transform(self, dataframe):\n        \"\"\"Applies standardization to the specified columns.\n\n        # Arguments\n            dataframe: dataframe. Spark Dataframe.\n        \"\"\"\n        # Compute the means of the specified columns.\n        means = [mean(x) for x in self.columns]\n        means = dataframe.select(means).collect()[0].asDict()\n        self.means = self.clean_mean_keys(means)\n        # Compute the standard deviation of the specified columns.\n        stddevs = [stddev_pop(x) for x in self.columns]\n        stddevs = dataframe.select(stddevs).collect()[0].asDict()\n        self.stddevs = self.clean_stddev_keys(stddevs)\n        # For every feature, add a new column to the dataframe.\n        for column in self.columns:\n            self.current_column = column\n            dataframe = dataframe.rdd.map(self._transform).toDF()\n\n        return dataframe\n\n\nclass DenseTransformer(Transformer):\n    \"\"\"Transformes sparse vectors into dense vectors.\n\n    # Arguments\n        input_col: string. Name of the input column of the sparse vector.\n        output_col: string. Name of the output column.\n    \"\"\"\n\n    def __init__(self, input_col, output_col):\n        self.input_column = input_col\n        self.output_column = output_col\n\n    def _transform(self, row):\n        \"\"\"Transforms the sparse vector to a dense vector while putting it in a new column.\"\"\"\n        sparse_vector = row[self.input_column]\n        dense_vector = DenseVector(sparse_vector.toArray())\n        new_row = new_dataframe_row(row, self.output_column, dense_vector)\n\n        return new_row\n\n    def transform(self, dataframe):\n        \"\"\"Transforms every sparse vector in the input column to a dense vector.\n\n        # Arguments\n            dataframe: dataframe. Spark Dataframe.\n        # Returns\n            A transformed Spark Dataframe.\n        \"\"\"\n        return dataframe.rdd.map(self._transform).toDF()\n\n\nclass ReshapeTransformer(Transformer):\n    \"\"\"Transforms vectors into other dense shapes.\n\n    # Note:\n        Only use this transformer in the last stage of the processing pipeline.\n        Since the arbitrary vector shapes will be directly passed on to the models.\n\n    # Arguments:\n        input_col: string. Name of the input column containing the vector.\n        output_col: string. Name of the output column.\n        shape: tuple. Shape of the matrix.\n    \"\"\"\n\n    def __init__(self, input_col, output_col, shape):\n        self.input_column = input_col\n        self.output_column = output_col\n        self.shape = shape\n\n    def _transform(self, row):\n        \"\"\"Transforms the vector to a dense matrix while putting it in a new column.\"\"\"\n        vector = row[self.input_column]\n        vector = np.asarray(vector)\n        reshaped = vector.reshape(self.shape).tolist()\n        new_row = new_dataframe_row(row, self.output_column, reshaped)\n\n        return new_row\n\n    def transform(self, dataframe):\n        \"\"\"Transforms every vector in the input column to a dense vector.\n\n        # Arguments\n            dataframe: dataframe. Spark Dataframe.\n        # Returns\n            A transformed Spark Dataframe.\n        \"\"\"\n        return dataframe.rdd.map(self._transform).toDF()\n\n\nclass OneHotTransformer(Transformer):\n    \"\"\"Transformer which transforms an integer index into a vector using one-hot-encoding.\n\n    # Arguments\n        output_dim: int. Dimension of output vector.\n        input_col: string. Name of input column.\n        output_col: string. Name of output column.\n    \"\"\"\n\n    def __init__(self, output_dim, input_col, output_col):\n        self.input_column = input_col\n        self.output_column = output_col\n        self.output_dimensionality = output_dim\n\n    def _transform(self, row):\n        \"\"\"Transforms every individual row.\n\n        Only for internal use.\n        \"\"\"\n        label = row[self.input_column]\n        vector = to_one_hot_encoded_dense(label, self.output_dimensionality)\n        new_row = new_dataframe_row(row, self.output_column, vector.tolist())\n\n        return new_row\n\n    def transform(self, dataframe):\n        \"\"\"Applies One-Hot encoding to every row in the dataframe.\n\n        # Arguments\n            dataframe: dataframe. A Spark Dataframe.\n        # Returns\n            A Spark Dataframe with one-hot encoded features.\n        \"\"\"\n        return dataframe.rdd.map(self._transform).toDF()\n\n\nclass LabelIndexTransformer(Transformer):\n    \"\"\"Transformer which will transform a prediction vector into an integer label.\n\n    # Arguments\n        output_dim: int. Dimension of output vector.\n        input_col: string. Name of the input column.\n        output_col: string. Name of the output column.\n        default_index: int. Default \"answer\".\n        activation_threshold: float. Threshold of immediate activation.\n    \"\"\"\n\n    def __init__(self, output_dim, input_col=\"prediction\", output_col=\"prediction_index\",\n                 default_index=0, activation_threshold=0.55):\n        self.input_column = input_col\n        self.output_column = output_col\n        self.output_dimensionality = output_dim\n        self.activation_threshold = activation_threshold\n        self.default_index = default_index\n\n    def get_index(self, vector):\n        \"\"\"Returns the index with the highest value or with activation threshold.\"\"\"\n        max = 0.0\n        max_index = self.default_index\n        for index in range(0, self.output_dimensionality):\n            if vector[index] >= self.activation_threshold:\n                return index\n            if vector[index] > max:\n                max = vector[index]\n                max_index = index\n\n        return max_index\n\n    def _transform(self, row):\n        \"\"\"Transforms every row by adding a \"predicted index\" column to the dataframe. \"\"\"\n        prediction = row[self.input_column]\n        index = float(self.get_index(prediction))\n        new_row = new_dataframe_row(row, self.output_column, index)\n\n        return new_row\n\n    def transform(self, dataframe):\n        \"\"\"Transforms the dataframe by adding a predicted index.\n\n       # Arguments\n            dataframe: dataframe. A Spark Dataframe.\n        # Returns\n            A Spark Dataframe with a \"predicted\" index.\n        \"\"\"\n        return dataframe.rdd.map(self._transform).toDF()\n"
  },
  {
    "path": "distkeras/utils.py",
    "content": "\"\"\"Utility functions used throughout Distributed Keras.\"\"\"\n\n## BEGIN Import. ###############################################################\n\nfrom keras import backend as K\n\nfrom keras.models import model_from_json\n\nfrom keras import backend as K\n\nfrom pyspark.mllib.linalg import DenseVector\nfrom pyspark.sql import Row\nfrom pyspark.sql.functions import rand\n\nimport pickle\n\nimport json\n\nimport numpy as np\n\nimport os\n\nimport pwd\n\n## END Import. #################################################################\n\n\ndef get_os_username():\n    \"\"\"Returns the username of user on the operating system.\n\n    From: http://stackoverflow.com/questions/842059/is-there-a-portable-way-to-get-the-current-username-in-python\n    \"\"\"\n    return pwd.getpwuid(os.getuid())[0]\n\n\ndef set_keras_base_directory(base_dir='/tmp/' + get_os_username()):\n    \"\"\"Sets the base directory of Keras.\"\"\"\n    K._keras_base_dir = base_dir\n\n\ndef to_one_hot_encoded_dense(value, n_dim=2):\n    \"\"\"Converts the value to a one-hot encoded vector.\n\n    # Arguments\n        value: float. Value of the single \"hot\" value.\n        n_dim: int. Dimension of the output vector.\n    \"\"\"\n    value = int(value)\n    vector = np.zeros(n_dim)\n    vector[value] = 1.0\n\n    return vector\n\n\ndef new_dataframe_row(old_row, column_name, column_value):\n    \"\"\"Constructs a new Spark Row based on the old row, and a new column name and value.\"\"\"\n    row = Row(*(old_row.__fields__ + [column_name]))(*(old_row + (column_value, )))\n\n    return row\n\n\ndef json_to_dataframe_row(string):\n    \"\"\"Converts a JSON String to a Spark Dataframe row.\"\"\"\n    dictionary = json.loads(string)\n    row = Row(**dictionary)\n\n    return row\n\n\ndef pickle_object(o):\n    \"\"\"Pickles the specified model and its weights.\"\"\"\n    return pickle.dumps(o, -1)\n\n\ndef unpickle_object(string):\n    \"\"\"Unpickles the specified string into a model.\"\"\"\n    return pickle.loads(string)\n\n\ndef serialize_keras_model(model):\n    \"\"\"Serializes the specified Keras model into a dictionary.\"\"\"\n    dictionary = {}\n    dictionary['model'] = model.to_json()\n    dictionary['weights'] = model.get_weights()\n\n    return dictionary\n\n\ndef history_executors_average(history):\n    \"\"\"Returns the averaged training metrics for all the executors.\"\"\"\n    max_iteration = max(history, key=lambda x: x['iteration'])['iteration']\n    max_executor = max(history, key=lambda x: x['worker_id'])['worker_id']\n    histories = []\n    averaged_history = []\n    # Fetch the histories of the individual executors.\n    for i in range(0, max_executor):\n        histories.append(history_executor(history, i))\n    # Construct the averaged history.\n    for i in range(0, max_iteration):\n        num_executors = 0\n        sum = np.zeros(2)\n        for j in range(0, max_executor):\n            if len(histories[j]) - 1 >= i:\n                num_executors += 1\n                sum += histories[j][i]['history']\n        # Average the history.\n        sum /= num_executors\n        averaged_history.append(sum)\n\n    return averaged_history\n\n\ndef history_executor(history, id):\n    \"\"\"Returns the history of a specific executor.\"\"\"\n    executor_history = [h for h in history if h['worker_id'] == id]\n    executor_history.sort(key=lambda x: x['iteration'])\n\n    return executor_history\n\n\ndef deserialize_keras_model(dictionary):\n    \"\"\"Deserialized the Keras model using the specified dictionary.\"\"\"\n    architecture = dictionary['model']\n    weights = dictionary['weights']\n    model = model_from_json(architecture)\n    model.set_weights(weights)\n\n    return model\n\n\ndef uniform_weights(model, constraints=[-0.5, 0.5]):\n    \"\"\"Initializes the parameters of the specified Keras model with uniform\n    weights between the specified ranges.\n\n    # Arguments\n        model: Keras model.\n        constraints: array. An array with two elements which defines the range\n                     of the uniform initalization.\n    \"\"\"\n    # We assume the following: Keras will return a list of weight matrices.\n    # All layers, even the activiation layers, will be randomly initialized.\n    weights = model.get_weights()\n    for layer in weights:\n        shape = layer.shape\n        if len(shape) > 1:\n            # Fill the matrix with random numbers.\n            n_rows = shape[0]\n            n_columns = shape[1]\n            for i in range(0, n_rows):\n                for j in range(0, n_columns):\n                    layer[i][j] = np.random.uniform(low=constraints[0], high=constraints[1])\n        else:\n            # Fill the vector with random numbers.\n            n_elements = shape[0]\n            for i in range(0, n_elements):\n                layer[i] = np.random.uniform(low=constraints[0], high=constraints[1])\n    # Set the new weights in the model.\n    model.set_weights(weights)\n\n\ndef shuffle(dataset):\n    \"\"\"Shuffles the rows in the specified Spark Dataframe.\n\n    # Arguments\n        dataset: dataframe. A Spark Dataframe.\n    \"\"\"\n    dataset = dataset.orderBy(rand())\n    dataset.cache()\n\n    return dataset\n\n\ndef precache(dataset, num_workers):\n    \"\"\"Precaches the specified dataset.\n\n    Make sure the specified dataframe has the desired partitioning scheme.\n\n    # Arguments\n        dataset: dataframe. A Spark Dataframe.\n        num_workers: int. Number of workers you are going to use.\n    \"\"\"\n    dataset = dataset.repartition(num_workers)\n    dataset.cache()\n    dataset.count()\n\n    return dataset\n"
  },
  {
    "path": "distkeras/workers.py",
    "content": "\"\"\"Workers module.\n\nThis module contains all worker specific implementations for different optimization\nalgorithms.\n\"\"\"\n\n## BEGIN Imports. ##############################################################\n\nfrom distkeras.networking import connect\nfrom distkeras.networking import recv_data\nfrom distkeras.networking import send_data\n\nfrom distkeras.utils import deserialize_keras_model\nfrom distkeras.utils import serialize_keras_model\nfrom distkeras.utils import set_keras_base_directory\nfrom distkeras.utils import shuffle\nfrom distkeras.utils import uniform_weights\n\nfrom keras.optimizers import Optimizer, serialize, deserialize\nimport keras.backend as K\n\nfrom itertools import tee\n\nfrom multiprocessing import Pool\n\nimport numpy as np\n\nimport threading\n\nimport tensorflow as tf\n\nimport sys\n\n# \"queue\" module in python 3 is named \"Queue\" in python 2\nuse_python3 = sys.version_info[0] == 3\nif use_python3:\n    import queue\nelse:\n    import Queue as queue\n\nimport random\n\nimport socket\n\nimport time\n\n## END Imports. ################################################################\n\nclass Worker(object):\n    \"\"\"Abstract class of a worker.\n\n    This class provides basic functionality and properties all workers share.\n    \"\"\"\n\n    def __init__(self, model, optimizer, loss, loss_weights, metrics=[\"accuracy\"], features_col=\"features\", label_col=\"label\",\n                 batch_size=32, num_epoch=1, learning_rate=1.0):\n        assert isinstance(optimizer, (str, Optimizer)), \"'optimizer' must be a string or a Keras Optimizer instance\"\n        assert isinstance(features_col, (str, list)), \"'features_col' must be a string or a list of strings\"\n        assert isinstance(label_col, (str, list)), \"'label_col' must be a string or a list of strings\"\n        self.model = model\n        self.optimizer = {'class_name': optimizer, 'config': {}} if isinstance(optimizer, str) else serialize(optimizer)\n        self.loss = loss\n        self.loss_weights = loss_weights\n        self.metrics= metrics\n        self.features_column = [features_col] if isinstance(features_col, str) else features_col\n        self.label_column = [label_col] if isinstance(label_col, str) else label_col\n        self.batch_size = batch_size\n        self.num_epoch = num_epoch\n        self.max_mini_batches = 100\n        self.prefetching_thread = None\n        self.mini_batches = None\n        self.is_prefetching = True\n        self.worker_id = -1\n        self.learning_rate = learning_rate\n        self.num_inputs = len(self.features_column)\n        self.num_outputs = len(self.label_column)\n        self.current_epoch = 0\n\n    def set_max_prefetch(self, max_mini_batches):\n        \"\"\"Sets the maximum number of mini-batches that can be prefetched.\"\"\"\n        self.max_mini_batches = max_mini_batches\n\n    def set_learning_rate(self, learning_rate):\n        \"\"\"Sets the learning rate of the worker.\"\"\"\n        self.learning_rate = learning_rate\n\n    def get_learning_rate(self):\n        \"\"\"Returns the learning rate of the worker.\"\"\"\n        return self.learning_rate\n\n    def set_worker_id(self, worker_id):\n        \"\"\"Sets the worker id.\n\n        # Arguments\n            worker_id: int. Worker identifier.\n        \"\"\"\n        self.worker_id = worker_id\n\n    def get_worker_id(self):\n        \"\"\"Returns the worker id.\"\"\"\n        return self.worker_id\n\n    def prepare_model(self):\n        \"\"\"Prepares the model for training.\"\"\"\n        # Set the Keras directory.\n        set_keras_base_directory()\n        if K.backend() == 'tensorflow':\n            # set GPU option allow_growth to False for GPU-enabled tensorflow\n            config = tf.ConfigProto()\n            config.gpu_options.allow_growth = False\n            sess = tf.Session(config=config)\n            K.set_session(sess)\n\n        # Deserialize the Keras model.\n        self.model = deserialize_keras_model(self.model)\n        self.optimizer = deserialize(self.optimizer)\n        # Compile the model with the specified loss and optimizer.\n        self.model.compile(loss=self.loss, loss_weights = self.loss_weights, \n            optimizer=self.optimizer, metrics=self.metrics)\n\n    def get_next_minibatch(self):\n        \"\"\"Returns the next mini-batch.\"\"\"\n        return self.mini_batches.get(timeout=10)\n\n    def start_prefetching_thread(self, iterator):\n        \"\"\"Starts the data prefetching thread.\"\"\"\n        self.mini_batches = queue.Queue()\n        self.iterator = iterator\n        self.prefetching_thread = threading.Thread(target=self.prefetching)\n        self.prefetching_thread.start()\n\n    def prefetching(self):\n        partition_iterators_all_epochs = tee(self.iterator, self.num_epoch)\n        for iter_one_epoch in partition_iterators_all_epochs:\n            self.current_epoch += 1\n            self.is_prefetching = True\n            try:\n                while self.is_prefetching:\n                    if self.mini_batches.qsize() < self.max_mini_batches:\n                        batch = [next(iter_one_epoch) for _ in range(self.batch_size)]\n                        batch_iterator_copies = tee(batch, self.num_inputs + self.num_outputs)\n                        feature_iterators = batch_iterator_copies[:self.num_inputs]\n                        label_iterators = batch_iterator_copies[self.num_inputs:]\n                        X = [np.asarray([x[self.features_column[i]] for x in iterator]) \n                            for i, iterator in enumerate(feature_iterators)]\n                        Y = [np.asarray([x[self.label_column[i]] for x in iterator])\n                            for i, iterator in enumerate(label_iterators)]\n                        self.mini_batches.put([X, Y])\n            except Exception as e:\n                print(e)\n                self.is_prefetching = False\n\n    def optimize(self):\n        \"\"\"Optimization procedure of a worker.\"\"\"\n        raise NotImplementedError\n\n    def train(self, worker_id, iterator):\n        \"\"\"Training procedure for the worker node.\n\n        # Arguments\n            worker_id: int. Partition index provided by Spark. Can be used as a worker_id.\n            iterator: iterator. Data iterator.\n        \"\"\"\n        # Prepare the optimization procedure.\n        self.start_prefetching_thread(iterator)\n        self.set_worker_id(worker_id)\n        self.prepare_model()\n        # Start the optimization procedure.\n        try:\n            self.optimize()\n        except Exception as e:\n            # Stop the prefetching process.\n            self.is_prefetching = False\n            print(e)\n        # Wait for the prefetching thread to stop.\n        self.prefetching_thread.join()\n\n        return iter([serialize_keras_model(self.model)])\n\n\nclass SequentialWorker(Worker):\n    \"\"\"Implementation for sequential gradient updates on a single worker.\n\n    Will train a model on a single worker node.\n    \"\"\"\n\n    def __init__(self, model, optimizer, loss, loss_weights, metrics=[\"accuracy\"], \n                 features_col=\"features\", label_col=\"label\", batch_size=32, num_epoch=1):\n        # Initialize the parent class.\n        super(SequentialWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col,\n                                               label_col, batch_size, num_epoch)\n\n    def optimize(self):\n        \"\"\"Training procedure with sequential gradient updates.\n\n        # Returns\n            Trained serialized Keras model.\n        \"\"\"\n        while True:\n            X, Y = self.get_next_minibatch()\n            h = self.model.train_on_batch(X, Y)\n            self.add_history(h)\n\n\nclass NetworkWorker(Worker):\n    \"\"\"Abstract class of a worker who shares the variables using the network.\"\"\"\n\n    def __init__(self, model, optimizer, loss, loss_weights, metrics=[\"accuracy\"], features_col=\"features\", label_col=\"label\",\n                 batch_size=32, num_epoch=1, master_host=\"localhost\", master_port=5000, learning_rate=1.0):\n        super(NetworkWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col,\n                                            label_col, batch_size, num_epoch, learning_rate)\n        self.master_host = master_host\n        self.master_port = master_port\n        self.socket = None\n        self.center_variable = None\n        self.disable_nagle = True\n        self.training_history = []\n        self.worker_id = 0\n\n    def connect(self):\n        \"\"\"Connect with the remote parameter server.\"\"\"\n        self.socket = connect(self.master_host, self.master_port, self.disable_nagle)\n\n    def pull(self):\n        \"\"\"Requests the center variable from the parameter server.\"\"\"\n        # Request a pull from the parameter server.\n        self.socket.sendall(b'p')\n        # Fetch the center variable from the parameter server.\n        self.center_variable = np.asarray(recv_data(self.socket))\n\n    def commit(self, residual):\n        \"\"\"Sends the gradient residual to the parameter server.\"\"\"\n        # Prepare the datastructure.\n        data = {}\n        data['worker_id'] = self.get_worker_id()\n        data['delta'] = residual\n        # Request a commit from the parameter server.\n        self.socket.sendall(b'c')\n        # Send the data to the paramter server.\n        send_data(self.socket, data)\n\n    def set_tcp_no_delay(self, flag):\n        \"\"\"Disables or enables Nagle's algorithm.\n        (True -> TCP_NODELAY = 1)\n        (False -> TCP_NODELAY = 0)\n\n        # Arguments:\n            flag: boolean. Indicates if Nagle's algorithm should be disabled.\n        \"\"\"\n        self.disable_nagle = flag\n\n    def tcp_no_delay(self):\n        \"\"\"Returns the value TCP_NODELAY of the flag (Nagle's algorithm).\n\n        # Returns\n            True, if Nagle's algorithm is disabled. False otherwise.\n        \"\"\"\n        return self.disable_nagle\n\n    def get_master_host(self):\n        \"\"\"Returns the host address of the master parameter server.\"\"\"\n        return self.master_host\n\n    def get_master_port(self):\n        \"\"\"Returns the port of the master parameter server.\"\"\"\n        return self.master_port\n\n    def add_history(self, h):\n        \"\"\"Appends the specified history data.\"\"\"\n        d = {}\n        d['history'] = h\n        d['worker_id'] = self.worker_id\n        d['iteration'] = self.iteration\n        d['timestamp'] = time.time()\n        self.training_history.append(d)\n\n    def optimize(self):\n        \"\"\"Optimization procedure of a network worker.\"\"\"\n        raise NotImplementedError\n\n    def train(self, worker_id, iterator):\n        \"\"\"Training procedure of a networked worker with a parameter server.\"\"\"\n        self.start_prefetching_thread(iterator)\n        self.set_worker_id(worker_id)\n        self.prepare_model()\n        self.connect()\n        self.pull()\n        self.model.set_weights(self.center_variable)\n        try:\n            self.optimize()\n        except Exception as e:\n            # Stop the prefetching process.\n            self.is_prefetching = False\n            print(e)\n        self.socket.close()\n        self.prefetching_thread.join(timeout=1)\n\n        return iter(self.training_history)\n\n\nclass ADAGWorker(NetworkWorker):\n    \"\"\"Implements the training procedure for ADAG.\n\n    Introduced by Hermans et al.\n    \"\"\"\n\n    def __init__(self, model, optimizer, loss, loss_weights, metrics=[\"accuracy\"], features_col=\"features\", label_col=\"label\",\n                 batch_size=32, num_epoch=1, master_host=\"localhost\", master_port=5000, communication_window=5):\n        # Initialize the parent object.\n        super(ADAGWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col,\n                                         batch_size, num_epoch, master_host, master_port)\n        # Initialize ADAG parameters.\n        self.communication_window = communication_window\n        self.iteration = 1\n\n    def commit(self, residual):\n        \"\"\"Sends the gradient residual to the parameter server.\"\"\"\n        # Prepare the datastructure.\n        data = {}\n        data['worker_id'] = self.get_worker_id()\n        data['residual'] = residual\n        # Request a commit from the parameter server.\n        self.socket.sendall(b'c')\n        # Send the data to the paramter server.\n        send_data(self.socket, data)\n\n    def optimize(self):\n        \"\"\"Optimization procedure of ADAG.\"\"\"\n        W1 = np.asarray(self.model.get_weights())\n        while True:\n            X, Y = self.get_next_minibatch()\n            h = self.model.train_on_batch(X, Y)\n            self.add_history(h)\n            if self.iteration % self.communication_window == 0:\n                W2 = np.asarray(self.model.get_weights())\n                delta = W2 - W1\n                delta /= self.communication_window\n                self.commit(delta)\n                self.pull()\n                self.model.set_weights(self.center_variable)\n                W1 = self.center_variable\n            self.iteration += 1\n\n\nclass DOWNPOURWorker(NetworkWorker):\n    \"\"\"Implements the training procedure for the distributed DOWNPOUR optimizer.\n\n    Introduced by Dean et al.\n    http://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf\n    \"\"\"\n\n    def __init__(self, model, optimizer, loss, loss_weights, metrics=[\"accuracy\"], features_col=\"features\", label_col=\"label\",\n                 batch_size=32, num_epoch=1, master_host=\"localhost\", master_port=5000, communication_window=3):\n        # Initialize the parent object.\n        super(DOWNPOURWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col,\n                                             batch_size, num_epoch, master_host, master_port)\n        self.communication_window = communication_window\n        self.iteration = 1\n\n    def optimize(self):\n        \"\"\"Specific optimization procedure for DOWNPOUR.\"\"\"\n        W1 = np.asarray(self.model.get_weights())\n        while True:\n            X, Y = self.get_next_minibatch()\n            if self.iteration % self.communication_window == 0:\n                W2 = np.asarray(self.model.get_weights())\n                delta = W2 - W1\n                self.commit(delta)\n                self.pull()\n                self.model.set_weights(self.center_variable)\n                W1 = self.center_variable\n            h = self.model.train_on_batch(X, Y)\n            self.add_history(h)\n            self.iteration += 1\n\n\nclass AEASGDWorker(NetworkWorker):\n    \"\"\"Implementation of asynchronous EASGD worker.\n\n    Introduced by Zhang et al.\n    https://arxiv.org/pdf/1412.6651.pdf\n    \"\"\"\n\n    def __init__(self, model, optimizer, loss, loss_weights, metrics=['accuracy'], features_col=\"features\", label_col=\"label\",\n                 batch_size=32, num_epoch=1, master_host=\"localhost\", master_port=5000, rho=5.0,\n                 learning_rate=0.01, communication_window=32):\n        # Initialize the parent object.\n        super(AEASGDWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col,\n                                           batch_size, num_epoch, master_host, master_port)\n        # Initialize AEASGD specific variables.\n        self.rho = rho\n        self.learning_rate = learning_rate\n        self.communication_window = communication_window\n        self.alpha = self.rho * self.learning_rate\n        self.iteration = 1\n\n    def optimize(self):\n        \"\"\"Specific training procedure for AEASGD.\"\"\"\n        while True:\n            X, Y = self.get_next_minibatch()\n            if self.iteration % self.communication_window == 0:\n                self.pull()\n                W = np.asarray(self.model.get_weights())\n                E = self.alpha * (W - self.center_variable)\n                W = W - E\n                self.model.set_weights(W)\n                self.commit(E)\n            h = self.model.train_on_batch(X, Y)\n            self.add_history(h)\n            self.iteration += 1\n\n\nclass EAMSGDWorker(NetworkWorker):\n    \"\"\"Worker implementation of Asynchronous EA Momentum SGD.\n\n    Introduced by Zhang et al.\n    https://arxiv.org/pdf/1412.6651.pdf\n    \"\"\"\n\n    def __init__(self, model, optimizer, loss, loss_weights, metrics=['accuracy'], features_col=\"features\", label_col=\"label\",\n                 batch_size=32, num_epoch=1, master_host=\"localhost\", master_port=5000, rho=5.0,\n                 learning_rate=0.01, momentum=0.9, communication_window=32):\n        # Initialize the parent object.\n        super(EAMSGDWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col,\n                                           batch_size, num_epoch, master_host, master_port)\n        # Initialize EAMSGD specific variables.\n        self.rho = rho\n        self.learning_rate = learning_rate\n        self.momentum = momentum\n        self.communication_window = communication_window\n        self.alpha = self.learning_rate * self.rho\n        self.iteration = 1\n\n    def optimize(self):\n        \"\"\"Specific training procedure of asynchronous EAMSGD.\"\"\"\n        r = np.asarray(self.model.get_weights())\n        r.fill(0.0)\n        while True:\n            X, Y = self.get_next_minibatch()\n            if self.iteration % self.communication_window == 0:\n                self.pull()\n                W = np.asarray(self.model.get_weights())\n                E = self.alpha * (W - self.center_variable)\n                W = W - E\n                self.model.set_weights(W)\n                self.commit(E)\n            r_t = self.momentum * r\n            W_copy = np.asarray(self.model.get_weights())\n            W = np.asarray(self.model.get_weights())\n            W += r_t\n            self.model.set_weights(W)\n            h = self.model.train_on_batch(X, Y)\n            self.add_history(h)\n            gradient = np.asarray(self.model.get_weights()) - W\n            r = r_t - self.learning_rate * gradient\n            W_copy -= r\n            self.model.set_weights(W_copy)\n            self.iteration += 1\n\n\nclass DynSGDWorker(NetworkWorker):\n    \"\"\"Implements the training procedure for DynSGD.\"\"\"\n\n    def __init__(self, model, optimizer, loss, loss_weights, metrics=[\"accuracy\"], features_col=\"features\", label_col=\"label\",\n                 batch_size=32, num_epoch=1, master_host=\"localhost\", master_port=5000, communication_window=5):\n        # Initialize the parent object.\n        super(DynSGDWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col,\n                                           batch_size, num_epoch, master_host, master_port)\n        # Initialize DynSGD parameters.\n        self.communication_window = communication_window\n        self.iteration = 1\n        self.last_update = 0\n\n    def pull(self):\n        \"\"\"Requests the center variable and last update from the parameter server.\"\"\"\n        # Request a pull from the parameter server.\n        self.socket.sendall(b'p')\n        # Fetch the dictionary from the parameter server.\n        data = recv_data(self.socket)\n        self.center_variable = np.asarray(data['model'])\n        self.last_update = data['update']\n\n    def commit(self, residual):\n        \"\"\"Sends the gradient residual to the parameter server.\"\"\"\n        # Prepare the datastructure.\n        data = {}\n        data['worker_id'] = self.get_worker_id()\n        data['residual'] = residual\n        data['last_update'] = self.last_update\n        # Request a commit from the parameter server.\n        self.socket.sendall(b'c')\n        # Send the data to the paramter server.\n        send_data(self.socket, data)\n\n    def optimize(self):\n        \"\"\"Optimization procedure of DynSGD.\"\"\"\n        W1 = np.asarray(self.model.get_weights())\n        while True:\n            X, Y = self.get_next_minibatch()\n            h = self.model.train_on_batch(X, Y)\n            self.add_history(h)\n            if self.iteration % self.communication_window == 0:\n                W2 = np.asarray(self.model.get_weights())\n                delta = W2 - W1\n                self.commit(delta)\n                self.pull()\n                self.model.set_weights(self.center_variable)\n                W1 = self.center_variable\n            self.iteration += 1\n\n\nclass ExperimentalWorker(NetworkWorker):\n    \"\"\"Implements the training procedure for ADAG.\n\n    Introduced by Hermans et al.\n    \"\"\"\n\n    def __init__(self, model, optimizer, loss, loss_weights, metrics=[\"accuracy\"], features_col=\"features\", label_col=\"label\",\n                 batch_size=32, num_epoch=1, master_host=\"localhost\", master_port=5000, communication_window=5,\n                 num_workers=2, learning_rate=1.0):\n        # Initialize the parent object.\n        super(ExperimentalWorker, self).__init__(model, optimizer, loss, loss_weights, metrics, features_col, label_col,\n                                                 batch_size, num_epoch, master_host, master_port, learning_rate)\n        # Initialize ADAG parameters.\n        self.communication_window = communication_window\n        self.num_workers = num_workers\n        self.current_num_workers = self.num_workers\n        self.inverse_learning_rate = 1 / self.learning_rate\n        self.iteration = 1\n\n    def commit(self, residual):\n        \"\"\"Sends the gradient residual to the parameter server.\"\"\"\n        # Prepare the datastructure.\n        data = {}\n        data['worker_id'] = self.get_worker_id()\n        data['residual'] = residual\n        data['stale_center_variable'] = self.center_variable\n        # Request a commit from the parameter server.\n        self.socket.sendall(b'c')\n        # Send the data to the paramter server.\n        send_data(self.socket, data)\n\n    def pull(self):\n        \"\"\"Requests the center variable from the parameter server.\"\"\"\n        # Request a pull from the parameter server.\n        self.socket.sendall(b'p')\n        # Fetch the center variable from the parameter server.\n        self.center_variable = np.asarray(recv_data(self.socket))\n\n    def optimize(self):\n        \"\"\"Optimization procedure of ADAG.\"\"\"\n        W1 = np.asarray(self.model.get_weights())\n        while True:\n            X, Y = self.get_next_minibatch()\n            h = self.model.train_on_batch(X, Y)\n            self.add_history(h)\n            if self.iteration % self.communication_window == 0:\n                W2 = np.asarray(self.model.get_weights())\n                delta = W2 - W1\n                delta /= self.communication_window\n                self.commit(delta)\n                self.pull()\n                self.model.set_weights(self.center_variable)\n                W1 = self.center_variable\n            self.iteration += 1\n"
  },
  {
    "path": "docs/index.md",
    "content": "# Distributed Keras\n\nDistributed Keras (DK) is a **distributed deep learning framework** built op top of Apache Spark and Keras with the goal to significantly reduce the training time using distributed machine learning algorithms. We designed the framework in such a way that a developer could implement a new distributed optimizer with ease, thus enabling a person to focus on research and model development.\n\nAs mentioned above, most of our methods follow the data parallel approach as described in the paper on [Large Scale Distributed Deep Networks](http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf). In this paradigm, replicas of a model are distributed over several \"trainers\", and every model replica will be trained on a different partition of the dataset. The gradient (or all network weights, depending on the implementation details) will be communicated with the parameter server after every gradient update. The parameter server is responsible for handling the gradient updates of all workers and incorperating all gradient updates into a single master model which will be returned to the user after the training procedure is complete.\n\n## Installation\n\nWe rely on [Keras](https://keras.io) for the construction of models, and thus following the Keras dependencies. Furthermore, PySpark is also a dependency for this project since DK is using Apache Spark for the distribution of the data and the model replicas.\n\n### Pip\n\nYou can use `pip` if you only need to DK framework without examples.\n\n```bash\npip install git+https://github.com/JoeriHermans/dist-keras.git\n```\n\n### Git\n\nHowever, if you would like to play with the examples and notebooks, simply install the framework using the approach described below.\n\n```bash\ngit clone https://github.com/JoeriHermans/dist-keras\ncd dist-keras\npip install -e .\n```\n\n## Getting Started\n\nWe recommend starting with the `workflow` notebook located in the `examples` directory. This Python notebook will guide you through all general steps which should need to perform. This includes setting up a Spark Context, reading the data, applying preprocessing, training and evaluation of your model in a distributed way.\n\n!!! Note\n    Running the **workflow.ipyn** notebook can be run on your local machine. However, we recommend running the notebook on a Spark cluster since the distributed trainers start to outperform the *SingleTrainer* when the number of workers (cores multiplied by executors) is usually higher than 10.\n\n## Support\n\nFor issues, bugs, questions, and suggestions. Please use the appropriate channels on [GitHub](https://github.com/JoeriHermans/dist-keras/).\n\nAfter the installation process is complete, you can start exploring the functionality by browsing the examples. We have also prepared a notebook which basically compares the different distributed optimizers with each other. This notebook is located at `examples/experiment.ipynb`. However, other examples are also provided which show you how to use the different distributed optimizers with Apache Spark for distributed pre-processing.\n\n## References\n\n* Zhang, S., Choromanska, A. E., & LeCun, Y. (2015). Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems (pp. 685-693).\n\n* Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. arXiv preprint arXiv:1511.06051.\n\n* Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., ... & Ng, A. Y. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223-1231).\n\n* Pumperla, M. (2015). Elephas. Github Repository https://github.com/maxpumperla/elephas/. [4]\n\n## Licensing\n\n![GPLv3](images/gpl_v3.png) ![CERN](images/cern_logo.jpg)\n"
  },
  {
    "path": "docs/license.md",
    "content": "# GNU General Public License\n**Version 3, 29 June 2007**\n\n Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>\n Everyone is permitted to copy and distribute verbatim copies\n of this license document, but changing it is not allowed.\n\n## Preamble\n\n  The GNU General Public License is a free, copyleft license for\nsoftware and other kinds of works.\n\n  The licenses for most software and other practical works are designed\nto take away your freedom to share and change the works.  By contrast,\nthe GNU General Public License is intended to guarantee your freedom to\nshare and change all versions of a program--to make sure it remains free\nsoftware for all its users.  We, the Free Software Foundation, use the\nGNU General Public License for most of our software; it applies also to\nany other work released this way by its authors.  You can apply it to\nyour programs, too.\n\n  When we speak of free software, we are referring to freedom, not\nprice.  Our General Public Licenses are designed to make sure that you\nhave the freedom to distribute copies of free software (and charge for\nthem if you wish), that you receive source code or can get it if you\nwant it, that you can change the software or use pieces of it in new\nfree programs, and that you know you can do these things.\n\n  To protect your rights, we need to prevent others from denying you\nthese rights or asking you to surrender the rights.  Therefore, you have\ncertain responsibilities if you distribute copies of the software, or if\nyou modify it: responsibilities to respect the freedom of others.\n\n  For example, if you distribute copies of such a program, whether\ngratis or for a fee, you must pass on to the recipients the same\nfreedoms that you received.  You must make sure that they, too, receive\nor can get the source code.  And you must show them these terms so they\nknow their rights.\n\n  Developers that use the GNU GPL protect your rights with two steps:\n(1) assert copyright on the software, and (2) offer you this License\ngiving you legal permission to copy, distribute and/or modify it.\n\n  For the developers' and authors' protection, the GPL clearly explains\nthat there is no warranty for this free software.  For both users' and\nauthors' sake, the GPL requires that modified versions be marked as\nchanged, so that their problems will not be attributed erroneously to\nauthors of previous versions.\n\n  Some devices are designed to deny users access to install or run\nmodified versions of the software inside them, although the manufacturer\ncan do so.  This is fundamentally incompatible with the aim of\nprotecting users' freedom to change the software.  The systematic\npattern of such abuse occurs in the area of products for individuals to\nuse, which is precisely where it is most unacceptable.  Therefore, we\nhave designed this version of the GPL to prohibit the practice for those\nproducts.  If such problems arise substantially in other domains, we\nstand ready to extend this provision to those domains in future versions\nof the GPL, as needed to protect the freedom of users.\n\n  Finally, every program is threatened constantly by software patents.\nStates should not allow patents to restrict development and use of\nsoftware on general-purpose computers, but in those that do, we wish to\navoid the special danger that patents applied to a free program could\nmake it effectively proprietary.  To prevent this, the GPL assures that\npatents cannot be used to render the program non-free.\n\n  The precise terms and conditions for copying, distribution and\nmodification follow.\n\n## Terms And Conditions\n\n0. Definitions.\n\n  \"This License\" refers to version 3 of the GNU General Public License.\n\n  \"Copyright\" also means copyright-like laws that apply to other kinds of\nworks, such as semiconductor masks.\n\n  \"The Program\" refers to any copyrightable work licensed under this\nLicense.  Each licensee is addressed as \"you\".  \"Licensees\" and\n\"recipients\" may be individuals or organizations.\n\n  To \"modify\" a work means to copy from or adapt all or part of the work\nin a fashion requiring copyright permission, other than the making of an\nexact copy.  The resulting work is called a \"modified version\" of the\nearlier work or a work \"based on\" the earlier work.\n\n  A \"covered work\" means either the unmodified Program or a work based\non the Program.\n\n  To \"propagate\" a work means to do anything with it that, without\npermission, would make you directly or secondarily liable for\ninfringement under applicable copyright law, except executing it on a\ncomputer or modifying a private copy.  Propagation includes copying,\ndistribution (with or without modification), making available to the\npublic, and in some countries other activities as well.\n\n  To \"convey\" a work means any kind of propagation that enables other\nparties to make or receive copies.  Mere interaction with a user through\na computer network, with no transfer of a copy, is not conveying.\n\n  An interactive user interface displays \"Appropriate Legal Notices\"\nto the extent that it includes a convenient and prominently visible\nfeature that (1) displays an appropriate copyright notice, and (2)\ntells the user that there is no warranty for the work (except to the\nextent that warranties are provided), that licensees may convey the\nwork under this License, and how to view a copy of this License.  If\nthe interface presents a list of user commands or options, such as a\nmenu, a prominent item in the list meets this criterion.\n\n  1. Source Code.\n\n  The \"source code\" for a work means the preferred form of the work\nfor making modifications to it.  \"Object code\" means any non-source\nform of a work.\n\n  A \"Standard Interface\" means an interface that either is an official\nstandard defined by a recognized standards body, or, in the case of\ninterfaces specified for a particular programming language, one that\nis widely used among developers working in that language.\n\n  The \"System Libraries\" of an executable work include anything, other\nthan the work as a whole, that (a) is included in the normal form of\npackaging a Major Component, but which is not part of that Major\nComponent, and (b) serves only to enable use of the work with that\nMajor Component, or to implement a Standard Interface for which an\nimplementation is available to the public in source code form.  A\n\"Major Component\", in this context, means a major essential component\n(kernel, window system, and so on) of the specific operating system\n(if any) on which the executable work runs, or a compiler used to\nproduce the work, or an object code interpreter used to run it.\n\n  The \"Corresponding Source\" for a work in object code form means all\nthe source code needed to generate, install, and (for an executable\nwork) run the object code and to modify the work, including scripts to\ncontrol those activities.  However, it does not include the work's\nSystem Libraries, or general-purpose tools or generally available free\nprograms which are used unmodified in performing those activities but\nwhich are not part of the work.  For example, Corresponding Source\nincludes interface definition files associated with source files for\nthe work, and the source code for shared libraries and dynamically\nlinked subprograms that the work is specifically designed to require,\nsuch as by intimate data communication or control flow between those\nsubprograms and other parts of the work.\n\n  The Corresponding Source need not include anything that users\ncan regenerate automatically from other parts of the Corresponding\nSource.\n\n  The Corresponding Source for a work in source code form is that\nsame work.\n\n  2. Basic Permissions.\n\n  All rights granted under this License are granted for the term of\ncopyright on the Program, and are irrevocable provided the stated\nconditions are met.  This License explicitly affirms your unlimited\npermission to run the unmodified Program.  The output from running a\ncovered work is covered by this License only if the output, given its\ncontent, constitutes a covered work.  This License acknowledges your\nrights of fair use or other equivalent, as provided by copyright law.\n\n  You may make, run and propagate covered works that you do not\nconvey, without conditions so long as your license otherwise remains\nin force.  You may convey covered works to others for the sole purpose\nof having them make modifications exclusively for you, or provide you\nwith facilities for running those works, provided that you comply with\nthe terms of this License in conveying all material for which you do\nnot control copyright.  Those thus making or running the covered works\nfor you must do so exclusively on your behalf, under your direction\nand control, on terms that prohibit them from making any copies of\nyour copyrighted material outside their relationship with you.\n\n  Conveying under any other circumstances is permitted solely under\nthe conditions stated below.  Sublicensing is not allowed; section 10\nmakes it unnecessary.\n\n  3. Protecting Users' Legal Rights From Anti-Circumvention Law.\n\n  No covered work shall be deemed part of an effective technological\nmeasure under any applicable law fulfilling obligations under article\n11 of the WIPO copyright treaty adopted on 20 December 1996, or\nsimilar laws prohibiting or restricting circumvention of such\nmeasures.\n\n  When you convey a covered work, you waive any legal power to forbid\ncircumvention of technological measures to the extent such circumvention\nis effected by exercising rights under this License with respect to\nthe covered work, and you disclaim any intention to limit operation or\nmodification of the work as a means of enforcing, against the work's\nusers, your or third parties' legal rights to forbid circumvention of\ntechnological measures.\n\n  4. Conveying Verbatim Copies.\n\n  You may convey verbatim copies of the Program's source code as you\nreceive it, in any medium, provided that you conspicuously and\nappropriately publish on each copy an appropriate copyright notice;\nkeep intact all notices stating that this License and any\nnon-permissive terms added in accord with section 7 apply to the code;\nkeep intact all notices of the absence of any warranty; and give all\nrecipients a copy of this License along with the Program.\n\n  You may charge any price or no price for each copy that you convey,\nand you may offer support or warranty protection for a fee.\n\n  5. Conveying Modified Source Versions.\n\n  You may convey a work based on the Program, or the modifications to\nproduce it from the Program, in the form of source code under the\nterms of section 4, provided that you also meet all of these conditions:\n\n    a) The work must carry prominent notices stating that you modified\n    it, and giving a relevant date.\n\n    b) The work must carry prominent notices stating that it is\n    released under this License and any conditions added under section\n    7.  This requirement modifies the requirement in section 4 to\n    \"keep intact all notices\".\n\n    c) You must license the entire work, as a whole, under this\n    License to anyone who comes into possession of a copy.  This\n    License will therefore apply, along with any applicable section 7\n    additional terms, to the whole of the work, and all its parts,\n    regardless of how they are packaged.  This License gives no\n    permission to license the work in any other way, but it does not\n    invalidate such permission if you have separately received it.\n\n    d) If the work has interactive user interfaces, each must display\n    Appropriate Legal Notices; however, if the Program has interactive\n    interfaces that do not display Appropriate Legal Notices, your\n    work need not make them do so.\n\n  A compilation of a covered work with other separate and independent\nworks, which are not by their nature extensions of the covered work,\nand which are not combined with it such as to form a larger program,\nin or on a volume of a storage or distribution medium, is called an\n\"aggregate\" if the compilation and its resulting copyright are not\nused to limit the access or legal rights of the compilation's users\nbeyond what the individual works permit.  Inclusion of a covered work\nin an aggregate does not cause this License to apply to the other\nparts of the aggregate.\n\n  6. Conveying Non-Source Forms.\n\n  You may convey a covered work in object code form under the terms\nof sections 4 and 5, provided that you also convey the\nmachine-readable Corresponding Source under the terms of this License,\nin one of these ways:\n\n    a) Convey the object code in, or embodied in, a physical product\n    (including a physical distribution medium), accompanied by the\n    Corresponding Source fixed on a durable physical medium\n    customarily used for software interchange.\n\n    b) Convey the object code in, or embodied in, a physical product\n    (including a physical distribution medium), accompanied by a\n    written offer, valid for at least three years and valid for as\n    long as you offer spare parts or customer support for that product\n    model, to give anyone who possesses the object code either (1) a\n    copy of the Corresponding Source for all the software in the\n    product that is covered by this License, on a durable physical\n    medium customarily used for software interchange, for a price no\n    more than your reasonable cost of physically performing this\n    conveying of source, or (2) access to copy the\n    Corresponding Source from a network server at no charge.\n\n    c) Convey individual copies of the object code with a copy of the\n    written offer to provide the Corresponding Source.  This\n    alternative is allowed only occasionally and noncommercially, and\n    only if you received the object code with such an offer, in accord\n    with subsection 6b.\n\n    d) Convey the object code by offering access from a designated\n    place (gratis or for a charge), and offer equivalent access to the\n    Corresponding Source in the same way through the same place at no\n    further charge.  You need not require recipients to copy the\n    Corresponding Source along with the object code.  If the place to\n    copy the object code is a network server, the Corresponding Source\n    may be on a different server (operated by you or a third party)\n    that supports equivalent copying facilities, provided you maintain\n    clear directions next to the object code saying where to find the\n    Corresponding Source.  Regardless of what server hosts the\n    Corresponding Source, you remain obligated to ensure that it is\n    available for as long as needed to satisfy these requirements.\n\n    e) Convey the object code using peer-to-peer transmission, provided\n    you inform other peers where the object code and Corresponding\n    Source of the work are being offered to the general public at no\n    charge under subsection 6d.\n\n  A separable portion of the object code, whose source code is excluded\nfrom the Corresponding Source as a System Library, need not be\nincluded in conveying the object code work.\n\n  A \"User Product\" is either (1) a \"consumer product\", which means any\ntangible personal property which is normally used for personal, family,\nor household purposes, or (2) anything designed or sold for incorporation\ninto a dwelling.  In determining whether a product is a consumer product,\ndoubtful cases shall be resolved in favor of coverage.  For a particular\nproduct received by a particular user, \"normally used\" refers to a\ntypical or common use of that class of product, regardless of the status\nof the particular user or of the way in which the particular user\nactually uses, or expects or is expected to use, the product.  A product\nis a consumer product regardless of whether the product has substantial\ncommercial, industrial or non-consumer uses, unless such uses represent\nthe only significant mode of use of the product.\n\n  \"Installation Information\" for a User Product means any methods,\nprocedures, authorization keys, or other information required to install\nand execute modified versions of a covered work in that User Product from\na modified version of its Corresponding Source.  The information must\nsuffice to ensure that the continued functioning of the modified object\ncode is in no case prevented or interfered with solely because\nmodification has been made.\n\n  If you convey an object code work under this section in, or with, or\nspecifically for use in, a User Product, and the conveying occurs as\npart of a transaction in which the right of possession and use of the\nUser Product is transferred to the recipient in perpetuity or for a\nfixed term (regardless of how the transaction is characterized), the\nCorresponding Source conveyed under this section must be accompanied\nby the Installation Information.  But this requirement does not apply\nif neither you nor any third party retains the ability to install\nmodified object code on the User Product (for example, the work has\nbeen installed in ROM).\n\n  The requirement to provide Installation Information does not include a\nrequirement to continue to provide support service, warranty, or updates\nfor a work that has been modified or installed by the recipient, or for\nthe User Product in which it has been modified or installed.  Access to a\nnetwork may be denied when the modification itself materially and\nadversely affects the operation of the network or violates the rules and\nprotocols for communication across the network.\n\n  Corresponding Source conveyed, and Installation Information provided,\nin accord with this section must be in a format that is publicly\ndocumented (and with an implementation available to the public in\nsource code form), and must require no special password or key for\nunpacking, reading or copying.\n\n  7. Additional Terms.\n\n  \"Additional permissions\" are terms that supplement the terms of this\nLicense by making exceptions from one or more of its conditions.\nAdditional permissions that are applicable to the entire Program shall\nbe treated as though they were included in this License, to the extent\nthat they are valid under applicable law.  If additional permissions\napply only to part of the Program, that part may be used separately\nunder those permissions, but the entire Program remains governed by\nthis License without regard to the additional permissions.\n\n  When you convey a copy of a covered work, you may at your option\nremove any additional permissions from that copy, or from any part of\nit.  (Additional permissions may be written to require their own\nremoval in certain cases when you modify the work.)  You may place\nadditional permissions on material, added by you to a covered work,\nfor which you have or can give appropriate copyright permission.\n\n  Notwithstanding any other provision of this License, for material you\nadd to a covered work, you may (if authorized by the copyright holders of\nthat material) supplement the terms of this License with terms:\n\n    a) Disclaiming warranty or limiting liability differently from the\n    terms of sections 15 and 16 of this License; or\n\n    b) Requiring preservation of specified reasonable legal notices or\n    author attributions in that material or in the Appropriate Legal\n    Notices displayed by works containing it; or\n\n    c) Prohibiting misrepresentation of the origin of that material, or\n    requiring that modified versions of such material be marked in\n    reasonable ways as different from the original version; or\n\n    d) Limiting the use for publicity purposes of names of licensors or\n    authors of the material; or\n\n    e) Declining to grant rights under trademark law for use of some\n    trade names, trademarks, or service marks; or\n\n    f) Requiring indemnification of licensors and authors of that\n    material by anyone who conveys the material (or modified versions of\n    it) with contractual assumptions of liability to the recipient, for\n    any liability that these contractual assumptions directly impose on\n    those licensors and authors.\n\n  All other non-permissive additional terms are considered \"further\nrestrictions\" within the meaning of section 10.  If the Program as you\nreceived it, or any part of it, contains a notice stating that it is\ngoverned by this License along with a term that is a further\nrestriction, you may remove that term.  If a license document contains\na further restriction but permits relicensing or conveying under this\nLicense, you may add to a covered work material governed by the terms\nof that license document, provided that the further restriction does\nnot survive such relicensing or conveying.\n\n  If you add terms to a covered work in accord with this section, you\nmust place, in the relevant source files, a statement of the\nadditional terms that apply to those files, or a notice indicating\nwhere to find the applicable terms.\n\n  Additional terms, permissive or non-permissive, may be stated in the\nform of a separately written license, or stated as exceptions;\nthe above requirements apply either way.\n\n  8. Termination.\n\n  You may not propagate or modify a covered work except as expressly\nprovided under this License.  Any attempt otherwise to propagate or\nmodify it is void, and will automatically terminate your rights under\nthis License (including any patent licenses granted under the third\nparagraph of section 11).\n\n  However, if you cease all violation of this License, then your\nlicense from a particular copyright holder is reinstated (a)\nprovisionally, unless and until the copyright holder explicitly and\nfinally terminates your license, and (b) permanently, if the copyright\nholder fails to notify you of the violation by some reasonable means\nprior to 60 days after the cessation.\n\n  Moreover, your license from a particular copyright holder is\nreinstated permanently if the copyright holder notifies you of the\nviolation by some reasonable means, this is the first time you have\nreceived notice of violation of this License (for any work) from that\ncopyright holder, and you cure the violation prior to 30 days after\nyour receipt of the notice.\n\n  Termination of your rights under this section does not terminate the\nlicenses of parties who have received copies or rights from you under\nthis License.  If your rights have been terminated and not permanently\nreinstated, you do not qualify to receive new licenses for the same\nmaterial under section 10.\n\n  9. Acceptance Not Required for Having Copies.\n\n  You are not required to accept this License in order to receive or\nrun a copy of the Program.  Ancillary propagation of a covered work\noccurring solely as a consequence of using peer-to-peer transmission\nto receive a copy likewise does not require acceptance.  However,\nnothing other than this License grants you permission to propagate or\nmodify any covered work.  These actions infringe copyright if you do\nnot accept this License.  Therefore, by modifying or propagating a\ncovered work, you indicate your acceptance of this License to do so.\n\n  10. Automatic Licensing of Downstream Recipients.\n\n  Each time you convey a covered work, the recipient automatically\nreceives a license from the original licensors, to run, modify and\npropagate that work, subject to this License.  You are not responsible\nfor enforcing compliance by third parties with this License.\n\n  An \"entity transaction\" is a transaction transferring control of an\norganization, or substantially all assets of one, or subdividing an\norganization, or merging organizations.  If propagation of a covered\nwork results from an entity transaction, each party to that\ntransaction who receives a copy of the work also receives whatever\nlicenses to the work the party's predecessor in interest had or could\ngive under the previous paragraph, plus a right to possession of the\nCorresponding Source of the work from the predecessor in interest, if\nthe predecessor has it or can get it with reasonable efforts.\n\n  You may not impose any further restrictions on the exercise of the\nrights granted or affirmed under this License.  For example, you may\nnot impose a license fee, royalty, or other charge for exercise of\nrights granted under this License, and you may not initiate litigation\n(including a cross-claim or counterclaim in a lawsuit) alleging that\nany patent claim is infringed by making, using, selling, offering for\nsale, or importing the Program or any portion of it.\n\n  11. Patents.\n\n  A \"contributor\" is a copyright holder who authorizes use under this\nLicense of the Program or a work on which the Program is based.  The\nwork thus licensed is called the contributor's \"contributor version\".\n\n  A contributor's \"essential patent claims\" are all patent claims\nowned or controlled by the contributor, whether already acquired or\nhereafter acquired, that would be infringed by some manner, permitted\nby this License, of making, using, or selling its contributor version,\nbut do not include claims that would be infringed only as a\nconsequence of further modification of the contributor version.  For\npurposes of this definition, \"control\" includes the right to grant\npatent sublicenses in a manner consistent with the requirements of\nthis License.\n\n  Each contributor grants you a non-exclusive, worldwide, royalty-free\npatent license under the contributor's essential patent claims, to\nmake, use, sell, offer for sale, import and otherwise run, modify and\npropagate the contents of its contributor version.\n\n  In the following three paragraphs, a \"patent license\" is any express\nagreement or commitment, however denominated, not to enforce a patent\n(such as an express permission to practice a patent or covenant not to\nsue for patent infringement).  To \"grant\" such a patent license to a\nparty means to make such an agreement or commitment not to enforce a\npatent against the party.\n\n  If you convey a covered work, knowingly relying on a patent license,\nand the Corresponding Source of the work is not available for anyone\nto copy, free of charge and under the terms of this License, through a\npublicly available network server or other readily accessible means,\nthen you must either (1) cause the Corresponding Source to be so\navailable, or (2) arrange to deprive yourself of the benefit of the\npatent license for this particular work, or (3) arrange, in a manner\nconsistent with the requirements of this License, to extend the patent\nlicense to downstream recipients.  \"Knowingly relying\" means you have\nactual knowledge that, but for the patent license, your conveying the\ncovered work in a country, or your recipient's use of the covered work\nin a country, would infringe one or more identifiable patents in that\ncountry that you have reason to believe are valid.\n\n  If, pursuant to or in connection with a single transaction or\narrangement, you convey, or propagate by procuring conveyance of, a\ncovered work, and grant a patent license to some of the parties\nreceiving the covered work authorizing them to use, propagate, modify\nor convey a specific copy of the covered work, then the patent license\nyou grant is automatically extended to all recipients of the covered\nwork and works based on it.\n\n  A patent license is \"discriminatory\" if it does not include within\nthe scope of its coverage, prohibits the exercise of, or is\nconditioned on the non-exercise of one or more of the rights that are\nspecifically granted under this License.  You may not convey a covered\nwork if you are a party to an arrangement with a third party that is\nin the business of distributing software, under which you make payment\nto the third party based on the extent of your activity of conveying\nthe work, and under which the third party grants, to any of the\nparties who would receive the covered work from you, a discriminatory\npatent license (a) in connection with copies of the covered work\nconveyed by you (or copies made from those copies), or (b) primarily\nfor and in connection with specific products or compilations that\ncontain the covered work, unless you entered into that arrangement,\nor that patent license was granted, prior to 28 March 2007.\n\n  Nothing in this License shall be construed as excluding or limiting\nany implied license or other defenses to infringement that may\notherwise be available to you under applicable patent law.\n\n  12. No Surrender of Others' Freedom.\n\n  If conditions are imposed on you (whether by court order, agreement or\notherwise) that contradict the conditions of this License, they do not\nexcuse you from the conditions of this License.  If you cannot convey a\ncovered work so as to satisfy simultaneously your obligations under this\nLicense and any other pertinent obligations, then as a consequence you may\nnot convey it at all.  For example, if you agree to terms that obligate you\nto collect a royalty for further conveying from those to whom you convey\nthe Program, the only way you could satisfy both those terms and this\nLicense would be to refrain entirely from conveying the Program.\n\n  13. Use with the GNU Affero General Public License.\n\n  Notwithstanding any other provision of this License, you have\npermission to link or combine any covered work with a work licensed\nunder version 3 of the GNU Affero General Public License into a single\ncombined work, and to convey the resulting work.  The terms of this\nLicense will continue to apply to the part which is the covered work,\nbut the special requirements of the GNU Affero General Public License,\nsection 13, concerning interaction through a network will apply to the\ncombination as such.\n\n  14. Revised Versions of this License.\n\n  The Free Software Foundation may publish revised and/or new versions of\nthe GNU General Public License from time to time.  Such new versions will\nbe similar in spirit to the present version, but may differ in detail to\naddress new problems or concerns.\n\n  Each version is given a distinguishing version number.  If the\nProgram specifies that a certain numbered version of the GNU General\nPublic License \"or any later version\" applies to it, you have the\noption of following the terms and conditions either of that numbered\nversion or of any later version published by the Free Software\nFoundation.  If the Program does not specify a version number of the\nGNU General Public License, you may choose any version ever published\nby the Free Software Foundation.\n\n  If the Program specifies that a proxy can decide which future\nversions of the GNU General Public License can be used, that proxy's\npublic statement of acceptance of a version permanently authorizes you\nto choose that version for the Program.\n\n  Later license versions may give you additional or different\npermissions.  However, no additional obligations are imposed on any\nauthor or copyright holder as a result of your choosing to follow a\nlater version.\n\n  15. Disclaimer of Warranty.\n\n  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY\nAPPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT\nHOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM \"AS IS\" WITHOUT WARRANTY\nOF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,\nTHE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR\nPURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM\nIS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF\nALL NECESSARY SERVICING, REPAIR OR CORRECTION.\n\n  16. Limitation of Liability.\n\n  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING\nWILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS\nTHE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY\nGENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE\nUSE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF\nDATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD\nPARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),\nEVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF\nSUCH DAMAGES.\n\n  17. Interpretation of Sections 15 and 16.\n\n  If the disclaimer of warranty and limitation of liability provided\nabove cannot be given local legal effect according to their terms,\nreviewing courts shall apply local law that most closely approximates\nan absolute waiver of all civil liability in connection with the\nProgram, unless a warranty or assumption of liability accompanies a\ncopy of the Program in return for a fee.\n\n                     END OF TERMS AND CONDITIONS\n\n            How to Apply These Terms to Your New Programs\n\n  If you develop a new program, and you want it to be of the greatest\npossible use to the public, the best way to achieve this is to make it\nfree software which everyone can redistribute and change under these terms.\n\n  To do so, attach the following notices to the program.  It is safest\nto attach them to the start of each source file to most effectively\nstate the exclusion of warranty; and each file should have at least\nthe \"copyright\" line and a pointer to where the full notice is found.\n\n    Distributed Deep Learning using Keras and Apache Spark.\n    Copyright (C) 2016  Joeri Hermans\n\n    This program is free software: you can redistribute it and/or modify\n    it under the terms of the GNU General Public License as published by\n    the Free Software Foundation, either version 3 of the License, or\n    (at your option) any later version.\n\n    This program is distributed in the hope that it will be useful,\n    but WITHOUT ANY WARRANTY; without even the implied warranty of\n    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n    GNU General Public License for more details.\n\n    You should have received a copy of the GNU General Public License\n    along with this program.  If not, see <http://www.gnu.org/licenses/>.\n\nAlso add information on how to contact you by electronic and paper mail.\n\n  If the program does terminal interaction, make it output a short\nnotice like this when it starts in an interactive mode:\n\n    Distributed Keras  Copyright (C) 2016  Joeri Hermans\n    This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.\n    This is free software, and you are welcome to redistribute it\n    under certain conditions; type `show c' for details.\n\nThe hypothetical commands `show w' and `show c' should show the appropriate\nparts of the General Public License.  Of course, your program's commands\nmight be different; for a GUI interface, you would use an \"about box\".\n\n  You should also get your employer (if you work as a programmer) or school,\nif any, to sign a \"copyright disclaimer\" for the program, if necessary.\nFor more information on this, and how to apply and follow the GNU GPL, see\n<http://www.gnu.org/licenses/>.\n\n  The GNU General Public License does not permit incorporating your program\ninto proprietary programs.  If your program is a subroutine library, you\nmay consider it more useful to permit linking proprietary applications with\nthe library.  If this is what you want to do, use the GNU Lesser General\nPublic License instead of this License.  But first, please read\n<http://www.gnu.org/philosophy/why-not-lgpl.html>.\n"
  },
  {
    "path": "docs/optimizers.md",
    "content": "# Optimizers\n\nOptimizers, or trainers, are the main component in Distributed Keras (DK). All trainers share a single interface, which is the `Trainer` class, defined in `distkeras/distributed.py`. This class also contains the `serialized model`, the `loss`, and the `Keras optimizer` the workers need to use. Generally, a trainer will run on a single worker. In the context of Apache Spark, this means that the thread which is responsible for doing the foreachPartition or mapPartitions will have been assigned a trainer. In reality however, the training of the model itself will utilise more physical cores. In fact, it will employ all available cores, and thus bypassing resource managers such as YARN.\n\n## Single Trainer\n\nA single trainer is in all simplicity a trainer which will use a single thread (as discussed above) to train a model. This trainer is usually used as a baseline metric for new distributed optimizers.\n\n```python\nSingleTrainer(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_epoch=1,\n              batch_size=32, features_col=\"features\", label_col=\"label\")\n```\n**Parameters**:\n\n- **keras_model**:            The Keras model which should be trained.\n- **worker_optmizer**:        Keras optimizer for workers.\n- **num_epoch**:              Number of epoch iterations over the data.\n- **batch_size**:             Mini-batch size.\n- **features_col**:           Column of the feature vector in the Spark Dataframe.\n- **label_col**:              Column of the label in the Spark Dataframe.\n\n## EASGD\n\nThe distinctive idea of EASGD is to allow the local workers to perform more exploration (small rho) and the master to perform exploitation. This approach differs from other settings explored in the literature, and focus on how fast the center variable converges [(paper)](https://arxiv.org/pdf/1412.6651.pdf) .\n\nWe want to note the basic version of EASGD is a synchronous algorithm, i.e., once a worker is done processing a batch of the data, it will wait until all other workers have submitted their variables (this includes the weight parameterization, iteration number, and worker id) to the parameter server before starting the next data batch.\n\n```python\nEASGD(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, \n      features_col=\"features\", label_col=\"label\", rho=5.0, learning_rate=0.01, \n      batch_size=32, num_epoch=1, master_port=5000)\n```\n\n**Parameters**:\n\nTODO\n\n## Asynchronous EASGD\n\nIn this section we propose the asynchronous version of EASGD. Instead of waiting on the synchronization of other trainers, this method communicates the elastic difference (as described in the paper), with the parameter server. The only synchronization mechanism that has been implemented, is to ensure no race-conditions occur when updating the center variable.\n\n```python\nAsynchronousEASGD(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, batch_size=1000,\n                  features_col=\"features\", label_col=\"label\", communication_window=3,\n                  rho=0.01, learning_rate=0.01, master_port=5000, num_epoch=1)\n```\n\n**Parameters**:\n\nTODO\n\n## Asynchronous EAMSGD\n\nAsynchronous EAMSGD is a variant of asynchronous EASGD. It is based on the Nesterov's momentum scheme, where the update of the local worker is modified to incorepare a momentum term.\n\n```python\nAsynchornousEAMSGD(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, batch_size=32,\n                  features_col=\"features\", label_col=\"label\", communication_window=10,\n                  rho=5.0, learning_rate=0.01, momentum=0.9, master_port=5000, num_epoch=1)\n```\n\n**Parameters**:\n\nTODO\n\n## DOWNPOUR\n\nAn asynchronous stochastic gradient descent procedure supporting a large number of model replicas and leverages adaptive learning rates. This implementation is based on the pseudocode provided by [Zhang et al.](https://arxiv.org/pdf/1412.6651.pdf)\n\n```python\nDOWNPOUR(keras_model, worker_optimizer, loss, metrics=[\"accuracy\"], num_workers=2, batch_size=1000,\n         features_col=\"features\", label_col=\"label\", communication_window=5,\n         master_port=5000, num_epoch=1, learning_rate=0.01))\n```\n\n**Parameters**:\n\nTODO\n\n## Custom distributed optimizer\n\nTODO\n\n### Synchronized Distributed Trainer\n\nTODO\n\n### Asynchronous Distributed Trainer\n\nTODO\n\n### Implementing a custom worker\n\nTODO\n"
  },
  {
    "path": "examples/cifar-10-preprocessing.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# CIFAR-10 Preprocessing\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN)             \\n\",\n    \"*Departement of Data Science & Knowledge Engineering*         \\n\",\n    \"*Maastricht University, The Netherlands*\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this notebook we download the CIFAR-10 dataset, and prepare it in such a way it can be processed by Spark.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Using TensorFlow backend.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import cPickle as pickle\\n\",\n    \"\\n\",\n    \"import csv\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"from pyspark import SparkContext\\n\",\n    \"from pyspark import SparkConf\\n\",\n    \"\\n\",\n    \"from pyspark.ml.feature import VectorAssembler\\n\",\n    \"from pyspark.ml.feature import OneHotEncoder\\n\",\n    \"\\n\",\n    \"from distkeras.trainers import *\\n\",\n    \"from distkeras.predictors import *\\n\",\n    \"from distkeras.transformers import *\\n\",\n    \"from distkeras.evaluators import *\\n\",\n    \"from distkeras.utils import *\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Downloading and decompressing the dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"--2017-01-26 15:42:04--  https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz\\n\",\n      \"Resolving www.cs.toronto.edu... 128.100.3.30\\n\",\n      \"Connecting to www.cs.toronto.edu|128.100.3.30|:443... connected.\\n\",\n      \"HTTP request sent, awaiting response... 200 OK\\n\",\n      \"Length: 170498071 (163M) [application/x-gzip]\\n\",\n      \"Saving to: “cifar-10-python.tar.gz”\\n\",\n      \"\\n\",\n      \"100%[======================================>] 170,498,071 4.88M/s   in 33s     \\n\",\n      \"\\n\",\n      \"2017-01-26 15:42:40 (4.89 MB/s) - “cifar-10-python.tar.gz” saved [170498071/170498071]\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!rm cifar-10-python.tar.gz\\n\",\n    \"!rm -r cifar-10-batches-py\\n\",\n    \"!wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"cifar-10-batches-py/\\n\",\n      \"cifar-10-batches-py/data_batch_4\\n\",\n      \"cifar-10-batches-py/readme.html\\n\",\n      \"cifar-10-batches-py/test_batch\\n\",\n      \"cifar-10-batches-py/data_batch_3\\n\",\n      \"cifar-10-batches-py/batches.meta\\n\",\n      \"cifar-10-batches-py/data_batch_2\\n\",\n      \"cifar-10-batches-py/data_batch_5\\n\",\n      \"cifar-10-batches-py/data_batch_1\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!tar -xvzf cifar-10-python.tar.gz\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Loading the dataset in memory for further processing\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Number of training instances: 50000\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Define the required datastructures.\\n\",\n    \"training_instances = []\\n\",\n    \"training_labels = []\\n\",\n    \"\\n\",\n    \"# Iterate through all training batches, and load them in memory.\\n\",\n    \"for i in range(1, 6):\\n\",\n    \"    path = \\\"cifar-10-batches-py/data_batch_\\\" + str(i)\\n\",\n    \"    fd = open(path, \\\"rb\\\")\\n\",\n    \"    d = pickle.load(fd)\\n\",\n    \"    fd.close()\\n\",\n    \"    # Add the training data to our datastructures.\\n\",\n    \"    num_instances = len(d['data'])\\n\",\n    \"    for j in range(0, num_instances):\\n\",\n    \"        training_instances.append(d['data'][j])\\n\",\n    \"        training_labels.append(d['labels'][j])\\n\",\n    \"        \\n\",\n    \"print(\\\"Number of training instances: \\\" + str(len(training_instances)))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Number of test instances: 10000\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Define the reuiqred datastructures.\\n\",\n    \"test_instances = []\\n\",\n    \"test_labels = []\\n\",\n    \"\\n\",\n    \"# Load the test batch.\\n\",\n    \"path = \\\"cifar-10-batches-py/test_batch\\\"\\n\",\n    \"fd = open(path, \\\"rb\\\")\\n\",\n    \"d = pickle.load(fd)\\n\",\n    \"fd.close()\\n\",\n    \"# Add the testset to our datastructures.\\n\",\n    \"num_instances = len(d['data'])\\n\",\n    \"for j in range(0, num_instances):\\n\",\n    \"    test_instances.append(d['data'][j])\\n\",\n    \"    test_labels.append(d['labels'][j])\\n\",\n    \"    \\n\",\n    \"print(\\\"Number of test instances: \\\" + str(len(test_instances)))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"At this point we have the training and test set in memory. At this point we basically have 2 options to prepare it for Apache Spark. First, we simply \\\"parallelize\\\" the data, and continue from there. However, this requires some additional logic. The second approach is to write it to a file which Spark will be able to read (CSV, Parquet, Avro...). Due to the simplicity of the second approach, we will choose to write the contents of our datastructures in a CSV file.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Number of columns: 3073\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# First, prepare the column names.\\n\",\n    \"columns = ['label']\\n\",\n    \"# Now, add the pixel column names. Note, first 1024 pixels are red, then green and finally blue.\\n\",\n    \"for c in ['r','g','b']:\\n\",\n    \"    for i in range(0, 1024):\\n\",\n    \"        column_name = \\\"p_\\\" + str(i) + \\\"_\\\" + c\\n\",\n    \"        columns.append(column_name)\\n\",\n    \"        \\n\",\n    \"# Now, we should have 3072 (data) + 1 (label) column names.\\n\",\n    \"print(\\\"Number of columns: \\\" + str(len(columns)))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Size training set: 50000\\n\",\n      \"Size test set: 10000\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"training_set = []\\n\",\n    \"test_set = []\\n\",\n    \"\\n\",\n    \"# Prepare the training set.\\n\",\n    \"for i in range(0, len(training_instances)):\\n\",\n    \"    row = np.insert(training_instances[i], 0, training_labels[i])\\n\",\n    \"    training_set.append(row)\\n\",\n    \"\\n\",\n    \"# Prepare the test set.\\n\",\n    \"for i in range(0, len(test_instances)):\\n\",\n    \"    row = np.insert(test_instances[i], 0, test_labels[i])\\n\",\n    \"    test_set.append(row)\\n\",\n    \"    \\n\",\n    \"print(\\\"Size training set: \\\" + str(len(training_set)))\\n\",\n    \"print(\\\"Size test set: \\\" + str(len(test_set)))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def save(path, columns, dataset):\\n\",\n    \"    with open(path, 'wb') as f:\\n\",\n    \"        w = csv.writer(f)\\n\",\n    \"        # Write the columns.\\n\",\n    \"        w.writerow(columns)\\n\",\n    \"        # Iterate through all instances in the training set.\\n\",\n    \"        n = len(dataset)\\n\",\n    \"        for i in range(0, n):\\n\",\n    \"            w.writerow(dataset[i].tolist())\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Save the datasets to disk.\\n\",\n    \"save(\\\"cifar-10-training.csv\\\", columns, training_set)\\n\",\n    \"save(\\\"cifar-10-test.csv\\\", columns, test_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"cifar-10-test.csv\\r\\n\",\n      \"cifar-10-training.csv\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Confirming that produced CSV's are present\\n\",\n    \"!ls | grep cifar | grep csv\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Deleted data/cifar-10-training.csv\\n\",\n      \"Deleted data/cifar-10-test.csv\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Remove the old training and test set from HDFS.\\n\",\n    \"!hdfs dfs -rm data/cifar-10-training.csv\\n\",\n    \"!hdfs dfs -rm data/cifar-10-test.csv\\n\",\n    \"# Copy the training and test set to HDFS.\\n\",\n    \"!hdfs dfs -copyFromLocal cifar-10-training.csv data/cifar-10-training.csv\\n\",\n    \"!hdfs dfs -copyFromLocal cifar-10-test.csv data/cifar-10-test.csv\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Further distributed preprocessing with Apache Spark\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Setting up a Spark Context\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Modify these variables according to your needs.\\n\",\n    \"application_name = \\\"CIFAR-10 Preprocessing Notebook\\\"\\n\",\n    \"using_spark_2 = False\\n\",\n    \"local = False\\n\",\n    \"path_train = \\\"data/cifar-10-training.csv\\\"\\n\",\n    \"path_test = \\\"data/cifar-10-test.csv\\\"\\n\",\n    \"if local:\\n\",\n    \"    # Tell master to use local resources.\\n\",\n    \"    master = \\\"local[*]\\\"\\n\",\n    \"    num_processes = 3\\n\",\n    \"    num_executors = 1\\n\",\n    \"else:\\n\",\n    \"    # Tell master to use YARN.\\n\",\n    \"    master = \\\"yarn-client\\\"\\n\",\n    \"    num_executors = 20\\n\",\n    \"    num_processes = 1\\n\",\n    \"    \\n\",\n    \"num_workers = num_executors * num_processes\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"# Use the DataBricks CSV reader, this has some nice functionality regarding invalid values.\\n\",\n    \"os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell'\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"conf = SparkConf()\\n\",\n    \"conf.set(\\\"spark.app.name\\\", application_name)\\n\",\n    \"conf.set(\\\"spark.master\\\", master)\\n\",\n    \"conf.set(\\\"spark.executor.cores\\\", `num_processes`)\\n\",\n    \"conf.set(\\\"spark.executor.instances\\\", `num_executors`)\\n\",\n    \"conf.set(\\\"spark.executor.memory\\\", \\\"4g\\\")\\n\",\n    \"conf.set(\\\"spark.locality.wait\\\", \\\"0\\\")\\n\",\n    \"conf.set(\\\"spark.serializer\\\", \\\"org.apache.spark.serializer.KryoSerializer\\\");\\n\",\n    \"\\n\",\n    \"# Check if the user is running Spark 2.0 +\\n\",\n    \"if using_spark_2:\\n\",\n    \"    sc = SparkSession.builder.config(conf=conf) \\\\\\n\",\n    \"            .appName(application_name) \\\\\\n\",\n    \"            .getOrCreate()\\n\",\n    \"else:\\n\",\n    \"    # Create the Spark context.\\n\",\n    \"    sc = SparkContext(conf=conf)\\n\",\n    \"    # Add the missing imports\\n\",\n    \"    from pyspark import SQLContext\\n\",\n    \"    sqlContext = SQLContext(sc)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Reading the raw CSV files\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Check if we are using Spark 2.0\\n\",\n    \"if using_spark_2:\\n\",\n    \"    reader = sc\\n\",\n    \"else:\\n\",\n    \"    reader = sqlContext\\n\",\n    \"# Read the training set.\\n\",\n    \"raw_dataset_train = reader.read.format('com.databricks.spark.csv') \\\\\\n\",\n    \"                          .options(header='true', inferSchema='true') \\\\\\n\",\n    \"                          .load(path_train)\\n\",\n    \"# Read the testing set.\\n\",\n    \"raw_dataset_test = reader.read.format('com.databricks.spark.csv') \\\\\\n\",\n    \"                         .options(header='true', inferSchema='true') \\\\\\n\",\n    \"                         .load(path_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Training set size: 50000\\n\",\n      \"Test set size: 10000\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Count the number of instances in the training and test set (to check).\\n\",\n    \"print(\\\"Training set size: \\\" + str(raw_dataset_train.count()))\\n\",\n    \"print(\\\"Test set size: \\\" + str(raw_dataset_test.count()))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Preparing for further preprocessing, training and testing\\n\",\n    \"\\n\",\n    \"In order to ensure compatibility with Apache Spark, we vectorize the columns, and add the resulting vectors as a seperate column. However, in order to achieve this, we first need a list of the required columns. This is shown in the cell below.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"features = raw_dataset_train.columns\\n\",\n    \"features.remove('label')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Once we have a list of columns names, we can pass this to Spark's [VectorAssembler](http://spark.apache.org/docs/latest/ml-features.html#vectorassembler). This VectorAssembler will take a list of features, vectorize them, and place them in a column defined in `outputCol`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Assemble the columns.\\n\",\n    \"vector_assembler = VectorAssembler(inputCols=features, outputCol=\\\"features\\\")\\n\",\n    \"dataset_train = vector_assembler.transform(raw_dataset_train)\\n\",\n    \"dataset_test = vector_assembler.transform(raw_dataset_test)\\n\",\n    \"# Repartition the dataset.\\n\",\n    \"dataset_train = dataset_train.repartition(num_workers)\\n\",\n    \"dataset_test = dataset_test.repartition(num_workers)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Once we have the inputs for our Neural Network (features column) after applying the VectorAssembler, we should also define the outputs. Since we are dealing with a classification task, the output of our Neural Network should be a one-hot encoded vector with 10 elements. For this, we provide a `OneHotTransformer` which accomplish this exact task.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"nb_classes = 10\\n\",\n    \"encoder = OneHotTransformer(nb_classes, input_col=\\\"label\\\", output_col=\\\"label_encoded\\\")\\n\",\n    \"dataset_train = encoder.transform(dataset_train)\\n\",\n    \"dataset_test = encoder.transform(dataset_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Finally, normalize the pixel intensities with the range [0, 1].\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Allocate a MinMaxTransformer.\\n\",\n    \"transformer = MinMaxTransformer(n_min=0.0, n_max=1.0, \\\\\\n\",\n    \"                                o_min=0.0, o_max=250.0, \\\\\\n\",\n    \"                                input_col=\\\"features\\\", \\\\\\n\",\n    \"                                output_col=\\\"features_normalized\\\")\\n\",\n    \"# Transform the datasets.\\n\",\n    \"dataset_train = transformer.transform(dataset_train)\\n\",\n    \"dataset_test = transformer.transform(dataset_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Saving the datasets to Parquet.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Deleted data/cifar-10-train-preprocessed.parquet\\n\",\n      \"Deleted data/cifar-10-test-preprocessed.parquet\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Delete the old preprocessed Parquet files.\\n\",\n    \"!hdfs dfs -rm -r data/cifar-10-train-preprocessed.parquet\\n\",\n    \"!hdfs dfs -rm -r data/cifar-10-test-preprocessed.parquet\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"dataset_train.write.parquet(\\\"data/cifar-10-train-preprocessed.parquet\\\")\\n\",\n    \"dataset_test.write.parquet(\\\"data/cifar-10-test-preprocessed.parquet\\\")\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [conda root]\",\n   \"language\": \"python\",\n   \"name\": \"conda-root-py\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 2\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython2\",\n   \"version\": \"2.7.12\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 1\n}\n"
  },
  {
    "path": "examples/distributed_numpy_parsing.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Distributed Numpy Parsing\\n\",\n    \"\\n\",\n    \"Joeri R. Hermans                    \\n\",\n    \"*Departement of Data Science & Knowledge Engineering*          \\n\",\n    \"*Maastricht University, The Netherlands*           \\n\",\n    \"\\n\",\n    \"This notebook will show you how to parse a collection of Numpy files straight from HDFS into a Spark Dataframe.\\n\",\n    \"\\n\",\n    \"## Cluster Configuration\\n\",\n    \"\\n\",\n    \"In the following sections, we set up the cluster properties.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"import os\\n\",\n    \"\\n\",\n    \"from pyspark import SparkContext\\n\",\n    \"from pyspark import SparkConf\\n\",\n    \"\\n\",\n    \"from pyspark.sql.types import *\\n\",\n    \"\\n\",\n    \"from pyspark.sql import Row\\n\",\n    \"\\n\",\n    \"from pyspark.storagelevel import StorageLevel\\n\",\n    \"\\n\",\n    \"# Use the DataBricks AVRO reader.\\n\",\n    \"os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-avro_2.11:3.2.0 pyspark-shell'\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Number of desired executors: 20\\n\",\n      \"Number of desired processes / executor: 1\\n\",\n      \"Total number of workers: 20\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Modify these variables according to your needs.\\n\",\n    \"application_name = \\\"Distributed Numpy Parsing\\\"\\n\",\n    \"using_spark_2 = False\\n\",\n    \"local = False\\n\",\n    \"\\n\",\n    \"if local:\\n\",\n    \"    # Tell master to use local resources.\\n\",\n    \"    master = \\\"local[*]\\\"\\n\",\n    \"    num_processes = 3\\n\",\n    \"    num_executors = 1\\n\",\n    \"else:\\n\",\n    \"    # Tell master to use YARN.\\n\",\n    \"    master = \\\"yarn-client\\\"\\n\",\n    \"    num_executors = 20\\n\",\n    \"    num_processes = 1\\n\",\n    \"\\n\",\n    \"# This variable is derived from the number of cores and executors,\\n\",\n    \"# and will be used to assign the number of model trainers.\\n\",\n    \"num_workers = num_executors * num_processes\\n\",\n    \"\\n\",\n    \"print(\\\"Number of desired executors: \\\" + `num_executors`)\\n\",\n    \"print(\\\"Number of desired processes / executor: \\\" + `num_processes`)\\n\",\n    \"print(\\\"Total number of workers: \\\" + `num_workers`)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Do not change anything here.\\n\",\n    \"conf = SparkConf()\\n\",\n    \"conf.set(\\\"spark.app.name\\\", application_name)\\n\",\n    \"conf.set(\\\"spark.master\\\", master)\\n\",\n    \"conf.set(\\\"spark.executor.cores\\\", `num_processes`)\\n\",\n    \"conf.set(\\\"spark.executor.instances\\\", `num_executors`)\\n\",\n    \"conf.set(\\\"spark.executor.memory\\\", \\\"5g\\\")\\n\",\n    \"conf.set(\\\"spark.locality.wait\\\", \\\"0\\\")\\n\",\n    \"conf.set(\\\"spark.serializer\\\", \\\"org.apache.spark.serializer.KryoSerializer\\\")\\n\",\n    \"conf.set(\\\"spark.kryoserializer.buffer.max\\\", \\\"2000\\\")\\n\",\n    \"conf.set(\\\"spark.executor.heartbeatInterval\\\", \\\"6000s\\\")\\n\",\n    \"conf.set(\\\"spark.network.timeout\\\", \\\"10000000s\\\")\\n\",\n    \"conf.set(\\\"spark.shuffle.spill\\\", \\\"true\\\")\\n\",\n    \"conf.set(\\\"spark.driver.memory\\\", \\\"10g\\\")\\n\",\n    \"conf.set(\\\"spark.driver.maxResultSize\\\", \\\"10g\\\")\\n\",\n    \"\\n\",\n    \"# Check if the user is running Spark 2.0 +\\n\",\n    \"if using_spark_2:\\n\",\n    \"    sc = SparkSession.builder.config(conf=conf) \\\\\\n\",\n    \"                     .appName(application_name) \\\\\\n\",\n    \"                     .getOrCreate()\\n\",\n    \"else:\\n\",\n    \"    # Create the Spark context.\\n\",\n    \"    sc = SparkContext(conf=conf)\\n\",\n    \"    # Add the missing imports\\n\",\n    \"    from pyspark import SQLContext\\n\",\n    \"    sqlContext = SQLContext(sc)\\n\",\n    \"\\n\",\n    \"# Check if we are using Spark 2.0\\n\",\n    \"if using_spark_2:\\n\",\n    \"    reader = sc\\n\",\n    \"else:\\n\",\n    \"    reader = sqlContext\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Obtaining the required file-paths\\n\",\n    \"\\n\",\n    \"Basically what we are going to do now, is obtain a lists of file paths (*.npy) which we will map with a custom lambda function to read all the data into a dataframe.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Define the command that needs to be executed, this will list all the numpy files in the specified directory.\\n\",\n    \"cmd = \\\"hdfs dfs -ls /user/jhermans/data/cms/RelValWjet_Pt_3000_3500_13_GEN-SIM-RECO_evt3150/*.npy | awk '{print $NF}'\\\"\\n\",\n    \"# Fetch the output of the command, and construct a list.\\n\",\n    \"output = os.popen(cmd).read()\\n\",\n    \"file_paths = output.split(\\\"\\\\n\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating a Spark Dataframe from the specified list\\n\",\n    \"\\n\",\n    \"Before we convert to a list to a Spark Dataframe, we first need to specify the schema. We do this by converting every element in the list to a Spark row. Afterwards, Spark will be able to automatically infer the schema of the dataframe.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"rows = []\\n\",\n    \"\\n\",\n    \"for path in file_paths:\\n\",\n    \"    row = Row(**{'path': path})\\n\",\n    \"    rows.append(row)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we are able to create the Spark DataFrame. Note, for Spark 2.0 use `spark.` instead of `sqlContext.`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Number of paths to be parsed: 393\\n\",\n      \"root\\n\",\n      \" |-- path: string (nullable = true)\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df = sqlContext.createDataFrame(rows)\\n\",\n    \"# Repartition the dataset for increased parallelism.\\n\",\n    \"df = df.repartition(20)\\n\",\n    \"\\n\",\n    \"print(\\\"Number of paths to be parsed: \\\" + str(df.count()))\\n\",\n    \"df.printSchema()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[Row(path=u'/user/jhermans/data/cms/RelValWjet_Pt_3000_3500_13_GEN-SIM-RECO_evt3150/trackparams220.npy')]\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# Example content of the dataframe.\\n\",\n    \"df.take(1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Parsing your Numpy files\\n\",\n    \"\\n\",\n    \"This is a fairly straightforward operation where we basically map all the file paths using a custom lambda function to read the numpy files from HDFS.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Number of columns: 190\\n\",\n      \"First five columns: \\n\",\n      \"sis_25_x\\n\",\n      \"normalizedChi2\\n\",\n      \"sis_25_z\\n\",\n      \"sis_25_y\\n\",\n      \"sis_48_x\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Development cell, this will be executed in the lambdas.\\n\",\n    \"\\n\",\n    \"import pydoop.hdfs as hdfs\\n\",\n    \"\\n\",\n    \"with hdfs.open(file_paths[0]) as f:\\n\",\n    \"    data = np.load(f)\\n\",\n    \"\\n\",\n    \"# Obtain the fields (columns) of your numpy data.\\n\",\n    \"fields = []\\n\",\n    \"for k in data[0].dtype.fields:\\n\",\n    \"    fields.append(k)\\n\",\n    \"    \\n\",\n    \"print(\\\"Number of columns: \\\" + str(len(data.dtype.fields)))\\n\",\n    \"\\n\",\n    \"print(\\\"First five columns: \\\")\\n\",\n    \"i = 0\\n\",\n    \"for k in data.dtype.fields:\\n\",\n    \"    print(k)\\n\",\n    \"    i += 1\\n\",\n    \"    if i == 5:\\n\",\n    \"        break\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we have a working prototype, let's construct a Spark mapper which will fetch the data in a distributed manner from HDFS. Note that if you would like to adjust the data in any way after reading, you can do so by modifying the lambda function, or executing another map after the data has been read.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"root\\n\",\n      \" |-- TrackId: long (nullable = true)\\n\",\n      \" |-- charge: long (nullable = true)\\n\",\n      \" |-- chi2: double (nullable = true)\\n\",\n      \" |-- d0: double (nullable = true)\\n\",\n      \" |-- dsz: double (nullable = true)\\n\",\n      \" |-- dxy: double (nullable = true)\\n\",\n      \" |-- dz: double (nullable = true)\\n\",\n      \" |-- eta: double (nullable = true)\\n\",\n      \" |-- evt: long (nullable = true)\\n\",\n      \" |-- lambda: double (nullable = true)\\n\",\n      \" |-- lumi: long (nullable = true)\\n\",\n      \" |-- ndof: double (nullable = true)\\n\",\n      \" |-- normalizedChi2: double (nullable = true)\\n\",\n      \" |-- p: double (nullable = true)\\n\",\n      \" |-- phi: double (nullable = true)\\n\",\n      \" |-- pix_0_x: double (nullable = true)\\n\",\n      \" |-- pix_0_y: double (nullable = true)\\n\",\n      \" |-- pix_0_z: double (nullable = true)\\n\",\n      \" |-- pix_1_x: double (nullable = true)\\n\",\n      \" |-- pix_1_y: double (nullable = true)\\n\",\n      \" |-- pix_1_z: double (nullable = true)\\n\",\n      \" |-- pix_2_x: double (nullable = true)\\n\",\n      \" |-- pix_2_y: double (nullable = true)\\n\",\n      \" |-- pix_2_z: double (nullable = true)\\n\",\n      \" |-- pix_3_x: double (nullable = true)\\n\",\n      \" |-- pix_3_y: double (nullable = true)\\n\",\n      \" |-- pix_3_z: double (nullable = true)\\n\",\n      \" |-- pix_4_x: double (nullable = true)\\n\",\n      \" |-- pix_4_y: double (nullable = true)\\n\",\n      \" |-- pix_4_z: double (nullable = true)\\n\",\n      \" |-- pt: double (nullable = true)\\n\",\n      \" |-- px: double (nullable = true)\\n\",\n      \" |-- py: double (nullable = true)\\n\",\n      \" |-- pz: double (nullable = true)\\n\",\n      \" |-- qoverp: double (nullable = true)\\n\",\n      \" |-- run: long (nullable = true)\\n\",\n      \" |-- sis_0_x: double (nullable = true)\\n\",\n      \" |-- sis_0_y: double (nullable = true)\\n\",\n      \" |-- sis_0_z: double (nullable = true)\\n\",\n      \" |-- sis_10_x: double (nullable = true)\\n\",\n      \" |-- sis_10_y: double (nullable = true)\\n\",\n      \" |-- sis_10_z: double (nullable = true)\\n\",\n      \" |-- sis_11_x: double (nullable = true)\\n\",\n      \" |-- sis_11_y: double (nullable = true)\\n\",\n      \" |-- sis_11_z: double (nullable = true)\\n\",\n      \" |-- sis_12_x: double (nullable = true)\\n\",\n      \" |-- sis_12_y: double (nullable = true)\\n\",\n      \" |-- sis_12_z: double (nullable = true)\\n\",\n      \" |-- sis_13_x: double (nullable = true)\\n\",\n      \" |-- sis_13_y: double (nullable = true)\\n\",\n      \" |-- sis_13_z: double (nullable = true)\\n\",\n      \" |-- sis_14_x: double (nullable = true)\\n\",\n      \" |-- sis_14_y: double (nullable = true)\\n\",\n      \" |-- sis_14_z: double (nullable = true)\\n\",\n      \" |-- sis_15_x: double (nullable = true)\\n\",\n      \" |-- sis_15_y: double (nullable = true)\\n\",\n      \" |-- sis_15_z: double (nullable = true)\\n\",\n      \" |-- sis_16_x: double (nullable = true)\\n\",\n      \" |-- sis_16_y: double (nullable = true)\\n\",\n      \" |-- sis_16_z: double (nullable = true)\\n\",\n      \" |-- sis_17_x: double (nullable = true)\\n\",\n      \" |-- sis_17_y: double (nullable = true)\\n\",\n      \" |-- sis_17_z: double (nullable = true)\\n\",\n      \" |-- sis_18_x: double (nullable = true)\\n\",\n      \" |-- sis_18_y: double (nullable = true)\\n\",\n      \" |-- sis_18_z: double (nullable = true)\\n\",\n      \" |-- sis_19_x: double (nullable = true)\\n\",\n      \" |-- sis_19_y: double (nullable = true)\\n\",\n      \" |-- sis_19_z: double (nullable = true)\\n\",\n      \" |-- sis_1_x: double (nullable = true)\\n\",\n      \" |-- sis_1_y: double (nullable = true)\\n\",\n      \" |-- sis_1_z: double (nullable = true)\\n\",\n      \" |-- sis_20_x: double (nullable = true)\\n\",\n      \" |-- sis_20_y: double (nullable = true)\\n\",\n      \" |-- sis_20_z: double (nullable = true)\\n\",\n      \" |-- sis_21_x: double (nullable = true)\\n\",\n      \" |-- sis_21_y: double (nullable = true)\\n\",\n      \" |-- sis_21_z: double (nullable = true)\\n\",\n      \" |-- sis_22_x: double (nullable = true)\\n\",\n      \" |-- sis_22_y: double (nullable = true)\\n\",\n      \" |-- sis_22_z: double (nullable = true)\\n\",\n      \" |-- sis_23_x: double (nullable = true)\\n\",\n      \" |-- sis_23_y: double (nullable = true)\\n\",\n      \" |-- sis_23_z: double (nullable = true)\\n\",\n      \" |-- sis_24_x: double (nullable = true)\\n\",\n      \" |-- sis_24_y: double (nullable = true)\\n\",\n      \" |-- sis_24_z: double (nullable = true)\\n\",\n      \" |-- sis_25_x: double (nullable = true)\\n\",\n      \" |-- sis_25_y: double (nullable = true)\\n\",\n      \" |-- sis_25_z: double (nullable = true)\\n\",\n      \" |-- sis_26_x: double (nullable = true)\\n\",\n      \" |-- sis_26_y: double (nullable = true)\\n\",\n      \" |-- sis_26_z: double (nullable = true)\\n\",\n      \" |-- sis_27_x: double (nullable = true)\\n\",\n      \" |-- sis_27_y: double (nullable = true)\\n\",\n      \" |-- sis_27_z: double (nullable = true)\\n\",\n      \" |-- sis_28_x: double (nullable = true)\\n\",\n      \" |-- sis_28_y: double (nullable = true)\\n\",\n      \" |-- sis_28_z: double (nullable = true)\\n\",\n      \" |-- sis_29_x: double (nullable = true)\\n\",\n      \" |-- sis_29_y: double (nullable = true)\\n\",\n      \" |-- sis_29_z: double (nullable = true)\\n\",\n      \" |-- sis_2_x: double (nullable = true)\\n\",\n      \" |-- sis_2_y: double (nullable = true)\\n\",\n      \" |-- sis_2_z: double (nullable = true)\\n\",\n      \" |-- sis_30_x: double (nullable = true)\\n\",\n      \" |-- sis_30_y: double (nullable = true)\\n\",\n      \" |-- sis_30_z: double (nullable = true)\\n\",\n      \" |-- sis_31_x: double (nullable = true)\\n\",\n      \" |-- sis_31_y: double (nullable = true)\\n\",\n      \" |-- sis_31_z: double (nullable = true)\\n\",\n      \" |-- sis_32_x: double (nullable = true)\\n\",\n      \" |-- sis_32_y: double (nullable = true)\\n\",\n      \" |-- sis_32_z: double (nullable = true)\\n\",\n      \" |-- sis_33_x: double (nullable = true)\\n\",\n      \" |-- sis_33_y: double (nullable = true)\\n\",\n      \" |-- sis_33_z: double (nullable = true)\\n\",\n      \" |-- sis_34_x: double (nullable = true)\\n\",\n      \" |-- sis_34_y: double (nullable = true)\\n\",\n      \" |-- sis_34_z: double (nullable = true)\\n\",\n      \" |-- sis_35_x: double (nullable = true)\\n\",\n      \" |-- sis_35_y: double (nullable = true)\\n\",\n      \" |-- sis_35_z: double (nullable = true)\\n\",\n      \" |-- sis_36_x: double (nullable = true)\\n\",\n      \" |-- sis_36_y: double (nullable = true)\\n\",\n      \" |-- sis_36_z: double (nullable = true)\\n\",\n      \" |-- sis_37_x: double (nullable = true)\\n\",\n      \" |-- sis_37_y: double (nullable = true)\\n\",\n      \" |-- sis_37_z: double (nullable = true)\\n\",\n      \" |-- sis_38_x: double (nullable = true)\\n\",\n      \" |-- sis_38_y: double (nullable = true)\\n\",\n      \" |-- sis_38_z: double (nullable = true)\\n\",\n      \" |-- sis_39_x: double (nullable = true)\\n\",\n      \" |-- sis_39_y: double (nullable = true)\\n\",\n      \" |-- sis_39_z: double (nullable = true)\\n\",\n      \" |-- sis_3_x: double (nullable = true)\\n\",\n      \" |-- sis_3_y: double (nullable = true)\\n\",\n      \" |-- sis_3_z: double (nullable = true)\\n\",\n      \" |-- sis_40_x: double (nullable = true)\\n\",\n      \" |-- sis_40_y: double (nullable = true)\\n\",\n      \" |-- sis_40_z: double (nullable = true)\\n\",\n      \" |-- sis_41_x: double (nullable = true)\\n\",\n      \" |-- sis_41_y: double (nullable = true)\\n\",\n      \" |-- sis_41_z: double (nullable = true)\\n\",\n      \" |-- sis_42_x: double (nullable = true)\\n\",\n      \" |-- sis_42_y: double (nullable = true)\\n\",\n      \" |-- sis_42_z: double (nullable = true)\\n\",\n      \" |-- sis_43_x: double (nullable = true)\\n\",\n      \" |-- sis_43_y: double (nullable = true)\\n\",\n      \" |-- sis_43_z: double (nullable = true)\\n\",\n      \" |-- sis_44_x: double (nullable = true)\\n\",\n      \" |-- sis_44_y: double (nullable = true)\\n\",\n      \" |-- sis_44_z: double (nullable = true)\\n\",\n      \" |-- sis_45_x: double (nullable = true)\\n\",\n      \" |-- sis_45_y: double (nullable = true)\\n\",\n      \" |-- sis_45_z: double (nullable = true)\\n\",\n      \" |-- sis_46_x: double (nullable = true)\\n\",\n      \" |-- sis_46_y: double (nullable = true)\\n\",\n      \" |-- sis_46_z: double (nullable = true)\\n\",\n      \" |-- sis_47_x: double (nullable = true)\\n\",\n      \" |-- sis_47_y: double (nullable = true)\\n\",\n      \" |-- sis_47_z: double (nullable = true)\\n\",\n      \" |-- sis_48_x: double (nullable = true)\\n\",\n      \" |-- sis_48_y: double (nullable = true)\\n\",\n      \" |-- sis_48_z: double (nullable = true)\\n\",\n      \" |-- sis_49_x: double (nullable = true)\\n\",\n      \" |-- sis_49_y: double (nullable = true)\\n\",\n      \" |-- sis_49_z: double (nullable = true)\\n\",\n      \" |-- sis_4_x: double (nullable = true)\\n\",\n      \" |-- sis_4_y: double (nullable = true)\\n\",\n      \" |-- sis_4_z: double (nullable = true)\\n\",\n      \" |-- sis_5_x: double (nullable = true)\\n\",\n      \" |-- sis_5_y: double (nullable = true)\\n\",\n      \" |-- sis_5_z: double (nullable = true)\\n\",\n      \" |-- sis_6_x: double (nullable = true)\\n\",\n      \" |-- sis_6_y: double (nullable = true)\\n\",\n      \" |-- sis_6_z: double (nullable = true)\\n\",\n      \" |-- sis_7_x: double (nullable = true)\\n\",\n      \" |-- sis_7_y: double (nullable = true)\\n\",\n      \" |-- sis_7_z: double (nullable = true)\\n\",\n      \" |-- sis_8_x: double (nullable = true)\\n\",\n      \" |-- sis_8_y: double (nullable = true)\\n\",\n      \" |-- sis_8_z: double (nullable = true)\\n\",\n      \" |-- sis_9_x: double (nullable = true)\\n\",\n      \" |-- sis_9_y: double (nullable = true)\\n\",\n      \" |-- sis_9_z: double (nullable = true)\\n\",\n      \" |-- theta: double (nullable = true)\\n\",\n      \" |-- vx: double (nullable = true)\\n\",\n      \" |-- vy: double (nullable = true)\\n\",\n      \" |-- vz: double (nullable = true)\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"def parse(iterator):\\n\",\n    \"    rows = []\\n\",\n    \"    \\n\",\n    \"    # MODIFY TO YOUR NEEDS IF NECESSARY\\n\",\n    \"    for row in iterator:\\n\",\n    \"        path = row['path']\\n\",\n    \"        # Load the file from HFDS.\\n\",\n    \"        with hdfs.open(path) as f:\\n\",\n    \"            data = np.load(f)\\n\",\n    \"        # Add all rows in current path.\\n\",\n    \"        for r in data:\\n\",\n    \"            d = {}\\n\",\n    \"            for f in fields:\\n\",\n    \"                d[f] = r[f].item()\\n\",\n    \"            rows.append(Row(**d))\\n\",\n    \"        \\n\",\n    \"    return iter(rows)\\n\",\n    \"\\n\",\n    \"# Apply the lambda function.\\n\",\n    \"dataset = df.rdd.mapPartitions(parse).toDF()\\n\",\n    \"dataset.printSchema()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 2\",\n   \"language\": \"python\",\n   \"name\": \"python2\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 2\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython2\",\n   \"version\": \"2.7.13\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/example_0_data_preprocessing.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Data Preprocessing\\n\",\n    \"\\n\",\n    \"**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN)             \\n\",\n    \"*Departement of Knowledge Engineering*         \\n\",\n    \"*Maastricht University, The Netherlands*\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"07 December 2016\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!(date +%d\\\\ %B\\\\ %G)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this notebook we will be preprocessing a **4.6 GB** CSV file containing simulated ATLAS events. Afterwards we will save the processed data to the [Parquet](https://parquet.apache.org/) format for further analysis. After the completion of this notebook, we will have a processed dataset ready for model development, training and evaluation.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Using TensorFlow backend.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"import time\\n\",\n    \"\\n\",\n    \"from pyspark import SparkContext\\n\",\n    \"from pyspark import SparkConf\\n\",\n    \"\\n\",\n    \"from pyspark.ml.feature import StandardScaler\\n\",\n    \"from pyspark.ml.feature import VectorAssembler\\n\",\n    \"from pyspark.ml.feature import StringIndexer\\n\",\n    \"\\n\",\n    \"from distkeras.utils import shuffle\\n\",\n    \"from distkeras.transformers import OneHotTransformer\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Spark preparation and configuration\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Edit the variables in the cell below. If you are running Spark in local mode, please set the `local` flag to true and adjust the resources you wish to use on your local machine. The same goes for the case when you are running Spark 2.0 and higher.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Modify these variables according to your needs.\\n\",\n    \"application_name = \\\"Distributed Deep Learning: Data Prerocessing\\\"\\n\",\n    \"using_spark_2 = False\\n\",\n    \"local = False\\n\",\n    \"if local:\\n\",\n    \"    # Tell master to use local resources.\\n\",\n    \"    master = \\\"local[*]\\\"\\n\",\n    \"    num_cores = 3\\n\",\n    \"    num_executors = 1\\n\",\n    \"else:\\n\",\n    \"    # Tell master to use YARN.\\n\",\n    \"    master = \\\"yarn-client\\\"\\n\",\n    \"    num_executors = 8\\n\",\n    \"    num_cores = 2\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In the following cells you are not required to change something. Adjusting the configuration in the cell above should be sufficient for running this notebook.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"# Use the DataBricks CSV reader, this has some nice functionality regarding invalid values.\\n\",\n    \"os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell'\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"conf = SparkConf()\\n\",\n    \"conf.set(\\\"spark.app.name\\\", application_name)\\n\",\n    \"conf.set(\\\"spark.master\\\", master)\\n\",\n    \"conf.set(\\\"spark.executor.cores\\\", `num_cores`)\\n\",\n    \"conf.set(\\\"spark.executor.instances\\\", `num_executors`)\\n\",\n    \"conf.set(\\\"spark.executor.memory\\\",\\\"2g\\\")\\n\",\n    \"conf.set(\\\"spark.serializer\\\", \\\"org.apache.spark.serializer.KryoSerializer\\\");\\n\",\n    \"\\n\",\n    \"# Check if the user is running Spark 2.0 +\\n\",\n    \"if using_spark_2:\\n\",\n    \"    sc = SparkSession.builder.config(conf=conf) \\\\\\n\",\n    \"            .appName(application_name) \\\\\\n\",\n    \"            .getOrCreate()\\n\",\n    \"else:\\n\",\n    \"    # Create the Spark context.\\n\",\n    \"    sc = SparkContext(conf=conf)\\n\",\n    \"    # Add the missing imports\\n\",\n    \"    from pyspark import SQLContext\\n\",\n    \"    sqlContext = SQLContext(sc)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Dataset preprocessing\\n\",\n    \"\\n\",\n    \"### Reading\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Check if we are using Spark 2.0\\n\",\n    \"if using_spark_2:\\n\",\n    \"    reader = sc\\n\",\n    \"else:\\n\",\n    \"    reader = sqlContext\\n\",\n    \"\\n\",\n    \"# Read the dataset.\\n\",\n    \"raw_dataset = reader.read.format('com.databricks.spark.csv') \\\\\\n\",\n    \"                    .options(header='true', inferSchema='true').load(\\\"data/atlas_higgs.csv\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"root\\n\",\n      \" |-- EventId: integer (nullable = true)\\n\",\n      \" |-- DER_mass_MMC: double (nullable = true)\\n\",\n      \" |-- DER_mass_transverse_met_lep: double (nullable = true)\\n\",\n      \" |-- DER_mass_vis: double (nullable = true)\\n\",\n      \" |-- DER_pt_h: double (nullable = true)\\n\",\n      \" |-- DER_deltaeta_jet_jet: double (nullable = true)\\n\",\n      \" |-- DER_mass_jet_jet: double (nullable = true)\\n\",\n      \" |-- DER_prodeta_jet_jet: double (nullable = true)\\n\",\n      \" |-- DER_deltar_tau_lep: double (nullable = true)\\n\",\n      \" |-- DER_pt_tot: double (nullable = true)\\n\",\n      \" |-- DER_sum_pt: double (nullable = true)\\n\",\n      \" |-- DER_pt_ratio_lep_tau: double (nullable = true)\\n\",\n      \" |-- DER_met_phi_centrality: double (nullable = true)\\n\",\n      \" |-- DER_lep_eta_centrality: double (nullable = true)\\n\",\n      \" |-- PRI_tau_pt: double (nullable = true)\\n\",\n      \" |-- PRI_tau_eta: double (nullable = true)\\n\",\n      \" |-- PRI_tau_phi: double (nullable = true)\\n\",\n      \" |-- PRI_lep_pt: double (nullable = true)\\n\",\n      \" |-- PRI_lep_eta: double (nullable = true)\\n\",\n      \" |-- PRI_lep_phi: double (nullable = true)\\n\",\n      \" |-- PRI_met: double (nullable = true)\\n\",\n      \" |-- PRI_met_phi: double (nullable = true)\\n\",\n      \" |-- PRI_met_sumet: double (nullable = true)\\n\",\n      \" |-- PRI_jet_num: integer (nullable = true)\\n\",\n      \" |-- PRI_jet_leading_pt: double (nullable = true)\\n\",\n      \" |-- PRI_jet_leading_eta: double (nullable = true)\\n\",\n      \" |-- PRI_jet_leading_phi: double (nullable = true)\\n\",\n      \" |-- PRI_jet_subleading_pt: double (nullable = true)\\n\",\n      \" |-- PRI_jet_subleading_eta: double (nullable = true)\\n\",\n      \" |-- PRI_jet_subleading_phi: double (nullable = true)\\n\",\n      \" |-- PRI_jet_all_pt: double (nullable = true)\\n\",\n      \" |-- Weight: double (nullable = true)\\n\",\n      \" |-- Label: string (nullable = true)\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Double-check the inferred schema, and get fetch a row to show how the dataset looks like.\\n\",\n    \"raw_dataset.printSchema()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Feature processing\\n\",\n    \"\\n\",\n    \"Next, we will take all the columns in the CSV except the *EventId*, *Weight*, and *Label* column since they are not relevant features.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Record the starting time of the data preprocessing.\\n\",\n    \"time_start = time.time()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[Row(features=DenseVector([138.47, 51.655, 97.827, 27.98, 0.91, 124.711, 2.666, 3.064, 41.928, 197.76, 1.582, 1.396, 0.2, 32.638, 1.017, 0.381, 51.626, 2.273, -2.414, 16.824, -0.277, 258.733, 2.0, 67.435, 2.15, 0.444, 46.062, 1.24, -2.475, 113.497]))]\"\n      ]\n     },\n     \"execution_count\": 9,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# First, we would like to extract the desired features from the raw dataset.\\n\",\n    \"# We do this by constructing a list with all desired columns.\\n\",\n    \"features = raw_dataset.columns\\n\",\n    \"features.remove('EventId')\\n\",\n    \"features.remove('Weight')\\n\",\n    \"features.remove('Label')\\n\",\n    \"\\n\",\n    \"# Next, we use Spark's VectorAssembler to \\\"assemble\\\" (create) a vector of all desired features.\\n\",\n    \"# http://spark.apache.org/docs/latest/ml-features.html#vectorassembler\\n\",\n    \"vector_assembler = VectorAssembler(inputCols=features, outputCol=\\\"features\\\")\\n\",\n    \"\\n\",\n    \"# This transformer will take all columns specified in features, and create an additional column \\\"features\\\" which will contain all the desired features aggregated into a single vector.\\n\",\n    \"dataset = vector_assembler.transform(raw_dataset)\\n\",\n    \"\\n\",\n    \"# Show what happened after applying the vector assembler.\\n\",\n    \"# Note: \\\"features\\\" column got appended to the end.\\n\",\n    \"dataset.select(\\\"features\\\").take(1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Feature normalization\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Apply feature normalization with standard scaling using Spark's [StandardScaler](http://spark.apache.org/docs/latest/ml-features.html#standardscaler). This will transform a feature to have mean 0, and std 1.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"standard_scaler = StandardScaler(inputCol=\\\"features\\\", outputCol=\\\"features_normalized\\\", withStd=True, withMean=True)\\n\",\n    \"standard_scaler_model = standard_scaler.fit(dataset)\\n\",\n    \"dataset = standard_scaler_model.transform(dataset)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Label transformation\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The dataset is devided into 2 classes, i.e., Signal(s), and Background(b). In order to make our lives easier in the future. We need to provide a mapping with a one-hot encoded vector (of course, this is a design decision). We achieve this by applying a [StringIndexer](http://spark.apache.org/docs/latest/ml-features.html#stringindexer). Again, a StringIndexer is an internal feature transformer provided by Apache Spark.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[Row(Label=u's', label_index=1.0),\\n\",\n       \" Row(Label=u'b', label_index=0.0),\\n\",\n       \" Row(Label=u'b', label_index=0.0),\\n\",\n       \" Row(Label=u'b', label_index=0.0),\\n\",\n       \" Row(Label=u'b', label_index=0.0)]\"\n      ]\n     },\n     \"execution_count\": 11,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"label_indexer = StringIndexer(inputCol=\\\"Label\\\", outputCol=\\\"label_index\\\").fit(dataset)\\n\",\n    \"dataset = label_indexer.transform(dataset)\\n\",\n    \"\\n\",\n    \"# Show the result of the label transformation.\\n\",\n    \"dataset.select(\\\"Label\\\", \\\"label_index\\\").take(5)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"DataFrame[features_normalized: vector, label_index: double, label: vector]\"\n      ]\n     },\n     \"execution_count\": 12,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# We observe that Keras is not able to work with these indexes.\\n\",\n    \"# What it actually expects is a vector with an identical size to the output layer.\\n\",\n    \"# Our framework provides functionality to do this with ease. What it basically does,\\n\",\n    \"# given an expected vector dimension, it prepares zero vector with the specified dimensionality,\\n\",\n    \"# and will set the neuron with a specific label index to one.\\n\",\n    \"\\n\",\n    \"# For example:\\n\",\n    \"# 1. Assume we have a label index: 3\\n\",\n    \"# 2. Output dimensionality: 5\\n\",\n    \"# With these parameters, we obtain the following vector in the DataFrame column: [0,0,0,1,0]\\n\",\n    \"\\n\",\n    \"# First, we fetch the columns of interest.\\n\",\n    \"dataset = dataset.select(\\\"features_normalized\\\", \\\"label_index\\\")\\n\",\n    \"\\n\",\n    \"# Number of classes (signal and background).\\n\",\n    \"nb_classes = 2\\n\",\n    \"\\n\",\n    \"# Construct a one-hot encoded vector using the provided index.\\n\",\n    \"transformer = OneHotTransformer(output_dim=nb_classes, input_col=\\\"label_index\\\", output_col=\\\"label\\\")\\n\",\n    \"dataset = transformer.transform(dataset)\\n\",\n    \"\\n\",\n    \"# Only select the columns we need (less data shuffling) while training.\\n\",\n    \"dataset = dataset.select(\\\"features_normalized\\\", \\\"label_index\\\", \\\"label\\\")\\n\",\n    \"dataset.cache()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Dataset randomization\\n\",\n    \"\\n\",\n    \"We shuffle the complete dataset in order to be able to draw stochastic samples from the dataframe.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Randomize the dataset.\\n\",\n    \"dataset = shuffle(dataset)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Dataset saving\\n\",\n    \"\\n\",\n    \"Finally, we save the shuffled and processed dataset to disk for later use.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Store the preprocessed dataset as a Parquet file.\\n\",\n    \"dataset.write.save(\\\"data/processed.parquet\\\", format=\\\"parquet\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Total time: 694.612912178 seconds.\\n\",\n      \"Total time: 11.5768818696 minutes.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"time_end = time.time()\\n\",\n    \"dt = time_end - time_start\\n\",\n    \"print(\\\"Total time: \\\" + str(dt) + \\\" seconds.\\\")\\n\",\n    \"print(\\\"Total time: \\\" + str(dt / 60) + \\\" minutes.\\\")\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [conda root]\",\n   \"language\": \"python\",\n   \"name\": \"conda-root-py\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 2\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython2\",\n   \"version\": \"2.7.12\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "examples/example_1_analysis.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Model Development and Evaluation\\n\",\n    \"\\n\",\n    \"**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN)             \\n\",\n    \"*Departement of Knowledge Engineering*         \\n\",\n    \"*Maastricht University, The Netherlands*\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"This notebook is dedicated to the development and evaluation of a Keras model based on a large [preprocessed dataset](https://github.com/JoeriHermans/dist-keras/blob/master/examples/data_preprocessing.ipynb).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline  \\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"\\n\",\n    \"from keras.models import Sequential\\n\",\n    \"from keras.layers.core import Dense, Dropout, Activation\\n\",\n    \"\\n\",\n    \"from pyspark import SparkContext\\n\",\n    \"from pyspark import SparkConf\\n\",\n    \"\\n\",\n    \"from pyspark.ml.feature import StandardScaler\\n\",\n    \"from pyspark.ml.feature import VectorAssembler\\n\",\n    \"from pyspark.ml.feature import StringIndexer\\n\",\n    \"from pyspark.ml.evaluation import MulticlassClassificationEvaluator\\n\",\n    \"\\n\",\n    \"from distkeras.transformers import LabelIndexTransformer\\n\",\n    \"from distkeras.predictors import ModelPredictor\\n\",\n    \"from distkeras.trainers import SingleTrainer\\n\",\n    \"from distkeras.trainers import AEASGD\\n\",\n    \"from distkeras.trainers import DOWNPOUR\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Spark Configuration and Preparation\\n\",\n    \"\\n\",\n    \"Edit the variables in the cell below. If you are running Spark in local mode, please set the `local` flag to true and adjust the resources you wish to use on your local machine. The same goes for the case when you are running Spark 2.0 and higher.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Modify these variables according to your needs.\\n\",\n    \"application_name = \\\"Distributed Deep Learning: Analysis\\\"\\n\",\n    \"using_spark_2 = False\\n\",\n    \"local = False\\n\",\n    \"if local:\\n\",\n    \"    # Tell master to use local resources.\\n\",\n    \"    master = \\\"local[*]\\\"\\n\",\n    \"    num_cores = 3\\n\",\n    \"    num_executors = 1\\n\",\n    \"else:\\n\",\n    \"    # Tell master to use YARN.\\n\",\n    \"    master = \\\"yarn-client\\\"\\n\",\n    \"    num_executors = 8\\n\",\n    \"    num_cores = 2\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Number of desired executors: 8\\n\",\n      \"Number of desired cores / executor: 2\\n\",\n      \"Total number of workers: 16\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# This variable is derived from the number of cores and executors, and will be used to assign the number of model trainers.\\n\",\n    \"num_workers = num_executors * num_cores\\n\",\n    \"\\n\",\n    \"print(\\\"Number of desired executors: \\\" + `num_executors`)\\n\",\n    \"print(\\\"Number of desired cores / executor: \\\" + `num_cores`)\\n\",\n    \"print(\\\"Total number of workers: \\\" + `num_workers`)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"conf = SparkConf()\\n\",\n    \"conf.set(\\\"spark.app.name\\\", application_name)\\n\",\n    \"conf.set(\\\"spark.master\\\", master)\\n\",\n    \"conf.set(\\\"spark.executor.cores\\\", `num_cores`)\\n\",\n    \"conf.set(\\\"spark.executor.instances\\\", `num_executors`)\\n\",\n    \"conf.set(\\\"spark.executor.memory\\\",\\\"2g\\\")\\n\",\n    \"conf.set(\\\"spark.serializer\\\", \\\"org.apache.spark.serializer.KryoSerializer\\\");\\n\",\n    \"\\n\",\n    \"# Check if the user is running Spark 2.0 +\\n\",\n    \"if using_spark_2:\\n\",\n    \"    sc = SparkSession.builder.config(conf=conf) \\\\\\n\",\n    \"            .appName(application_name) \\\\\\n\",\n    \"            .getOrCreate()\\n\",\n    \"else:\\n\",\n    \"    # Create the Spark context.\\n\",\n    \"    sc = SparkContext(conf=conf)\\n\",\n    \"    # Add the missing imports\\n\",\n    \"    from pyspark import SQLContext\\n\",\n    \"    sqlContext = SQLContext(sc)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Data Preparation\\n\",\n    \"\\n\",\n    \"After the Spark Context (or Spark Session if you are using Spark 2.0) has been set up, we can start reading the preprocessed dataset from storage.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Check if we are using Spark 2.0\\n\",\n    \"if using_spark_2:\\n\",\n    \"    reader = sc\\n\",\n    \"else:\\n\",\n    \"    reader = sqlContext\\n\",\n    \"# Read the dataset.\\n\",\n    \"raw_dataset = reader.read.parquet(\\\"data/processed.parquet\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"root\\n\",\n      \" |-- features_normalized: vector (nullable = true)\\n\",\n      \" |-- label_index: double (nullable = true)\\n\",\n      \" |-- label: vector (nullable = true)\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Check the schema.\\n\",\n    \"raw_dataset.printSchema()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After reading the dataset from storage, we will extract several metrics such as `nb_features`, which basically is the number of input neurons, and `nb_classes`, which is the number of classes (signal and background).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Number of features: 30\\n\",\n      \"Number of classes: 2\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"nb_features = len(raw_dataset.select(\\\"features_normalized\\\").take(1)[0][\\\"features_normalized\\\"])\\n\",\n    \"nb_classes = len(raw_dataset.select(\\\"label\\\").take(1)[0][\\\"label\\\"])\\n\",\n    \"\\n\",\n    \"print(\\\"Number of features: \\\" + str(nb_features))\\n\",\n    \"print(\\\"Number of classes: \\\" + str(nb_classes))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Finally, we split up the dataset for training and testing purposes, and fetch some additional statistics on the number of training and testing instances.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"DataFrame[features_normalized: vector, label_index: double, label: vector]\"\n      ]\n     },\n     \"execution_count\": 8,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# Finally, we create a trainingset and a testset.\\n\",\n    \"(training_set, test_set) = raw_dataset.randomSplit([0.7, 0.3])\\n\",\n    \"training_set.cache()\\n\",\n    \"test_set.cache()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Number of testset instances: 6377863\\n\",\n      \"Number of trainingset instances: 14872137\\n\",\n      \"Total number of instances: 21250000\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Distribute the training and test set to the workers.\\n\",\n    \"test_set = test_set.repartition(num_workers)\\n\",\n    \"training_set = training_set.repartition(num_workers)\\n\",\n    \"\\n\",\n    \"num_test_set = test_set.count()\\n\",\n    \"num_training_set = training_set.count()\\n\",\n    \"\\n\",\n    \"print(\\\"Number of testset instances: \\\" + str(num_test_set))\\n\",\n    \"print(\\\"Number of trainingset instances: \\\" + str(num_training_set))\\n\",\n    \"print(\\\"Total number of instances: \\\" + str(num_test_set + num_training_set))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Model construction\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"model = Sequential()\\n\",\n    \"model.add(Dense(500, input_shape=(nb_features,)))\\n\",\n    \"model.add(Activation('relu'))\\n\",\n    \"model.add(Dropout(0.4))\\n\",\n    \"model.add(Dense(500))\\n\",\n    \"model.add(Activation('relu'))\\n\",\n    \"model.add(Dropout(0.6))\\n\",\n    \"model.add(Dense(500))\\n\",\n    \"model.add(Activation('relu'))\\n\",\n    \"model.add(Dense(nb_classes))\\n\",\n    \"model.add(Activation('softmax'))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"____________________________________________________________________________________________________\\n\",\n      \"Layer (type)                     Output Shape          Param #     Connected to                     \\n\",\n      \"====================================================================================================\\n\",\n      \"dense_1 (Dense)                  (None, 500)           15500       dense_input_1[0][0]              \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"activation_1 (Activation)        (None, 500)           0           dense_1[0][0]                    \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"dropout_1 (Dropout)              (None, 500)           0           activation_1[0][0]               \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"dense_2 (Dense)                  (None, 500)           250500      dropout_1[0][0]                  \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"activation_2 (Activation)        (None, 500)           0           dense_2[0][0]                    \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"dropout_2 (Dropout)              (None, 500)           0           activation_2[0][0]               \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"dense_3 (Dense)                  (None, 500)           250500      dropout_2[0][0]                  \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"activation_3 (Activation)        (None, 500)           0           dense_3[0][0]                    \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"dense_4 (Dense)                  (None, 2)             1002        activation_3[0][0]               \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"activation_4 (Activation)        (None, 2)             0           dense_4[0][0]                    \\n\",\n      \"====================================================================================================\\n\",\n      \"Total params: 517502\\n\",\n      \"____________________________________________________________________________________________________\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Summarize the model.\\n\",\n    \"model.summary()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"optimizer = 'adagrad'\\n\",\n    \"loss = 'categorical_crossentropy'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Model evaluation\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def evaluate(model):\\n\",\n    \"    global test_set\\n\",\n    \"\\n\",\n    \"    metric_name = \\\"f1\\\"\\n\",\n    \"    evaluator = MulticlassClassificationEvaluator(metricName=metric_name, predictionCol=\\\"prediction_index\\\", labelCol=\\\"label_index\\\")\\n\",\n    \"    # Clear the prediction column from the testset.\\n\",\n    \"    test_set = test_set.select(\\\"features_normalized\\\", \\\"label\\\", \\\"label_index\\\")\\n\",\n    \"    # Apply a prediction from a trained model.\\n\",\n    \"    predictor = ModelPredictor(keras_model=trained_model, features_col=\\\"features_normalized\\\")\\n\",\n    \"    test_set = predictor.predict(test_set)\\n\",\n    \"    # Transform the prediction vector to an indexed label.\\n\",\n    \"    index_transformer = LabelIndexTransformer(output_dim=nb_classes)\\n\",\n    \"    test_set = index_transformer.transform(test_set)\\n\",\n    \"    # Store the F1 score of the SingleTrainer.\\n\",\n    \"    score = evaluator.evaluate(test_set)\\n\",\n    \"    \\n\",\n    \"    return score\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"results = {}\\n\",\n    \"time_spent = {}\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Model training and evaluation\\n\",\n    \"\\n\",\n    \"In the next sections we train and evaluate the models trained by different (distributed) optimizers.\\n\",\n    \"\\n\",\n    \"### Single Trainer\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer = SingleTrainer(keras_model=model, loss=loss, worker_optimizer=optimizer, \\n\",\n    \"                        features_col=\\\"features_normalized\\\", num_epoch=1, batch_size=64)\\n\",\n    \"trained_model = trainer.train(training_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Time spent (SingleTrainer): 5927.329083919525 seconds.\\n\",\n      \"F1 (SingleTrainer): 0.839630118149035\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Fetch the training time.\\n\",\n    \"dt = trainer.get_training_time()\\n\",\n    \"print(\\\"Time spent (SingleTrainer): \\\" + `dt` + \\\" seconds.\\\")\\n\",\n    \"\\n\",\n    \"# Evaluate the model.\\n\",\n    \"score = evaluate(trained_model)\\n\",\n    \"print(\\\"F1 (SingleTrainer): \\\" + `score`)\\n\",\n    \"\\n\",\n    \"# Store the training metrics.\\n\",\n    \"results['single'] = score\\n\",\n    \"time_spent['single'] = dt\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Asynchronous EASGD\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer = AEASGD(keras_model=model, worker_optimizer=optimizer, loss=loss, num_workers=num_workers, batch_size=64,\\n\",\n    \"                 features_col=\\\"features_normalized\\\", num_epoch=1, communication_window=32, \\n\",\n    \"                 rho=5.0, learning_rate=0.1)\\n\",\n    \"trainer.set_parallelism_factor(1)\\n\",\n    \"trained_model = trainer.train(training_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Time spent (AEASGD): 903.8733949661255 seconds.\\n\",\n      \"F1 (AEASGD): 0.8326362659335457\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Fetch the training time.\\n\",\n    \"dt = trainer.get_training_time()\\n\",\n    \"print(\\\"Time spent (AEASGD): \\\" + `dt` + \\\" seconds.\\\")\\n\",\n    \"\\n\",\n    \"# Evaluate the model.\\n\",\n    \"score = evaluate(trained_model)\\n\",\n    \"print(\\\"F1 (AEASGD): \\\" + `score`)\\n\",\n    \"\\n\",\n    \"# Store the training metrics.\\n\",\n    \"results['aeasgd'] = score\\n\",\n    \"time_spent['aeasgd'] = dt\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### DOWNPOUR\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer = DOWNPOUR(keras_model=model, worker_optimizer=optimizer, loss=loss, num_workers=num_workers,\\n\",\n    \"                   batch_size=64, communication_window=5, learning_rate=0.1, num_epoch=1,\\n\",\n    \"                   features_col=\\\"features_normalized\\\")\\n\",\n    \"trainer.set_parallelism_factor(1)\\n\",\n    \"trained_model = trainer.train(training_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Time spent (DOWNPOUR): 774.4893491268158 seconds.\\n\",\n      \"F1 (DOWNPOUR): 0.8345395134754954\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Fetch the training time.\\n\",\n    \"dt = trainer.get_training_time()\\n\",\n    \"print(\\\"Time spent (DOWNPOUR): \\\" + `dt` + \\\" seconds.\\\")\\n\",\n    \"\\n\",\n    \"# Evaluate the model.\\n\",\n    \"score = evaluate(trained_model)\\n\",\n    \"print(\\\"F1 (DOWNPOUR): \\\" + `score`)\\n\",\n    \"\\n\",\n    \"# Store the training metrics.\\n\",\n    \"results['downpour'] = score\\n\",\n    \"time_spent['downpour'] = dt\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Results\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can see from the plots below, the distributed optimizers finish a single epoch ~7 times however. However, for this, the distributed optimizers use 16 times the amount of resources. However, a not very descriptive measure since some of jobs are scheduled on the same machines, some machines have a higher load etc. Nevertheless, the statistical performance of the optimizers is within 1% error. Which means that the classifiers would have near-identical performance. Furthermore, it is our guess that the statistical performance of the distributed optimizers can be improved by adding adaptive learning rates.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 27,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"image/png\": \"iVBORw0KGgoAAAANSUhEUgAAAiMAAAGSCAYAAAAxVMH8AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\\nAAAPYQAAD2EBqD+naQAAIABJREFUeJzt3X2YH2V97/H3RzAgKAmIJKWCoijG+lASDOADtMaCD9Wi\\neFqWUh+o5WgBMZWq7bFKwdMqVoIoniJaHyqsUtRiBYlAFQEjqQRFa6BFoQExwUgIEeQx3/PHzOov\\nPzfJ7mY3s2Tfr+vai/3d852Ze5LJ8tl77plJVSFJktSVR3TdAUmSNLUZRiRJUqcMI5IkqVOGEUmS\\n1CnDiCRJ6pRhRJIkdcowIkmSOmUYkSRJnTKMSJKkThlGJP1Skp90uO/fSPLJEda+Nsnfj2Lbz04y\\nv+fzHyR50hi6KWkCGEYk9dpi74dIst7Pn6r6SVW9bhSbGE1ffxt4Uc/nw4CnjGJ9kmQ09ZJGzjAi\\naaOSPCfJfyT5bpIz03ju0ChGkgVJvtN+/ztJzmq/f2mSxUmWJvlI2/aEJNcm+Rzwn337eUKSxe33\\nv5vkunbdKzbQtb2TXJHk+iTH9Gznr5MsSfKdJH/ahp6Tgde123sb8Argw+3nHdtj/EaSbyf5XJLt\\n222tSPKRJNcBjx+/P1VJvQwjkjbln4A/q6pnA48FBoBvA3Pa5c8F7kvyaOB5wJVJHgu8BTi4quYA\\n65K8qq1/OvDuqpo9zL6GRjsWAMe36750A/3aD3gx8BzgrUlmJTkUeFxVzWvb/wx4HPAu4BNVNaeq\\nTgUuAI5tt38/8H7g5VW1H/AfwJ+3+9gN+EJVPauqbhnxn5ikUdm26w5ImrySTAdSVd9pm84BXlxV\\n5yZZk2Q3mv9hXwAcSBNGPtt+/yzgW+3lje2Bm4FrgGVVdf0mdv1N4P1JPgWcB6wdpuaiqrq77ecl\\nwP7AC4DfT3IwEGAn4MnDHVrP9/u0ff1a29dHApe2y+6qqkv7V5Y0vgwjkjZlQ3MlFgN/AtwIXEkz\\nJ+NJVfXDJE8H/rWq3rjehpInAPdsaodV9d4kF9FcTrk6ydyqWt1f1vd5Xfvfv6mqz/btd2PzQwIs\\nqaoXD7Nsk32VtPm8TCOp13rBo6rWAA8meWbbNEATPACuorkUcxXNpY2j+NU8kG8B85P8JkCSXYa+\\n79/HsJ1I9qqq66rqPcBNwB7DlL2kne/xGOCFbR8uA96QZLt2O09tv19LM0oypPfz9cBeSZ7RrrND\\nkqHRFCetSluAYURSr12TLE9yS/vfFwNHA/+U5LvAnTSXYaC5lLI7cGVV/QK4vW2jqn4KHAtc0K63\\niGbuBozsLpi/SPL9dmLs9VV13TA11wAX04SQ06pqRVV9ZaitnXT6EZqfc18D5iW5pr2E81ngb5Ms\\nBaYBRwL/2O7vm/zq0s56fU1yYZJZI+i/pFFI1Ra7k0+SJOnXODIiSZI6ZRiRJEmdMoxIkqROGUYk\\nSVKnDCOSJKlThhFJktQpw4gkSeqUYUSSJHXKMCJJkjplGJEkSZ0yjEiSpE4ZRiRJUqcMI5IkqVOG\\nEUmS1CnDiCRJ6pRhRJIkdcowIkmSOtV5GElyU5J1w3x9qKfm5CS3JbknySVJ9u7bxnZJzkyyKsna\\nJOcn2a2vZuck5yRZk2R1ko8l2XFLHackSRpe52EE2A+Y1fP1e0AB5wEkeTtwHHAMMA+4G1iUZFrP\\nNk4HXgYcDhwE7A58vm8/5wKzgflt7UHAWRNyRJIkacRSVV33YT1JTgdeWlVPbT/fBry/qha2n3cC\\nVgKvrarz2s8/BY6oqi+2NfsAy4ADqmpJktnAfwJzq+ratuZQ4ELg8VW1YssepSRJGjIZRkZ+Kckj\\ngT8GPt5+3otmtOSyoZqqugu4GjiwbdoP2Lav5gZgeU/NAcDqoSDSupRmBGb/iTgWSZI0MpMqjACv\\nBKYDn2o/z6IJDCv76la2ywBmAve3IWVDNbOA23sXVtVDwB09NZIkqQPbdt2BPkcDX5ksl02SPBY4\\nFLgZuLfb3kiS9LCyPfBEYFFV/WxjhZMmjCTZE3gRcFhP8wogNKMfvaMjM4Fre2qmJdmpb3RkZrts\\nqKb/7pptgF16aoZzKHDO6I5EkiT1+GOam0g2aNKEEZpRkZXARUMNVXVTkhU0d8BcB7+cwLo/cGZb\\ndg3wYFvTO4F1T2BxW7MYmJFk3555I/Npgs7VG+nTzQCf+cxnmD179mYe3tSyYMECFi5c2HU3NAV4\\nrmlL8VwbnWXLlnHUUUdB+//SjZkUYSRJgNcBn6yqdX2LTwfemeRGmgM6BbgVuACaCa1JPg6clmQ1\\nsBY4A7iqqpa0NdcnWQScneRNwDTgQ8DgJi4J3Qswe/Zs5syZMy7HOlVMnz7dPzNtEZ5r2lI818Zs\\nk9McJkUYobk8swfwif4FVXVqkh1ongkyA7gCeElV3d9TtgB4CDgf2A64GDi2b1NHAh+muYtmXVt7\\nwvgehiRJGq1JEUaq6hJgm40sPwk4aSPL7wOOb782VHMncNSYOylJkibEZLu1V5IkTTGGEU2IgYGB\\nrrugKcJzTVuK59rEMYxoQviPVluK55q2FM+1iWMYkSRJnTKMSJKkThlGJElSpwwjkiSpU4YRSZLU\\nKcOIJEnqlGFEkiR1yjAiSZI6ZRiRJEmdMoxIkqROGUYkSVKnDCOSJKlThhFJktQpw4gkSeqUYUSS\\nJHXKMCJJkjplGJEkSZ0yjEiSpE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ6pRhRJIkdcowIkmSOmUY\\nkSRJnTKMSJKkThlGJElSpwwjkiSpU4YRSZLUKcOIJEnq1KQII0l2T/LPSVYluSfJd5PM6as5Oclt\\n7fJLkuzdt3y7JGe221ib5Pwku/XV7JzknCRrkqxO8rEkO26JY5QkScPrPIwkmQFcBdwHHArMBt4K\\nrO6peTtwHHAMMA+4G1iUZFrPpk4HXgYcDhwE7A58vm9357bbn9/WHgScNe4HJUmSRmzbrjsAvANY\\nXlVv6Gn7n76aE4BTqurLAEleA6wEDgPOS7ITcDRwRFVd3ta8HliWZF5VLUkymybszK2qa9ua44EL\\nk5xYVSsm8BglSdIGdD4yArwc+HaS85KsTLI0yS+DSZK9gFnAZUNtVXUXcDVwYNu0H02w6q25AVje\\nU3MAsHooiLQuBQrYf9yPSpIkjchkCCNPAt4E3AAcAvw/4Iwkf9Iun0UTGFb2rbeyXQYwE7i/DSkb\\nqpkF3N67sKoeAu7oqZEkSVvYZLhM8whgSVX9Tfv5u0meAbwR+OfuuiVJkraEyRBGfgIs62tbBryq\\n/X4FEJrRj97RkZnAtT0105Ls1Dc6MrNdNlTTf3fNNsAuPTXDWrBgAdOnT1+vbWBggIGBgY2tJknS\\nlDA4OMjg4OB6bWvWrBnx+pMhjFwF7NPXtg/tJNaquinJCpo7YK4DaCes7g+c2dZfAzzY1nyxrdkH\\n2BNY3NYsBmYk2bdn3sh8mqBz9cY6uHDhQubMmbOxEkmSpqzhfkFfunQpc+fOHdH6kyGMLASuSvJX\\nwHk0IeMNwJ/11JwOvDPJjcDNwCnArcAF0ExoTfJx4LQkq4G1wBnAVVW1pK25Pski4OwkbwKmAR8C\\nBr2TRluD5cuXs2rVqq67oS1k1113Zc899+y6G9K46DyMVNW3k7wSeC/wN8BNwAlV9dmemlOT7EDz\\nTJAZwBXAS6rq/p5NLQAeAs4HtgMuBo7t292RwIdp7qJZ19aeMBHHJW1Jy5cvZ599ZnPvvfd03RVt\\nIdtvvwM33LDMQKKtQudhBKCqLgIu2kTNScBJG1l+H3B8+7WhmjuBo8bUSWkSW7VqVRtEPkPzXD9t\\n3ZZx771HsWrVKsOItgqTIoxIGi+zAec3SXp4mQzPGZEkSVOYYUSSJHXKMCJJkjplGJEkSZ0yjEiS\\npE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ6pRhRJIkdcowIkmSOmUYkSRJnTKMSJKkThlGJElSpwwj\\nkiSpU4YRSZLUKcOIJEnqlGFEkiR1yjAiSZI6ZRiRJEmdMoxIkqROGUYkSVKnDCOSJKlThhFJktQp\\nw4gkSeqUYUSSJHXKMCJJkjplGJEkSZ0yjEiSpE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ6lTnYSTJ\\nu5Os6/v6QV/NyUluS3JPkkuS7N23fLskZyZZlWRtkvOT7NZXs3OSc5KsSbI6yceS7LgljlGSJG1Y\\n52Gk9X1gJjCr/Xr+0IIkbweOA44B5gF3A4uSTOtZ/3TgZcDhwEHA7sDn+/ZxLjAbmN/WHgScNQHH\\nIkmSRmHbrjvQerCqfrqBZScAp1TVlwGSvAZYCRwGnJdkJ+Bo4IiquryteT2wLMm8qlqSZDZwKDC3\\nqq5ta44HLkxyYlWtmNCjkyRJGzRZRkaekuTHSX6Y5DNJ9gBIshfNSMllQ4VVdRdwNXBg27QfTajq\\nrbkBWN5TcwCweiiItC4FCth/Yg5JkiSNxGQII98CXkczcvFGYC/gG+18jlk0gWFl3zor22XQXN65\\nvw0pG6qZBdzeu7CqHgLu6KmRJEkd6PwyTVUt6vn4/SRLgP8B/hC4vpterW/BggVMnz59vbaBgQEG\\nBgY66pEkSZPH4OAgg4OD67WtWbNmxOt3Hkb6VdWaJP8F7A18HQjN6Efv6MhMYOiSywpgWpKd+kZH\\nZrbLhmr6767ZBtilp2aDFi5cyJw5c0Z/MJIkTQHD/YK+dOlS5s6dO6L1J8NlmvUkeTRNELmtqm6i\\nCQvze5bvRDPP45tt0zXAg301+wB7AovbpsXAjCT79uxqPk3QuXpijkSSJI1E5yMjSd4P/BvNpZnf\\nBP4WeAD4bFtyOvDOJDcCNwOnALcCF0AzoTXJx4HTkqwG1gJnAFdV1ZK25voki4Czk7wJmAZ8CBj0\\nThpJkrrVeRgBHk/zDJDHAj8FrgQOqKqfAVTVqUl2oHkmyAzgCuAlVXV/zzYWAA8B5wPbARcDx/bt\\n50jgwzR30axra0+YoGOSJEkj1HkYqapNzgKtqpOAkzay/D7g+PZrQzV3AkeNvoeSJGkiTbo5I5Ik\\naWoxjEiSpE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ6pRhRJIkdcowIkmSOmUYkSRJnTKMSJKkThlG\\nJElSpwwjkiSpU4YRSZLUKcOIJEnqlGFEkiR1yjAiSZI6ZRiRJEmdMoxIkqROGUYkSVKnDCOSJKlT\\nhhFJktQpw4gkSeqUYUSSJHXKMCJJkjplGJEkSZ0yjEiSpE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ\\n6pRhRJIkdcowIkmSOmUYkSRJnZp0YSTJO5KsS3JaX/vJSW5Lck+SS5Ls3bd8uyRnJlmVZG2S85Ps\\n1lezc5JzkqxJsjrJx5LsuCWOS5IkDW9MYSTJo5Ls0PP5CUnekuSQzelMkucAxwDf7Wt/O3Bcu2we\\ncDewKMm0nrLTgZcBhwMHAbsDn+/bxbnAbGB+W3sQcNbm9FmSJG2esY6MXAC8BiDJDOBq4K3ABUne\\nNJYNJnk08BngDcCdfYtPAE6pqi9X1ffbfe8OHNauuxNwNLCgqi6vqmuB1wPPSzKvrZkNHAr8aVV9\\nu6q+CRwPHJFk1lj6LEmSNt9Yw8gc4Ir2+1cDK4En0ISEN49xm2cC/1ZV/97bmGQvYBZw2VBbVd1F\\nE4AObJv2A7btq7kBWN5TcwCwug0qQy4FCth/jH2WJEmbadsxrrcDsLb9/hDgC1W1Lsm3aELJqCQ5\\nAvhtmlDRbxZNYFjZ176yXQYwE7i/DSkbqpkF3N67sKoeSnJHT40kSdrCxhpGbgQOS/JFmksfC9v2\\n3YD+QLBRSR5PM9/jRVX1wBj7M6EWLFjA9OnT12sbGBhgYGCgox5JkjR5DA4OMjg4uF7bmjVrRrz+\\nWMPIyTSTQRcCl1XV4rb9EODaDa41vLnA44ClSdK2bQMclOQ44GlAaEY/ekdHZvbsawUwLclOfaMj\\nM9tlQzX9d9dsA+zSUzOshQsXMmfOnFEeliRJU8Nwv6AvXbqUuXPnjmj9Mc0ZqarzgT1pLqu8uGfR\\nZcCCUW7uUuCZNJdpnt1+fZtmMuuzq+pHNGFh/tAK7YTV/YFvtk3XAA/21ezT9nEoKC0GZiTZt2ff\\n82mCztWj7LMkSRonYx0ZoapW0DeiUFVLxrCdu4Ef9LYluRv4WVUta5tOB96Z5EbgZuAU4Faau3qo\\nqruSfBw4LclqmvksZwBXDfWpqq5Psgg4u73jZxrwIWCwPRZJktSBEYeRJF8YaW1VvWps3fnVJvq2\\nd2r7XJOzgBk0d/K8pKru7ylbADwEnA9sB1wMHNu33SOBD9OMxqxra0/YzL5KkqTNMJqRkd6ZKAFe\\n2bZ9u22bSxMURhxaNqSqXjhM20nASRtZ5z6a54Ycv5GaO4GjNrd/kiRp/Iw4jFTV64e+T/I+4Dzg\\njVX1UNu2DfARRnk3jSRJmtrG+tCzo4F/GAoi0DyzAzitXSZJkjQiYw0j29LcctvvaZuxTUmSNAWN\\n9W6aTwAfT/JkYOgOmv2Bd7TLJEmSRmSsYeREmtt63wr8Rtv2E+D9wAfGoV+SJGmKGFMYqap1wKnA\\nqe0DyBjmvTCSJEmbNOaHng0xhEiSpM0xpsmmSWYm+ecktyV5MMlDvV/j3UlJkrT1GuvIyCdp3vty\\nCs1ckdpotSRJ0gaMNYw8H3hBVX1nPDsjSZKmnrE+E+QWmkfCS5IkbZaxhpG3AO9N8sTx64okSZqK\\nxnqZ5nPADsAPk9wDPNC7sKp22dyOSZKkqWGsYeQt49oLSZI0ZY31oWefGu+OSJKkqWnMDz1Lsg1w\\nGDC7bfpP4Eu9b/KVJEnalDGFkSR7AxcBvwnc0Db/FXBLkpdV1Q/HqX+SJGkrN9a7ac4AfgjsUVVz\\nqmoOzUPQbmqXSZIkjchYL9McDBxQVXcMNVTVz5K8A7hqXHomSZKmhLGOjNwHPGaY9kcD94+9O5Ik\\naaoZaxj5MvDRJPvnVw4A/hH40vh1T5Ikbe3GGkbeTDNnZDFwb/t1FXAjcML4dE2SJE0FY33OyJ3A\\nH7R31Qzd2rusqm4ct55JkqQpYczPGQFow4cBRJIkjdmYLtMk+XySvxym/W1J/mXzuyVJkqaKsc4Z\\nOYjmoWf9vtIukyRJGpGxhpFHAw8O0/4AsNPYuyNJkqaasYaR7wF/NEz7EcAPxt4dSZI01Yx1Ausp\\nwBeSPBn497ZtPjAA/K/x6JgkSZoaxnpr778lOQz4a+DVwC+A64AXVdXl49g/SZK0lRvzrb1VdSFw\\n4Tj2RZIkTUFjnTNCkhlJ3pDk75Ls0rbNSfKb49c9SZK0tRvTyEiSZwGXAmuAJwIfA+4AXgXsCbxm\\nnPonSZK2cmMdGTkN+GRVPYXmvTRDLmKUzxlJ8sYk302ypv36ZpIX99WcnOS2JPckuaR9DH3v8u2S\\nnJlkVZK1Sc5Psltfzc5Jzmn3sTrJx5LsOLrDliRJ422sYeQ5wFnDtP8YmDXKbd0CvB2YA8yluTvn\\ngiSzAZK8HTgOOAaYB9wNLEoyrWcbpwMvAw6nCUO7A5/v28+5NO/Rmd/WHrSBY5AkSVvQWCew3sfw\\nDzd7KvDT0WyonQjb651J3gQcACyjeQvwKVX1ZYAkrwFWAocB5yXZCTgaOGLoTp4krweWJZlXVUva\\nYHMoMLeqrm1rjgcuTHJiVa0YTZ8lSdL4GevIyJeAdyV5ZPu5kuwJvI9fH5EYsSSPSHIEsAPwzSR7\\n0Yy0XDZUU1V3AVcDB7ZN+9GEqt6aG4DlPTUHAKuHgkjrUqCA/cfaX0mStPnGGkbeSvNI+NuBRwGX\\nAz8Efg78n9FuLMkzkqylGXH5CPDKNlDMogkMK/tWWcmvLgfNBO5vQ8qGama1ff2lqnqIZtLtaC8r\\nSZKkcTTWh56tAX4vyfOBZ9EEk2uq6rKNr7lB1wPPBqbTPETt00l84Z4kSVPAqMJIkgOBxw7N36iq\\nK9tHwr8N2CHJvwLHV9V9o9luVT0I/Kj9eG2SeTRzRU4FQjP60Ts6MhMYuuSyApiWZKe+0ZGZ7bKh\\nmv67a7YBdump2aAFCxYwffr09doGBgYYGBjY9MFJkrSVGxwcZHBwcL22NWvWjHj90Y6MvAv4OjA0\\nmfSZwNnAp2gmm/4lcBtw0ii32+8RwHZVdVOSFTR3wFzX7nMnmnkeZ7a119C8QXg+8MW2Zh+a550s\\nbmsWAzOS7Nszb2Q+TdC5elOdWbhwIXPmzNnMQ5Ikaes03C/oS5cuZe7cuSNaf7Rh5LeBv+n5fASw\\npKr+DCDJLcDfMoowkuTvgK/QTDh9DPDHwMHAIW3J6TR32NwI3Ezzkr5bgQugmdCa5OPAaUlWA2uB\\nM4CrqmpJW3N9kkXA2e2dOtOADwGD3kkjSVK3RhtGdmb9yyUH0wSJIf8B7DHKbe5GM7LyGzRPdL0O\\nOKSq/h2gqk5NsgPNM0FmAFcAL6mq+3u2sQB4CDgf2A64GDi2bz9HAh+muYtmXVt7wij7KkmSxtlo\\nw8hKYC/glvahY3OAd/csfwzwwGg2WFVvGEHNSWxktKWdo3J8+7WhmjuBo0bTN0mSNPFGe2vvRcB7\\nk7wA+HvgHpqRiiHPornFV5IkaURGOzLyN8AXaJ4r8nPgtX2XS44GvjpOfZMkSVPAqMJIVa0CDkoy\\nHfh5++CwXv+LJqRIkiSNyOY89Gy49js2rzuSJGmqGevj4CVJksaFYUSSJHXKMCJJkjplGJEkSZ0y\\njEiSpE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ6pRhRJIkdcowIkmSOmUYkSRJnTKMSJKkThlGJElS\\npwwjkiSpU4YRSZLUKcOIJEnqlGFEkiR1yjAiSZI6ZRiRJEmdMoxIkqROGUYkSVKnDCOSJKlThhFJ\\nktQpw4gkSeqUYUSSJHXKMCJJkjplGJEkSZ0yjEiSpE51HkaS/FWSJUnuSrIyyReTPHWYupOT3Jbk\\nniSXJNm7b/l2Sc5MsirJ2iTnJ9mtr2bnJOckWZNkdZKPJdlxoo9RkiRtWOdhBHgB8CFgf+BFwCOB\\nryZ51FBBkrcDxwHHAPOAu4FFSab1bOd04GXA4cBBwO7A5/v2dS4wG5jf1h4EnDX+hyRJkkZq2647\\nUFUv7f2c5HXA7cBc4Mq2+QTglKr6clvzGmAlcBhwXpKdgKOBI6rq8rbm9cCyJPOqakmS2cChwNyq\\nuratOR64MMmJVbVigg9VkiQNYzKMjPSbARRwB0CSvYBZwGVDBVV1F3A1cGDbtB9NsOqtuQFY3lNz\\nALB6KIi0Lm33tf9EHIgkSdq0SRVGkoTmcsuVVfWDtnkWTWBY2Ve+sl0GMBO4vw0pG6qZRTPi8ktV\\n9RBN6JmFJEnqROeXafp8BHg68LyuOyJJkraMSRNGknwYeCnwgqr6Sc+iFUBoRj96R0dmAtf21ExL\\nslPf6MjMdtlQTf/dNdsAu/TUDGvBggVMnz59vbaBgQEGBgZGcGSSJG3dBgcHGRwcXK9tzZo1I15/\\nUoSRNoj8AXBwVS3vXVZVNyVZQXMHzHVt/U408zzObMuuAR5sa77Y1uwD7AksbmsWAzOS7Nszb2Q+\\nTdC5emP9W7hwIXPmzNmsY5QkaWs13C/oS5cuZe7cuSNav/MwkuQjwADwCuDuJDPbRWuq6t72+9OB\\ndya5EbgZOAW4FbgAmgmtST4OnJZkNbAWOAO4qqqWtDXXJ1kEnJ3kTcA0mluKB72TRpKk7nQeRoA3\\n0kxQ/Xpf++uBTwNU1alJdqB5JsgM4ArgJVV1f0/9AuAh4HxgO+Bi4Ni+bR4JfJjmLpp1be0J43gs\\nkiRplDoPI1U1ojt6quok4KSNLL8POL792lDNncBRo+uhJEmaSJPq1l5JkjT1GEYkSVKnDCOSJKlT\\nhhFJktQpw4gkSeqUYUSSJHXKMCJJkjplGJEkSZ0yjEiSpE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ\\n6pRhRJIkdcowIkmSOmUYkSRJnTKMSJKkThlGJElSpwwjkiSpU4YRSZLUKcOIJEnqlGFEkiR1yjAi\\nSZI6ZRiRJEmdMoxIkqROGUYkSVKnDCOSJKlThhFJktQpw4gkSeqUYUSSJHXKMCJJkjplGJEkSZ0y\\njEiSpE4ZRiRJUqcmRRhJ8oIkX0ry4yTrkrximJqTk9yW5J4klyTZu2/5dknOTLIqydok5yfZra9m\\n5yTnJFmTZHWSjyXZcaKPT5IkbdikCCPAjsB3gD8Hqn9hkrcDxwHHAPOAu4FFSab1lJ0OvAw4HDgI\\n2B34fN+mzgVmA/Pb2oOAs8bzQCRJ0uhs23UHAKrqYuBigCQZpuQE4JSq+nJb8xpgJXAYcF6SnYCj\\ngSOq6vK25vXAsiTzqmpJktnAocDcqrq2rTkeuDDJiVW1YmKPUpIkDWeyjIxsUJK9gFnAZUNtVXUX\\ncDVwYNu0H02w6q25AVjeU3MAsHooiLQupRmJ2X+i+i9JkjZu0ocRmiBSNCMhvVa2ywBmAve3IWVD\\nNbOA23sXVtVDwB09NZIkaQubFJdpJrsFCxYwffr09doGBgYYGBjoqEeSJE0eg4ODDA4Orte2Zs2a\\nEa//cAgjK4DQjH70jo7MBK7tqZmWZKe+0ZGZ7bKhmv67a7YBdumpGdbChQuZM2fOmA9AkqSt2XC/\\noC9dupS5c+eOaP1Jf5mmqm6iCQvzh9raCav7A99sm64BHuyr2QfYE1jcNi0GZiTZt2fz82mCztUT\\n1X9JkrRxk2JkpH3Wx940wQDgSUmeDdxRVbfQ3Lb7ziQ3AjcDpwC3AhdAM6E1yceB05KsBtYCZwBX\\nVdWStub6JIuAs5O8CZgGfAgYnMg7aZYvX86qVasmavOaZHbddVf23HPPrrshSQ8rkyKM0NwN8zWa\\niaoFfKBt/xRwdFWdmmQHmmeCzACuAF5SVff3bGMB8BBwPrAdza3Cx/bt50jgwzR30axra0+YiAOC\\nJojss89s7r33nonahSaZ7bffgRtuWGYgkaRRmBRhpH02yEYvGVXVScBJG1l+H3B8+7WhmjuBo8bU\\nyTFYtWpVG0Q+Q/OsNW3dlnHvvUexatUqw4gkjcKkCCNbv9mAE2AlSRrOpJ/AKkmStm6GEUmS1Ckv\\n00iSRsXBSn9iAAAO5UlEQVS7BKeWLXGXoGFEkjRi3iU49WyJuwQNI5KkEfMuwalmy9wlaBiRJI2B\\ndwlq/DiBVZIkdcowIkmSOmUYkSRJnTKMSJKkThlGJElSpwwjkiSpU4YRSZLUKcOIJEnqlGFEkiR1\\nyjAiSZI6ZRiRJEmdMoxIkqROGUYkSVKnDCOSJKlThhFJktQpw4gkSeqUYUSSJHXKMCJJkjplGJEk\\nSZ0yjEiSpE4ZRiRJUqcMI5IkqVOGEUmS1CnDiCRJ6pRhRJIkdcowIkmSOjXlwkiSY5PclOQXSb6V\\n5Dld92nrNNh1BzRleK5pS/FcmyhTKowk+SPgA8C7gX2B7wKLkuzaace2Sv6j1ZbiuaYtxXNtokyp\\nMAIsAM6qqk9X1fXAG4F7gKO77ZYkSVPXlAkjSR4JzAUuG2qrqgIuBQ7sql+SJE11UyaMALsC2wAr\\n+9pXArO2fHckSRLAtl13YJLbHmDZsmVjWvlX610EjG0bD1+3Aud03Ykt7CZg7OfL5vBc81zbUjzX\\nPNdGqmed7TdVm+ZKxdavvUxzD3B4VX2pp/2TwPSqeuUw6xzJ1DvzJEkaT39cVedurGDKjIxU1QNJ\\nrgHmA18CSJL28xkbWG0R8MfAzcC9W6CbkiRtLbYHnkjz/9KNmjIjIwBJ/hD4JM1dNEto7q55NfC0\\nqvpph12TJGnKmjIjIwBVdV77TJGTgZnAd4BDDSKSJHVnSo2MSJKkyWcq3dorSZImIcOIHlaSrEvy\\niq77oYmR5BNJvjDO23xCe948azy3q+4l+VqS07ruhzbflJozImnSezOQCdiu16OlScwwImnSqKq1\\nE7TpiQg4UieSbFtVD3bdj/HkZZopKsmhSa5IsjrJqiT/luRJPcsfn+Rz7fKfJfnXJE/oWb5fkq8m\\n+WmSO5N8Pcm+ffs4Kcn/JLk3ya1JTu9ZNivJhUnuSXJjkj9MclOSN/fU7J3kG0l+keT7SV400X8u\\n2jKSvDrJde3f/6r2XHpU/2Wadhj+g0ne156HP0ny7r5t7ZPkyvY8+V6S39nU5bwkz0hyUZK1SVYk\\n+XSSx07kMWvzJNmh/Xtam+THSf6ib/mMdvkdSe5u/3737ll+e5JX9Xz+TpIf93x+fvuzavv287ok\\nf5rkC+32/ivJy3vqD25rXprku+35tzjJb/X16/D259e97c+4/n7/2rna/tx9Tfv90GXGP2x/zt4D\\nHLlZf5iTkGFk6toR+AAwB3gh8BDwRWhSN81DatYAzwOeC6wFLm6XATyG5pktzwX2B/4LuCjJju02\\nXg28BfgzYG/gMOB7Pfv/Z5p3Ah1E86yXNwGPG1rYPpDuizQPm3sOzbNh3ofD7Q97SWYB5wIfA54G\\nHAx8gQ3/PHoN8HNgHvA24F1J5rfbegRwAc35+RzgfwPvZSPnSZLpNC/MvIbm/D8U2A343GYemibW\\nPwAvAF4OHAL8Ds3f35BPtZ9/HziAZjTsoiTbtMu/0a5Dkhk0596jkjy1XX4QsKSqeh9w+S7gs8Az\\naZ5/f067bq9TaZ5ZtR/wU+BLQ/tMMpfmvDoXeAbwbuCUoaAxSn8PLARmM4KHiD3sVJVffkHzIsF1\\nwNNpnjr7g77l04C7gRdtYP1H0ISXl7afF9C8uGKbYWr3afe1b0/bk9u2N7efDwHuA2b21Bza1ryi\\n6z8vvzbrXNuXJvzuMcyyTwBf6Pn8NeDyvpqrgb9rv39xe548rmf5/N7zBHhC+/lZ7ef/A3ylb5uP\\nb2v27vrPx69hz5kdaX4xeVVP287tz6TTaH7hWQfs37N8l3b54e3n44Dr2u9fAXyTJgQf07Z9FTil\\nZ/11wEk9n3do2w5pPx/cfn71MH16dfv5M8DFfcfyPuB7fft5RV/NauA17fdD5+9xXf89TOSXIyNT\\nVHsJ5NwkP0yyhuZtSAXsCTwbeEo7HLo2yVrgZ8B2NKGBJLslObsduryTJojs2K4P8C80/3hvSvLR\\nJIf1/IayD/BAVV071J+q+iHNP8AhTwNuqaretywvHt8/BXXkuzQjE99Pcl6SNwzz22av6/o+/4Rm\\nJAPgqTTnSe+DC5dsYv/PBl7Yd34vozn/nzzio9CW9GTgkfT83VbVauCG9uNs4IG+5Xe0y2e3TZcD\\nT28vxx0MfL39+p12xPe57edevxzNrap7gLv41bkHzTnzrWH6NLTP2cBVfdu8iubn62jnMV0zyvqH\\nFSewTl1fpgkgbwBuA7YBvk8zAvJo4Ns01yX7/8EM/dD/NM1vAccDy2l+O/1Wuz5VdWs7/Pki4PeA\\njwAnJjl44g5JDwdVtQ44JMmBNCNgxwPvSXLABlZ5oH8TbN4l5kfTvJ/qbfz6+f2TzdiuJrGq+l6S\\nO2gu1RwM/DWwEngHzSW+bWlGS3qN97k3bNf49fPwkcPU3T3O+51UHBmZgpLsQvMb5Xuq6mtVdQPN\\nkObQdfalwFOAn1bVj/q+hu52eC5wRlUtqqplNP9od+3dT1XdV1UXVtVbaH4APJfm2usNwLbpmfDa\\nTjTbuWf1ZcAeSWb2tB2Ic0a2GlW1uKr+luayzQM084pG6waa8+RxPW3zNrHOUuC3gP8Z5vz+xRj6\\noIn3Q+BBmvlpACTZmebnGDQ/Lx7Zt/yxNKOwP+jZzpXAH9Bcjr6SZtRtO5q5Rt8ew99/aOan9Pdp\\naJ/LaObd9Xo+8F/VXoOh+QXvN3q28RSaUeVeW/3PPcPI1LSa5rLLMUmenOSFNJNZh5zTLr+gnWH+\\nxPYOhQ8m2b2t+W/gT5I8Lcn+NNdG7xnaQJLXJjk6yW8l2Qv4k3b5/7Th5zLg7CTPaUPJWe3yoX90\\nl7b7+HSSZyV5AfCeifnj0JaUZF6Sv0oyN8kewOE0QXbZGDZ3CfAjmvPkmUmeR3OeFBv+AX4mTfj+\\nbJq7wp6U5u6yfxrD0Lm2gKq6G/g48P4kv5vkGTTzix5ql99IM5H57CTPS/Jsmp9Jt7TtQ74ODADf\\nqap72kDwDZp5cpePsXvvSvLCtk+fpAkXQ/v8ADA/yTuTPCXJa4Fjgff3rP/vwHFJfjvJfsD/A+7v\\n28dWf14aRqag9h/gHwFzaa6JfgA4sWf5L2hmrS8HPk+T8s+m+Q3irrbsaJqRjGtoZrF/ELi9Zzd3\\n0txJcyXNHIEXAr/fXlOFJpysoPkB8Pl2+z+nmaQ21MfDaF5BfTXwUZphVT383UVz58KFNCMbJwN/\\nUVXD3SGw0d8I20s+f0AzX2kJzXnyHpof3r13RVTPOj+h+W31ETR3JVxHMwlydc9vq5p8/hK4guYS\\n21fb73vnUby+/fxvNPMy1gEvq6qHemoup/l7/1pP29fbtq/37W+4c6G/rWgu83wQ+A+aOwJfXu0z\\nQNp5cX9I8/P2e8BJwDur6p97tvFWmtD0DZoA9X56frHbSF+2Kr4oT5NCksfThJ/5VfW1TdVLG9KO\\njnyD5s6Ym7ruj7ZO7fy3fwd2rqq7NlWvjXMCqzqR5HdpJhJ+D9id5l79H9H8T0QasSSH0Yyq/TfN\\nXKfTgSsNItoCtvrLJ1uKYURdeSTwd8BeNA+sugoY6BtSlUbiMTTPbtgDWEUzj+TEja4hjQ8vLYwT\\nL9NIkqROOYFVkiR1yjAiSZI6ZRiRJEmdMoxIkqROGUYkSVKnDCOSJpUk706ydDO38YQk65I8a7z6\\nJWniGEYkjVqSx7fvcvlxkvuS3Jzk9PYljKPZzrokr+hrfj8wfzO7uByYRfMmakmTnGFE0qi0Lz78\\nNvBkmnduPJnmrafzgcVJZmzO9tsXmK3edOVGt1FVdXv77ppxl+QRvlRPGj+GEUmj9RHgPuD3qurK\\nqrq1fcndi4DfBP4vQJKb2reVnpvk50luTfLnQxtJchPNEyz/tR0h+VHbflKSa3vqPpHki+2bflck\\nWd1ud5skpyb5WZJbkryuZ531LtO021jXfj3U8/1B7fJpSf6h7ePPkyxu3z0ytL3Xtvt9eZL/pHkJ\\n3x7t26yvbtdZneSK9k3EkkbBMCJpxJLsDBwCnFlV673mvKpWAufQjJYMORG4Fvht4L3AB5MMXYJ5\\nDs27PV5Lc0nlOUOb4tcfs/1C4Ddo3ia9gOZNv18G7gDmAf8InJVk994u9Xz/5nYfs9rtfBBYCVzf\\nLj8T2J/mDavPBP4F+EqSJ/dsYwfgbcCfAr8FrAa+SPMG2GcAB9C8NdjHWkuj5LtpJI3GU2gCxPUb\\nWL4M2DnJru3nq6rq/e33H27fqLsAuKyqVrVXOtZU1e2b2O/PqurN7ff/neTtwKOq6r0ASf6e5lXu\\nzwfOa+t+eRmlqtbSvAOJJK8CjqF5Q/Tt7UjG64A9qmpFu8ppSV5C81r6d7Zt2wJvqqrvt9vZGdgJ\\nuLCqbm5rbtjEcUgahmFE0liMdL7E4mE+nzCG/f1n3+eVNG98BqCq1iX5GbDbxjaSZF/g08CxVfWt\\ntvmZwDbAf/XNA5lG8+K9IfcPBZF2n6uTfAr4apJLgEuB83oCjaQR8jKNpNG4keYyxOwNLH86sLqq\\nVm1g+Vg90Pe5NtC2wZ9pSWYBFwAfrapP9ix6NPAgMAd4ds/XbNYPTr/o32ZVHU1zeeYqmstTNySZ\\nt+nDkdTLMCJpxKrqDuAS4M+TbNe7rP2f/ZHAZ3uaD+jbxAE0l3KGPEAzKjERfjl3o+3rvwI/AN7a\\nV3dt24eZVfWjvq9NXT6iqr5bVe+rqufRjOAcOX6HIE0NhhFJo3UcsB2wKMkL2meOvBj4KnALv5pj\\nAfC8JCcmeUqSY4FXA6f3LL8ZmJ9k5ubeEjyM3ksuHwUeTzPSsVu7v5lJHllV/w2cC3w6ySuTPDHJ\\nvCTvaOeNDL/xpu7vkhyQZM8kh9DMqfnBOB+HtNUzjEgalaq6EdgP+BHwOZpLN/8IXAY8t6ru7Cn/\\nQFt7LfDXwIKqurRn+VuB36MJMaN56upwd6z0t/V+PojmLpofALcBP2n/e2C7/HU0c0n+gWZy7hfa\\nfi/fSB/uAZ4GnE8zcfUfgQ9V1UdHcRySgFR5F5qk8dc+R2RhVZ3RdV8kTW6OjEiSpE4ZRiRNFIdd\\nJY2Il2kkSVKnHBmRJEmdMoxIkqROGUYkSVKnDCOSJKlThhFJktQpw4gkSeqUYUSSJHXKMCJJkjpl\\nGJEkSZ36/9Lx/6FV7lc5AAAAAElFTkSuQmCC\\n\",\n      \"text/plain\": [\n       \"<matplotlib.figure.Figure at 0x7fd9fc0a1410>\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"# Plot the time.\\n\",\n    \"fig = plt.figure()\\n\",\n    \"st = fig.suptitle(\\\"Lower is better.\\\", fontsize=\\\"x-small\\\")\\n\",\n    \"\\n\",\n    \"plt.bar(range(len(time_spent)), time_spent.values(), align='center')\\n\",\n    \"plt.xticks(range(len(time_spent)), time_spent.keys())\\n\",\n    \"plt.xlabel(\\\"Optimizers\\\")\\n\",\n    \"plt.ylabel(\\\"Seconds\\\")\\n\",\n    \"plt.ylim([0, 7000])\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 28,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"image/png\": \"iVBORw0KGgoAAAANSUhEUgAAAicAAAGSCAYAAAA4v2GGAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\\nAAAPYQAAD2EBqD+naQAAIABJREFUeJzt3Xu4XVV97vHvy0VTQCkaBZHgBRGDIpoIolC1oqK2Wi9U\\nG6VFLfVQUU/TejxUW1Fra1sUxFM4AsdyKRqxLVUQNYgXQAFRAiI14IVIvACyMSASImh+5485t65s\\n9t7Jvo/A9/M868mac445x1h7z73yrjHGmjNVhSRJUiu2mOsGSJIkDTKcSJKkphhOJElSUwwnkiSp\\nKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxI2qgkN4xYPiXJ8/rnn0qy1TTX97Akp25i\\n2UOTvHcCx947yYEDy3+Q5NGTaKakGWI4kbQpxrzPRVX9flX9cioHT7LBe1FV3VBVr5nAISZyH44n\\nAc8ZWH4JsPsE9idJJlJe0sQYTiRtijH/M06yKsn9+ufvSXJNkvP6xzP69S9MckmSFUlO6Nc9IskV\\nSc4E/nvEMR+R5JL++e8muarf96IxmvGYJBf1db9+4DhvS3JZkiuT/Gkfgt4NvKY/3luBFwP/0i9v\\nm2SfJBcm+XqSM5PM6491Y5ITklwF7DLJn6OkTTCtXbGS7rUenGRF/zzAAmBZv1wASfYBngXsCewI\\nXNOvfzDwF8Azq+quJP+S5GXA5X3ZJVV1zSh1DveGLAXeVFUXJHnAGO17CrAX3Qeuryc5G9gbeEhV\\n7Ztka+Ai4FPAO4A9quptffseB3ysqs7ryx0NvKiqbkvyFuANwDHAQ4GzquoNm/5jkzQZhhNJm2Ko\\nqhYNLyQ5ZZQyTwf+q6rWAzckubBf/zTgicCl/XDIPOD7dOFk5RjBZNDFwNFJTgM+Dtw+SplPV9Ud\\nfds+BzwV+B3g95M8ky5QPRDYbZR9B3uF9ujb+sW+rVsD5/fbflZV54/cWdL0M5xImi4jh34y8O8n\\nqurwDTYmjwDWbuygVfWPST5NN/zy1SSLq2rNyGIjltf3//5tVX1sRL3jzS8JcFlVPX+UbRttq6Tp\\n4ZwTSZtivAmgw9suBl6SZMskDwMO6NdfChyY5OEASR40/Hwjx6Uv/6iquqqq3gOsohtSGukF/XyR\\nBwDPBr4GfB44LMn9++M8tn9+O10vyrDB5WuARyV5Qr/PNkmGe1ucBCvNEsOJpE0xsmeiRj6vqsuA\\nC4CrgVOAK+mGQm4GjgA+meQbwHLgIWMcdzR/meTqJFcC11TVVaOUuRz4LF0oOaaqbqyqzwyv6yex\\nnkD3nvdFYN8kl/dDPh8D3tXPqbkf8CrgQ319F/OboaAN2prk3CQ7bUL7JU1QqibyDTxJGluSbapq\\nbT8J9mJg0fBcEEnaVM45kTSd/jXJHnTvLX9jMJE0GfacSJKkpjjnRJIkNcVwIkmSmmI4kSRJTTGc\\nSJKkphhOJElSUwwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxIkqSmGE4kSVJT\\nDCeSJKkphhNJktQUw4kkSWqK4USSJDXFcCJJkppiOJEkSU0xnEiSpKZsNdcN2JwkeTBwEPB9YN3c\\ntkaSpM3KPOCRwPKqumW8goaTiTkI+MhcN0KSpM3Yq4GPjlfAcDIx3wc444wzWLhw4Rw3ZfOydOlS\\njj322Lluhu4DPNc0WzzXJmblypUccsgh0P9fOh7DycSsA1i4cCGLFi2a67ZsVrbffnt/ZpoVnmua\\nLZ5rk7bRaRFOiJUkSU0xnEiSpKYYTiRJUlMMJ5oVS5Ysmesm6D7Cc02zxXNt5hhONCv8I9Zs8VzT\\nbPFcmzmGE0mS1BTDiSRJaorhRJIkNcVwIkmSmmI4kSRJTTGcSJKkphhOJElSUwwnkiSpKYYTSZLU\\nFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USS\\nJDXFcCJJkppiOJEkSU0xnEiSpKYYTiRJUlMMJ5IkqSmGE0mS1BTDiSRJaorhRJIkNcVwIkmSmmI4\\nkSRJTTGcSJKkphhOJElSUwwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxIkqSm\\nGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USSJDXFcCJJkppiOJEkSU0xnEiSpKYYTiRJUlMMJ5Ik\\nqSmGE0mS1BTDiSRJaorhRJIkNcVwIkmSmmI4kSRJTTGcSJKkphhOJElSUwwnkiSpKYYTSZLUFMOJ\\nJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxIkqSmGE4kSVJTmgknSY5IsirJnUkuTbLPRsq/OsmVSe5I\\n8uMkH07yoDHK/lGS9UnOGrH+qH794ONb0/m6JEnSxDQRTpK8Eng/cBTwZOAbwPIk88covz9wGnAy\\nsCdwMLAvcNIoZR8JHA1cOEb1VwM7Ajv1jwMm/0okSdJUNRFOgKXAiVV1elVdAxwOrAVeN0b5/YBV\\nVXV8VV1fVRcDJ9IFlF9LsgVwBvAOYNUYx/plVd1cVT/pHz+djhckSZImZ87DSZKtgcXA54fXVVUB\\n5wNPG2O3S4AFSV7QH2NH4A+Bc0eUOwq4qapOGacJuyf5UZLvJTkjyYJJvhRJkjQN5jycAPOBLYGb\\nRqy/iW6Y5R76npJDgDOT3AXcAKwB3jhcJskBwGuBw8ap+1LgNcBBdL01jwIuTLLtZF6IJEmauhbC\\nyYQl2RM4DngnsIguXDyKbmiHJNsBpwN/VlVrxjpOVS2vqv+sqqur6nPAC4EdgFfM7CuQJElj2Wqu\\nGwAMAb+im5Q6aEfgxjH2ORL4SlUd0y9fneQNwEVJ3k7X4/II4Jwk6ctsAdD3tOxRVfeYg1JVtyX5\\nNvCY8Rq8dOlStt9++w3WLVmyhCVLloy3myRJ9wnLli1j2bJlG6y77bbbNnn/OQ8nVXV3ksuBA4Gz\\nAfpAcSDwwTF22wa4a8S69UABAa4B9hqx/e+B7YA3Az8Y7aB9j8tj6HpdxnTssceyaNGi8YpIknSf\\nNdoH9hUrVrB48eJN2n/Ow0nvGODUPqRcRvftnW2AUwGSvBfYuaoO7cufA5yU5HBgObAzcCzw1aoa\\n7m3Z4HolSW6lm2u7cmDd0f2xrgceDrwLuBvYMO5JkqRZ00Q4qaqP99c0eTfdcM6VwEFVdXNfZCdg\\nwUD50/pejiOA9wG30n3b58gJVr0L8FHgwcDNwJeB/arqlim8HEmSNAVNhBOAqjoBOGGMba8dZd3x\\nwPETOP5ox3CSiCRJjdksv60jSZLuvQwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElN\\nMZxIkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USSJDXFcCJJkppiOJEkSU0xnEiSpKYYTiRJ\\nUlMMJ5IkqSmGE0mS1BTDiSRJaorhRJIkNcVwIkmSmmI4kSRJTTGcSJKkphhOJElSUwwnkiSpKYYT\\nSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK\\n4USSJDXFcCJJkppiOJEkSU0xnEiSpKYYTiRJUlMMJ5IkqSmGE0mS1BTDiSRJaorhRJIkNcVwIkmS\\nmmI4kSRJTTGcSJKkphhOJElSUwwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxI\\nkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USSJDXFcCJJkppiOJEkSU0xnEiSpKYYTiRJUlMM\\nJ5IkqSmGE0mS1JRmwkmSI5KsSnJnkkuT7LOR8q9OcmWSO5L8OMmHkzxojLJ/lGR9krOmWq8kSZpZ\\n0xpOkixI8q+T2O+VwPuBo4AnA98AlieZP0b5/YHTgJOBPYGDgX2Bk0Yp+0jgaODCqdYrSZJm3nT3\\nnDwIOHQS+y0FTqyq06vqGuBwYC3wujHK7wesqqrjq+r6qroYOJEuoPxaki2AM4B3AKumoV5JkjTD\\ntppI4SQv3kiRR0+0AUm2BhYD/zC8rqoqyfnA08bY7RLg75O8oKo+k2RH4A+Bc0eUOwq4qapOSfKM\\naahXkiTNsAmFE+ATQAEZp0xN8JjzgS2Bm0asvwnYY9QKqi5OcghwZpJ5dK/jbOCNw2WSHAC8Fth7\\nuuqVJEkzb6Lh5AbgDVX1ydE2JnkScPmUW7URSfYEjgPeCZwHPAx4H93QzmFJtgNOB/6sqtZMd/1L\\nly5l++2332DdkiVLWLJkyXRXJUnSZmfZsmUsW7Zsg3W33XbbJu+fqk3v6EhyNnBlVb1jjO17A1dU\\n1SbPZemHV9YCL6+qswfWnwpsX1UvHWWf04F5VfWKgXX7AxfRBZWdgBXAr/hNL89wm35F1zPyw0nU\\nuwi4/PLLL2fRokWb+hIlSbrPW7FiBYsXLwZYXFUrxis70QmxRwMXj7P9u8DvTuSAVXU3XW/LgcPr\\nkqRfHquubYBfjli3nt8MOV0D7AU8iW5YZ2+6YZ8v9M9/MMl6JUnSDJvosM6PGP1bLwBU1R3ABZNo\\nxzHAqUkuBy6j+xbNNsCpAEneC+xcVcPfBDoHOCnJ4cByYGfgWOCrVXVjX+ZbgxUkubVrYq3c1Hol\\nSdLsm2g4+Q7dsMlPAJKcCby5qkZOKp2Qqvp4f22RdwM7AlcCB1XVzX2RnYAFA+VP6+eVHEE31+RW\\n4PPAkdNcryRJmmUTnXOyHtipqobDye3A3lV13Qy1rynOOdHmZPXq1QwNDc11MzRL5s+fz6677jrX\\nzZDGNJE5JxPtOZG0GVi9ejV77LGQdevWznVTNEvmzduGa69daUDRvcJEw0lxz+uYTPS6JpJm2NDQ\\nUB9MzgAWznVzNONWsm7dIQwNDRlOdK8w0XASugmkv+iX5wEfSnLHYKGqetl0NE7SVC0EHIKUtHmZ\\naDg5bcTyGdPVEEmSJJhgOKmq185UQyRJkmD670osSZI0JYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKa\\nYjiRJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USSJDXFcCJJkppiOJEkSU0xnEiS\\npKYYTiRJUlMMJ5IkqSmGE0mS1BTDiSRJaorhRJIkNcVwIkmSmmI4kSRJTTGcSJKkphhOJElSUwwn\\nkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJktQU\\nw4kkSWqK4USSJDXFcCJJkppiOJEkSU0xnEiSpKYYTiRJUlMMJ5IkqSmGE0mS1BTDiSRJaorhRJIk\\nNcVwIkmSmmI4kSRJTTGcSJKkphhOJElSUwwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiR\\nJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USSJDXFcCJJkprSTDhJckSSVUnuTHJp\\nkn02Uv7VSa5MckeSHyf5cJIHDWx/aZKvJVmT5OdJrkhyyIhjHJVk/YjHt2bqNUqSpI1rIpwkeSXw\\nfuAo4MnAN4DlSeaPUX5/4DTgZGBP4GBgX+CkgWK3AO8B9gP2Ak4BTkny3BGHuxrYEdipfxwwPa9K\\nkiRNRhPhBFgKnFhVp1fVNcDhwFrgdWOU3w9YVVXHV9X1VXUxcCJdQAGgqi6sqk9W1bVVtaqqPghc\\nxT3Dxy+r6uaq+kn/+Om0vzpJkrTJ5jycJNkaWAx8fnhdVRVwPvC0MXa7BFiQ5AX9MXYE/hA4d5x6\\nDgQeC1wwYtPuSX6U5HtJzkiyYNIvRpIkTdmchxNgPrAlcNOI9TfRDbPcQ99TcghwZpK7gBuANcAb\\nB8sleWCS2/sy5wBvqqovDBS5FHgNcBBdb82jgAuTbDvVFyVJkianhXAyYUn2BI4D3gksogsXj6Ib\\n2hl0O7A38BTg7cCxSZ4xvLGqllfVf1bV1VX1OeCFwA7AK2b8RUiSpFFtNdcNAIaAX9FNSh20I3Dj\\nGPscCXylqo7pl69O8gbgoiRvr6qb4NfDQ9f1Za7qQ81fAxeOdtCqui3Jt4HHjNfgpUuXsv3222+w\\nbsmSJSxZsmS83SRJuk9YtmwZy5Yt22Ddbbfdtsn7z3k4qaq7k1wOHAicDZAk/fIHx9htG+CuEevW\\nAwVknOq2AO4/1sYk29EFk9PHa/Oxxx7LokWLxisiSdJ91mgf2FesWMHixYs3af85Dye9Y4BT+5By\\nGd23d7YBTgVI8l5g56o6tC9/DnBSksOB5cDOwLHAV6vqxn6fI4GvA9+jCyS/RzdP5fDhSpMc3R/r\\neuDhwLuAu4EN454kSZo1TYSTqvp4f02Td9MN51wJHFRVN/dFdgIWDJQ/re/lOAJ4H3Ar3bd9jhw4\\n7LbA8cAuwJ3ANcCrq+o/BsrsAnwUeDBwM/BlYL+qumXaX6QkSdokTYQTgKo6AThhjG2vHWXd8XTh\\nY6zj/S3wtxup00kikiQ1ZrP8to4kSbr3aqbnRJK0eVq9ejVDQ0Nz3QzNkvnz57PrrrvOaB2GE0nS\\npK1evZo99ljIunVr57opmiXz5m3DtdeunNGAYjiRJE3a0NBQH0zOABbOdXM041aybt0hDA0NGU4k\\nSa1bSHfBbmnqnBArSZKaYjiRJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USSJDXF\\ncCJJkppiOJEkSU0xnEiSpKYYTiRJUlMMJ5IkqSmGE0mS1BTDiSRJaorhRJIkNcVwIkmSmmI4kSRJ\\nTTGcSJKkphhOJElSUwwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElN2WquG3Bfs3r1\\naoaGhua6GZol8+fPZ9ddd53rZkjSZsVwMotWr17NHnssZN26tXPdFM2SefO24dprVxpQJGkCDCez\\naGhoqA8mZwAL57o5mnErWbfuEIaGhgwnkjQBhpM5sRBYNNeNkCSpSU6IlSRJTTGcSJKkphhOJElS\\nUwwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKaYjiRJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJ\\nktQUw4kkSWqK4USSJDXFcCJJkppiOJEkSU0xnEiSpKYYTiRJUlMMJ5IkqSmGE0mS1BTDiSRJaorh\\nRJIkNcVwIkmSmmI4kSRJTTGcSJKkphhOJElSUwwnkiSpKYYTSZLUFMOJJElqiuFEkiQ1xXAiSZKa\\n0kw4SXJEklVJ7kxyaZJ9NlL+1UmuTHJHkh8n+XCSBw1sf2mSryVZk+TnSa5IcshU69VkLZvrBug+\\nw3NNs8VzbaY0EU6SvBJ4P3AU8GTgG8DyJPPHKL8/cBpwMrAncDCwL3DSQLFbgPcA+wF7AacApyR5\\n7mTr1VT4R6zZ4rmm2eK5NlOaCCfAUuDEqjq9qq4BDgfWAq8bo/x+wKqqOr6qrq+qi4ET6QIKAFV1\\nYVV9sqqurapVVfVB4CrggCnUK0mSZtich5MkWwOLgc8Pr6uqAs4HnjbGbpcAC5K8oD/GjsAfAueO\\nU8+BwGOBC6ZQryRJmmFzHk6A+cCWwE0j1t8E7DTaDn1PySHAmUnuAm4A1gBvHCyX5IFJbu/LnAO8\\nqaq+MNl6JUnSzNtqrhswGUn2BI4D3gmcBzwMeB/d0M5hA0VvB/YGtgMOBI5Ncl1VXTjJqucBHHbY\\nYTzgAQ/YYMNBBx3E85///HF3Xrly5fCzSVa/ObsNWDHXjZhl3e/5N7/3WazZc22uGzHLPNfmhufa\\nWD772c+yfPnyDdbdfvvtw0/nbayWdCMZc6cfXlkLvLyqzh5YfyqwfVW9dJR9TgfmVdUrBtbtD1wE\\nPKyqRvaGDJc5Gdilql4wyXpfBXxkUi9UkiQBvLqqPjpegTnvOamqu5NcTtezcTZAkvTLHxxjt22A\\nu0asWw8UkHGq2wK4/xTqXQ68Gvg+sG681yVJkjYwD3gk3f+l45rzcNI7Bji1DwuX0X2LZhvgVIAk\\n7wV2rqpD+/LnACclOZzuRe4MHAt8tapu7Pc5Evg68D26QPJ7dPNUDt/UekeqqluAcdOeJEka08Wb\\nUqiJcFJVH++vLfJuYEfgSuCgqrq5L7ITsGCg/GlJtgOOoJtrcivdt26OHDjstsDxwC7AncA1dF1J\\n/zGBeiVJ0iyb8zknkiRJg1r4KrEkSdKvGU60WUuyPsmL57odmhlJTkly1jQf8xH9efPE6Tyu5l6S\\nLyY5Zq7boalrYs6JJI3hzYz/DbzJcjxbapjhRFKzqur2jZealJkIPNKcSLJVVf1yrtsxnRzWEQBJ\\nDkpyUZI1SYaSnJPk0QPbd0lyZr/9liSfSPKIge1PSXJekpuT3JrkS0mePKKOdya5Psm6JD9M8oGB\\nbTslOTfJ2iTfTfKKJKuSvHmgzGOSXJjkziRXJ3nOTP9cNDuSHJzkqv73P9SfS781clin77Y/Lsk/\\n9efhDUmOGnGsPZJ8uT9PvpnkWRsb/kvyhCSf7m93cWOS05M8eCZfs6YmyTb97+n2JD9K8pcjtv92\\nv/2nSe7of7+PGdj+kyQvG1i+MsmPBpYP6N+r5vXL65P8aZKz+uN9O8mLBso/sy/zwiTf6M+/S5I8\\nfkS7Xt6/f63r3+NGtvse52r/vvsn/fPhYclX9O+za4FXTemH2SDDiYZtC7wfWAQ8G/gV8F/QpXK6\\n68ncBuwPPJ3u1gCf7bcBPIDu+jBPB54KfBv4dJJt+2McDPwF8GfAY4CXAN8cqP/f6L4y/gzgYODP\\ngYcMb0ySvj3rgH3orlfzT9g9v9lLshPd9YP+H/A44JnAWYz9/vQnwM/p7kL+VuAd6W7sSZItgE/S\\nnZ/7AP8D+EfGOU+SbE93KYLL6c7/g4CHAmdO8aVpZr0P+B3gRcDzgGfR/f6GndYv/z7dnexD9560\\nZb/9wn4fkvw23bn3W0ke229/BnBZVQ1ecPMdwMeAvYBPAx/p9x30z3TXzHoKcDNw9nCdSRbTnVcf\\nBZ4AHAX83XDwmKD30l3fayGbcFGzzU5V+fBxjwfdjRHXA3vSXRX3WyO23w+4A3jOGPtvQRdmXtgv\\nL6W7KcOWo5Tdo6/ryQPrduvXvblffh7wC2DHgTIH9WVePNc/Lx9TOteeTBeGF4yy7RTgrIHlLwIX\\njCjzVeAf+ufP78+ThwxsP3DwPAEe0S8/sV9+O/CZEcfcpS/zmLn++fgY9ZzZlu6DyssG1u3Qvycd\\nQ/cBaD3w1IHtD+q3v7xffiNwVf/8xXQXBzsLeH2/7jzg7wb2Xw+8c2B5m37d8/rlZ/bLB4/SpoP7\\n5TOAz454Lf8EfHNEPS8eUWYN8Cf98+Hz941z/XuYyYc9JwJ+PWTy0STfS3IbsIru0+audDdP3L3v\\nPr09ye3ALXRX3t2t3/+hSU7uuzpvpQsm2/b7A/w73R/zqiQnJXnJwCeYPYC7q+qK4fZU1ffo/iCH\\nPQ74QW1436RLpvenoDnyDbqei6uTfDzJYaN8Gh101YjlG+h6OgAeS3eeDF5I8bKN1L838OwR5/dK\\nuvN/t01+FZpNuwFbM/C7rao1wLX94kLg7hHbf9pvX9ivugDYsx++eybwpf7xrL5H+On98qBf9/ZW\\n1VrgZ/zm3IPunLl0lDYN17kQ+MqIY36F7v11ovOgLp9g+c2KE2I17FN0geQw4MfAlsDVdD0k29Hd\\nCuBV3HMi4fB/AqfTfUp4E7Ca7tPrpf3+VNUP++7S5wDPBU4A3pLkmTP3krQ5qKr1wPOSPI2uh+xN\\nwHuS7DfGLnePPARTG6Leju7+Wm/lnuf3DVM4rhpWVd9M8lO6oZ1nAm8DbqK70vg+dP8/jrzU+nSf\\ne6M2jXueh1uPUu6Oaa63KfaciCQPovvE+Z6q+mJVXUvXBTo8Tr8C2B24uaquG/EY/jbF04EPVtXy\\nqlpJ90c8f7CeqvpFVZ1bVX9B94bwdLqx22uBrTIwgbafuLbDwO4rgQVJdhxY9zScc3KvUVWXVNW7\\n6IZ57qablzRR19KdJw8ZWLfvRvZZATweuH6U8/vOSbRBM+97wC/p5rcBkGQHuvcx6N4vth6x/cF0\\nvbTfGjjOl4E/oBu+/jJdr9z96eYqfX0Sv//QzW8Z2abhOlfSzdsbdADw7erHbOg+8D1s4Bi70/U6\\nD7rXv+8ZTgTd8MktwOuT7Jbk2XSTY4d9pN/+yX4G+yP7b0Acl2Tnvsx3gD9O8rgkT6UbW107fIAk\\nhyZ5XZLHJ3kU8Mf99uv7MPR54OQk+/Qh5cR++/Af4fl9HacneWKS3wHeMzM/Ds2mJPsm+eski5Ms\\nAF5OF2xXTuJwnwOuoztP9kqyP915Uoz9hn48XRj/WLpvnT063bfX/nUSXe2aBVV1B/Bh4Ogkv5vk\\nCXTzk37Vb/8u3cTok5Psn2RvuvekH/Trh30JWAJcWVVr+4BwId08uwsm2bx3JHl236ZT6cLGcJ3v\\nBw5M8jdJdk9yKN094o4e2P8LwBuTPCnJU4D/C9w1oo57/XlpOBH9H+QrgcV0Y6rvB94ysP1Oulnx\\nq4H/pPsUcDLdJ4yf9cVeR9fTcTndLPnjgJ8MVHMr3Td1vkw3x+DZwO/3Y7LQhZUb6d4Q/rM//s/p\\nJr0Nt/EldLfc/ipwEl03rDZ/P6P7ZsS5dD0f7wb+sqpG+wbCuJ8Y+yGiP6Cb73QZ3XnyHro388Fv\\nXdTAPjfQfZrdgu5bD1fRTapcM/BpVu35X8BFdENy5/XPB+dhvLZfPoduXsd64Peq6lcDZS6g+71/\\ncWDdl/p1XxpR32jnwsh1RTcsdBzwNbpvHL6o+muQ9PPqXkH3fvtN4J3A31TVvw0c46/oQtSFdIHq\\naAY+6I3TlnsVb/ynJiXZhS4MHVhVX9xYeWksfe/JhXTfvFk11+3RvVM/f+4LwA5V9bONldf4nBCr\\nJiT5XbqJid8Edqa7VsB1dP+pSJssyUvoet2+QzdX6gPAlw0mmgX3+uGW2WI4USu2Bv4BeBTdBbS+\\nAiwZ0QUrbYoH0F07YgEwRDcP5S3j7iFND4ciponDOpIkqSlOiJUkSU0xnEiSpKYYTiRJUlMMJ5Ik\\nqSmGE0ktcQgFAAAEYElEQVSS1BTDiaSmJTkqyYopHuMRSdYneeJ0tUvSzDGcSJqyJLv096L5UZJf\\nJPl+kg/0N5WcyHHWJ3nxiNVHAwdOsYmrgZ3o7rQtqXGGE0lT0t/I8evAbnT3DNmN7q6uBwKXJPnt\\nqRy/vyHbmo2XHPcYVVU/6e+9M+2SbOFNAqXpYziRNFUnAL8AnltVX66qH/Y37XsO8HDg7wGSrOrv\\nxvrRJD9P8sMkbxg+SJJVdFfY/ETfg3Jdv/6dSa4YKHdKkv/q72R8Y5I1/XG3TPLPSW5J8oMkrxnY\\nZ4Nhnf4Y6/vHrwaeP6Pffr8k7+vb+PMkl/T3Thk+3qF9vS9K8t90NxVc0N+t+6v9PmuSXNTfaVnS\\nBBhOJE1akh2A5wHHV9UGt3WvqpuAj9D1pgx7C3AF8CTgH4HjkgwP2exDd2+SQ+mGYPYZPhT3vCz4\\ns4GH0d0teyndnYw/BfwU2Bf4EHBikp0HmzTw/M19HTv1xzkOuAm4pt9+PPBUujvI7gX8O/CZJLsN\\nHGMb4K3AnwKPB9YA/0V3h9snAPvR3RXZy3BLE+S9dSRNxe50geKaMbavBHZIMr9f/kpVHd0//5f+\\njsFLgc9X1VA/MnJbVf1kI/XeUlVv7p9/J8n/Bn6rqv4RIMl76W5dfwDw8b7cr4ddqup2uns4keRl\\nwOvp7oD9k76n4zXAgqq6sd/lmCQvAF4L/E2/bivgz6vq6v44OwAPBM6tqu/3Za7dyOuQNArDiaTp\\nsKnzLS4ZZfl/TqK+/x6xfBPdHa0BqKr1SW4BHjreQZI8GTgdOKKqLu1X7wVsCXx7xDyS+9HdSHDY\\nXcPBpK9zTZLTgPOSfA44H/j4QMCRtIkc1pE0Fd+lG7ZYOMb2PYE1VTU0xvbJunvEco2xbsz3uCQ7\\nAZ8ETqqqUwc2bQf8ElgE7D3wWMiGQerOkcesqtfRDed8hW4469ok+2785UgaZDiRNGlV9VPgc8Ab\\nktx/cFv/n/+rgI8NrN5vxCH2oxv6GXY3Xa/FTPj13I++rZ8AvgX81YhyV/Rt2LGqrhvx2NhwE1X1\\njar6p6ran66H51XT9xKk+wbDiaSpeiNwf2B5kt/pr3nyfOA84Af8Zo4GwP5J3pJk9yRHAAcDHxjY\\n/n3gwCQ7TvUryKMYHKI5CdiFrifkoX19OybZuqq+A3wUOD3JS5M8Msm+SY7s552MfvCu3D8k2S/J\\nrkmeRzcn51vT/Dqkez3DiaQpqarvAk8BrgPOpBvq+RDweeDpVXXrQPH392WvAN4GLK2q8we2/xXw\\nXLpQM5Grwo72jZiR6waXn0H3LZ1vAT8Gbuj/fVq//TV0c1HeRzfZ96y+3avHacNa4HHAf9BNhP0Q\\n8H+q6qQJvA5JQKr8lpukmddfx+TYqvrgXLdFUtvsOZEkSU0xnEiaLXbTStokDutIkqSm2HMiSZKa\\nYjiRJElNMZxIkqSmGE4kSVJTDCeSJKkphhNJktQUw4kkSWqK4USSJDXFcCJJkpry/wGmW9ItYQME\\nSAAAAABJRU5ErkJggg==\\n\",\n      \"text/plain\": [\n       \"<matplotlib.figure.Figure at 0x7fd9d419ba50>\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"# Plot the statistical performanc of the optimizers.\\n\",\n    \"fig = plt.figure()\\n\",\n    \"st = fig.suptitle(\\\"Higer is better.\\\", fontsize=\\\"x-small\\\")\\n\",\n    \"\\n\",\n    \"plt.bar(range(len(results)), results.values(), align='center')\\n\",\n    \"plt.xticks(range(len(results)), results.keys())\\n\",\n    \"plt.xlabel(\\\"Optimizers\\\")\\n\",\n    \"plt.ylabel(\\\"F1\\\")\\n\",\n    \"plt.ylim([0.83,0.85])\\n\",\n    \"plt.show()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [conda root]\",\n   \"language\": \"python\",\n   \"name\": \"conda-root-py\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 2\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython2\",\n   \"version\": \"2.7.12\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "examples/kafka_producer.py",
    "content": "\"\"\"\nThis example will be used as a Kafka producer to generate dummy\ndata for our Spark Streaming example.\n\"\"\"\n\n## BEGIN Imports. ##############################################################\n\nfrom kafka import *\n\nimport sys\n\nimport pandas\n\nimport time\n\nimport json\n\n## END Imports. ################################################################\n\ndef usage():\n    print(\"Distributed Keras Example: Kafka Producer\")\n    print(\"\")\n    print(\"Usage:\")\n    print(\"python kafka_producer.py [bootstrap_server]\")\n    exit(0)\n\ndef allocate_producer(bootstrap_server):\n    producer = KafkaProducer(bootstrap_servers=[bootstrap_server])\n\n    return producer\n\ndef read_data():\n    path = 'data/atlas_higgs.csv'\n    data = []\n    # Use Pandas to infer the types.\n    data = pandas.read_csv(path)\n    # Remove the unneeded columns.\n    del data['Label']\n    del data['Weight']\n    # Convert the data to a list of dictionaries.\n    data = data.transpose().to_dict().values()\n\n    return data\n\ndef produce(producer, topic, data):\n    for row in data:\n        producer.send(topic, json.dumps(row))\n\ndef main():\n    # Check if the required number of arguments has been specified.\n    if len(sys.argv) != 2:\n        usage()\n    # Fetch the bootstrap server from the arguments.\n    bootstrap_server = sys.argv[1]\n    # Allocate the producer.\n    producer = allocate_producer(bootstrap_server)\n    # Read the data from the CSV file.\n    data = read_data()\n    iteration = 1\n    # Transmit the data in a continous loop while waiting for 5 seconds after every iteration.\n    while True:\n        print(\"Iteration \" + str(iteration) + \".\")\n        produce(producer, 'Machine_Learning', data)\n        iteration += 1\n        time.sleep(5)\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "examples/kafka_spark_high_throughput_ml_pipeline.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Kafka and Spark High Throughput Deep Learning Production Pipeline\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN)             \\n\",\n    \"*Departement of Knowledge Engineering*         \\n\",\n    \"*Maastricht University, The Netherlands*\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"15 November 2016\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!(date +%d\\\\ %B\\\\ %G)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this notebook we will inform the reader how to set up a production ready machine learning pipeline using [Apache Kafka](https://kafka.apache.org) and [Apache Spark](https://spark.apache.org), together with our Distributed Deep Learning framework [Distributed Keras](https://github.com/JoeriHermans/dist-keras) which is built using [Keras](https://keras.io).\\n\",\n    \"\\n\",\n    \"***Note before starting this notebook: *** Do not forget to run the Kafka producer (as explained in this notebook).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Contents\\n\",\n    \"\\n\",\n    \"- [Introduction and problem statement](#Problem-statement)\\n\",\n    \"- [Preliminaries](#Preliminaries)\\n\",\n    \"   - [Installation and requirements](#Installation-and-requirements)\\n\",\n    \"   - [Pretrained model](#Pretrained-model)\\n\",\n    \"   - [Kafka producer](#Kafka-producer)\\n\",\n    \"- [Usage](#Distributed-Keras:-a-practicle-example)\\n\",\n    \"- [Experiments](#Experiments)\\n\",\n    \"- [Conclusion](#Conclusion)\\n\",\n    \"- [Acknowledgments](#Acknowledgments)\\n\",\n    \"- [References](#References)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Using TensorFlow backend.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import json\\n\",\n    \"\\n\",\n    \"from keras.models import Sequential\\n\",\n    \"from keras.layers.core import Dense, Dropout, Activation\\n\",\n    \"\\n\",\n    \"from pyspark import SparkContext\\n\",\n    \"from pyspark import SparkConf\\n\",\n    \"from pyspark.streaming import StreamingContext\\n\",\n    \"from pyspark.streaming.kafka import KafkaUtils\\n\",\n    \"\\n\",\n    \"from pyspark.ml.feature import VectorAssembler\\n\",\n    \"from pyspark.ml.feature import Normalizer\\n\",\n    \"\\n\",\n    \"from distkeras.trainers import *\\n\",\n    \"from distkeras.predictors import *\\n\",\n    \"from distkeras.transformers import *\\n\",\n    \"from distkeras.evaluators import *\\n\",\n    \"from distkeras.utils import *\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Problem statement\\n\",\n    \"\\n\",\n    \"The problem of building an efficient machine learning production pipeline is quite similar to building an efficient training procedure. However, in contrast to the training procedure, in production (model serving) most of this data will arrive in a streaming fashion. Usually, one just reads from a particular source using Spark Streaming. However, intergration with Apache Kafka is also possible. Kafka allows us to scale our streaming application if a bottleneck would occur. At CERN we employ Apache Kafka with different use-cases in [IT](https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-10-benchmarking-apache-kafka-openstack-vms) (IT Group), [BE](https://indico.cern.ch/event/533714/contributions/2173938/attachments/1292041/1924841/CALS2-Hadoop-IT.pdf) (Beams Group), and ATLAS.\\n\",\n    \"\\n\",\n    \"However, building a distributed streaming application has some practical considerations as mentioned in [1]. This includes specifying the *retention* (i.e., how much time is the data allowed to stay in the buffer, or what is the maximum size of the buffer before discarding older data) of the data in your buffer, usage of *compression*, number of *brokers*, *partitions*, and how to throttle incoming data. Of course, these settings are always application and infrastructure depended. But since this is a general-purpose framework, we will show in the following sections how to build a scalable deep learning production (model serving) pipeline using the technologies mentioned above.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Preliminaries\\n\",\n    \"\\n\",\n    \"### Installation and requirements\\n\",\n    \"\\n\",\n    \"#### Cluster requirements\\n\",\n    \"\\n\",\n    \"We will assume that you will already have a running Kafka and Spark cluster. Furthermore, in order to run this example, we require that the topic **\\\"Machine_Learning\\\"** is available on this Kafka cluster.\\n\",\n    \"\\n\",\n    \"#### Kafka Python\\n\",\n    \"\\n\",\n    \"In order to manage your Python dependencies, it is recommended to install a Python distribution like [Anaconda](https://www.continuum.io/downloads). In the following sections, we assume that Spark is already added to your PATH variable. In order to run our Kafka producer (located in the *examples* directory). We first need [Kafka Python](https://github.com/dpkp/kafka-python). This is done by simply running Pip in your shell:\\n\",\n    \"\\n\",\n    \"```pip install kafka-python```\\n\",\n    \"\\n\",\n    \"### Pretrained model\\n\",\n    \"\\n\",\n    \"In order to run a production classification pipeline you should have access to a trained model. Keras provides an API to load and store trained models. The same procedures can be used with Distributed Keras and Spark to load a pretrained model for production use-cases. However, in this example, we will construct a Neural Network with randomly initialized weights (which will simulate such a pretrained model). The structure of the model (input and output data) will be equivalent to the neural network in the *workflow notebook*. So if anyone wants to use the distributed training methods described in the workflow notebook to train a model, and afterwards save it to use the trained model in this notebook, you should not experience any problems. Just make sure the model variable is set to your trained Keras model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As defined in the *workflow* notebook, our neural network will use 30 features and will be trained to classify two classes (signal and background).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"nb_features = 30\\n\",\n    \"nb_classes = 2 \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As described above, we construct a randomly initialized neural network to simulate a pretrained network.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"____________________________________________________________________________________________________\\n\",\n      \"Layer (type)                     Output Shape          Param #     Connected to                     \\n\",\n      \"====================================================================================================\\n\",\n      \"dense_1 (Dense)                  (None, 500)           15500       dense_input_1[0][0]              \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"activation_1 (Activation)        (None, 500)           0           dense_1[0][0]                    \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"dropout_1 (Dropout)              (None, 500)           0           activation_1[0][0]               \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"dense_2 (Dense)                  (None, 1000)          501000      dropout_1[0][0]                  \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"activation_2 (Activation)        (None, 1000)          0           dense_2[0][0]                    \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"dropout_2 (Dropout)              (None, 1000)          0           activation_2[0][0]               \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"dense_3 (Dense)                  (None, 500)           500500      dropout_2[0][0]                  \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"activation_3 (Activation)        (None, 500)           0           dense_3[0][0]                    \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"dense_4 (Dense)                  (None, 2)             1002        activation_3[0][0]               \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"activation_4 (Activation)        (None, 2)             0           dense_4[0][0]                    \\n\",\n      \"====================================================================================================\\n\",\n      \"Total params: 1018002\\n\",\n      \"____________________________________________________________________________________________________\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"model = Sequential()\\n\",\n    \"model.add(Dense(500, input_shape=(nb_features,)))\\n\",\n    \"model.add(Activation('relu'))\\n\",\n    \"model.add(Dropout(0.4))\\n\",\n    \"model.add(Dense(1000))\\n\",\n    \"model.add(Activation('relu'))\\n\",\n    \"model.add(Dropout(0.4))\\n\",\n    \"model.add(Dense(500))\\n\",\n    \"model.add(Activation('relu'))\\n\",\n    \"model.add(Dense(nb_classes))\\n\",\n    \"model.add(Activation('softmax'))\\n\",\n    \"\\n\",\n    \"model.summary()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Kafka producer\\n\",\n    \"\\n\",\n    \"In order to run the Kafka producer, change the directory to the examples directory. Next, fetch the address of a bootstrap server. Once you have this address, run the following command in a seperate shell to run the Kafka producer:\\n\",\n    \"\\n\",\n    \"```python kafka_producer.py [bootstrap_server]```\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Usage\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In the following cell, please modify the required parameters according to your requirements.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Modify these variables according to your needs.\\n\",\n    \"application_name = \\\"Distributed Keras Kafka Pipeline\\\"\\n\",\n    \"using_spark_2 = False\\n\",\n    \"local = False\\n\",\n    \"if local:\\n\",\n    \"    # Tell master to use local resources.\\n\",\n    \"    master = \\\"local[*]\\\"\\n\",\n    \"    num_cores = 3\\n\",\n    \"    num_executors = 1\\n\",\n    \"else:\\n\",\n    \"    # Tell master to use YARN.\\n\",\n    \"    master = \\\"yarn-client\\\"\\n\",\n    \"    num_executors = 8\\n\",\n    \"    num_cores = 2\\n\",\n    \"# Define Kafka specific metrics.\\n\",\n    \"zk = \\\"zookeeper_host:2181\\\";             # ZooKeeper address\\n\",\n    \"topic = \\\"Machine_Learning\\\"              # Topic name\\n\",\n    \"consumer_name = \\\"dist-keras-consumer\\\"   # Consumer identifier\\n\",\n    \"# Define Spark streaming specific parameters.\\n\",\n    \"batch_duriation = 10 # In seconds.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We will allocate a Spark Context (sc) with a Spark Streaming Context (ssc) using the parameters you provided above.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"conf = SparkConf()\\n\",\n    \"conf.set(\\\"spark.app.name\\\", application_name)\\n\",\n    \"conf.set(\\\"spark.master\\\", master)\\n\",\n    \"conf.set(\\\"spark.executor.cores\\\", `num_cores`)\\n\",\n    \"conf.set(\\\"spark.executor.instances\\\", `num_executors`)\\n\",\n    \"conf.set(\\\"spark.serializer\\\", \\\"org.apache.spark.serializer.KryoSerializer\\\");\\n\",\n    \"\\n\",\n    \"# Check if the user is running Spark 2.0 +\\n\",\n    \"if using_spark_2:\\n\",\n    \"    sc = SparkSession.builder.config(conf=conf) \\\\\\n\",\n    \"            .appName(application_name) \\\\\\n\",\n    \"            .getOrCreate()\\n\",\n    \"else:\\n\",\n    \"    # Create the Spark context.\\n\",\n    \"    sc = SparkContext(conf=conf)\\n\",\n    \"    # Add the missing imports\\n\",\n    \"    from pyspark import SQLContext\\n\",\n    \"    sqlContext = SQLContext(sc)\\n\",\n    \"# Allocate the streaming context with a batch duration of 10 seconds.\\n\",\n    \"ssc = StreamingContext(sc, batch_duriation)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Next, we allocate a Kafka Stream using the previously defined parameters. However, the final parameter, which is passed as a dictionary, will tell the consumer group to read from (in this case) 3 different partitions at once.\\n\",\n    \"\\n\",\n    \"For additional and more detailed information on Spark's Kafka API, we will refer to their documentation [http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html](http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Allocate a Kafka stream.\\n\",\n    \"kafkaStream = KafkaUtils.createStream(ssc, zk, consumer_name, {topic: 3})\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def predict(df):\\n\",\n    \"    \\\"\\\"\\\"This method will add a prediction column to the specified DataFrame using the pretrained model.\\\"\\\"\\\"\\n\",\n    \"    predictor = ModelPredictor(keras_model=model, features_col=\\\"features_normalized\\\", output_col=\\\"prediction\\\")\\n\",\n    \"    \\n\",\n    \"    return predictor.predict(df)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def post_process(df):\\n\",\n    \"    \\\"\\\"\\\"\\n\",\n    \"    Will add a column to the specified DataFrame by converting the raw\\n\",\n    \"    model prediction (which is an array) to a predicted class (identifier by an index).\\n\",\n    \"    Since we only have two classes, the output dimension is 2. This will cause the\\n\",\n    \"    LabelIndexTransformer to output a 0 or a 1 given the raw neural network classification.\\n\",\n    \"    \\\"\\\"\\\"\\n\",\n    \"    transformer = LabelIndexTransformer(output_dim=2, input_col=\\\"prediction\\\", output_col=\\\"predicted_index\\\")\\n\",\n    \"    \\n\",\n    \"    return transformer.transform(df)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def prepare_dataframe(df):\\n\",\n    \"    \\\"\\\"\\\"\\n\",\n    \"    Takes the specified dataframe and add two columns:\\n\",\n    \"    \\n\",\n    \"    1. features\\n\",\n    \"       Every row will hold a vector of the specified features.\\n\",\n    \"    2. features_normalized\\n\",\n    \"       Every row will hold a normalized vector of features based\\n\",\n    \"       on the features vector created before.\\n\",\n    \"    \\\"\\\"\\\"\\n\",\n    \"    features = df.columns\\n\",\n    \"    features.remove('EventId')\\n\",\n    \"    vector_assembler = VectorAssembler(inputCols=features, outputCol=\\\"features\\\")\\n\",\n    \"    df = vector_assembler.transform(df)\\n\",\n    \"    normalizer = Normalizer(inputCol=\\\"features\\\", outputCol=\\\"features_normalized\\\", p=2.0)\\n\",\n    \"    df = normalizer.transform(df)\\n\",\n    \"    \\n\",\n    \"    return df\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In *process_instances* we will process the incoming RDD's into predictions. Of course, since there is no real goal to this notebook besides demonstration purposes, we just print the number of instances which were classified as \\\"signal\\\" by the pretrained model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def process_instances(rdd):\\n\",\n    \"    # Check if there is new data available.\\n\",\n    \"    if not rdd.isEmpty():\\n\",\n    \"        df = rdd.toDF()             # Convert the RDD to a Spark DataFrame.\\n\",\n    \"        df = prepare_dataframe(df)  # Create a feature column and normalize the batch.\\n\",\n    \"        df = predict(df)            # Add the raw Neural Network predictions.\\n\",\n    \"        df = post_process(df)       # Convert the raw Neural Network predictions to a class (index).\\n\",\n    \"        # Extract the instances which are interesting (signal).\\n\",\n    \"        df = df.filter(df['predicted_index'] == 0)\\n\",\n    \"        # TODO: Do something with your DataFrame (e.g., storing to HDFS).\\n\",\n    \"        print(df.count())\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Fetch the raw instances from the Kafka stream.\\n\",\n    \"raw_instances = kafkaStream.map(lambda x: x[1])\\n\",\n    \"# Convert the raw instances (which are JSON strings) to Spark rows.\\n\",\n    \"instances = raw_instances.map(json_to_dataframe_row)\\n\",\n    \"# Process every RDD in the DStream.\\n\",\n    \"instances.foreachRDD(process_instances)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"33023\\n\",\n      \"46801\\n\",\n      \"45446\\n\",\n      \"48116\\n\",\n      \"22459\\n\",\n      \"45999\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"ssc.start()\\n\",\n    \"ssc.awaitTermination()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Experiments\\n\",\n    \"\\n\",\n    \"TODO\\n\",\n    \"\\n\",\n    \"## Conclusion\\n\",\n    \"\\n\",\n    \"In this notebook we demonstrated how to construct a high throughput model serving pipeline using Apache Spark, Apache Kafka and Distributed Keras. Furthermore, we also showed that this infrastructure provides an easily scalable approach for production use-cases. However, since Distributed Keras is still being developed, some bugs might still show up. So please notify us when any of these occur on your system.\\n\",\n    \"\\n\",\n    \"**Contact**: [joeri.hermans@cern.ch](mailto:joeri.hermans@cern.ch)\\n\",\n    \"             [luca.canali@cern.ch](mailto:luca.canali@cern.ch)\\n\",\n    \"             [zbigniew.baranowski@cern.ch](mailto:zbigniew.baranowski@cern.ch)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Acknowledgements\\n\",\n    \"\\n\",\n    \"Many thanks to Zbigniew Baranowski and Luca Canali of the IT-DB group for their collaboration on this work.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python [Root]\",\n   \"language\": \"python\",\n   \"name\": \"Python [Root]\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 2\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython2\",\n   \"version\": \"2.7.12\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "examples/mnist.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# MNIST using Distributed Keras\\n\",\n    \"\\n\",\n    \"**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN)             \\n\",\n    \"*Departement of Knowledge Engineering*         \\n\",\n    \"*Maastricht University, The Netherlands*\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"!(date +%d\\\\ %B\\\\ %G)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this notebook we will show you how to process the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset using Distributed Keras. As in the [workflow](https://github.com/JoeriHermans/dist-keras/blob/master/examples/workflow.ipynb) notebook, we will guide you through the complete machine learning pipeline.\\n\",\n    \"\\n\",\n    \"## Preparation\\n\",\n    \"\\n\",\n    \"To get started, we first load all the required imports. Please make sure you installed `dist-keras`, and `seaborn`. Furthermore, we assume that you have access to an installation which provides Apache Spark.\\n\",\n    \"\\n\",\n    \"Before you start this notebook, place the MNIST dataset (which is provided in this repository) on HDFS. Or in the case HDFS is not available, place it on the local filesystem. But make sure the path to the file is identical for all computing nodes.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"import seaborn as sns\\n\",\n    \"\\n\",\n    \"from keras.optimizers import *\\n\",\n    \"from keras.models import Sequential\\n\",\n    \"from keras.layers.core import *\\n\",\n    \"from keras.layers.convolutional import *\\n\",\n    \"\\n\",\n    \"from pyspark import SparkContext\\n\",\n    \"from pyspark import SparkConf\\n\",\n    \"\\n\",\n    \"from matplotlib import pyplot as plt\\n\",\n    \"import matplotlib.patches as mpatches\\n\",\n    \"\\n\",\n    \"from pyspark.ml.feature import StandardScaler\\n\",\n    \"from pyspark.ml.feature import VectorAssembler\\n\",\n    \"from pyspark.ml.feature import OneHotEncoder\\n\",\n    \"from pyspark.ml.feature import MinMaxScaler\\n\",\n    \"from pyspark.ml.feature import StringIndexer\\n\",\n    \"from pyspark.ml.evaluation import MulticlassClassificationEvaluator\\n\",\n    \"\\n\",\n    \"from distkeras.trainers import *\\n\",\n    \"from distkeras.predictors import *\\n\",\n    \"from distkeras.transformers import *\\n\",\n    \"from distkeras.evaluators import *\\n\",\n    \"from distkeras.utils import *\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In the following cell, adapt the parameters to fit your personal requirements.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Modify these variables according to your needs.\\n\",\n    \"application_name = \\\"Distributed Keras MNIST Notebook\\\"\\n\",\n    \"using_spark_2 = False\\n\",\n    \"local = False\\n\",\n    \"path_train = \\\"data/mnist_train.csv\\\"\\n\",\n    \"path_test = \\\"data/mnist_test.csv\\\"\\n\",\n    \"if local:\\n\",\n    \"    # Tell master to use local resources.\\n\",\n    \"    master = \\\"local[*]\\\"\\n\",\n    \"    num_processes = 3\\n\",\n    \"    num_executors = 1\\n\",\n    \"else:\\n\",\n    \"    # Tell master to use YARN.\\n\",\n    \"    master = \\\"yarn-client\\\"\\n\",\n    \"    num_executors = 20\\n\",\n    \"    num_processes = 1\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# This variable is derived from the number of cores and executors, and will be used to assign the number of model trainers.\\n\",\n    \"num_workers = num_executors * num_processes\\n\",\n    \"\\n\",\n    \"print(\\\"Number of desired executors: \\\" + `num_executors`)\\n\",\n    \"print(\\\"Number of desired processes / executor: \\\" + `num_processes`)\\n\",\n    \"print(\\\"Total number of workers: \\\" + `num_workers`)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"# Use the DataBricks CSV reader, this has some nice functionality regarding invalid values.\\n\",\n    \"os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell'\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"conf = SparkConf()\\n\",\n    \"conf.set(\\\"spark.app.name\\\", application_name)\\n\",\n    \"conf.set(\\\"spark.master\\\", master)\\n\",\n    \"conf.set(\\\"spark.executor.cores\\\", `num_processes`)\\n\",\n    \"conf.set(\\\"spark.executor.instances\\\", `num_executors`)\\n\",\n    \"conf.set(\\\"spark.executor.memory\\\", \\\"4g\\\")\\n\",\n    \"conf.set(\\\"spark.locality.wait\\\", \\\"0\\\")\\n\",\n    \"conf.set(\\\"spark.serializer\\\", \\\"org.apache.spark.serializer.KryoSerializer\\\");\\n\",\n    \"\\n\",\n    \"# Check if the user is running Spark 2.0 +\\n\",\n    \"if using_spark_2:\\n\",\n    \"    sc = SparkSession.builder.config(conf=conf) \\\\\\n\",\n    \"            .appName(application_name) \\\\\\n\",\n    \"            .getOrCreate()\\n\",\n    \"else:\\n\",\n    \"    # Create the Spark context.\\n\",\n    \"    sc = SparkContext(conf=conf)\\n\",\n    \"    # Add the missing imports\\n\",\n    \"    from pyspark import SQLContext\\n\",\n    \"    sqlContext = SQLContext(sc)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Check if we are using Spark 2.0\\n\",\n    \"if using_spark_2:\\n\",\n    \"    reader = sc\\n\",\n    \"else:\\n\",\n    \"    reader = sqlContext\\n\",\n    \"# Read the training dataset.\\n\",\n    \"raw_dataset_train = reader.read.format('com.databricks.spark.csv') \\\\\\n\",\n    \"                          .options(header='true', inferSchema='true') \\\\\\n\",\n    \"                          .load(path_train)\\n\",\n    \"# Read the testing dataset.\\n\",\n    \"raw_dataset_test = reader.read.format('com.databricks.spark.csv') \\\\\\n\",\n    \"                         .options(header='true', inferSchema='true') \\\\\\n\",\n    \"                         .load(path_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As shown in the output of the cell above, we see that every pixel is associated with a seperate column. In order to ensure compatibility with Apache Spark, we vectorize the columns, and add the resulting vectors as a seperate column. However, in order to achieve this, we first need a list of the required columns. This is shown in the cell below.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# First, we would like to extract the desired features from the raw dataset.\\n\",\n    \"# We do this by constructing a list with all desired columns.\\n\",\n    \"# This is identical for the test set.\\n\",\n    \"features = raw_dataset_train.columns\\n\",\n    \"features.remove('label')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Once we have a list of columns names, we can pass this to Spark's [VectorAssembler](http://spark.apache.org/docs/latest/ml-features.html#vectorassembler). This VectorAssembler will take a list of features, vectorize them, and place them in a column defined in `outputCol`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Next, we use Spark's VectorAssembler to \\\"assemble\\\" (create) a vector of all desired features.\\n\",\n    \"# http://spark.apache.org/docs/latest/ml-features.html#vectorassembler\\n\",\n    \"vector_assembler = VectorAssembler(inputCols=features, outputCol=\\\"features\\\")\\n\",\n    \"# This transformer will take all columns specified in features, and create an additional column \\\"features\\\" which will contain all the desired features aggregated into a single vector.\\n\",\n    \"dataset_train = vector_assembler.transform(raw_dataset_train)\\n\",\n    \"dataset_test = vector_assembler.transform(raw_dataset_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Once we have the inputs for our Neural Network (features column) after applying the VectorAssembler, we should also define the outputs. Since we are dealing with a classification task, the output of our Neural Network should be a one-hot encoded vector with 10 elements. For this, we provide a `OneHotTransformer` which accomplish this exact task.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Define the number of output classes.\\n\",\n    \"nb_classes = 10\\n\",\n    \"encoder = OneHotTransformer(nb_classes, input_col=\\\"label\\\", output_col=\\\"label_encoded\\\")\\n\",\n    \"dataset_train = encoder.transform(dataset_train)\\n\",\n    \"dataset_test = encoder.transform(dataset_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## MNIST\\n\",\n    \"\\n\",\n    \"[MNIST](http://yann.lecun.com/exdb/mnist/) is a dataset of handwritten digits. Every image is a 28 by 28 pixel grayscale image. This means that every pixel has a value between 0 and 255. Some examples of instances within this dataset are shown in the cells below.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def show_instances(column):\\n\",\n    \"    global dataset\\n\",\n    \"\\n\",\n    \"    num_instances = 6 # Number of instances you would like to draw.\\n\",\n    \"    x_dimension   = 3 # Number of images to draw on the x-axis.\\n\",\n    \"    y_dimension   = 2 # Number of images to draw on the y-axis.\\n\",\n    \"\\n\",\n    \"    # Fetch 3 different instance from the dataset.\\n\",\n    \"    instances = dataset_train.select(column).take(num_instances)\\n\",\n    \"    # Process the instances.\\n\",\n    \"    for i in range(0, num_instances):\\n\",\n    \"        instance = instances[i]\\n\",\n    \"        instance = instance[column].toArray().reshape((28, 28))\\n\",\n    \"        instances[i] = instance\\n\",\n    \"\\n\",\n    \"    # Draw the sampled instances.\\n\",\n    \"    fig, axn = plt.subplots(y_dimension, x_dimension, sharex=True, sharey=True)\\n\",\n    \"    num_axn = len(axn.flat)\\n\",\n    \"    for i in range(0, num_axn):\\n\",\n    \"        ax = axn.flat[i]\\n\",\n    \"        h = sns.heatmap(instances[i], ax=ax)\\n\",\n    \"        h.set_yticks([])\\n\",\n    \"        h.set_xticks([])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"show_instances(\\\"features\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Normalization\\n\",\n    \"\\n\",\n    \"In this Section, we will normalize the feature vectors between the 0 and 1 range.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Clear the dataset in the case you ran this cell before.\\n\",\n    \"dataset_train = dataset_train.select(\\\"features\\\", \\\"label\\\", \\\"label_encoded\\\")\\n\",\n    \"dataset_test = dataset_test.select(\\\"features\\\", \\\"label\\\", \\\"label_encoded\\\")\\n\",\n    \"# Allocate a MinMaxTransformer using Distributed Keras.\\n\",\n    \"# o_min -> original_minimum\\n\",\n    \"# n_min -> new_minimum\\n\",\n    \"transformer = MinMaxTransformer(n_min=0.0, n_max=1.0, \\\\\\n\",\n    \"                                o_min=0.0, o_max=250.0, \\\\\\n\",\n    \"                                input_col=\\\"features\\\", \\\\\\n\",\n    \"                                output_col=\\\"features_normalized\\\")\\n\",\n    \"# Transform the dataset.\\n\",\n    \"dataset_train = transformer.transform(dataset_train)\\n\",\n    \"dataset_test = transformer.transform(dataset_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"show_instances(\\\"features_normalized\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Convolutions\\n\",\n    \"\\n\",\n    \"In order to make the dense vectors compatible with convolution operations in Keras, we add another column which contains the matrix form of these images. We provide a utility class (MatrixTransformer), which helps you with this.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"reshape_transformer = ReshapeTransformer(\\\"features_normalized\\\", \\\"matrix\\\", (28, 28, 1))\\n\",\n    \"dataset_train = reshape_transformer.transform(dataset_train)\\n\",\n    \"dataset_test = reshape_transformer.transform(dataset_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Model Development\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Multilayer Perceptron\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"mlp = Sequential()\\n\",\n    \"mlp.add(Dense(1000, input_shape=(784,)))\\n\",\n    \"mlp.add(Activation('relu'))\\n\",\n    \"mlp.add(Dense(250))\\n\",\n    \"mlp.add(Activation('relu'))\\n\",\n    \"mlp.add(Dense(10))\\n\",\n    \"mlp.add(Activation('softmax'))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"mlp.summary()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"optimizer_mlp = 'adam'\\n\",\n    \"loss_mlp = 'categorical_crossentropy'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Convolutional network\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Taken from Keras MNIST example.\\n\",\n    \"\\n\",\n    \"# Declare model parameters.\\n\",\n    \"img_rows, img_cols = 28, 28\\n\",\n    \"# number of convolutional filters to use\\n\",\n    \"nb_filters = 32\\n\",\n    \"# size of pooling area for max pooling\\n\",\n    \"pool_size = (2, 2)\\n\",\n    \"# convolution kernel size\\n\",\n    \"kernel_size = (3, 3)\\n\",\n    \"input_shape = (img_rows, img_cols, 1)\\n\",\n    \"\\n\",\n    \"# Construct the model.\\n\",\n    \"convnet = Sequential()\\n\",\n    \"convnet.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1],\\n\",\n    \"                          border_mode='valid',\\n\",\n    \"                          input_shape=input_shape))\\n\",\n    \"convnet.add(Activation('relu'))\\n\",\n    \"convnet.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1]))\\n\",\n    \"convnet.add(Activation('relu'))\\n\",\n    \"convnet.add(MaxPooling2D(pool_size=pool_size))\\n\",\n    \"\\n\",\n    \"convnet.add(Flatten())\\n\",\n    \"convnet.add(Dense(225))\\n\",\n    \"convnet.add(Activation('relu'))\\n\",\n    \"convnet.add(Dense(nb_classes))\\n\",\n    \"convnet.add(Activation('softmax'))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"convnet.summary()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"optimizer_convnet = 'adam'\\n\",\n    \"loss_convnet = 'categorical_crossentropy'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Evaluation\\n\",\n    \"\\n\",\n    \"We define a utility function which will compute the accuracy for us.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def evaluate_accuracy(model, test_set, features=\\\"features_normalized_dense\\\"):\\n\",\n    \"    evaluator = AccuracyEvaluator(prediction_col=\\\"prediction_index\\\", label_col=\\\"label\\\")\\n\",\n    \"    predictor = ModelPredictor(keras_model=model, features_col=features)\\n\",\n    \"    transformer = LabelIndexTransformer(output_dim=nb_classes)\\n\",\n    \"    test_set = test_set.select(features, \\\"label\\\")\\n\",\n    \"    test_set = predictor.predict(test_set)\\n\",\n    \"    test_set = transformer.transform(test_set)\\n\",\n    \"    score = evaluator.evaluate(test_set)\\n\",\n    \"    \\n\",\n    \"    return score\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"dataset_train.printSchema()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"dataset_train = dataset_train.select(\\\"features_normalized\\\", \\\"matrix\\\",\\\"label\\\", \\\"label_encoded\\\")\\n\",\n    \"dataset_test = dataset_test.select(\\\"features_normalized\\\", \\\"matrix\\\",\\\"label\\\", \\\"label_encoded\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"dense_transformer = DenseTransformer(input_col=\\\"features_normalized\\\", output_col=\\\"features_normalized_dense\\\")\\n\",\n    \"dataset_train = dense_transformer.transform(dataset_train)\\n\",\n    \"dataset_test = dense_transformer.transform(dataset_test)\\n\",\n    \"dataset_train.repartition(num_workers)\\n\",\n    \"dataset_test.repartition(num_workers)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Assing the training and test set.\\n\",\n    \"training_set = dataset_train.repartition(num_workers)\\n\",\n    \"test_set = dataset_test.repartition(num_workers)\\n\",\n    \"# Cache them.\\n\",\n    \"training_set.cache()\\n\",\n    \"test_set.cache()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(training_set.count())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### DOWNPOUR (Multilayer Perceptron)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer = DOWNPOUR(keras_model=mlp, worker_optimizer=optimizer_mlp, loss=loss_mlp, num_workers=num_workers,\\n\",\n    \"                   batch_size=4, communication_window=5, num_epoch=1,\\n\",\n    \"                   features_col=\\\"features_normalized_dense\\\", label_col=\\\"label_encoded\\\")\\n\",\n    \"trained_model = trainer.train(training_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(\\\"Training time: \\\" + str(trainer.get_training_time()))\\n\",\n    \"print(\\\"Accuracy: \\\" + str(evaluate_accuracy(trained_model, test_set)))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer.parameter_server.num_updates\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### ADAG (MultiLayer Perceptron)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer = ADAG(keras_model=mlp, worker_optimizer=optimizer_mlp, loss=loss_mlp, num_workers=num_workers,\\n\",\n    \"               batch_size=4, communication_window=15, num_epoch=1,\\n\",\n    \"               features_col=\\\"features_normalized_dense\\\", label_col=\\\"label_encoded\\\")\\n\",\n    \"trained_model = trainer.train(training_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(\\\"Training time: \\\" + str(trainer.get_training_time()))\\n\",\n    \"print(\\\"Accuracy: \\\" + str(evaluate_accuracy(trained_model, test_set)))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer.parameter_server.num_updates\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### EASGD (MultiLayer Perceptron)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer = AEASGD(keras_model=mlp, worker_optimizer=optimizer_mlp, loss=loss_mlp, num_workers=num_workers,\\n\",\n    \"                 batch_size=4, communication_window=35, num_epoch=1, features_col=\\\"features_normalized_dense\\\",\\n\",\n    \"                 label_col=\\\"label_encoded\\\")\\n\",\n    \"trained_model = trainer.train(training_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(\\\"Training time: \\\" + str(trainer.get_training_time()))\\n\",\n    \"print(\\\"Accuracy: \\\" + str(evaluate_accuracy(trained_model, test_set)))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer.parameter_server.num_updates\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### DOWNPOUR (Convolutional network)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer = DOWNPOUR(keras_model=convnet, worker_optimizer=optimizer_convnet, loss=loss_convnet,\\n\",\n    \"                   num_workers=num_workers, batch_size=4, communication_window=5,\\n\",\n    \"                   num_epoch=1, features_col=\\\"matrix\\\", label_col=\\\"label_encoded\\\")\\n\",\n    \"trainer.set_parallelism_factor(1)\\n\",\n    \"trained_model = trainer.train(training_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(\\\"Training time: \\\" + str(trainer.get_training_time()))\\n\",\n    \"print(\\\"Accuracy: \\\" + str(evaluate_accuracy(trained_model, test_set, \\\"matrix\\\")))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer.parameter_server.num_updates\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### ADAG (Convolutional network)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer = ADAG(keras_model=convnet, worker_optimizer=optimizer_convnet, loss=loss_convnet,\\n\",\n    \"               num_workers=num_workers, batch_size=15, communication_window=5, num_epoch=1,\\n\",\n    \"               features_col=\\\"matrix\\\", label_col=\\\"label_encoded\\\")\\n\",\n    \"trainer.set_parallelism_factor(1)\\n\",\n    \"trained_model = trainer.train(training_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(\\\"Training time: \\\" + str(trainer.get_training_time()))\\n\",\n    \"print(\\\"Accuracy: \\\" + str(evaluate_accuracy(trained_model, test_set, \\\"matrix\\\")))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer.parameter_server.num_updates\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### EASGD (Convolutional network)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer = AEASGD(keras_model=convnet, worker_optimizer=optimizer_convnet, loss=loss_convnet, \\n\",\n    \"                 num_workers=num_workers, batch_size=35, communication_window=32, num_epoch=1,\\n\",\n    \"                 features_col=\\\"matrix\\\", label_col=\\\"label_encoded\\\")\\n\",\n    \"trained_model = trainer.train(training_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(\\\"Training time: \\\" + str(trainer.get_training_time()))\\n\",\n    \"print(\\\"Accuracy: \\\" + str(evaluate_accuracy(trained_model, test_set, \\\"matrix\\\")))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer.parameter_server.num_updates\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python 2\",\n   \"language\": \"python\",\n   \"name\": \"python2\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 2\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython2\",\n   \"version\": \"2.7.13\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "examples/mnist.py",
    "content": "\"\"\"MNIST classification using Distributed Keras.\n\nATTENTION:\nBefore running this example, make sure you put the MNIST dataset\non HDFS.\n1. unzip mnist.zip\n2. hdfs dfs -mkdir data\n3. hdfs dfs -copyFromLocal mnist_train.csv data/mnist_train.csv\n4. hdfs dfs -copyFromLocal mnist_test.csv data/mnist_test.csv\n\"\"\"\n\nfrom distkeras.evaluators import *\nfrom distkeras.predictors import *\nfrom distkeras.trainers import *\nfrom distkeras.transformers import *\nfrom distkeras.utils import *\n\nfrom keras.layers.convolutional import *\nfrom keras.layers.core import *\nfrom keras.models import Sequential\nfrom keras.optimizers import *\n\nfrom pyspark import SparkConf\nfrom pyspark import SparkContext\n\nfrom pyspark.ml.evaluation import MulticlassClassificationEvaluator\nfrom pyspark.ml.feature import OneHotEncoder\nfrom pyspark.ml.feature import StandardScaler\nfrom pyspark.ml.feature import StringIndexer\nfrom pyspark.ml.feature import VectorAssembler\n\nimport pwd\nimport os\n\n\n# First, setup the Spark variables. You can modify them to your needs.\napplication_name = \"Distributed Keras MNIST Notebook\"\nusing_spark_2 = False\nlocal = False\npath_train = \"data/mnist_train.csv\"\npath_test = \"data/mnist_test.csv\"\nif local:\n    # Tell master to use local resources.\n    master = \"local[*]\"\n    num_processes = 3\n    num_executors = 1\nelse:\n    # Tell master to use YARN.\n    master = \"yarn-client\"\n    num_executors = 20\n    num_processes = 1\n\n# This variable is derived from the number of cores and executors, and will be used to assign the number of model trainers.\nnum_workers = num_executors * num_processes\n\nprint(\"Number of desired executors: \" + `num_executors`)\nprint(\"Number of desired processes / executor: \" + `num_processes`)\nprint(\"Total number of workers: \" + `num_workers`)\n\n# Use the DataBricks CSV reader, this has some nice functionality regarding invalid values.\nos.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell'\n\nconf = SparkConf()\nconf.set(\"spark.app.name\", application_name)\nconf.set(\"spark.master\", master)\nconf.set(\"spark.executor.cores\", `num_processes`)\nconf.set(\"spark.executor.instances\", `num_executors`)\nconf.set(\"spark.executor.memory\", \"4g\")\nconf.set(\"spark.locality.wait\", \"0\")\nconf.set(\"spark.serializer\", \"org.apache.spark.serializer.KryoSerializer\");\nconf.set(\"spark.local.dir\", \"/tmp/\" + get_os_username() + \"/dist-keras\");\n\n# Check if the user is running Spark 2.0 +\nif using_spark_2:\n    sc = SparkSession.builder.config(conf=conf) \\\n            .appName(application_name) \\\n            .getOrCreate()\nelse:\n    # Create the Spark context.\n    sc = SparkContext(conf=conf)\n    # Add the missing imports\n    from pyspark import SQLContext\n    sqlContext = SQLContext(sc)\n\n# Check if we are using Spark 2.0\nif using_spark_2:\n    reader = sc\nelse:\n    reader = sqlContext\n# Read the training dataset.\nraw_dataset_train = reader.read.format('com.databricks.spark.csv') \\\n                          .options(header='true', inferSchema='true') \\\n                          .load(path_train)\n# Read the testing dataset.\nraw_dataset_test = reader.read.format('com.databricks.spark.csv') \\\n                         .options(header='true', inferSchema='true') \\\n                         .load(path_test)\n\n# First, we would like to extract the desired features from the raw dataset.\n# We do this by constructing a list with all desired columns.\n# This is identical for the test set.\nfeatures = raw_dataset_train.columns\nfeatures.remove('label')\n\n# Next, we use Spark's VectorAssembler to \"assemble\" (create) a vector of all desired features.\n# http://spark.apache.org/docs/latest/ml-features.html#vectorassembler\nvector_assembler = VectorAssembler(inputCols=features, outputCol=\"features\")\n# This transformer will take all columns specified in features, and create an additional column\n# \"features\" which will contain all the desired features aggregated into a single vector.\ndataset_train = vector_assembler.transform(raw_dataset_train)\ndataset_test = vector_assembler.transform(raw_dataset_test)\n\n# Define the number of output classes.\nnb_classes = 10\nencoder = OneHotTransformer(nb_classes, input_col=\"label\", output_col=\"label_encoded\")\ndataset_train = encoder.transform(dataset_train)\ndataset_test = encoder.transform(dataset_test)\n\n# Allocate a MinMaxTransformer from Distributed Keras to normalize the features..\n# o_min -> original_minimum\n# n_min -> new_minimum\ntransformer = MinMaxTransformer(n_min=0.0, n_max=1.0, \\\n                                o_min=0.0, o_max=250.0, \\\n                                input_col=\"features\", \\\n                                output_col=\"features_normalized\")\n# Transform the dataset.\ndataset_train = transformer.transform(dataset_train)\ndataset_test = transformer.transform(dataset_test)\n\n# Keras expects the vectors to be in a particular shape, we can reshape the\n# vectors using Spark.\nreshape_transformer = ReshapeTransformer(\"features_normalized\", \"matrix\", (28, 28, 1))\ndataset_train = reshape_transformer.transform(dataset_train)\ndataset_test = reshape_transformer.transform(dataset_test)\n\n# Now, create a Keras model.\n# Taken from Keras MNIST example.\n\n# Declare model parameters.\nimg_rows, img_cols = 28, 28\n# number of convolutional filters to use\nnb_filters = 32\n# size of pooling area for max pooling\npool_size = (2, 2)\n# convolution kernel size\nkernel_size = (3, 3)\ninput_shape = (img_rows, img_cols, 1)\n\n# Construct the model.\nconvnet = Sequential()\nconvnet.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1],\n                          border_mode='valid',\n                          input_shape=input_shape))\nconvnet.add(Activation('relu'))\nconvnet.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1]))\nconvnet.add(Activation('relu'))\nconvnet.add(MaxPooling2D(pool_size=pool_size))\nconvnet.add(Flatten())\nconvnet.add(Dense(225))\nconvnet.add(Activation('relu'))\nconvnet.add(Dense(nb_classes))\nconvnet.add(Activation('softmax'))\n\n# Define the optimizer and the loss.\noptimizer_convnet = 'adam'\nloss_convnet = 'categorical_crossentropy'\n\n# Print the summary.\nconvnet.summary()\n\n# We can also evaluate the dataset in a distributed manner.\n# However, for this we need to specify a procedure how to do this.\ndef evaluate_accuracy(model, test_set, features=\"matrix\"):\n    evaluator = AccuracyEvaluator(prediction_col=\"prediction_index\", label_col=\"label\")\n    predictor = ModelPredictor(keras_model=model, features_col=features)\n    transformer = LabelIndexTransformer(output_dim=nb_classes)\n    test_set = test_set.select(features, \"label\")\n    test_set = predictor.predict(test_set)\n    test_set = transformer.transform(test_set)\n    score = evaluator.evaluate(test_set)\n\n    return score\n\n# Select the desired columns, this will reduce network usage.\ndataset_train = dataset_train.select(\"features_normalized\", \"matrix\",\"label\", \"label_encoded\")\ndataset_test = dataset_test.select(\"features_normalized\", \"matrix\",\"label\", \"label_encoded\")\n# Keras expects DenseVectors.\ndense_transformer = DenseTransformer(input_col=\"features_normalized\", output_col=\"features_normalized_dense\")\ndataset_train = dense_transformer.transform(dataset_train)\ndataset_test = dense_transformer.transform(dataset_test)\ndataset_train.repartition(num_workers)\ndataset_test.repartition(num_workers)\n# Assing the training and test set.\ntraining_set = dataset_train.repartition(num_workers)\ntest_set = dataset_test.repartition(num_workers)\n# Cache them.\ntraining_set.cache()\ntest_set.cache()\n\n# Precache the trainingset on the nodes using a simple count.\nprint(training_set.count())\n\n# Use the ADAG optimizer. You can also use a SingleWorker for testing purposes -> traditional\n# non-distributed gradient descent.\ntrainer = ADAG(keras_model=convnet, worker_optimizer=optimizer_convnet, loss=loss_convnet,\n               num_workers=num_workers, batch_size=16, communication_window=5, num_epoch=5,\n               features_col=\"matrix\", label_col=\"label_encoded\")\ntrained_model = trainer.train(training_set)\n\nprint(\"Training time: \" + str(trainer.get_training_time()))\nprint(\"Accuracy: \" + str(evaluate_accuracy(trained_model, test_set)))\nprint(\"Number of parameter server updates: \" + str(trainer.parameter_server.num_updates))\n"
  },
  {
    "path": "examples/mnist_analysis.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# MNIST Analysis with Distributed Keras\\n\",\n    \"\\n\",\n    \"**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN)             \\n\",\n    \"*Departement of Knowledge Engineering*         \\n\",\n    \"*Maastricht University, The Netherlands*\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"18 January 2017\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!(date +%d\\\\ %B\\\\ %G)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this notebook we will show you how to process the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset using Distributed Keras. As in the [workflow](https://github.com/JoeriHermans/dist-keras/blob/master/examples/workflow.ipynb) notebook, we will guide you through the complete machine learning pipeline.\\n\",\n    \"\\n\",\n    \"## Preparation\\n\",\n    \"\\n\",\n    \"To get started, we first load all the required imports. Please make sure you installed `dist-keras`, and `seaborn`. Furthermore, we assume that you have access to an installation which provides Apache Spark.\\n\",\n    \"\\n\",\n    \"Before you start this notebook, place make sure you ran the \\\"MNIST preprocessing\\\" notebook first, since we will be evaluating a manually \\\"enlarged dataset\\\".\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Using TensorFlow backend.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"from keras.optimizers import *\\n\",\n    \"from keras.models import Sequential\\n\",\n    \"from keras.layers.core import *\\n\",\n    \"from keras.layers.convolutional import *\\n\",\n    \"\\n\",\n    \"from pyspark import SparkContext\\n\",\n    \"from pyspark import SparkConf\\n\",\n    \"\\n\",\n    \"from matplotlib import pyplot as plt\\n\",\n    \"\\n\",\n    \"from pyspark import StorageLevel\\n\",\n    \"\\n\",\n    \"from pyspark.ml.feature import StandardScaler\\n\",\n    \"from pyspark.ml.feature import VectorAssembler\\n\",\n    \"from pyspark.ml.feature import OneHotEncoder\\n\",\n    \"from pyspark.ml.feature import MinMaxScaler\\n\",\n    \"from pyspark.ml.feature import StringIndexer\\n\",\n    \"from pyspark.ml.evaluation import MulticlassClassificationEvaluator\\n\",\n    \"\\n\",\n    \"from distkeras.trainers import *\\n\",\n    \"from distkeras.predictors import *\\n\",\n    \"from distkeras.transformers import *\\n\",\n    \"from distkeras.evaluators import *\\n\",\n    \"from distkeras.utils import *\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In the following cell, adapt the parameters to fit your personal requirements.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Modify these variables according to your needs.\\n\",\n    \"application_name = \\\"Distributed Keras MNIST Analysis\\\"\\n\",\n    \"using_spark_2 = False\\n\",\n    \"local = False\\n\",\n    \"path = \\\"mnist.parquet\\\"\\n\",\n    \"if local:\\n\",\n    \"    # Tell master to use local resources.\\n\",\n    \"    master = \\\"local[*]\\\"\\n\",\n    \"    num_processes = 3\\n\",\n    \"    num_executors = 1\\n\",\n    \"else:\\n\",\n    \"    # Tell master to use YARN.\\n\",\n    \"    master = \\\"yarn-client\\\"\\n\",\n    \"    num_executors = 30\\n\",\n    \"    num_processes = 1\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Number of desired executors: 30\\n\",\n      \"Number of desired processes / executor: 1\\n\",\n      \"Total number of workers: 30\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# This variable is derived from the number of cores and executors, and will be used to assign the number of model trainers.\\n\",\n    \"num_workers = num_executors * num_processes\\n\",\n    \"\\n\",\n    \"print(\\\"Number of desired executors: \\\" + `num_executors`)\\n\",\n    \"print(\\\"Number of desired processes / executor: \\\" + `num_processes`)\\n\",\n    \"print(\\\"Total number of workers: \\\" + `num_workers`)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"conf = SparkConf()\\n\",\n    \"conf.set(\\\"spark.app.name\\\", application_name)\\n\",\n    \"conf.set(\\\"spark.master\\\", master)\\n\",\n    \"conf.set(\\\"spark.executor.cores\\\", `num_processes`)\\n\",\n    \"conf.set(\\\"spark.executor.instances\\\", `num_executors`)\\n\",\n    \"conf.set(\\\"spark.locality.wait\\\", \\\"0\\\")\\n\",\n    \"conf.set(\\\"spark.executor.memory\\\", \\\"5g\\\")\\n\",\n    \"conf.set(\\\"spark.serializer\\\", \\\"org.apache.spark.serializer.KryoSerializer\\\");\\n\",\n    \"\\n\",\n    \"# Check if the user is running Spark 2.0 +\\n\",\n    \"if using_spark_2:\\n\",\n    \"    sc = SparkSession.builder.config(conf=conf) \\\\\\n\",\n    \"            .appName(application_name) \\\\\\n\",\n    \"            .getOrCreate()\\n\",\n    \"else:\\n\",\n    \"    # Create the Spark context.\\n\",\n    \"    sc = SparkContext(conf=conf)\\n\",\n    \"    # Add the missing imports\\n\",\n    \"    from pyspark import SQLContext\\n\",\n    \"    sqlContext = SQLContext(sc)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Check if we are using Spark 2.0\\n\",\n    \"if using_spark_2:\\n\",\n    \"    reader = sc\\n\",\n    \"else:\\n\",\n    \"    reader = sqlContext\\n\",\n    \"# Read the training and test set.\\n\",\n    \"training_set = reader.read.parquet('data/mnist_train_big.parquet') \\\\\\n\",\n    \"                     .select(\\\"features_normalized_dense\\\", \\\"label_encoded\\\", \\\"label\\\")\\n\",\n    \"test_set = reader.read.parquet('data/mnist_test_preprocessed.parquet') \\\\\\n\",\n    \"                 .select(\\\"features_normalized_dense\\\", \\\"label_encoded\\\", \\\"label\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"root\\n\",\n      \" |-- features_normalized_dense: vector (nullable = true)\\n\",\n      \" |-- label_encoded: vector (nullable = true)\\n\",\n      \" |-- label: long (nullable = true)\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Print the schema of the dataset.\\n\",\n    \"training_set.printSchema()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Model Development\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Multilayer Perceptron\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"mlp = Sequential()\\n\",\n    \"mlp.add(Dense(1000, input_shape=(784,)))\\n\",\n    \"mlp.add(Activation('relu'))\\n\",\n    \"mlp.add(Dropout(0.2))\\n\",\n    \"mlp.add(Dense(200))\\n\",\n    \"mlp.add(Activation('relu'))\\n\",\n    \"mlp.add(Dropout(0.2))\\n\",\n    \"mlp.add(Dense(10))\\n\",\n    \"mlp.add(Activation('softmax'))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"____________________________________________________________________________________________________\\n\",\n      \"Layer (type)                     Output Shape          Param #     Connected to                     \\n\",\n      \"====================================================================================================\\n\",\n      \"dense_1 (Dense)                  (None, 1000)          785000      dense_input_1[0][0]              \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"activation_1 (Activation)        (None, 1000)          0           dense_1[0][0]                    \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"dropout_1 (Dropout)              (None, 1000)          0           activation_1[0][0]               \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"dense_2 (Dense)                  (None, 200)           200200      dropout_1[0][0]                  \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"activation_2 (Activation)        (None, 200)           0           dense_2[0][0]                    \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"dropout_2 (Dropout)              (None, 200)           0           activation_2[0][0]               \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"dense_3 (Dense)                  (None, 10)            2010        dropout_2[0][0]                  \\n\",\n      \"____________________________________________________________________________________________________\\n\",\n      \"activation_3 (Activation)        (None, 10)            0           dense_3[0][0]                    \\n\",\n      \"====================================================================================================\\n\",\n      \"Total params: 987210\\n\",\n      \"____________________________________________________________________________________________________\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"mlp.summary()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"optimizer_mlp = 'adam'\\n\",\n    \"loss_mlp = 'categorical_crossentropy'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Training\\n\",\n    \"\\n\",\n    \"Prepare the training and test set for evaluation and training.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Number of training instances: 6060000\\n\",\n      \"Number of testing instances: 10000\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"training_set = training_set.repartition(num_workers)\\n\",\n    \"test_set = test_set.repartition(num_workers)\\n\",\n    \"training_set.cache()\\n\",\n    \"test_set.cache()\\n\",\n    \"print(\\\"Number of training instances: \\\" + str(training_set.count()))\\n\",\n    \"print(\\\"Number of testing instances: \\\" + str(test_set.count()))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Evaluation\\n\",\n    \"\\n\",\n    \"We define a utility function which will compute the accuracy for us.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def evaluate_accuracy(model, test_set, features=\\\"features_normalized_dense\\\"):\\n\",\n    \"    evaluator = AccuracyEvaluator(prediction_col=\\\"prediction_index\\\", label_col=\\\"label\\\")\\n\",\n    \"    predictor = ModelPredictor(keras_model=model, features_col=features)\\n\",\n    \"    transformer = LabelIndexTransformer(output_dim=10)\\n\",\n    \"    test_set = test_set.select(features, \\\"label\\\")\\n\",\n    \"    test_set = predictor.predict(test_set)\\n\",\n    \"    test_set = transformer.transform(test_set)\\n\",\n    \"    score = evaluator.evaluate(test_set)\\n\",\n    \"    \\n\",\n    \"    return score\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### ADAG\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer = ADAG(keras_model=mlp, worker_optimizer=optimizer_mlp, loss=loss_mlp, num_workers=num_workers,\\n\",\n    \"               batch_size=4, communication_window=5, num_epoch=1,\\n\",\n    \"               features_col=\\\"features_normalized_dense\\\", label_col=\\\"label_encoded\\\")\\n\",\n    \"# Modify the default parallelism factor.\\n\",\n    \"trained_model = trainer.train(training_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[array([[-0.02490237, -0.01861665,  0.03102627, ...,  0.01722135,\\n\",\n       \"          0.02223415, -0.04933412],\\n\",\n       \"        [-0.02634868,  0.03564246, -0.05392314, ..., -0.02999102,\\n\",\n       \"         -0.01270337, -0.03888189],\\n\",\n       \"        [ 0.00727941,  0.04553502, -0.01856072, ...,  0.0319587 ,\\n\",\n       \"         -0.00354035, -0.03581727],\\n\",\n       \"        ..., \\n\",\n       \"        [-0.03245988, -0.01220334,  0.019447  , ...,  0.05723321,\\n\",\n       \"         -0.05618715, -0.0248918 ],\\n\",\n       \"        [-0.02532675, -0.01772211,  0.05514754, ...,  0.03839124,\\n\",\n       \"         -0.05036234, -0.03766601],\\n\",\n       \"        [ 0.04610632,  0.01409597,  0.03790993, ..., -0.02038677,\\n\",\n       \"         -0.03649681,  0.04742099]], dtype=float32),\\n\",\n       \" array([ -1.29682487e-02,   1.38744503e-01,  -3.10007334e-01,\\n\",\n       \"         -3.04996595e-02,  -1.39434069e-01,  -4.05185074e-02,\\n\",\n       \"         -2.09797233e-01,  -4.62490469e-01,  -6.72216356e-01,\\n\",\n       \"         -1.83647368e-02,  -2.93090612e-01,   5.11649624e-02,\\n\",\n       \"         -2.74094105e-01,  -9.03906003e-02,  -7.21242726e-01,\\n\",\n       \"         -2.51375604e-02,  -1.40052319e-01,  -1.31754786e-01,\\n\",\n       \"         -1.88921779e-01,  -3.18406552e-01,  -3.45931239e-02,\\n\",\n       \"         -1.89292878e-01,   3.80539931e-02,   3.54425013e-02,\\n\",\n       \"         -6.34538352e-01,  -2.27093436e-02,  -5.49978614e-01,\\n\",\n       \"         -2.85222325e-02,  -4.87636119e-01,  -2.94719964e-01,\\n\",\n       \"         -4.62469608e-01,  -4.31859016e-01,  -4.95594800e-01,\\n\",\n       \"         -7.55963206e-01,  -7.07836151e-01,   5.50588481e-02,\\n\",\n       \"          1.01570776e-02,  -3.62383217e-01,  -2.37895608e-01,\\n\",\n       \"         -3.48139226e-01,  -5.14193960e-02,  -4.49353665e-01,\\n\",\n       \"         -2.04702299e-02,  -1.28980473e-01,  -6.01515993e-02,\\n\",\n       \"         -4.11046803e-01,  -2.73511171e-01,  -4.22501177e-01,\\n\",\n       \"          6.57678917e-02,  -3.77899945e-01,  -3.68858546e-01,\\n\",\n       \"         -3.45079124e-01,  -1.21501423e-01,  -2.59954304e-01,\\n\",\n       \"         -2.77339309e-01,   7.24700987e-02,  -1.75704360e-01,\\n\",\n       \"         -1.79602101e-01,  -3.49472016e-01,  -4.22441006e-01,\\n\",\n       \"         -3.98772031e-01,   4.78056073e-02,   1.63912345e-02,\\n\",\n       \"         -1.73481293e-02,   2.03711018e-01,  -1.66458517e-01,\\n\",\n       \"         -2.50248574e-02,  -4.33256328e-01,  -1.77355483e-02,\\n\",\n       \"         -6.68845698e-02,  -6.33655787e-02,  -2.07219645e-01,\\n\",\n       \"         -2.81381667e-01,  -2.10354477e-01,   9.65033993e-02,\\n\",\n       \"          1.45252123e-01,  -1.62108362e-01,  -4.10078391e-02,\\n\",\n       \"         -5.01093924e-01,   6.61657602e-02,  -3.54006797e-01,\\n\",\n       \"         -2.72664815e-01,  -4.63590562e-01,  -2.76888013e-01,\\n\",\n       \"          5.67168836e-03,  -1.63264722e-02,  -5.64372167e-02,\\n\",\n       \"         -3.27719487e-02,  -1.25738844e-01,  -3.16582769e-02,\\n\",\n       \"         -3.16652000e-01,   2.20678657e-01,  -4.90398854e-01,\\n\",\n       \"         -3.87180448e-01,   4.62217331e-02,  -3.87124509e-01,\\n\",\n       \"          3.44271868e-01,  -6.47646427e-01,  -4.47504744e-02,\\n\",\n       \"         -3.12687427e-01,  -3.64519686e-01,  -1.19691178e-01,\\n\",\n       \"         -1.22579239e-01,  -1.74031451e-01,  -3.50467891e-01,\\n\",\n       \"         -3.85930926e-01,  -1.01258140e-02,   1.65355578e-01,\\n\",\n       \"          2.38174275e-02,  -3.86843532e-01,  -2.11541757e-01,\\n\",\n       \"         -1.60455573e-02,  -3.41660500e-01,  -2.41097137e-01,\\n\",\n       \"         -3.58184397e-01,  -3.74646991e-01,  -5.68306029e-01,\\n\",\n       \"          6.03663735e-02,  -2.25287676e-01,  -3.33954960e-01,\\n\",\n       \"         -3.21863830e-01,  -5.74063025e-02,  -9.54797715e-02,\\n\",\n       \"         -1.69863552e-01,   5.25663458e-02,  -1.78944767e-01,\\n\",\n       \"         -4.96068239e-01,  -9.37457308e-02,  -4.91037033e-02,\\n\",\n       \"         -5.45800686e-01,  -4.19147074e-01,  -3.63402218e-01,\\n\",\n       \"         -9.55256671e-02,  -6.56951070e-02,  -4.74279895e-02,\\n\",\n       \"          3.94136347e-02,  -6.89108312e-01,  -6.40569270e-01,\\n\",\n       \"         -2.92730868e-01,  -4.21674043e-01,  -9.05798003e-02,\\n\",\n       \"         -9.85799953e-02,  -3.34262311e-01,  -2.91352630e-01,\\n\",\n       \"         -1.20481804e-01,  -1.30824670e-01,  -3.15101117e-01,\\n\",\n       \"         -3.82897407e-01,  -3.67818296e-01,  -2.51174152e-01,\\n\",\n       \"         -4.45220284e-02,  -3.63316804e-01,  -5.95236719e-01,\\n\",\n       \"         -3.27549487e-01,  -5.18906057e-01,  -1.80942759e-01,\\n\",\n       \"         -1.93147764e-01,  -1.63675278e-01,   5.25709763e-02,\\n\",\n       \"         -1.69222236e-01,  -1.66612849e-01,  -1.89764783e-01,\\n\",\n       \"          9.59388837e-02,  -1.79865390e-01,  -2.87416220e-01,\\n\",\n       \"         -1.37040511e-01,  -3.68917108e-01,  -1.97503880e-01,\\n\",\n       \"         -4.80307907e-01,  -9.74704884e-03,  -1.62035048e-01,\\n\",\n       \"         -4.33685966e-02,  -3.75206321e-01,  -2.71574229e-01,\\n\",\n       \"         -2.51338482e-01,  -1.91602707e-01,  -4.66123730e-01,\\n\",\n       \"         -3.09535444e-01,  -3.18885483e-02,  -3.23637798e-02,\\n\",\n       \"         -3.71796012e-01,  -2.26407617e-01,  -4.69909385e-02,\\n\",\n       \"         -3.70391518e-01,  -5.37406743e-01,  -5.00004053e-01,\\n\",\n       \"         -4.49130647e-02,   1.55784473e-01,  -3.39550585e-01,\\n\",\n       \"         -5.15295863e-01,  -5.79936266e-01,   4.80024889e-03,\\n\",\n       \"         -1.23718642e-01,  -6.55675307e-02,  -2.74233013e-01,\\n\",\n       \"         -2.67147571e-01,  -4.20176655e-01,  -2.30046362e-02,\\n\",\n       \"         -2.80579627e-01,  -6.52074635e-01,  -2.07271874e-01,\\n\",\n       \"         -3.34823787e-01,  -5.11079669e-01,  -4.89039391e-01,\\n\",\n       \"         -1.69896662e-01,  -6.09769404e-01,   1.67333558e-01,\\n\",\n       \"         -1.52619872e-02,  -1.82103708e-01,  -1.59035064e-02,\\n\",\n       \"         -2.82586038e-01,  -4.48576622e-02,  -2.77401984e-01,\\n\",\n       \"         -1.18868940e-01,  -3.09958905e-01,  -4.54939663e-01,\\n\",\n       \"         -6.84868218e-03,  -1.78479820e-01,  -4.12694991e-01,\\n\",\n       \"         -4.86943096e-01,  -4.83419180e-01,  -2.92061418e-01,\\n\",\n       \"         -3.56696308e-01,  -2.38492072e-01,  -1.99521467e-01,\\n\",\n       \"         -6.62643433e-01,  -6.58789635e-01,  -3.13386142e-01,\\n\",\n       \"         -2.39210613e-02,   3.81695509e-01,   3.89514342e-02,\\n\",\n       \"         -4.21914130e-01,  -1.78643346e-01,  -3.58139843e-01,\\n\",\n       \"         -2.31155585e-02,  -5.25866091e-01,  -2.01350115e-02,\\n\",\n       \"          1.34515122e-01,  -4.72941786e-01,   1.28511051e-02,\\n\",\n       \"         -1.92628369e-01,  -2.94919074e-01,  -1.21810228e-01,\\n\",\n       \"         -2.63900816e-01,  -1.77175865e-01,  -3.85966711e-02,\\n\",\n       \"         -3.91167760e-01,  -3.54940116e-01,  -4.08377945e-02,\\n\",\n       \"         -2.46946454e-01,  -1.70614153e-01,   9.64559093e-02,\\n\",\n       \"         -1.58487067e-01,  -1.40857771e-01,  -2.60191988e-02,\\n\",\n       \"         -2.16996279e-02,  -2.01046526e-01,   1.07773796e-01,\\n\",\n       \"         -7.25519285e-02,  -4.59324010e-02,  -3.97602469e-01,\\n\",\n       \"         -2.86683738e-01,  -2.06594560e-02,  -2.32254282e-01,\\n\",\n       \"         -1.47455707e-01,  -2.11738929e-01,  -3.97648931e-01,\\n\",\n       \"         -1.92232862e-01,  -4.22664315e-01,  -2.10082695e-01,\\n\",\n       \"         -3.69767874e-01,  -3.35989922e-01,  -2.50372291e-02,\\n\",\n       \"         -2.56772131e-01,  -7.55918026e-01,  -1.45749766e-02,\\n\",\n       \"         -5.94904542e-01,  -1.83992922e-01,  -1.98239967e-01,\\n\",\n       \"          2.28624657e-01,  -3.67346585e-01,  -2.17467710e-01,\\n\",\n       \"         -8.19451883e-02,  -5.01424968e-02,  -3.00576668e-02,\\n\",\n       \"          2.42029456e-03,  -6.11475348e-01,  -2.48637870e-01,\\n\",\n       \"         -1.25368005e-02,  -1.07831452e-02,   3.56794626e-01,\\n\",\n       \"         -2.73973256e-01,  -5.00894673e-02,  -3.93987626e-01,\\n\",\n       \"         -6.70151055e-01,   5.03201634e-02,  -3.47819924e-01,\\n\",\n       \"          2.21592330e-04,  -9.35477093e-02,  -4.01370734e-01,\\n\",\n       \"         -5.17268419e-01,  -2.08003540e-02,  -1.58300679e-02,\\n\",\n       \"          1.09454863e-01,   4.86627640e-03,  -4.40006703e-01,\\n\",\n       \"          1.10145152e-01,  -3.08435559e-01,  -2.27646939e-02,\\n\",\n       \"         -6.15591705e-02,  -6.83150813e-02,   1.51192188e-01,\\n\",\n       \"         -2.93954074e-01,   1.76271528e-01,  -5.47897398e-01,\\n\",\n       \"         -2.94454783e-01,  -4.87583935e-01,  -2.25682836e-02,\\n\",\n       \"         -2.61891991e-01,  -2.05876276e-01,  -2.91871820e-02,\\n\",\n       \"         -4.65158612e-01,  -1.10427953e-01,   2.59957045e-01,\\n\",\n       \"         -6.44603491e-01,  -5.89241982e-01,  -2.40099952e-01,\\n\",\n       \"         -2.48620026e-02,   2.60877088e-02,  -3.69062722e-01,\\n\",\n       \"         -5.85998118e-01,   6.35902397e-04,   1.52950898e-01,\\n\",\n       \"         -1.31705374e-01,  -6.95600629e-01,  -6.93177283e-02,\\n\",\n       \"         -3.34524751e-01,  -2.05166377e-02,  -4.04433101e-01,\\n\",\n       \"         -3.34488690e-01,   4.12484966e-02,  -1.07743412e-01,\\n\",\n       \"         -2.31767640e-01,  -5.87181449e-01,  -1.24916852e-01,\\n\",\n       \"         -2.45317779e-02,  -4.82061923e-01,   4.29915352e-04,\\n\",\n       \"         -2.29062542e-01,  -1.53157920e-01,  -8.75511765e-02,\\n\",\n       \"         -1.93034634e-01,  -2.39149824e-01,  -2.81021118e-01,\\n\",\n       \"         -1.92091212e-01,   4.84096706e-02,  -3.15482467e-01,\\n\",\n       \"         -9.38970945e-04,  -7.32823536e-02,   1.46180347e-01,\\n\",\n       \"         -7.48398662e-01,  -2.95927972e-01,  -1.01935327e-01,\\n\",\n       \"         -2.25223079e-02,  -3.76603395e-01,  -3.72446418e-01,\\n\",\n       \"         -5.44973463e-02,  -3.04856654e-02,  -8.12882781e-01,\\n\",\n       \"         -6.35300994e-01,   1.01717256e-01,   1.15769980e-02,\\n\",\n       \"          1.94745436e-01,  -4.62203443e-01,  -1.94413647e-01,\\n\",\n       \"         -1.19787067e-01,   5.01835823e-01,  -1.22532628e-01,\\n\",\n       \"         -4.83275265e-01,  -5.72950900e-01,  -1.68230399e-01,\\n\",\n       \"         -2.53478941e-02,  -8.93718377e-02,  -2.09907755e-01,\\n\",\n       \"          1.15736432e-01,   7.35889524e-02,  -2.25963101e-01,\\n\",\n       \"         -1.25411734e-01,  -1.58686683e-01,   3.05348307e-01,\\n\",\n       \"         -4.07805927e-02,  -6.87129676e-01,  -1.78614125e-01,\\n\",\n       \"         -6.12517297e-02,  -1.26590893e-01,  -5.44444025e-01,\\n\",\n       \"         -2.87909880e-02,  -1.61622658e-01,  -6.28022432e-01,\\n\",\n       \"         -3.93144011e-01,  -4.14166540e-01,  -3.36472809e-01,\\n\",\n       \"         -2.14290902e-01,  -1.57012552e-01,  -6.99233487e-02,\\n\",\n       \"         -1.79140717e-01,  -3.44865173e-01,  -4.32067961e-01,\\n\",\n       \"         -4.17658724e-02,  -1.92612112e-01,  -4.07513529e-01,\\n\",\n       \"         -2.00688168e-01,  -3.12940218e-02,  -5.83245270e-02,\\n\",\n       \"         -3.02525491e-01,  -6.36755228e-01,  -2.01398991e-02,\\n\",\n       \"         -1.94140598e-01,  -5.85560381e-01,  -2.78204322e-01,\\n\",\n       \"         -4.92228866e-01,   2.85394281e-01,  -5.29185772e-01,\\n\",\n       \"         -5.80944479e-01,  -4.82267290e-01,  -3.02456468e-01,\\n\",\n       \"         -2.17350312e-02,  -2.27617443e-01,  -8.41379631e-03,\\n\",\n       \"         -5.19459188e-01,  -1.92483932e-01,  -6.69973344e-02,\\n\",\n       \"         -3.18294495e-01,  -4.43626344e-01,   1.03083804e-01,\\n\",\n       \"         -1.43494621e-01,  -3.98965865e-01,  -2.91880131e-01,\\n\",\n       \"         -1.15407094e-01,  -2.33865350e-01,  -3.48333865e-01,\\n\",\n       \"         -3.13846886e-01,  -2.00329088e-02,  -2.08419889e-01,\\n\",\n       \"         -6.56257868e-02,  -3.15933287e-01,  -2.66032100e-01,\\n\",\n       \"         -2.17209011e-01,  -2.57886738e-01,  -3.74219060e-01,\\n\",\n       \"         -3.42252910e-01,  -3.02372843e-01,  -2.70351022e-01,\\n\",\n       \"         -4.19028729e-01,  -2.16944158e-01,   1.65465083e-02,\\n\",\n       \"         -1.38239786e-01,   8.82068649e-03,  -5.47306299e-01,\\n\",\n       \"         -6.58184737e-02,  -1.07372276e-01,  -1.99595578e-02,\\n\",\n       \"         -3.04633468e-01,  -2.42436364e-01,  -9.85036939e-02,\\n\",\n       \"          8.13045427e-02,  -6.01692021e-01,  -7.83374131e-01,\\n\",\n       \"         -3.54873002e-01,  -1.54401422e-01,  -1.99920405e-02,\\n\",\n       \"         -6.02073036e-03,  -7.46182263e-01,  -5.17743170e-01,\\n\",\n       \"         -1.43411651e-01,   1.35698587e-01,  -4.32992607e-01,\\n\",\n       \"         -3.22256982e-01,   2.01625749e-01,  -1.68692529e-01,\\n\",\n       \"          9.03868079e-02,  -7.36883581e-02,  -2.26779003e-02,\\n\",\n       \"          7.53887817e-02,  -3.51618379e-01,  -6.96502507e-01,\\n\",\n       \"         -1.97232455e-01,  -2.19720408e-01,  -1.76197141e-01,\\n\",\n       \"         -3.31067145e-01,   2.52920628e-01,  -5.32557011e-01,\\n\",\n       \"         -9.84433852e-03,  -2.28284430e-02,  -2.18466327e-01,\\n\",\n       \"         -2.50813589e-02,  -1.22822799e-01,  -6.21357895e-02,\\n\",\n       \"         -1.85140949e-02,   1.55188337e-01,  -2.91802138e-01,\\n\",\n       \"         -1.76329892e-02,  -3.60844210e-02,  -5.81378281e-01,\\n\",\n       \"         -6.11039221e-01,  -3.28095675e-01,  -2.83731908e-01,\\n\",\n       \"         -1.66193381e-01,   5.52292354e-02,   6.29878119e-02,\\n\",\n       \"         -3.41305107e-01,  -1.39835373e-01,   1.71938047e-01,\\n\",\n       \"         -1.84613727e-02,   7.50863180e-02,  -3.44148017e-02,\\n\",\n       \"         -3.53854299e-01,  -5.12476027e-01,   1.22042328e-01,\\n\",\n       \"         -5.39535470e-02,   3.05281021e-03,  -1.19409911e-01,\\n\",\n       \"         -2.89323032e-01,  -6.71940520e-02,  -2.19452642e-02,\\n\",\n       \"         -2.90004104e-01,  -1.76387712e-01,  -4.56134796e-01,\\n\",\n       \"         -8.09880495e-01,  -1.83778346e-01,  -2.31890544e-01,\\n\",\n       \"         -4.52327728e-01,  -2.06816241e-01,  -1.38748497e-01,\\n\",\n       \"         -4.18441355e-01,  -5.38856745e-01,  -5.05130768e-01,\\n\",\n       \"         -1.75971299e-01,  -1.19080685e-01,  -9.46213081e-02,\\n\",\n       \"         -3.64823714e-02,  -3.22997957e-01,  -1.34447142e-01,\\n\",\n       \"         -1.27073288e-01,   1.64654911e-01,  -9.78678912e-02,\\n\",\n       \"         -4.47389364e-01,  -2.54144296e-02,   1.73969138e-02,\\n\",\n       \"         -2.04480872e-01,  -4.30503398e-01,  -1.67036086e-01,\\n\",\n       \"         -2.49711365e-01,  -3.37412119e-01,  -6.02359474e-01,\\n\",\n       \"         -6.62094355e-01,  -1.16948448e-01,   9.77696292e-03,\\n\",\n       \"         -5.21902740e-01,  -2.33485606e-02,  -6.64649755e-02,\\n\",\n       \"         -6.00027978e-01,  -5.42070754e-02,  -2.38561943e-01,\\n\",\n       \"         -4.47000265e-01,   1.17274612e-01,  -1.11540303e-01,\\n\",\n       \"         -1.02203742e-01,  -6.74192980e-02,  -1.72974497e-01,\\n\",\n       \"         -2.43933983e-02,  -2.18470603e-01,  -1.02555685e-01,\\n\",\n       \"         -5.01730680e-01,  -1.63745075e-01,  -2.48166338e-01,\\n\",\n       \"          4.25796956e-02,  -8.81046131e-02,  -4.94634926e-01,\\n\",\n       \"         -2.48743445e-01,   8.22583865e-03,  -2.14855313e-01,\\n\",\n       \"         -5.94667614e-01,   1.23224966e-01,  -2.28983104e-01,\\n\",\n       \"         -4.89580818e-02,  -3.53976309e-01,  -1.02518976e-01,\\n\",\n       \"         -2.80924350e-01,   2.18932718e-01,  -9.42684943e-04,\\n\",\n       \"         -2.78814733e-01,  -2.43697301e-01,  -4.07780051e-01,\\n\",\n       \"         -1.57622676e-02,  -4.32732075e-01,   2.76384447e-02,\\n\",\n       \"         -2.56971091e-01,  -1.39276221e-01,  -2.89412320e-01,\\n\",\n       \"         -7.84103293e-03,  -5.75612962e-01,  -2.65779234e-02,\\n\",\n       \"         -2.83633530e-01,  -2.42152084e-02,  -3.54716778e-01,\\n\",\n       \"         -5.25303543e-01,  -6.30853772e-02,  -2.22892091e-01,\\n\",\n       \"         -3.32897723e-01,  -8.58137235e-02,  -1.35768950e-01,\\n\",\n       \"         -4.00102228e-01,  -6.81776628e-02,  -1.11637965e-01,\\n\",\n       \"          8.71941745e-02,   7.97185600e-02,  -4.74733919e-01,\\n\",\n       \"         -5.36120776e-03,  -2.00053956e-02,   2.74125468e-02,\\n\",\n       \"         -5.23373425e-01,  -3.52810740e-01,  -5.75067937e-01,\\n\",\n       \"         -1.27765425e-02,  -2.41196215e-01,   1.35370884e-02,\\n\",\n       \"         -3.42776716e-01,  -2.61937886e-01,  -1.73471346e-01,\\n\",\n       \"         -7.74265826e-01,  -3.25414896e-01,  -6.52070194e-02,\\n\",\n       \"         -1.75177939e-02,  -2.78512776e-01,  -1.26804650e-01,\\n\",\n       \"         -1.54330492e-01,  -2.43354395e-01,  -5.10048628e-01,\\n\",\n       \"         -5.22104055e-02,  -4.48061913e-01,  -2.54915148e-01,\\n\",\n       \"         -3.71145964e-01,  -2.34785691e-01,  -5.76828778e-01,\\n\",\n       \"         -5.20584345e-01,  -2.01370478e-01,  -3.43574703e-01,\\n\",\n       \"         -3.95394504e-01,  -7.02085435e-01,   3.80159239e-03,\\n\",\n       \"         -5.05006194e-01,  -6.66690245e-02,  -2.13820174e-01,\\n\",\n       \"         -1.86356172e-01,  -1.98591515e-01,  -2.26664558e-01,\\n\",\n       \"         -9.84562710e-02,   9.10461769e-02,  -1.63858235e-01,\\n\",\n       \"         -6.71461642e-01,  -2.07045935e-02,  -1.84064224e-01,\\n\",\n       \"         -1.52253630e-02,  -6.44623414e-02,  -1.90693051e-01,\\n\",\n       \"         -3.26317549e-01,  -3.90465967e-02,  -4.31612767e-02,\\n\",\n       \"         -2.69320831e-02,  -2.61054486e-01,  -5.56032240e-01,\\n\",\n       \"         -1.39396250e-01,  -3.04626554e-01,  -4.00418974e-02,\\n\",\n       \"         -5.22964954e-01,  -2.74515212e-01,  -2.05182180e-01,\\n\",\n       \"         -4.55017984e-01,  -4.10655349e-01,  -3.91681463e-01,\\n\",\n       \"         -2.95707285e-01,  -1.75162852e-02,  -1.80232033e-01,\\n\",\n       \"         -9.38054398e-02,  -4.48614866e-01,  -1.20916396e-01,\\n\",\n       \"         -1.26026660e-01,  -6.13098264e-01,  -9.16779786e-02,\\n\",\n       \"         -1.24931745e-01,  -1.14639051e-01,  -5.89349389e-01,\\n\",\n       \"         -2.86892831e-01,  -4.32475626e-01,  -4.53839451e-01,\\n\",\n       \"         -5.40873766e-01,  -3.22011739e-01,  -1.04171380e-01,\\n\",\n       \"         -2.03116417e-01,  -7.34383706e-03,  -2.95767933e-01,\\n\",\n       \"          3.77100818e-02,  -3.95163864e-01,  -9.11748350e-01,\\n\",\n       \"         -2.14269429e-01,  -4.47106093e-01,  -1.02919694e-02,\\n\",\n       \"         -1.46425188e-01,   1.30215868e-01,   3.46448004e-01,\\n\",\n       \"         -7.53604919e-02,  -3.68188143e-01,  -1.75004661e-01,\\n\",\n       \"         -3.42096955e-01,  -1.19322361e-02,   9.38493479e-03,\\n\",\n       \"         -5.18787801e-01,  -1.09108455e-01,   6.15557991e-02,\\n\",\n       \"         -8.33496079e-03,  -6.41730651e-02,  -1.36719868e-02,\\n\",\n       \"         -3.73748362e-01,  -3.73859495e-01,   2.80248914e-02,\\n\",\n       \"         -3.09117913e-01,  -2.88713902e-01,  -4.28494245e-01,\\n\",\n       \"         -5.13740003e-01,  -1.57594740e-01,  -4.70732421e-01,\\n\",\n       \"         -1.38654308e-02,  -6.85215056e-01,  -3.66586596e-01,\\n\",\n       \"         -1.41351402e-01,  -1.13854766e-01,  -5.36643863e-01,\\n\",\n       \"         -4.75565642e-01,  -5.00832915e-01,  -4.08477843e-01,\\n\",\n       \"         -3.66504490e-01,  -1.15367234e-01,  -2.48915218e-02,\\n\",\n       \"         -4.96757418e-01,   1.17366053e-01,  -2.26039514e-01,\\n\",\n       \"         -5.49678802e-01,  -2.75789142e-01,  -5.08426309e-01,\\n\",\n       \"          1.07284091e-01,  -2.54364550e-01,  -3.72139484e-01,\\n\",\n       \"         -3.34391892e-01,   2.10764147e-02,  -1.33560911e-01,\\n\",\n       \"         -9.50245783e-02,  -3.13357562e-01,  -2.62188077e-01,\\n\",\n       \"         -5.32095313e-01,  -5.31459413e-03,  -3.21489833e-02,\\n\",\n       \"         -7.84164011e-01,  -1.10715240e-01,  -2.87352562e-01,\\n\",\n       \"         -5.71807444e-01,  -2.04134420e-01,   7.85130933e-02,\\n\",\n       \"         -3.69185776e-01,  -1.98006928e-02,   6.63151639e-03,\\n\",\n       \"         -2.87224799e-01,   5.36596589e-02,  -7.96930939e-02,\\n\",\n       \"         -2.82612413e-01,  -1.87133670e-01,  -6.54792845e-01,\\n\",\n       \"         -8.59472081e-02,  -1.13062121e-01,  -1.83315545e-01,\\n\",\n       \"         -2.58277714e-01,  -5.51701725e-01,  -5.59242129e-01,\\n\",\n       \"         -1.50169775e-01,   4.73141856e-02,  -1.68764800e-01,\\n\",\n       \"         -2.75284111e-01,  -4.43699747e-01,  -2.76820183e-01,\\n\",\n       \"         -3.51191200e-02,  -1.07176892e-01,  -4.73967902e-02,\\n\",\n       \"         -4.53751475e-01,  -2.84370124e-01,  -4.89342690e-01,\\n\",\n       \"         -3.81000303e-02,  -5.29655755e-01,  -1.50656566e-01,\\n\",\n       \"         -4.64593619e-01,  -1.58045471e-01,  -7.06188157e-02,\\n\",\n       \"         -4.04648870e-01,  -3.15317452e-01,  -2.87708908e-01,\\n\",\n       \"         -1.71832666e-01,  -2.27938369e-01,  -2.11054739e-02,\\n\",\n       \"         -3.29687774e-01,  -1.82581544e-01,  -2.17228252e-02,\\n\",\n       \"          2.08218992e-02,  -1.46109968e-01,  -7.96382129e-02,\\n\",\n       \"         -3.17795098e-01,  -5.75634658e-01,  -3.44916396e-02,\\n\",\n       \"         -4.36014533e-01,  -2.85244137e-02,  -5.68732560e-01,\\n\",\n       \"         -5.59068859e-01,  -1.22407533e-01,  -2.56792486e-01,\\n\",\n       \"         -2.97368616e-01,  -3.03129584e-01,  -1.62084669e-01,\\n\",\n       \"         -2.64727145e-01,  -4.05563980e-01,   3.00995618e-01,\\n\",\n       \"         -1.86940640e-01,  -9.05097499e-02,  -1.19438395e-01,\\n\",\n       \"         -1.88409179e-01,  -3.68620992e-01,   3.19603570e-02,\\n\",\n       \"         -5.20787895e-01,  -2.95364499e-01,  -1.96136490e-01,\\n\",\n       \"          1.30156171e+00,  -3.09764799e-02,  -1.63758829e-01,\\n\",\n       \"         -1.63395420e-01,  -1.06308326e-01,  -3.37606370e-01,\\n\",\n       \"         -4.02779371e-01,  -1.04163669e-01,  -3.29879135e-01,\\n\",\n       \"         -6.24738149e-02,   7.57394284e-02,  -6.51596487e-01,\\n\",\n       \"         -2.37611696e-01,  -5.25772333e-01,   1.44061729e-01,\\n\",\n       \"         -2.59940475e-01,  -2.72920489e-01,  -3.10522407e-01,\\n\",\n       \"         -8.48866284e-01,  -5.29746771e-01,  -1.75354518e-02,\\n\",\n       \"         -8.73476788e-02,  -4.62230533e-01,  -3.12623024e-01,\\n\",\n       \"         -4.66565102e-01,  -2.35941991e-01,  -4.72842991e-01,\\n\",\n       \"         -8.59152302e-02,  -3.31128508e-01,  -1.34016275e-01,\\n\",\n       \"         -6.82140663e-02,  -1.31053597e-01,   3.27668451e-02,\\n\",\n       \"         -4.59252357e-01,  -7.40645081e-02,  -2.32884094e-01,\\n\",\n       \"         -2.48913141e-03,  -5.38118541e-01,  -6.48121983e-02,\\n\",\n       \"         -2.82097995e-01,  -4.83397216e-01,  -3.75957131e-01,\\n\",\n       \"         -1.20243065e-01,  -2.91992631e-02,  -2.34807402e-01,\\n\",\n       \"         -8.57004896e-02,  -1.76332936e-01,  -4.79596853e-01,\\n\",\n       \"         -3.59954983e-01,  -3.86393666e-01,  -1.49604112e-01,\\n\",\n       \"          9.89474952e-02,  -1.43513409e-02,  -5.00253379e-01,\\n\",\n       \"         -2.31766224e-01,  -2.78296471e-01,  -1.47517323e-01,\\n\",\n       \"         -2.70760179e-01,   5.62180728e-02,   1.26814142e-01,\\n\",\n       \"         -2.58570649e-02,  -3.02321255e-01,  -5.06240189e-01,\\n\",\n       \"         -3.60810488e-01,  -1.61365643e-01,  -1.28059566e-01,\\n\",\n       \"         -2.62734950e-01,  -1.67697724e-02,   9.22571719e-02,\\n\",\n       \"         -7.30941415e-01,  -3.17986846e-01,  -3.49215209e-01,\\n\",\n       \"         -4.75899428e-01,  -5.54573357e-01,  -2.22814456e-01,\\n\",\n       \"         -9.33618564e-03,  -4.88777943e-02,  -2.79946309e-02,\\n\",\n       \"         -2.43498668e-01,   1.63741887e-01,  -8.86490270e-02,\\n\",\n       \"         -1.80582032e-02,   5.81286959e-02,  -5.06547272e-01,\\n\",\n       \"         -2.36781448e-01,  -2.82066971e-01,   3.62231545e-02,\\n\",\n       \"          5.59952706e-02,  -5.27004182e-01,  -5.63789010e-02,\\n\",\n       \"         -6.33812070e-01,  -7.20118701e-01,  -3.27905029e-01,\\n\",\n       \"         -1.09615184e-01,  -1.97968498e-01,  -3.48774903e-02,\\n\",\n       \"         -4.36178327e-01,  -1.90760285e-01,  -2.00712010e-01,\\n\",\n       \"         -4.05785292e-02,  -7.98018798e-02,  -6.48312092e-01,\\n\",\n       \"         -5.16030610e-01,  -1.82418972e-02,  -3.22774321e-01,\\n\",\n       \"         -1.91510841e-01,  -1.31354675e-01,  -5.67911983e-01,\\n\",\n       \"         -4.27046567e-01,  -2.61492878e-01,  -7.63690919e-02,\\n\",\n       \"         -3.53502780e-01,  -2.86672637e-02,   6.57036155e-02,\\n\",\n       \"         -2.32697666e-01,  -2.25740999e-01,  -2.21521795e-01,\\n\",\n       \"          3.64017077e-02,  -4.65820670e-01,  -1.67809874e-01,\\n\",\n       \"         -2.34040041e-02,  -3.40095460e-01,   5.10562137e-02,\\n\",\n       \"         -2.80955017e-01,   2.17410009e-02,  -2.25610495e-01,\\n\",\n       \"         -2.61850543e-02,  -1.18860357e-01,   9.67218876e-02,\\n\",\n       \"         -6.98161423e-01,  -4.03901875e-01,  -2.49750782e-02,\\n\",\n       \"         -1.49894670e-01,  -1.55417640e-02,  -2.35045440e-02,\\n\",\n       \"         -1.22158304e-02,  -3.60701740e-01,  -5.72664201e-01,\\n\",\n       \"         -4.56410229e-01,  -9.86423045e-02,  -5.59065938e-01,\\n\",\n       \"         -2.43323550e-01,   1.14932351e-01,  -1.32146357e-02,\\n\",\n       \"         -1.13701306e-01,  -2.43878905e-02,   3.04878563e-01,\\n\",\n       \"         -2.93137670e-01,  -4.26690668e-01,  -1.90759376e-01,\\n\",\n       \"         -5.80423713e-01,   1.61198322e-02,  -3.25486124e-01,\\n\",\n       \"         -3.21475148e-01,  -2.53617167e-01,  -1.20874017e-01,\\n\",\n       \"         -4.76823658e-01,  -3.47528964e-01,  -2.89901286e-01,\\n\",\n       \"          2.24457998e-02,  -4.97344643e-01,   1.08718812e+00,\\n\",\n       \"         -2.79220223e-01], dtype=float32),\\n\",\n       \" array([[ 0.03900816,  0.00785677, -0.06511776, ...,  0.00776991,\\n\",\n       \"         -0.05963232, -0.05985177],\\n\",\n       \"        [-0.20750827,  0.08817152,  0.40323174, ...,  0.20854132,\\n\",\n       \"         -0.11089708,  0.14705186],\\n\",\n       \"        [-0.24851227,  0.36102909,  0.07329425, ...,  0.12305254,\\n\",\n       \"          0.02824712,  0.2746895 ],\\n\",\n       \"        ..., \\n\",\n       \"        [-0.27076459,  0.04397521,  0.10150083, ..., -0.02952144,\\n\",\n       \"          0.35495111,  0.01788467],\\n\",\n       \"        [-0.22880824, -0.14765862, -0.01148497, ..., -0.04802479,\\n\",\n       \"         -0.11898327,  0.16021334],\\n\",\n       \"        [-0.01458607,  0.51388001,  0.25630933, ...,  0.10885861,\\n\",\n       \"         -0.15997633,  0.01113635]], dtype=float32),\\n\",\n       \" array([-0.36252829, -0.41307127, -0.37561458, -0.790694  , -0.7867986 ,\\n\",\n       \"        -0.39656818, -0.49989551, -0.56961799, -0.67535901, -0.78190619,\\n\",\n       \"        -0.64679927, -0.62336636, -0.73334086, -0.51707494, -0.80007225,\\n\",\n       \"        -0.57039291, -0.43117863, -0.57423478, -1.01204598, -0.99576569,\\n\",\n       \"        -0.45388478, -0.9715423 , -0.57562113, -0.85434681, -0.4783178 ,\\n\",\n       \"        -0.65333492, -0.56394655, -0.51519966, -0.87941819, -0.9431147 ,\\n\",\n       \"        -0.52889907, -0.51141596, -1.04037309, -0.87605566, -0.5586676 ,\\n\",\n       \"        -0.67145008, -0.62178028, -0.74712718, -0.47700772, -0.81794   ,\\n\",\n       \"        -0.94796181, -1.03332078, -0.99911004, -0.35762793, -0.41830212,\\n\",\n       \"        -0.44990394, -0.54796964, -0.64622766, -0.36980084, -0.62949306,\\n\",\n       \"        -0.73081511, -0.92071664, -0.96040893, -0.17141432, -0.50711352,\\n\",\n       \"        -0.68742466, -0.58205402, -0.60873783, -0.51237881, -0.42307621,\\n\",\n       \"        -0.59278268, -0.77905166, -0.70859444, -0.99470675, -0.68357819,\\n\",\n       \"        -0.45728955, -0.98573047, -0.7740072 , -0.76561183, -0.38337517,\\n\",\n       \"        -0.78785807, -0.9682638 , -0.41092423, -0.81709141, -0.4595961 ,\\n\",\n       \"        -0.45476505, -0.89052409, -0.95178139, -0.920165  , -0.83498871,\\n\",\n       \"        -0.54309958, -0.62142682, -0.10648966, -0.55824465, -0.51698029,\\n\",\n       \"        -0.65391433, -0.73073816, -0.63968295, -0.73563075, -0.37823838,\\n\",\n       \"        -0.83874625, -0.35336301, -0.72945499, -0.61786187, -1.04557991,\\n\",\n       \"        -0.58565521, -0.35223064, -0.30662736, -0.66361117, -0.74605358,\\n\",\n       \"        -0.79575521, -1.12011874, -0.65195775, -0.66316205, -0.30292839,\\n\",\n       \"        -0.97478765, -0.30300212, -0.98781288, -0.88087404, -0.56088251,\\n\",\n       \"        -0.82704026, -0.57432526, -0.44808209, -0.65736598, -0.7800023 ,\\n\",\n       \"        -0.43863136, -0.71997589, -0.79668957, -0.58597511, -0.79392022,\\n\",\n       \"        -0.91689253, -0.17079359, -0.70273119, -0.31935337, -0.99297088,\\n\",\n       \"        -1.21429086, -0.54536754, -0.66847122, -1.0803057 , -0.02116329,\\n\",\n       \"        -0.36946481, -0.78094089, -0.67028719, -0.63478422, -0.56762469,\\n\",\n       \"        -0.59048861, -0.40834036, -0.76510531, -0.86944491, -0.26183733,\\n\",\n       \"        -0.64363545, -0.21043499, -0.80520427, -0.98543239, -1.02239132,\\n\",\n       \"        -0.87130302, -1.06532812, -0.47601402, -0.55352145, -0.75008106,\\n\",\n       \"        -0.57477021, -0.73686802, -0.44472244, -0.64302158, -0.61648601,\\n\",\n       \"        -1.09791934, -0.83204991, -0.40939972, -0.82405424, -0.57132626,\\n\",\n       \"        -0.85813493, -0.84275389, -0.53043413, -1.03980398, -0.41696942,\\n\",\n       \"        -0.99465734, -0.70751721, -0.94126099, -0.70646006, -0.85644752,\\n\",\n       \"        -0.75323451, -0.62099051, -0.99225199, -0.81427616, -0.72105873,\\n\",\n       \"        -0.3865678 , -0.71929121, -0.85359961, -0.47467613, -0.49992275,\\n\",\n       \"        -0.78395241, -0.66783226, -0.85084015, -0.37230313, -0.74241304,\\n\",\n       \"        -0.52368313, -0.57518154, -0.88761586, -0.78079957, -0.84552658,\\n\",\n       \"        -0.60064358, -0.58771318, -0.68866116, -0.7030834 , -0.8059988 ,\\n\",\n       \"        -0.71570534, -0.56441271, -0.89694452, -0.83912975, -0.46641162], dtype=float32),\\n\",\n       \" array([[-0.78751951,  0.02826324, -0.07172652, ..., -0.27620244,\\n\",\n       \"         -0.47863257, -0.49731782],\\n\",\n       \"        [-0.49682441,  0.04474993, -0.77598727, ..., -0.54524791,\\n\",\n       \"         -0.21792939, -0.47720003],\\n\",\n       \"        [-0.2323969 , -0.88028777, -0.2349651 , ..., -0.14491257,\\n\",\n       \"         -0.17279406, -0.64144588],\\n\",\n       \"        ..., \\n\",\n       \"        [-0.7111882 , -0.30641097, -0.66904122, ..., -0.0798426 ,\\n\",\n       \"         -0.57756215, -0.08725328],\\n\",\n       \"        [ 0.11830693,  0.07352046,  0.08562858, ...,  0.09446803,\\n\",\n       \"         -0.41451645, -0.35526502],\\n\",\n       \"        [-0.92134595,  0.0993112 , -0.0636774 , ..., -0.0216356 ,\\n\",\n       \"         -0.54615569, -0.05519475]], dtype=float32),\\n\",\n       \" array([-0.28950188, -0.33981469, -0.49054769, -0.24692491, -0.54108179,\\n\",\n       \"        -0.53850734, -0.51629019, -0.45034203,  0.94987106,  0.34385717], dtype=float32)]\"\n      ]\n     },\n     \"execution_count\": 19,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# View the weights of the trained model.\\n\",\n    \"trained_model.get_weights()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Training time: 22619.2383449\\n\",\n      \"Accuracy: 0.9859\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(\\\"Training time: \\\" + str(trainer.get_training_time()))\\n\",\n    \"print(\\\"Accuracy: \\\" + str(evaluate_accuracy(trained_model, test_set)))\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [conda root]\",\n   \"language\": \"python\",\n   \"name\": \"conda-root-py\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 2\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython2\",\n   \"version\": \"2.7.12\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 1\n}\n"
  },
  {
    "path": "examples/mnist_preprocessing.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# MNIST Preprocessing\\n\",\n    \"\\n\",\n    \"**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN)             \\n\",\n    \"*Departement of Knowledge Engineering*         \\n\",\n    \"*Maastricht University, The Netherlands*\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"07 February 2017\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!(date +%d\\\\ %B\\\\ %G)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Preparation\\n\",\n    \"\\n\",\n    \"To get started, we first load all the required imports. Please make sure you installed dist-keras, and seaborn. Furthermore, we assume that you have access to an installation which provides Apache Spark.\\n\",\n    \"\\n\",\n    \"Before you start this notebook, place the MNIST dataset (which is provided in a zip in examples/data within this repository) on HDFS. Or in the case HDFS is not available, place it on the local filesystem. But make sure the path to the file is identical for all computing nodes.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Using TensorFlow backend.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"import seaborn as sns\\n\",\n    \"\\n\",\n    \"import time\\n\",\n    \"\\n\",\n    \"from pyspark import SparkContext\\n\",\n    \"from pyspark import SparkConf\\n\",\n    \"\\n\",\n    \"from matplotlib import pyplot as plt\\n\",\n    \"\\n\",\n    \"from pyspark.ml.feature import StandardScaler\\n\",\n    \"from pyspark.ml.feature import VectorAssembler\\n\",\n    \"from pyspark.ml.feature import OneHotEncoder\\n\",\n    \"from pyspark.ml.feature import MinMaxScaler\\n\",\n    \"from pyspark.ml.feature import StringIndexer\\n\",\n    \"\\n\",\n    \"from distkeras.transformers import *\\n\",\n    \"from distkeras.utils import *\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In the following cell, adapt the parameters to fit your personal requirements.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Modify these variables according to your needs.\\n\",\n    \"application_name = \\\"MNIST Preprocessing\\\"\\n\",\n    \"using_spark_2 = False\\n\",\n    \"local = False\\n\",\n    \"path_train = \\\"data/mnist_train.csv\\\"\\n\",\n    \"path_test = \\\"data/mnist_test.csv\\\"\\n\",\n    \"if local:\\n\",\n    \"    # Tell master to use local resources.\\n\",\n    \"    master = \\\"local[*]\\\"\\n\",\n    \"    num_processes = 3\\n\",\n    \"    num_executors = 1\\n\",\n    \"else:\\n\",\n    \"    # Tell master to use YARN.\\n\",\n    \"    master = \\\"yarn-client\\\"\\n\",\n    \"    num_executors = 20\\n\",\n    \"    num_processes = 1\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Number of desired executors: 20\\n\",\n      \"Number of desired processes / executor: 1\\n\",\n      \"Total number of workers: 20\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# This variable is derived from the number of cores and executors, and will be used to assign the number of model trainers.\\n\",\n    \"num_workers = num_executors * num_processes\\n\",\n    \"\\n\",\n    \"print(\\\"Number of desired executors: \\\" + `num_executors`)\\n\",\n    \"print(\\\"Number of desired processes / executor: \\\" + `num_processes`)\\n\",\n    \"print(\\\"Total number of workers: \\\" + `num_workers`)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"# Use the DataBricks CSV reader, this has some nice functionality regarding invalid values.\\n\",\n    \"os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell'\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"conf = SparkConf()\\n\",\n    \"conf.set(\\\"spark.app.name\\\", application_name)\\n\",\n    \"conf.set(\\\"spark.master\\\", master)\\n\",\n    \"conf.set(\\\"spark.executor.cores\\\", `num_processes`)\\n\",\n    \"conf.set(\\\"spark.executor.instances\\\", `num_executors`)\\n\",\n    \"conf.set(\\\"spark.executor.memory\\\", \\\"20g\\\")\\n\",\n    \"conf.set(\\\"spark.yarn.executor.memoryOverhead\\\", \\\"2\\\")\\n\",\n    \"conf.set(\\\"spark.locality.wait\\\", \\\"0\\\")\\n\",\n    \"conf.set(\\\"spark.serializer\\\", \\\"org.apache.spark.serializer.KryoSerializer\\\");\\n\",\n    \"\\n\",\n    \"# Check if the user is running Spark 2.0 +\\n\",\n    \"if using_spark_2:\\n\",\n    \"    sc = SparkSession.builder.config(conf=conf) \\\\\\n\",\n    \"            .appName(application_name) \\\\\\n\",\n    \"            .getOrCreate()\\n\",\n    \"else:\\n\",\n    \"    # Create the Spark context.\\n\",\n    \"    sc = SparkContext(conf=conf)\\n\",\n    \"    # Add the missing imports\\n\",\n    \"    from pyspark import SQLContext\\n\",\n    \"    sqlContext = SQLContext(sc)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Record time of starting point.\\n\",\n    \"time_start = time.time()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Check if we are using Spark 2.0\\n\",\n    \"if using_spark_2:\\n\",\n    \"    reader = sc\\n\",\n    \"else:\\n\",\n    \"    reader = sqlContext\\n\",\n    \"# Read the training set.\\n\",\n    \"raw_dataset_train = reader.read.format('com.databricks.spark.csv') \\\\\\n\",\n    \"                          .options(header='true', inferSchema='true') \\\\\\n\",\n    \"                          .load(path_train)\\n\",\n    \"# Read the test set.\\n\",\n    \"raw_dataset_test = reader.read.format('com.databricks.spark.csv') \\\\\\n\",\n    \"                         .options(header='true', inferSchema='true') \\\\\\n\",\n    \"                         .load(path_test)\\n\",\n    \"# Repartition the datasets.\\n\",\n    \"raw_dataset_train = raw_dataset_train.repartition(num_workers)\\n\",\n    \"raw_dataset_test = raw_dataset_test.repartition(num_workers)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As shown in the output of the cell above, we see that every pixel is associated with a seperate column. In order to ensure compatibility with Apache Spark, we vectorize the columns, and add the resulting vectors as a seperate column. However, in order to achieve this, we first need a list of the required columns. This is shown in the cell below.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# First, we would like to extract the desired features from the raw dataset.\\n\",\n    \"# We do this by constructing a list with all desired columns.\\n\",\n    \"features = raw_dataset_train.columns\\n\",\n    \"features.remove('label')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Once we have a list of columns names, we can pass this to Spark's [VectorAssembler](http://spark.apache.org/docs/latest/ml-features.html#vectorassembler). This VectorAssembler will take a list of features, vectorize them, and place them in a column defined in `outputCol`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Next, we use Spark's VectorAssembler to \\\"assemble\\\" (create) a vector of all desired features.\\n\",\n    \"# http://spark.apache.org/docs/latest/ml-features.html#vectorassembler\\n\",\n    \"vector_assembler = VectorAssembler(inputCols=features, outputCol=\\\"features\\\")\\n\",\n    \"# This transformer will take all columns specified in features, and create an additional column \\\"features\\\" which will contain all the desired features aggregated into a single vector.\\n\",\n    \"training_set = vector_assembler.transform(raw_dataset_train)\\n\",\n    \"test_set = vector_assembler.transform(raw_dataset_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Once we have the inputs for our Neural Network (features column) after applying the VectorAssembler, we should also define the outputs. Since we are dealing with a classification task, the output of our Neural Network should be a one-hot encoded vector with 10 elements. For this, we provide a `OneHotTransformer` which accomplish this exact task.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Define the number of output classes.\\n\",\n    \"nb_classes = 10\\n\",\n    \"encoder = OneHotTransformer(nb_classes, input_col=\\\"label\\\", output_col=\\\"label_encoded\\\")\\n\",\n    \"training_set = encoder.transform(training_set)\\n\",\n    \"test_set = encoder.transform(test_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## MNIST\\n\",\n    \"\\n\",\n    \"[MNIST](http://yann.lecun.com/exdb/mnist/) is a dataset of handwritten digits. Every image is a 28 by 28 pixel grayscale image. This means that every pixel has a value between 0 and 255. Some examples of instances within this dataset are shown in the cells below.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Normalization\\n\",\n    \"\\n\",\n    \"In this Section, we will normalize the feature vectors between the 0 and 1 range.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Clear the datasets in the case you ran this cell before.\\n\",\n    \"training_set = training_set.select(\\\"features\\\", \\\"label\\\", \\\"label_encoded\\\")\\n\",\n    \"test_set = test_set.select(\\\"features\\\", \\\"label\\\", \\\"label_encoded\\\")\\n\",\n    \"# Allocate a MinMaxTransformer using Distributed Keras.\\n\",\n    \"# o_min -> original_minimum\\n\",\n    \"# n_min -> new_minimum\\n\",\n    \"transformer = MinMaxTransformer(n_min=0.0, n_max=1.0, \\\\\\n\",\n    \"                                o_min=0.0, o_max=250.0, \\\\\\n\",\n    \"                                input_col=\\\"features\\\", \\\\\\n\",\n    \"                                output_col=\\\"features_normalized\\\")\\n\",\n    \"# Transform the datasets.\\n\",\n    \"training_set = transformer.transform(training_set)\\n\",\n    \"test_set = transformer.transform(test_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Convolutions\\n\",\n    \"\\n\",\n    \"In order to make the dense vectors compatible with convolution operations in Keras, we add another column which contains the matrix form of these images. We provide a utility class (MatrixTransformer), which helps you with this.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"reshape_transformer = ReshapeTransformer(\\\"features_normalized\\\", \\\"matrix\\\", (28, 28, 1))\\n\",\n    \"training_set = reshape_transformer.transform(training_set)\\n\",\n    \"test_set = reshape_transformer.transform(test_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Dense Transformation\\n\",\n    \"\\n\",\n    \"At the moment, dist-keras does not support SparseVectors due to the numpy dependency. As a result, we have to convert the SparseVector to a DenseVector. We added a simple utility transformer which does this for you.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"dense_transformer = DenseTransformer(input_col=\\\"features_normalized\\\", output_col=\\\"features_normalized_dense\\\")\\n\",\n    \"training_set = dense_transformer.transform(training_set)\\n\",\n    \"test_set = dense_transformer.transform(test_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Artificial Enlargement\\n\",\n    \"\\n\",\n    \"We want to make the dataset 100 times larger to simulate larger datasets, and to evaluate optimizer performance.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"DataFrame[features: vector, label: bigint, label_encoded: vector, features_normalized: vector, matrix: array<array<array<double>>>, features_normalized_dense: vector]\"\n      ]\n     },\n     \"execution_count\": 15,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df = training_set\\n\",\n    \"expansion = 10\\n\",\n    \"for i in range(0, expansion):\\n\",\n    \"    df = df.unionAll(training_set)\\n\",\n    \"training_set = df\\n\",\n    \"training_set.cache()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Writing to HDFS\\n\",\n    \"\\n\",\n    \"In order to prevent constant preprocessing, and ensure optimizer performance, we write the data to HDFS in a Parquet format.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"training_set.write.parquet(\\\"data/mnist_train.parquet\\\")\\n\",\n    \"test_set.write.parquet(\\\"data/mnist_test.parquet\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Record end of transformation.\\n\",\n    \"time_end = time.time()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"dt = time_end - time_start\\n\",\n    \"print(\\\"Took \\\" + str(dt) + \\\" seconds.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"!hdfs dfs -rm -r data/mnist_test.parquet\\n\",\n    \"!hdfs dfs -rm -r data/mnist_train.parquet\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python 2\",\n   \"language\": \"python\",\n   \"name\": \"python2\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 2\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython2\",\n   \"version\": \"2.7.13\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 1\n}\n"
  },
  {
    "path": "examples/workflow.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Distributed Deep Learning with Apache Spark and Keras\\n\",\n    \"\\n\",\n    \"**Joeri Hermans** (Technical Student, IT-DB-SAS, CERN)             \\n\",\n    \"*Departement of Knowledge Engineering*         \\n\",\n    \"*Maastricht University, The Netherlands*\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"06 April 2017\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!(date +%d\\\\ %B\\\\ %G)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"This presentation will give the reader an introduction to the topic of distributed deep learning (DDL) and to the issues which need to be taken into consideration when applying this technique. We will also introduce a DDL framework based on a **fast and general engine for large-scale data processing** called [Apache Spark](https://spark.apache.org/) and the **neural network library** [Keras](https://keras.io). \\n\",\n    \"\\n\",\n    \"The project was initially initiated by the CMS experiment. CMS is exploring the possibility to use a deep learning model for the high level trigger in order to be able to handle the data rates for LHC run 3 and up. Furthermore, they would like to be able to train their models faster using distributed algorithms which allows them to tune their models with an increased frequency. An other requirement was those models should be trained on their complete dataset, which is in the order of a TB. At this point, production use-cases for ATLAS are also being evaluated. These focus more on the serving of models to classify instances.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Contents\\n\",\n    \"\\n\",\n    \"- [Introduction and problem statement](#Distributed-Deep-Learning,-an-introduction.)\\n\",\n    \"  - [Model parallelism](#Model-parallelism2)\\n\",\n    \"  - [Data parallelism](#Data-parallelism)\\n\",\n    \"- [Usage](#Distributed-Keras:-a-practicle-example)\\n\",\n    \"- [Acknowledgments](#Acknowledgments)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Distributed Deep Learning, an introduction.\\n\",\n    \"\\n\",\n    \"Unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. However, consider the problem of training a deep network with billions of parameters. How do we achieve this without waiting for days, or even weeks, and thus leaving more time to tune the model? Dean et al. [[1]](https://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf) proposed an training paradigm which allows us to train a model on multiple physical machines. The authors describe two methods to achieve this, i.e., **data parallelism** and **model parallelism**<sup>1</sup>.\\n\",\n    \"\\n\",\n    \"### Model parallelism<sup><span style=\\\"font-size: 0.75em\\\">2</span></sup>\\n\",\n    \"\\n\",\n    \"In model parallelism a *single* model is distributed over multiple machines. The performance benefits of distributing a deep network across multiple machines depends mainly on the structure and of the model. Models with a large number of parameters typically benefit from access to more CPUs and memory, up to the point where communication costs, i.e., propagation of the weight updates and synchronization mechanisms, dominate [[1]](https://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf).\\n\",\n    \"\\n\",\n    \"<img src=\\\"https://github.com/JoeriHermans/dist-keras/blob/master/resources/model_parallelism.png?raw=true\\\" alt=\\\"Model Parallelism\\\" width=\\\"300px\\\" />\\n\",\n    \"\\n\",\n    \"### Data parallelism\\n\",\n    \"\\n\",\n    \"As stated in the introduction, in order to train a large network in a reasonable amount of time, we need to parallize the optimization process (which is the learning of the model). In this settings, we take *several* model replicas, and distribute them over multiple machines. Of course, it would also be possible to combine this with the model parallelism approach. However, for the sake of simplicity, let us assume that a model (or several models) can be contained on a single machine. In order to parallelize the training, and to improve the usage of the resources of the cluster, we distribute the models over several machines.\\n\",\n    \"\\n\",\n    \"In order to build a distributed learning scheme using data parallelism, you would in the most simple case need at least 1 **parameter server**. A parameter server is basically a thread (or a collection of threads) which aggregate the incoming gradient updates of the workers into a so-called center variable, which acts as a *global consensus* variable. Finally, the weights which are stored in the center variable will eventually be used by the produced model.\\n\",\n    \"\\n\",\n    \"<img src=\\\"https://github.com/JoeriHermans/dist-keras/blob/master/resources/data_parallelism.png?raw=true\\\" alt=\\\"Data Parallelism\\\" width=\\\"400px\\\" />\\n\",\n    \"\\n\",\n    \"There are two general approaches towards solving data parallelism. The most straightforward is a **synchronous** method. In short, a synchronous data parallel method will wait for all workers to finish the current mini-batch or stochastic sample before continuing to the next iteration. Synchronous methods have the advantage that all workers will use the most recent center variable, i.e., a worker knows that all other workers will use the same center variable. However, the main disadvantage of this method is the synchronization itself. A synchronous method will never be truly synchronous. This is due to the many, and possibly different, machines. Furthermore, every machine could have a different workload, which would influence the training speed of a worker. As a result, synchronous methods need additional waiting mechanisms to synchronize all workers. These locking mechanisms will make sure that that all workers compute the next gradient based on the same center variable. However, locking mechanisms induce a significant wait which will significantly influence the training speed. For example, imagine a cluster node with an unusual high load. This high load will, due to CPU sharing, cause the training procedure to slow down. Which in turn, will cause the other workers to wait for this single node. Of course, this is just a simple and possibly extreme example, but this example shows how a single worker could significantly influence the training time of all workers.\\n\",\n    \"\\n\",\n    \"<img src=\\\"https://github.com/JoeriHermans/dist-keras/blob/master/resources/synchronous_method.png?raw=true\\\" alt=\\\"Data Parallelism\\\" width=\\\"500px\\\" />\\n\",\n    \"\\n\",\n    \"A very simple, but radical \\\"solution\\\" for this synchronization problem is to not synchronize the workers :) Workers simply fetch the center variable and update the parameter server with the computed gradient whenever a worker is ready. This approach is called an **asynchronous** data parallel method. Asynchronous methods, compared to synchronous methods, will have a different set of problems. One of these is a so-called **stale gradient**. This a gradient based on an older version of the center variable while the current center variable is a center variable which has already been updated by other workers. One approach to solve this is to induce and exponential decay factor to the gradient updates. However, this would waste computational resources, but of course, one could just get the most recent weights from the parameter server and then start again. However, as we will show later, it is actually stale gradients (result of asynchrony) that induce *implicit momentum* to the learning process [[2]](https://arxiv.org/pdf/1606.04487v4.pdf).\\n\",\n    \"\\n\",\n    \"<img src=\\\"https://github.com/JoeriHermans/dist-keras/blob/master/resources/asynchronous_method.png?raw=true\\\" alt=\\\"Data Parallelism\\\" width=\\\"500px\\\" />\\n\",\n    \"\\n\",\n    \"At this point you probably ask the question: **why does this actually work?** A lot of people suggest this is due to the sparsity of the gradients. Intuitively, image having multiple workers processing different data (since every worker has its own dat partition), chances are the weights updates will be totally dissimilar since we are training a large network with a lot of tunable parameters. Furthermore, techniques such as dropout (if they are applied differently among the replicas) only increase the sparsity updates<sup>3</sup>.\\n\",\n    \"\\n\",\n    \"### Formalization\\n\",\n    \"\\n\",\n    \"We would also like to inform the reader that the general problem to be solved is the so-called **global consensus optimization** problem. A popular approach towards solving this is using the Alternating Direction Method of Multipliers (ADMM) [[3]](http://www.jmlr.org/proceedings/papers/v32/zhange14.pdf) [[4]](http://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf). However, since this is outside the scope of this notebook we will not review this in-depth. But, we would like to note that the *Elastic Averaging* methods [[5]](https://arxiv.org/pdf/1412.6651.pdf) by Zhang et al., which we included in *Distributed Keras* are based on ADMM.\\n\",\n    \"\\n\",\n    \"<sub>**1:** Hybrids are possible as well.</sub>      \\n\",\n    \"<sub>**2:** This is mainly used for the computation of the network outputs [[1]](https://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf).</sub>           \\n\",\n    \"<sub>**3:** A way to check the sparsity between 2 gradients is to put all the weights in to a 1 dimensional vector, and then compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Distributed Keras\\n\",\n    \"\\n\",\n    \"Distributed Keras is a framework which uses Apache Spark and Keras. We chose for Spark because of the distributed environment. This allows us to preprocess the data in a distributed manner, and train our deep learning models on the same architecture, while still having the modeling simplicity of Keras.\\n\",\n    \"\\n\",\n    \"### Architecture\\n\",\n    \"\\n\",\n    \"Our architecture is very similar to the architecture discussed in [[1]](https://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf). However, we employ Apache Spark for data parallel reading and handling larger than memory datasets. The parameter server will always be created in the **Spark Driver**. This **is the program which creates the Spark Context**. For example, if the Jupyter installation of this notebook is running on the Spark cluster, then a cluster node will host the parameter server. However, if you run a Python script, which connects to a remote Spark cluster, then your computer will run the Spark Driver, and as a result will run the parameter server. In that case, be sure your network connection is able to handle the load, else your computer will be the bottleneck in the learning process.\\n\",\n    \"\\n\",\n    \"<img src=\\\"https://github.com/JoeriHermans/dist-keras/blob/master/resources/distkeras_architecture.png?raw=true\\\" alt=\\\"Model Parallelism\\\" width=\\\"500px\\\" />\\n\",\n    \"\\n\",\n    \"### Implementation of costum distributed optimizer\\n\",\n    \"\\n\",\n    \"In order to implement your own optimizer you need 2 classes. First, define your optimizer using the *Trainer* interface. We already supplied an *AsynchronousDistributedTrainer*, and an *SynchronousDistributedTrainer*. However, if you require an other procedure, please feel free to do so. Finally, you need a worker class. This class must have a *train* method with the required arguments, as specified by Apache Spark.\\n\",\n    \"\\n\",\n    \"### Usage\\n\",\n    \"\\n\",\n    \"In the following sections, we will give you an example how a complete workflow will look like. This includes setting up a Spark context, reading, preprocessing, and normalizing the data. Finally, we create a relatively simple model (feel free to adjust the parameters) with Keras and optimize it using the different distributed optimizers which are included by default.\\n\",\n    \"\\n\",\n    \"#### Dataset\\n\",\n    \"\\n\",\n    \"We are using the ATLAS Higgs dataset constructed for the Kaggle machine learning challenge. This dataset is quite limited, it contains only **250000** instances. 40% of which we will be using as a test set. For future experiments, it would be usefull to integrate well understood datasets such as CIFAR or MNIST to evaluate against other optimizers. However, it would be nice to have a \\\"well understood\\\" HEP (High Energy Physics) dataset for this task :) \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Using TensorFlow backend.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"import time\\n\",\n    \"\\n\",\n    \"import requests\\n\",\n    \"\\n\",\n    \"from keras.optimizers import *\\n\",\n    \"from keras.models import Sequential\\n\",\n    \"from keras.layers.core import Dense, Dropout, Activation\\n\",\n    \"\\n\",\n    \"from pyspark import SparkContext\\n\",\n    \"from pyspark import SparkConf\\n\",\n    \"\\n\",\n    \"from pyspark.ml.feature import StandardScaler\\n\",\n    \"from pyspark.ml.feature import VectorAssembler\\n\",\n    \"from pyspark.ml.feature import StringIndexer\\n\",\n    \"from pyspark.ml.evaluation import MulticlassClassificationEvaluator\\n\",\n    \"from pyspark.mllib.evaluation import BinaryClassificationMetrics\\n\",\n    \"\\n\",\n    \"from distkeras.trainers import *\\n\",\n    \"from distkeras.predictors import *\\n\",\n    \"from distkeras.transformers import *\\n\",\n    \"from distkeras.evaluators import *\\n\",\n    \"from distkeras.utils import *\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Modify these variables according to your needs.\\n\",\n    \"application_name = \\\"Distributed Keras Notebook\\\"\\n\",\n    \"using_spark_2 = False\\n\",\n    \"local = False\\n\",\n    \"if local:\\n\",\n    \"    # Tell master to use local resources.\\n\",\n    \"    master = \\\"local[*]\\\"\\n\",\n    \"    num_cores = 3\\n\",\n    \"    num_executors = 1\\n\",\n    \"else:\\n\",\n    \"    # Tell master to use YARN.\\n\",\n    \"    master = \\\"yarn-client\\\"\\n\",\n    \"    num_executors = 6\\n\",\n    \"    num_cores = 2\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Number of desired executors: 6\\n\",\n      \"Number of desired cores / executor: 2\\n\",\n      \"Total number of workers: 12\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# This variable is derived from the number of cores and executors, and will be used to assign the number of model trainers.\\n\",\n    \"num_workers = num_executors * num_cores\\n\",\n    \"\\n\",\n    \"print(\\\"Number of desired executors: \\\" + `num_executors`)\\n\",\n    \"print(\\\"Number of desired cores / executor: \\\" + `num_cores`)\\n\",\n    \"print(\\\"Total number of workers: \\\" + `num_workers`)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"# Use the DataBricks CSV reader, this has some nice functionality regarding invalid values.\\n\",\n    \"os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Preparing a Spark Context\\n\",\n    \"\\n\",\n    \"In order to read our (big) dataset into our Spark Cluster, we first need a Spark Context. However, since Spark 2.0 there are some changes regarding the initialization of a Spark Context. For example, SQLContext and HiveContext do not have to be initialized separately anymore.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"conf = SparkConf()\\n\",\n    \"conf.set(\\\"spark.app.name\\\", application_name)\\n\",\n    \"conf.set(\\\"spark.master\\\", master)\\n\",\n    \"conf.set(\\\"spark.executor.cores\\\", `num_cores`)\\n\",\n    \"conf.set(\\\"spark.executor.instances\\\", `num_executors`)\\n\",\n    \"conf.set(\\\"spark.locality.wait\\\", \\\"0\\\")\\n\",\n    \"conf.set(\\\"spark.serializer\\\", \\\"org.apache.spark.serializer.KryoSerializer\\\");\\n\",\n    \"\\n\",\n    \"# Check if the user is running Spark 2.0 +\\n\",\n    \"if using_spark_2:\\n\",\n    \"    sc = SparkSession.builder.config(conf=conf) \\\\\\n\",\n    \"            .appName(application_name) \\\\\\n\",\n    \"            .getOrCreate()\\n\",\n    \"else:\\n\",\n    \"    # Create the Spark context.\\n\",\n    \"    sc = SparkContext(conf=conf)\\n\",\n    \"    # Add the missing imports\\n\",\n    \"    from pyspark import SQLContext\\n\",\n    \"    sqlContext = SQLContext(sc)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Check if we are using Spark 2.0\\n\",\n    \"if using_spark_2:\\n\",\n    \"    reader = sc\\n\",\n    \"else:\\n\",\n    \"    reader = sqlContext\\n\",\n    \"# Read the dataset.\\n\",\n    \"raw_dataset = reader.read.format('com.databricks.spark.csv') \\\\\\n\",\n    \"                    .options(header='true', inferSchema='true').load(\\\"data/atlas_higgs.csv\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"root\\n\",\n      \" |-- EventId: integer (nullable = true)\\n\",\n      \" |-- DER_mass_MMC: double (nullable = true)\\n\",\n      \" |-- DER_mass_transverse_met_lep: double (nullable = true)\\n\",\n      \" |-- DER_mass_vis: double (nullable = true)\\n\",\n      \" |-- DER_pt_h: double (nullable = true)\\n\",\n      \" |-- DER_deltaeta_jet_jet: double (nullable = true)\\n\",\n      \" |-- DER_mass_jet_jet: double (nullable = true)\\n\",\n      \" |-- DER_prodeta_jet_jet: double (nullable = true)\\n\",\n      \" |-- DER_deltar_tau_lep: double (nullable = true)\\n\",\n      \" |-- DER_pt_tot: double (nullable = true)\\n\",\n      \" |-- DER_sum_pt: double (nullable = true)\\n\",\n      \" |-- DER_pt_ratio_lep_tau: double (nullable = true)\\n\",\n      \" |-- DER_met_phi_centrality: double (nullable = true)\\n\",\n      \" |-- DER_lep_eta_centrality: double (nullable = true)\\n\",\n      \" |-- PRI_tau_pt: double (nullable = true)\\n\",\n      \" |-- PRI_tau_eta: double (nullable = true)\\n\",\n      \" |-- PRI_tau_phi: double (nullable = true)\\n\",\n      \" |-- PRI_lep_pt: double (nullable = true)\\n\",\n      \" |-- PRI_lep_eta: double (nullable = true)\\n\",\n      \" |-- PRI_lep_phi: double (nullable = true)\\n\",\n      \" |-- PRI_met: double (nullable = true)\\n\",\n      \" |-- PRI_met_phi: double (nullable = true)\\n\",\n      \" |-- PRI_met_sumet: double (nullable = true)\\n\",\n      \" |-- PRI_jet_num: integer (nullable = true)\\n\",\n      \" |-- PRI_jet_leading_pt: double (nullable = true)\\n\",\n      \" |-- PRI_jet_leading_eta: double (nullable = true)\\n\",\n      \" |-- PRI_jet_leading_phi: double (nullable = true)\\n\",\n      \" |-- PRI_jet_subleading_pt: double (nullable = true)\\n\",\n      \" |-- PRI_jet_subleading_eta: double (nullable = true)\\n\",\n      \" |-- PRI_jet_subleading_phi: double (nullable = true)\\n\",\n      \" |-- PRI_jet_all_pt: double (nullable = true)\\n\",\n      \" |-- Weight: double (nullable = true)\\n\",\n      \" |-- Label: string (nullable = true)\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# Double-check the inferred schema, and get fetch a row to show how the dataset looks like.\\n\",\n    \"raw_dataset.printSchema()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Dataset preprocessing and normalization\\n\",\n    \"\\n\",\n    \"Since Spark's MLlib has some nice features for distributed preprocessing, we made sure we comply to the DataFrame API in order to ensure compatibility. What it basically boils down to, is that all the features (which can have different type) will be aggregated into a single column. More information on Spark MLlib (and other APIs) can be found here: [http://spark.apache.org/docs/latest/ml-guide.html](http://spark.apache.org/docs/latest/ml-guide.html)\\n\",\n    \"\\n\",\n    \"In the following steps we will show you how to extract the desired columns from the dataset and prepare the for further processing.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[Row(features=DenseVector([138.47, 51.655, 97.827, 27.98, 0.91, 124.711, 2.666, 3.064, 41.928, 197.76, 1.582, 1.396, 0.2, 32.638, 1.017, 0.381, 51.626, 2.273, -2.414, 16.824, -0.277, 258.733, 2.0, 67.435, 2.15, 0.444, 46.062, 1.24, -2.475, 113.497]))]\"\n      ]\n     },\n     \"execution_count\": 10,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# First, we would like to extract the desired features from the raw dataset.\\n\",\n    \"# We do this by constructing a list with all desired columns.\\n\",\n    \"features = raw_dataset.columns\\n\",\n    \"features.remove('EventId')\\n\",\n    \"features.remove('Weight')\\n\",\n    \"features.remove('Label')\\n\",\n    \"# Next, we use Spark's VectorAssembler to \\\"assemble\\\" (create) a vector of all desired features.\\n\",\n    \"# http://spark.apache.org/docs/latest/ml-features.html#vectorassembler\\n\",\n    \"vector_assembler = VectorAssembler(inputCols=features, outputCol=\\\"features\\\")\\n\",\n    \"# This transformer will take all columns specified in features, and create an additional column \\\"features\\\" which will contain all the desired features aggregated into a single vector.\\n\",\n    \"dataset = vector_assembler.transform(raw_dataset)\\n\",\n    \"\\n\",\n    \"# Show what happened after applying the vector assembler.\\n\",\n    \"# Note: \\\"features\\\" column got appended to the end.\\n\",\n    \"dataset.select(\\\"features\\\").take(1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Apply feature normalization with standard scaling. This will transform a feature to have mean 0, and std 1.\\n\",\n    \"# http://spark.apache.org/docs/latest/ml-features.html#standardscaler\\n\",\n    \"standard_scaler = StandardScaler(inputCol=\\\"features\\\", outputCol=\\\"features_normalized\\\", withStd=True, withMean=True)\\n\",\n    \"standard_scaler_model = standard_scaler.fit(dataset)\\n\",\n    \"dataset = standard_scaler_model.transform(dataset)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[Row(Label=u's', label_index=1.0),\\n\",\n       \" Row(Label=u'b', label_index=0.0),\\n\",\n       \" Row(Label=u'b', label_index=0.0),\\n\",\n       \" Row(Label=u'b', label_index=0.0),\\n\",\n       \" Row(Label=u'b', label_index=0.0)]\"\n      ]\n     },\n     \"execution_count\": 12,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# If we look at the dataset, the Label column consists of 2 entries, i.e., b (background), and s (signal).\\n\",\n    \"# Our neural network will not be able to handle these characters, so instead, we convert it to an index so we can indicate that output neuron with index 0 is background, and 1 is signal.\\n\",\n    \"# http://spark.apache.org/docs/latest/ml-features.html#stringindexer\\n\",\n    \"label_indexer = StringIndexer(inputCol=\\\"Label\\\", outputCol=\\\"label_index\\\").fit(dataset)\\n\",\n    \"dataset = label_indexer.transform(dataset)\\n\",\n    \"\\n\",\n    \"# Show the result of the label transformation.\\n\",\n    \"dataset.select(\\\"Label\\\", \\\"label_index\\\").take(5)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Define some properties of the neural network for later use.\\n\",\n    \"nb_classes = 2 # Number of output classes (signal and background)\\n\",\n    \"nb_features = len(features)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# We observe that Keras is not able to work with these indexes.\\n\",\n    \"# What it actually expects is a vector with an identical size to the output layer.\\n\",\n    \"# Our framework provides functionality to do this with ease.\\n\",\n    \"# What it basically does, given an expected vector dimension, \\n\",\n    \"# it prepares zero vector with the specified dimensionality, and will set the neuron\\n\",\n    \"# with a specific label index to one. (One-Hot encoding)\\n\",\n    \"\\n\",\n    \"# For example:\\n\",\n    \"# 1. Assume we have a label index: 3\\n\",\n    \"# 2. Output dimensionality: 5\\n\",\n    \"# With these parameters, we obtain the following vector in the DataFrame column: [0,0,0,1,0]\\n\",\n    \"\\n\",\n    \"transformer = OneHotTransformer(output_dim=nb_classes, input_col=\\\"label_index\\\", output_col=\\\"label\\\")\\n\",\n    \"dataset = transformer.transform(dataset)\\n\",\n    \"# Only select the columns we need (less data shuffling) while training.\\n\",\n    \"dataset = dataset.select(\\\"features_normalized\\\", \\\"label_index\\\", \\\"label\\\")\\n\",\n    \"\\n\",\n    \"# Show the expected output vectors of the neural network.\\n\",\n    \"dataset.select(\\\"label_index\\\", \\\"label\\\").take(1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"**Warning**: shuffling on a large dataset will take some time.\\n\",\n    \"\\n\",\n    \"We recommend users to first preprocess and shuffle their data, as is described in the data preprocessing notebook.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Shuffle the dataset.\\n\",\n    \"dataset = shuffle(dataset)\\n\",\n    \"\\n\",\n    \"# Note: we also support shuffling in the trainers by default.\\n\",\n    \"# However, since this would require a shuffle for every training we will only do it once here.\\n\",\n    \"# If you want, you can enable the training shuffling by specifying shuffle=True in the train() function.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Finally, we create a trainingset and a testset.\\n\",\n    \"(training_set, test_set) = dataset.randomSplit([0.6, 0.4])\\n\",\n    \"training_set.cache()\\n\",\n    \"test_set.cache()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Model construction\\n\",\n    \"\\n\",\n    \"We will now construct a relatively simple Keras model (without any modifications) which, hopefully, will be able to classify the dataset.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"model = Sequential()\\n\",\n    \"model.add(Dense(500, input_shape=(nb_features,)))\\n\",\n    \"model.add(Activation('relu'))\\n\",\n    \"model.add(Dropout(0.4))\\n\",\n    \"model.add(Dense(500))\\n\",\n    \"model.add(Activation('relu'))\\n\",\n    \"model.add(Dense(nb_classes))\\n\",\n    \"model.add(Activation('softmax'))\\n\",\n    \"\\n\",\n    \"model.summary()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"##### Worker Optimizer and Loss\\n\",\n    \"\\n\",\n    \"In order to evaluate the gradient on the model replicas, we have to specify an optimizer and a loss method. For this, we just follow the Keras API as defined in the documentation: [https://keras.io/optimizers/](https://keras.io/optimizers/) and [https://keras.io/objectives/](https://keras.io/objectives/).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"optimizer = 'adagrad'\\n\",\n    \"loss = 'categorical_crossentropy'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Training\\n\",\n    \"\\n\",\n    \"In the following cells we will train and evaluate the model using different distributed trainers, however, we will as well provide a baseline metric using a **SingleTrainer**, which is basically an instance of the Adagrad optimizer running on Spark.\\n\",\n    \"\\n\",\n    \"Furthermore, we will also evaluate every training using Spark's MulticlassClassificationEvaluator [https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.MulticlassClassificationEvaluator](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.MulticlassClassificationEvaluator).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"##### Evaluation\\n\",\n    \"\\n\",\n    \"We will evaluate all algorithms using the F1 [https://en.wikipedia.org/wiki/F1_score](https://en.wikipedia.org/wiki/F1_score) and accuracy metric.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def evaluate_accuracy(model):\\n\",\n    \"    global test_set\\n\",\n    \"    \\n\",\n    \"    # Allocate a Distributed Keras Accuracy evaluator.\\n\",\n    \"    evaluator = AccuracyEvaluator(prediction_col=\\\"prediction_index\\\", label_col=\\\"label_index\\\")\\n\",\n    \"    # Clear the prediction column from the testset.\\n\",\n    \"    test_set = test_set.select(\\\"features_normalized\\\", \\\"label_index\\\", \\\"label\\\")\\n\",\n    \"    # Apply a prediction from a trained model.\\n\",\n    \"    predictor = ModelPredictor(keras_model=trained_model, features_col=\\\"features_normalized\\\")\\n\",\n    \"    test_set = predictor.predict(test_set)\\n\",\n    \"    # Allocate an index transformer.\\n\",\n    \"    index_transformer = LabelIndexTransformer(output_dim=nb_classes)\\n\",\n    \"    # Transform the prediction vector to an indexed label.\\n\",\n    \"    test_set = index_transformer.transform(test_set)\\n\",\n    \"    # Fetch the score.\\n\",\n    \"    score = evaluator.evaluate(test_set)\\n\",\n    \"    \\n\",\n    \"    return score\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def add_result(trainer, accuracy, dt):\\n\",\n    \"    global results;\\n\",\n    \"    \\n\",\n    \"    # Store the metrics.\\n\",\n    \"    results[trainer] = {}\\n\",\n    \"    results[trainer]['accuracy'] = accuracy;\\n\",\n    \"    results[trainer]['time_spent'] = dt\\n\",\n    \"    # Display the metrics.\\n\",\n    \"    print(\\\"Trainer: \\\" + str(trainer))\\n\",\n    \"    print(\\\" - Accuracy: \\\" + str(accuracy))\\n\",\n    \"    print(\\\" - Training time: \\\" + str(dt))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"But first, we will allocate a simple datastructure which will hold the results.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"results = {}\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"##### SingleTrainer\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"A ***SingleTrainer*** is used as a benchmarking trainer to compare to distributed trainer. However, one could also use this trainer if the dataset is too big to fit in memory.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer = SingleTrainer(keras_model=model, worker_optimizer=optimizer,\\n\",\n    \"                        loss=loss, features_col=\\\"features_normalized\\\",\\n\",\n    \"                        label_col=\\\"label\\\", num_epoch=1, batch_size=32)\\n\",\n    \"trained_model = trainer.train(training_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Fetch the evaluation metrics.\\n\",\n    \"accuracy = evaluate_accuracy(trained_model)\\n\",\n    \"dt = trainer.get_training_time()\\n\",\n    \"# Add the metrics to the results.\\n\",\n    \"add_result('single', accuracy, dt)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"##### Asynchronous EASGD\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"EASGD based methods, proposed by Zhang et al., transmit the complete parametrization instead of the gradient. These methods will then \\\"average\\\" the difference of the center variable and the backpropagated worker variable. This is used to compute a new master variable, on which the worker nodes will base their backpropagation in the next iteration on.\\n\",\n    \"\\n\",\n    \"Asynchronous EASGD will do this in an asynchronous fashion, meaning, whenever a worker node is done processing its mini-batch after a certain amount of iterations (communication window), then the computed parameter will be communicated with the parameter server, which will update the center (master) variable immediately without waiting for other workers.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer = AEASGD(keras_model=model, worker_optimizer=optimizer, loss=loss, num_workers=num_workers, \\n\",\n    \"                 batch_size=32, features_col=\\\"features_normalized\\\", label_col=\\\"label\\\", num_epoch=1,\\n\",\n    \"                 communication_window=32, rho=5.0, learning_rate=0.1)\\n\",\n    \"trainer.set_parallelism_factor(1)\\n\",\n    \"trained_model = trainer.train(training_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Fetch the evaluation metrics.\\n\",\n    \"accuracy = evaluate_accuracy(trained_model)\\n\",\n    \"dt = trainer.get_training_time()\\n\",\n    \"# Add the metrics to the results.\\n\",\n    \"add_result('aeasgd', accuracy, dt)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"##### Asynchronous EAMSGD\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The only difference between asynchronous EAMSGD and asynchronous EASGD is the possibility of specifying an explicit momentum term.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer = EAMSGD(keras_model=model, worker_optimizer=optimizer, loss=loss, num_workers=num_workers,\\n\",\n    \"                 batch_size=32, features_col=\\\"features_normalized\\\", label_col=\\\"label\\\", num_epoch=1,\\n\",\n    \"                 communication_window=32, rho=5.0, learning_rate=0.1, momentum=0.6)\\n\",\n    \"trainer.set_parallelism_factor(1)\\n\",\n    \"trained_model = trainer.train(training_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Fetch the evaluation metrics.\\n\",\n    \"accuracy = evaluate_accuracy(trained_model)\\n\",\n    \"dt = trainer.get_training_time()\\n\",\n    \"# Add the metrics to the results.\\n\",\n    \"add_result('eamsgd', accuracy, dt)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"##### DOWNPOUR SGD\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"trainer = DOWNPOUR(keras_model=model, worker_optimizer=optimizer, loss=loss, num_workers=num_workers,\\n\",\n    \"                   batch_size=32, communication_window=5, learning_rate=0.05, num_epoch=1,\\n\",\n    \"                   features_col=\\\"features_normalized\\\", label_col=\\\"label\\\")\\n\",\n    \"trainer.set_parallelism_factor(1)\\n\",\n    \"trained_model = trainer.train(training_set)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Fetch the evaluation metrics.\\n\",\n    \"accuracy = evaluate_accuracy(trained_model)\\n\",\n    \"dt = trainer.get_training_time()\\n\",\n    \"# Add the metrics to the results.\\n\",\n    \"add_result('downpour', accuracy, dt)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Experimental observations\\n\",\n    \"\\n\",\n    \"- DOWNPOUR converges well when a small communication window is used $< 5$.\\n\",\n    \"- EASGD based methods on the other hand, thrive using large communication windows $> 25$.\\n\",\n    \"- Asynchronous methods induce implicit momentum.\\n\",\n    \"\\n\",\n    \"## Summary\\n\",\n    \"\\n\",\n    \"Distributed Deep Learning can **significantly speedup** the learning process. We provide such a framework built on top of Keras and Apache Spark. The latter provides a nice framework for distributed data processing and model evaluation. We can easily integrate our workflows using Apache Spark, and thus speeding up the **data preprocessing**, and our **model optimization procedure** while still having the same **modelling simplicity**.\\n\",\n    \"\\n\",\n    \"Our group is always open on further collaboration on this work, and would like to assist the physics community in their machine learning efforts.\\n\",\n    \"\\n\",\n    \"**Contact**: [joeri.hermans@cern.ch](mailto:joeri.hermans@cern.ch)\\n\",\n    \"             [luca.canali@cern.ch](mailto:luca.canali@cern.ch)\\n\",\n    \"             [zbigniew.baranowski@cern.ch](mailto:zbigniew.baranowski@cern.ch)\\n\",\n    \"\\n\",\n    \"## Future work\\n\",\n    \"\\n\",\n    \"- Understanding \\\"theoretical\\\" meaning of a communication window.\\n\",\n    \"- Apply compression for big weight updates when sending updates to the parameter server.\\n\",\n    \"- Keep track of a gradient residual (this will reduce the bandwidth due to sparsity).\\n\",\n    \"- Evaluation of algorithms with GPU's.\\n\",\n    \"- Optimization of parameter sharing (e.g., sockets instead of REST API).\\n\",\n    \"- Use \\\"famous\\\" ConvNet architectures with well known datasets in order to have a more sound evaluation.\\n\",\n    \"- Add threaded queue to process asynchronous updates.\\n\",\n    \"- Training accuracy while training.\\n\",\n    \"- Stop on target loss.\\n\",\n    \"\\n\",\n    \"## Acknowledgments\\n\",\n    \"\\n\",\n    \"Many thanks to Zbigniew Baranowski and Luca Canali of the IT-DB group, and to Jean-Roch Vlimant, Maurizio Pierini, and Federico Presutti of the EP-UCM group for their collaboration on this work.\\n\",\n    \"\\n\",\n    \"## GitHub repository\\n\",\n    \"\\n\",\n    \"[https://github.com/JoeriHermans/dist-keras/](#https://github.com/JoeriHermans/dist-keras)\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python 2\",\n   \"language\": \"python\",\n   \"name\": \"python2\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 2\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython2\",\n   \"version\": \"2.7.13\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "mkdocs.yml",
    "content": "# Project information\nsite_name: Distributed Keras\nsite_description: Distributed Deep Learning with Apache Spark and Keras.\nsite_author: Joeri Hermans\nsite_url: 'http://dist-keras.joerihermans.com'\n\n# Page definitions.\npages:\n- Home: index.md\n- Optimizers: optimizers.md\n- License: license.md\n# Documentation and theme configuration\ntheme: readthedocs\ndocs_dir: 'docs'\nmarkdown_extensions:\n- admonition\nextra:\n    version: '0.1.0'\n    palette:\n        primary: 'grey'\n        accent: 'light blue'\n    author:\n        github: 'JoeriHermans'\n        twitter: 'joeri_hermans'\n\n# Copyright\ncopyright: 'Copyright (c) 2016 Joeri Hermans'\n\n# Repository\nrepo_name: 'GitHub'\nrepo_url: 'https://github.com/JoeriHermans/dist-keras'\n"
  },
  {
    "path": "resources/blog-posts/css/main.css",
    "content": "/**\n * joerihermans.com main stylesheet.\n *\n * @author  Joeri Hermans\n * @version 0,1\n * @since   28 June 2016\n */\n\n/** BEGIN Imports. ************************************************************/\n\n@import url('https://fonts.googleapis.com/css?family=Roboto:300');\n\n/** END Imports. **************************************************************/\n\n/** BEGIN General. ************************************************************/\n\nhtml, body {\n  background: white;\n  color: #1b1b1b;\n  font-family: 'Roboto', sans-serif;\n  height: 100%;\n  margin: 0;\n  font-size: 0.95em;\n}\n\nbody {\n  padding: 50px;\n}\n\na {\n  color: #4078c0;\n  cursor: pointer;\n  text-decoration: none;\n}\n\na:hover {\n  text-decoration: none;\n}\n\na img {\n  border: 0;\n}\n\n.left {\n  float: left !important;\n}\n\n.right {\n  float: right !important;\n}\n\n.inlined-code {\n  border-radius: 2px;\n  background: #f0f0f0;\n  padding: 2px 4px;\n}\n\n.center-text {\n  text-align: center;\n}\n\n.image-container {\n  text-align: center;\n}\n\n/** END General. **************************************************************/\n\n/** BEGIN Math. ***************************************************************/\n\n.equation-math {\n  text-align: center;\n  margin: 40px 0;\n  position: relative;\n}\n\n.equation-math span {\n  text-align: left;\n}\n\n.equation-math-number {\n  position: absolute;\n  top: 0;\n  bottom: 0;\n  margin: 0;\n  height: 100%;\n  line-height: 40px;\n  right: 20px;\n  font-size: 0.9em;\n  color: #1b1b1b;\n  font-weight: bold;\n}\n\n/** END Math. *****************************************************************/\n\n/** BEGIN Blog. ***************************************************************/\n\n.blog-figure-container {\n  text-align: center;\n}\n\n.blog-figure-container img {\n  max-width: 100%;\n}\n\n.blog-figure-container p {\n  text-align: left;\n}\n\n/** END Blog. ****************************************************************/\n\n/** BEGIN Highlight. *********************************************************/\n\n.hljs {\n  display: block;\n  overflow-x: auto;\n  padding: 10px 30px;\n  background: #f0f0f0;\n}\n\n.hljs,\n.hljs-subst {\n  color: #444;\n}\n\n.hljs-comment {\n  color: #888888;\n}\n\n.hljs-keyword,\n.hljs-attribute,\n.hljs-selector-tag,\n.hljs-meta-keyword,\n.hljs-doctag,\n.hljs-name {\n  font-weight: bold;\n}\n\n.hljs-type,\n.hljs-string,\n.hljs-number,\n.hljs-selector-id,\n.hljs-selector-class,\n.hljs-quote,\n.hljs-template-tag,\n.hljs-deletion {\n  color: #d4645c;\n}\n\n.hljs-title,\n.hljs-section {\n  color: #d4645c;\n  font-weight: bold;\n}\n\n.hljs-literal {\n  color: #78A960;\n}\n\n.hljs-built_in,\n.hljs-bullet,\n.hljs-code,\n.hljs-addition {\n  color: #397300;\n}\n\n.hljs-meta {\n  color: #1f7199;\n}\n\n.hljs-meta-string {\n  color: #4d99bf;\n}\n\n.hljs-emphasis {\n  font-style: italic;\n}\n\n.hljs-strong {\n  font-weight: bold;\n}\n\n/** END Highlight. ***********************************************************/\n"
  },
  {
    "path": "resources/blog-posts/js/highlight.pack.js",
    "content": "/*! highlight.js v9.5.0 | BSD3 License | git.io/hljslicense */\n!function(e){var n=\"object\"==typeof window&&window||\"object\"==typeof self&&self;\"undefined\"!=typeof exports?e(exports):n&&(n.hljs=e({}),\"function\"==typeof define&&define.amd&&define([],function(){return n.hljs}))}(function(e){function n(e){return e.replace(/[&<>]/gm,function(e){return I[e]})}function t(e){return e.nodeName.toLowerCase()}function r(e,n){var t=e&&e.exec(n);return t&&0===t.index}function a(e){return k.test(e)}function i(e){var n,t,r,i,o=e.className+\" \";if(o+=e.parentNode?e.parentNode.className:\"\",t=B.exec(o))return R(t[1])?t[1]:\"no-highlight\";for(o=o.split(/\\s+/),n=0,r=o.length;r>n;n++)if(i=o[n],a(i)||R(i))return i}function o(e,n){var t,r={};for(t in e)r[t]=e[t];if(n)for(t in n)r[t]=n[t];return r}function u(e){var n=[];return function r(e,a){for(var i=e.firstChild;i;i=i.nextSibling)3===i.nodeType?a+=i.nodeValue.length:1===i.nodeType&&(n.push({event:\"start\",offset:a,node:i}),a=r(i,a),t(i).match(/br|hr|img|input/)||n.push({event:\"stop\",offset:a,node:i}));return a}(e,0),n}function c(e,r,a){function i(){return e.length&&r.length?e[0].offset!==r[0].offset?e[0].offset<r[0].offset?e:r:\"start\"===r[0].event?e:r:e.length?e:r}function o(e){function r(e){return\" \"+e.nodeName+'=\"'+n(e.value)+'\"'}l+=\"<\"+t(e)+w.map.call(e.attributes,r).join(\"\")+\">\"}function u(e){l+=\"</\"+t(e)+\">\"}function c(e){(\"start\"===e.event?o:u)(e.node)}for(var s=0,l=\"\",f=[];e.length||r.length;){var g=i();if(l+=n(a.substr(s,g[0].offset-s)),s=g[0].offset,g===e){f.reverse().forEach(u);do c(g.splice(0,1)[0]),g=i();while(g===e&&g.length&&g[0].offset===s);f.reverse().forEach(o)}else\"start\"===g[0].event?f.push(g[0].node):f.pop(),c(g.splice(0,1)[0])}return l+n(a.substr(s))}function s(e){function n(e){return e&&e.source||e}function t(t,r){return new RegExp(n(t),\"m\"+(e.cI?\"i\":\"\")+(r?\"g\":\"\"))}function r(a,i){if(!a.compiled){if(a.compiled=!0,a.k=a.k||a.bK,a.k){var u={},c=function(n,t){e.cI&&(t=t.toLowerCase()),t.split(\" \").forEach(function(e){var t=e.split(\"|\");u[t[0]]=[n,t[1]?Number(t[1]):1]})};\"string\"==typeof a.k?c(\"keyword\",a.k):E(a.k).forEach(function(e){c(e,a.k[e])}),a.k=u}a.lR=t(a.l||/\\w+/,!0),i&&(a.bK&&(a.b=\"\\\\b(\"+a.bK.split(\" \").join(\"|\")+\")\\\\b\"),a.b||(a.b=/\\B|\\b/),a.bR=t(a.b),a.e||a.eW||(a.e=/\\B|\\b/),a.e&&(a.eR=t(a.e)),a.tE=n(a.e)||\"\",a.eW&&i.tE&&(a.tE+=(a.e?\"|\":\"\")+i.tE)),a.i&&(a.iR=t(a.i)),null==a.r&&(a.r=1),a.c||(a.c=[]);var s=[];a.c.forEach(function(e){e.v?e.v.forEach(function(n){s.push(o(e,n))}):s.push(\"self\"===e?a:e)}),a.c=s,a.c.forEach(function(e){r(e,a)}),a.starts&&r(a.starts,i);var l=a.c.map(function(e){return e.bK?\"\\\\.?(\"+e.b+\")\\\\.?\":e.b}).concat([a.tE,a.i]).map(n).filter(Boolean);a.t=l.length?t(l.join(\"|\"),!0):{exec:function(){return null}}}}r(e)}function l(e,t,a,i){function o(e,n){for(var t=0;t<n.c.length;t++)if(r(n.c[t].bR,e))return n.c[t]}function u(e,n){if(r(e.eR,n)){for(;e.endsParent&&e.parent;)e=e.parent;return e}return e.eW?u(e.parent,n):void 0}function c(e,n){return!a&&r(n.iR,e)}function g(e,n){var t=N.cI?n[0].toLowerCase():n[0];return e.k.hasOwnProperty(t)&&e.k[t]}function h(e,n,t,r){var a=r?\"\":y.classPrefix,i='<span class=\"'+a,o=t?\"\":C;return i+=e+'\">',i+n+o}function p(){var e,t,r,a;if(!E.k)return n(B);for(a=\"\",t=0,E.lR.lastIndex=0,r=E.lR.exec(B);r;)a+=n(B.substr(t,r.index-t)),e=g(E,r),e?(M+=e[1],a+=h(e[0],n(r[0]))):a+=n(r[0]),t=E.lR.lastIndex,r=E.lR.exec(B);return a+n(B.substr(t))}function d(){var e=\"string\"==typeof E.sL;if(e&&!x[E.sL])return n(B);var t=e?l(E.sL,B,!0,L[E.sL]):f(B,E.sL.length?E.sL:void 0);return E.r>0&&(M+=t.r),e&&(L[E.sL]=t.top),h(t.language,t.value,!1,!0)}function b(){k+=null!=E.sL?d():p(),B=\"\"}function v(e){k+=e.cN?h(e.cN,\"\",!0):\"\",E=Object.create(e,{parent:{value:E}})}function m(e,n){if(B+=e,null==n)return b(),0;var t=o(n,E);if(t)return t.skip?B+=n:(t.eB&&(B+=n),b(),t.rB||t.eB||(B=n)),v(t,n),t.rB?0:n.length;var r=u(E,n);if(r){var a=E;a.skip?B+=n:(a.rE||a.eE||(B+=n),b(),a.eE&&(B=n));do E.cN&&(k+=C),E.skip||(M+=E.r),E=E.parent;while(E!==r.parent);return r.starts&&v(r.starts,\"\"),a.rE?0:n.length}if(c(n,E))throw new Error('Illegal lexeme \"'+n+'\" for mode \"'+(E.cN||\"<unnamed>\")+'\"');return B+=n,n.length||1}var N=R(e);if(!N)throw new Error('Unknown language: \"'+e+'\"');s(N);var w,E=i||N,L={},k=\"\";for(w=E;w!==N;w=w.parent)w.cN&&(k=h(w.cN,\"\",!0)+k);var B=\"\",M=0;try{for(var I,j,O=0;;){if(E.t.lastIndex=O,I=E.t.exec(t),!I)break;j=m(t.substr(O,I.index-O),I[0]),O=I.index+j}for(m(t.substr(O)),w=E;w.parent;w=w.parent)w.cN&&(k+=C);return{r:M,value:k,language:e,top:E}}catch(T){if(T.message&&-1!==T.message.indexOf(\"Illegal\"))return{r:0,value:n(t)};throw T}}function f(e,t){t=t||y.languages||E(x);var r={r:0,value:n(e)},a=r;return t.filter(R).forEach(function(n){var t=l(n,e,!1);t.language=n,t.r>a.r&&(a=t),t.r>r.r&&(a=r,r=t)}),a.language&&(r.second_best=a),r}function g(e){return y.tabReplace||y.useBR?e.replace(M,function(e,n){return y.useBR&&\"\\n\"===e?\"<br>\":y.tabReplace?n.replace(/\\t/g,y.tabReplace):void 0}):e}function h(e,n,t){var r=n?L[n]:t,a=[e.trim()];return e.match(/\\bhljs\\b/)||a.push(\"hljs\"),-1===e.indexOf(r)&&a.push(r),a.join(\" \").trim()}function p(e){var n,t,r,o,s,p=i(e);a(p)||(y.useBR?(n=document.createElementNS(\"http://www.w3.org/1999/xhtml\",\"div\"),n.innerHTML=e.innerHTML.replace(/\\n/g,\"\").replace(/<br[ \\/]*>/g,\"\\n\")):n=e,s=n.textContent,r=p?l(p,s,!0):f(s),t=u(n),t.length&&(o=document.createElementNS(\"http://www.w3.org/1999/xhtml\",\"div\"),o.innerHTML=r.value,r.value=c(t,u(o),s)),r.value=g(r.value),e.innerHTML=r.value,e.className=h(e.className,p,r.language),e.result={language:r.language,re:r.r},r.second_best&&(e.second_best={language:r.second_best.language,re:r.second_best.r}))}function d(e){y=o(y,e)}function b(){if(!b.called){b.called=!0;var e=document.querySelectorAll(\"pre code\");w.forEach.call(e,p)}}function v(){addEventListener(\"DOMContentLoaded\",b,!1),addEventListener(\"load\",b,!1)}function m(n,t){var r=x[n]=t(e);r.aliases&&r.aliases.forEach(function(e){L[e]=n})}function N(){return E(x)}function R(e){return e=(e||\"\").toLowerCase(),x[e]||x[L[e]]}var w=[],E=Object.keys,x={},L={},k=/^(no-?highlight|plain|text)$/i,B=/\\blang(?:uage)?-([\\w-]+)\\b/i,M=/((^(<[^>]+>|\\t|)+|(?:\\n)))/gm,C=\"</span>\",y={classPrefix:\"hljs-\",tabReplace:null,useBR:!1,languages:void 0},I={\"&\":\"&amp;\",\"<\":\"&lt;\",\">\":\"&gt;\"};return e.highlight=l,e.highlightAuto=f,e.fixMarkup=g,e.highlightBlock=p,e.configure=d,e.initHighlighting=b,e.initHighlightingOnLoad=v,e.registerLanguage=m,e.listLanguages=N,e.getLanguage=R,e.inherit=o,e.IR=\"[a-zA-Z]\\\\w*\",e.UIR=\"[a-zA-Z_]\\\\w*\",e.NR=\"\\\\b\\\\d+(\\\\.\\\\d+)?\",e.CNR=\"(-?)(\\\\b0[xX][a-fA-F0-9]+|(\\\\b\\\\d+(\\\\.\\\\d*)?|\\\\.\\\\d+)([eE][-+]?\\\\d+)?)\",e.BNR=\"\\\\b(0b[01]+)\",e.RSR=\"!|!=|!==|%|%=|&|&&|&=|\\\\*|\\\\*=|\\\\+|\\\\+=|,|-|-=|/=|/|:|;|<<|<<=|<=|<|===|==|=|>>>=|>>=|>=|>>>|>>|>|\\\\?|\\\\[|\\\\{|\\\\(|\\\\^|\\\\^=|\\\\||\\\\|=|\\\\|\\\\||~\",e.BE={b:\"\\\\\\\\[\\\\s\\\\S]\",r:0},e.ASM={cN:\"string\",b:\"'\",e:\"'\",i:\"\\\\n\",c:[e.BE]},e.QSM={cN:\"string\",b:'\"',e:'\"',i:\"\\\\n\",c:[e.BE]},e.PWM={b:/\\b(a|an|the|are|I'm|isn't|don't|doesn't|won't|but|just|should|pretty|simply|enough|gonna|going|wtf|so|such|will|you|your|like)\\b/},e.C=function(n,t,r){var a=e.inherit({cN:\"comment\",b:n,e:t,c:[]},r||{});return a.c.push(e.PWM),a.c.push({cN:\"doctag\",b:\"(?:TODO|FIXME|NOTE|BUG|XXX):\",r:0}),a},e.CLCM=e.C(\"//\",\"$\"),e.CBCM=e.C(\"/\\\\*\",\"\\\\*/\"),e.HCM=e.C(\"#\",\"$\"),e.NM={cN:\"number\",b:e.NR,r:0},e.CNM={cN:\"number\",b:e.CNR,r:0},e.BNM={cN:\"number\",b:e.BNR,r:0},e.CSSNM={cN:\"number\",b:e.NR+\"(%|em|ex|ch|rem|vw|vh|vmin|vmax|cm|mm|in|pt|pc|px|deg|grad|rad|turn|s|ms|Hz|kHz|dpi|dpcm|dppx)?\",r:0},e.RM={cN:\"regexp\",b:/\\//,e:/\\/[gimuy]*/,i:/\\n/,c:[e.BE,{b:/\\[/,e:/\\]/,r:0,c:[e.BE]}]},e.TM={cN:\"title\",b:e.IR,r:0},e.UTM={cN:\"title\",b:e.UIR,r:0},e.METHOD_GUARD={b:\"\\\\.\\\\s*\"+e.UIR,r:0},e});hljs.registerLanguage(\"python\",function(e){var r={cN:\"meta\",b:/^(>>>|\\.\\.\\.) /},b={cN:\"string\",c:[e.BE],v:[{b:/(u|b)?r?'''/,e:/'''/,c:[r],r:10},{b:/(u|b)?r?\"\"\"/,e:/\"\"\"/,c:[r],r:10},{b:/(u|r|ur)'/,e:/'/,r:10},{b:/(u|r|ur)\"/,e:/\"/,r:10},{b:/(b|br)'/,e:/'/},{b:/(b|br)\"/,e:/\"/},e.ASM,e.QSM]},a={cN:\"number\",r:0,v:[{b:e.BNR+\"[lLjJ]?\"},{b:\"\\\\b(0o[0-7]+)[lLjJ]?\"},{b:e.CNR+\"[lLjJ]?\"}]},l={cN:\"params\",b:/\\(/,e:/\\)/,c:[\"self\",r,a,b]};return{aliases:[\"py\",\"gyp\"],k:{keyword:\"and elif is global as in if from raise for except finally print import pass return exec else break not with class assert yield try while continue del or def lambda async await nonlocal|10 None True False\",built_in:\"Ellipsis NotImplemented\"},i:/(<\\/|->|\\?)/,c:[r,a,b,e.HCM,{v:[{cN:\"function\",bK:\"def\",r:10},{cN:\"class\",bK:\"class\"}],e:/:/,i:/[${=;\\n,]/,c:[e.UTM,l,{b:/->/,eW:!0,k:\"None\"}]},{cN:\"meta\",b:/^[\\t ]*@/,e:/$/},{b:/\\b(print|exec)\\(/}]}});hljs.registerLanguage(\"bash\",function(e){var t={cN:\"variable\",v:[{b:/\\$[\\w\\d#@][\\w\\d_]*/},{b:/\\$\\{(.*?)}/}]},s={cN:\"string\",b:/\"/,e:/\"/,c:[e.BE,t,{cN:\"variable\",b:/\\$\\(/,e:/\\)/,c:[e.BE]}]},a={cN:\"string\",b:/'/,e:/'/};return{aliases:[\"sh\",\"zsh\"],l:/-?[a-z\\.]+/,k:{keyword:\"if then else elif fi for while in do done case esac function\",literal:\"true false\",built_in:\"break cd continue eval exec exit export getopts hash pwd readonly return shift test times trap umask unset alias bind builtin caller command declare echo enable help let local logout mapfile printf read readarray source type typeset ulimit unalias set shopt autoload bg bindkey bye cap chdir clone comparguments compcall compctl compdescribe compfiles compgroups compquote comptags comptry compvalues dirs disable disown echotc echoti emulate fc fg float functions getcap getln history integer jobs kill limit log noglob popd print pushd pushln rehash sched setcap setopt stat suspend ttyctl unfunction unhash unlimit unsetopt vared wait whence where which zcompile zformat zftp zle zmodload zparseopts zprof zpty zregexparse zsocket zstyle ztcp\",_:\"-ne -eq -lt -gt -f -d -e -s -l -a\"},c:[{cN:\"meta\",b:/^#![^\\n]+sh\\s*$/,r:10},{cN:\"function\",b:/\\w[\\w\\d_]*\\s*\\(\\s*\\)\\s*\\{/,rB:!0,c:[e.inherit(e.TM,{b:/\\w[\\w\\d_]*/})],r:0},e.HCM,s,a,t]}});hljs.registerLanguage(\"json\",function(e){var i={literal:\"true false null\"},n=[e.QSM,e.CNM],r={e:\",\",eW:!0,eE:!0,c:n,k:i},t={b:\"{\",e:\"}\",c:[{cN:\"attr\",b:/\"/,e:/\"/,c:[e.BE],i:\"\\\\n\"},e.inherit(r,{b:/:/})],i:\"\\\\S\"},c={b:\"\\\\[\",e:\"\\\\]\",c:[e.inherit(r)],i:\"\\\\S\"};return n.splice(n.length,0,t,c),{c:n,k:i,i:\"\\\\S\"}});hljs.registerLanguage(\"scala\",function(e){var t={cN:\"meta\",b:\"@[A-Za-z]+\"},a={cN:\"subst\",v:[{b:\"\\\\$[A-Za-z0-9_]+\"},{b:\"\\\\${\",e:\"}\"}]},r={cN:\"string\",v:[{b:'\"',e:'\"',i:\"\\\\n\",c:[e.BE]},{b:'\"\"\"',e:'\"\"\"',r:10},{b:'[a-z]+\"',e:'\"',i:\"\\\\n\",c:[e.BE,a]},{cN:\"string\",b:'[a-z]+\"\"\"',e:'\"\"\"',c:[a],r:10}]},c={cN:\"symbol\",b:\"'\\\\w[\\\\w\\\\d_]*(?!')\"},i={cN:\"type\",b:\"\\\\b[A-Z][A-Za-z0-9_]*\",r:0},s={cN:\"title\",b:/[^0-9\\n\\t \"'(),.`{}\\[\\]:;][^\\n\\t \"'(),.`{}\\[\\]:;]+|[^0-9\\n\\t \"'(),.`{}\\[\\]:;=]/,r:0},n={cN:\"class\",bK:\"class object trait type\",e:/[:={\\[\\n;]/,eE:!0,c:[{bK:\"extends with\",r:10},{b:/\\[/,e:/\\]/,eB:!0,eE:!0,r:0,c:[i]},{cN:\"params\",b:/\\(/,e:/\\)/,eB:!0,eE:!0,r:0,c:[i]},s]},l={cN:\"function\",bK:\"def\",e:/[:={\\[(\\n;]/,eE:!0,c:[s]};return{k:{literal:\"true false null\",keyword:\"type yield lazy override def with val var sealed abstract private trait object if forSome for while throw finally protected extends import final return else break new catch super class case package default try this match continue throws implicit\"},c:[e.CLCM,e.CBCM,r,c,i,l,n,e.CNM,t]}});hljs.registerLanguage(\"javascript\",function(e){return{aliases:[\"js\",\"jsx\"],k:{keyword:\"in of if for while finally var new function do return void else break catch instanceof with throw case default try this switch continue typeof delete let yield const export super debugger as async await static import from as\",literal:\"true false null undefined NaN Infinity\",built_in:\"eval isFinite isNaN parseFloat parseInt decodeURI decodeURIComponent encodeURI encodeURIComponent escape unescape Object Function Boolean Error EvalError InternalError RangeError ReferenceError StopIteration SyntaxError TypeError URIError Number Math Date String RegExp Array Float32Array Float64Array Int16Array Int32Array Int8Array Uint16Array Uint32Array Uint8Array Uint8ClampedArray ArrayBuffer DataView JSON Intl arguments require module console window document Symbol Set Map WeakSet WeakMap Proxy Reflect Promise\"},c:[{cN:\"meta\",r:10,b:/^\\s*['\"]use (strict|asm)['\"]/},{cN:\"meta\",b:/^#!/,e:/$/},e.ASM,e.QSM,{cN:\"string\",b:\"`\",e:\"`\",c:[e.BE,{cN:\"subst\",b:\"\\\\$\\\\{\",e:\"\\\\}\"}]},e.CLCM,e.CBCM,{cN:\"number\",v:[{b:\"\\\\b(0[bB][01]+)\"},{b:\"\\\\b(0[oO][0-7]+)\"},{b:e.CNR}],r:0},{b:\"(\"+e.RSR+\"|\\\\b(case|return|throw)\\\\b)\\\\s*\",k:\"return throw case\",c:[e.CLCM,e.CBCM,e.RM,{b:/</,e:/(\\/\\w+|\\w+\\/)>/,sL:\"xml\",c:[{b:/<\\w+\\s*\\/>/,skip:!0},{b:/<\\w+/,e:/(\\/\\w+|\\w+\\/)>/,skip:!0,c:[\"self\"]}]}],r:0},{cN:\"function\",bK:\"function\",e:/\\{/,eE:!0,c:[e.inherit(e.TM,{b:/[A-Za-z$_][0-9A-Za-z$_]*/}),{cN:\"params\",b:/\\(/,e:/\\)/,eB:!0,eE:!0,c:[e.CLCM,e.CBCM]}],i:/\\[|%/},{b:/\\$[(.]/},e.METHOD_GUARD,{cN:\"class\",bK:\"class\",e:/[{;=]/,eE:!0,i:/[:\"\\[\\]]/,c:[{bK:\"extends\"},e.UTM]},{bK:\"constructor\",e:/\\{/,eE:!0}],i:/#(?!!)/}});hljs.registerLanguage(\"cpp\",function(t){var e={cN:\"keyword\",b:\"\\\\b[a-z\\\\d_]*_t\\\\b\"},r={cN:\"string\",v:[{b:'(u8?|U)?L?\"',e:'\"',i:\"\\\\n\",c:[t.BE]},{b:'(u8?|U)?R\"',e:'\"',c:[t.BE]},{b:\"'\\\\\\\\?.\",e:\"'\",i:\".\"}]},s={cN:\"number\",v:[{b:\"\\\\b(0b[01'_]+)\"},{b:\"\\\\b([\\\\d'_]+(\\\\.[\\\\d'_]*)?|\\\\.[\\\\d'_]+)(u|U|l|L|ul|UL|f|F|b|B)\"},{b:\"(-?)(\\\\b0[xX][a-fA-F0-9'_]+|(\\\\b[\\\\d'_]+(\\\\.[\\\\d'_]*)?|\\\\.[\\\\d'_]+)([eE][-+]?[\\\\d'_]+)?)\"}],r:0},i={cN:\"meta\",b:/#\\s*[a-z]+\\b/,e:/$/,k:{\"meta-keyword\":\"if else elif endif define undef warning error line pragma ifdef ifndef include\"},c:[{b:/\\\\\\n/,r:0},t.inherit(r,{cN:\"meta-string\"}),{cN:\"meta-string\",b:\"<\",e:\">\",i:\"\\\\n\"},t.CLCM,t.CBCM]},a=t.IR+\"\\\\s*\\\\(\",c={keyword:\"int float while private char catch export virtual operator sizeof dynamic_cast|10 typedef const_cast|10 const struct for static_cast|10 union namespace unsigned long volatile static protected bool template mutable if public friend do goto auto void enum else break extern using class asm case typeid short reinterpret_cast|10 default double register explicit signed typename try this switch continue inline delete alignof constexpr decltype noexcept static_assert thread_local restrict _Bool complex _Complex _Imaginary atomic_bool atomic_char atomic_schar atomic_uchar atomic_short atomic_ushort atomic_int atomic_uint atomic_long atomic_ulong atomic_llong atomic_ullong new throw return\",built_in:\"std string cin cout cerr clog stdin stdout stderr stringstream istringstream ostringstream auto_ptr deque list queue stack vector map set bitset multiset multimap unordered_set unordered_map unordered_multiset unordered_multimap array shared_ptr abort abs acos asin atan2 atan calloc ceil cosh cos exit exp fabs floor fmod fprintf fputs free frexp fscanf isalnum isalpha iscntrl isdigit isgraph islower isprint ispunct isspace isupper isxdigit tolower toupper labs ldexp log10 log malloc realloc memchr memcmp memcpy memset modf pow printf putchar puts scanf sinh sin snprintf sprintf sqrt sscanf strcat strchr strcmp strcpy strcspn strlen strncat strncmp strncpy strpbrk strrchr strspn strstr tanh tan vfprintf vprintf vsprintf endl initializer_list unique_ptr\",literal:\"true false nullptr NULL\"},n=[e,t.CLCM,t.CBCM,s,r];return{aliases:[\"c\",\"cc\",\"h\",\"c++\",\"h++\",\"hpp\"],k:c,i:\"</\",c:n.concat([i,{b:\"\\\\b(deque|list|queue|stack|vector|map|set|bitset|multiset|multimap|unordered_map|unordered_set|unordered_multiset|unordered_multimap|array)\\\\s*<\",e:\">\",k:c,c:[\"self\",e]},{b:t.IR+\"::\",k:c},{v:[{b:/=/,e:/;/},{b:/\\(/,e:/\\)/},{bK:\"new throw return else\",e:/;/}],k:c,c:n.concat([{b:/\\(/,e:/\\)/,k:c,c:n.concat([\"self\"]),r:0}]),r:0},{cN:\"function\",b:\"(\"+t.IR+\"[\\\\*&\\\\s]+)+\"+a,rB:!0,e:/[{;=]/,eE:!0,k:c,i:/[^\\w\\s\\*&]/,c:[{b:a,rB:!0,c:[t.TM],r:0},{cN:\"params\",b:/\\(/,e:/\\)/,k:c,r:0,c:[t.CLCM,t.CBCM,r,s,e]},t.CLCM,t.CBCM,i]}]),exports:{preprocessor:i,strings:r,k:c}}});hljs.registerLanguage(\"sql\",function(e){var t=e.C(\"--\",\"$\");return{cI:!0,i:/[<>{}*#]/,c:[{bK:\"begin end start commit rollback savepoint lock alter create drop rename call delete do handler insert load replace select truncate update set show pragma grant merge describe use explain help declare prepare execute deallocate release unlock purge reset change stop analyze cache flush optimize repair kill install uninstall checksum restore check backup revoke\",e:/;/,eW:!0,l:/[\\w\\.]+/,k:{keyword:\"abort abs absolute acc acce accep accept access accessed accessible account acos action activate add addtime admin administer advanced advise aes_decrypt aes_encrypt after agent aggregate ali alia alias allocate allow alter always analyze ancillary and any anydata anydataset anyschema anytype apply archive archived archivelog are as asc ascii asin assembly assertion associate asynchronous at atan atn2 attr attri attrib attribu attribut attribute attributes audit authenticated authentication authid authors auto autoallocate autodblink autoextend automatic availability avg backup badfile basicfile before begin beginning benchmark between bfile bfile_base big bigfile bin binary_double binary_float binlog bit_and bit_count bit_length bit_or bit_xor bitmap blob_base block blocksize body both bound buffer_cache buffer_pool build bulk by byte byteordermark bytes cache caching call calling cancel capacity cascade cascaded case cast catalog category ceil ceiling chain change changed char_base char_length character_length characters characterset charindex charset charsetform charsetid check checksum checksum_agg child choose chr chunk class cleanup clear client clob clob_base clone close cluster_id cluster_probability cluster_set clustering coalesce coercibility col collate collation collect colu colum column column_value columns columns_updated comment commit compact compatibility compiled complete composite_limit compound compress compute concat concat_ws concurrent confirm conn connec connect connect_by_iscycle connect_by_isleaf connect_by_root connect_time connection consider consistent constant constraint constraints constructor container content contents context contributors controlfile conv convert convert_tz corr corr_k corr_s corresponding corruption cos cost count count_big counted covar_pop covar_samp cpu_per_call cpu_per_session crc32 create creation critical cross cube cume_dist curdate current current_date current_time current_timestamp current_user cursor curtime customdatum cycle data database databases datafile datafiles datalength date_add date_cache date_format date_sub dateadd datediff datefromparts datename datepart datetime2fromparts day day_to_second dayname dayofmonth dayofweek dayofyear days db_role_change dbtimezone ddl deallocate declare decode decompose decrement decrypt deduplicate def defa defau defaul default defaults deferred defi defin define degrees delayed delegate delete delete_all delimited demand dense_rank depth dequeue des_decrypt des_encrypt des_key_file desc descr descri describ describe descriptor deterministic diagnostics difference dimension direct_load directory disable disable_all disallow disassociate discardfile disconnect diskgroup distinct distinctrow distribute distributed div do document domain dotnet double downgrade drop dumpfile duplicate duration each edition editionable editions element ellipsis else elsif elt empty enable enable_all enclosed encode encoding encrypt end end-exec endian enforced engine engines enqueue enterprise entityescaping eomonth error errors escaped evalname evaluate event eventdata events except exception exceptions exchange exclude excluding execu execut execute exempt exists exit exp expire explain export export_set extended extent external external_1 external_2 externally extract failed failed_login_attempts failover failure far fast feature_set feature_value fetch field fields file file_name_convert filesystem_like_logging final finish first first_value fixed flash_cache flashback floor flush following follows for forall force form forma format found found_rows freelist freelists freepools fresh from from_base64 from_days ftp full function general generated get get_format get_lock getdate getutcdate global global_name globally go goto grant grants greatest group group_concat group_id grouping grouping_id groups gtid_subtract guarantee guard handler hash hashkeys having hea head headi headin heading heap help hex hierarchy high high_priority hosts hour http id ident_current ident_incr ident_seed identified identity idle_time if ifnull ignore iif ilike ilm immediate import in include including increment index indexes indexing indextype indicator indices inet6_aton inet6_ntoa inet_aton inet_ntoa infile initial initialized initially initrans inmemory inner innodb input insert install instance instantiable instr interface interleaved intersect into invalidate invisible is is_free_lock is_ipv4 is_ipv4_compat is_not is_not_null is_used_lock isdate isnull isolation iterate java join json json_exists keep keep_duplicates key keys kill language large last last_day last_insert_id last_value lax lcase lead leading least leaves left len lenght length less level levels library like like2 like4 likec limit lines link list listagg little ln load load_file lob lobs local localtime localtimestamp locate locator lock locked log log10 log2 logfile logfiles logging logical logical_reads_per_call logoff logon logs long loop low low_priority lower lpad lrtrim ltrim main make_set makedate maketime managed management manual map mapping mask master master_pos_wait match matched materialized max maxextents maximize maxinstances maxlen maxlogfiles maxloghistory maxlogmembers maxsize maxtrans md5 measures median medium member memcompress memory merge microsecond mid migration min minextents minimum mining minus minute minvalue missing mod mode model modification modify module monitoring month months mount move movement multiset mutex name name_const names nan national native natural nav nchar nclob nested never new newline next nextval no no_write_to_binlog noarchivelog noaudit nobadfile nocheck nocompress nocopy nocycle nodelay nodiscardfile noentityescaping noguarantee nokeep nologfile nomapping nomaxvalue nominimize nominvalue nomonitoring none noneditionable nonschema noorder nopr nopro noprom nopromp noprompt norely noresetlogs noreverse normal norowdependencies noschemacheck noswitch not nothing notice notrim novalidate now nowait nth_value nullif nulls num numb numbe nvarchar nvarchar2 object ocicoll ocidate ocidatetime ociduration ociinterval ociloblocator ocinumber ociref ocirefcursor ocirowid ocistring ocitype oct octet_length of off offline offset oid oidindex old on online only opaque open operations operator optimal optimize option optionally or oracle oracle_date oradata ord ordaudio orddicom orddoc order ordimage ordinality ordvideo organization orlany orlvary out outer outfile outline output over overflow overriding package pad parallel parallel_enable parameters parent parse partial partition partitions pascal passing password password_grace_time password_lock_time password_reuse_max password_reuse_time password_verify_function patch path patindex pctincrease pctthreshold pctused pctversion percent percent_rank percentile_cont percentile_disc performance period period_add period_diff permanent physical pi pipe pipelined pivot pluggable plugin policy position post_transaction pow power pragma prebuilt precedes preceding precision prediction prediction_cost prediction_details prediction_probability prediction_set prepare present preserve prior priority private private_sga privileges procedural procedure procedure_analyze processlist profiles project prompt protection public publishingservername purge quarter query quick quiesce quota quotename radians raise rand range rank raw read reads readsize rebuild record records recover recovery recursive recycle redo reduced ref reference referenced references referencing refresh regexp_like register regr_avgx regr_avgy regr_count regr_intercept regr_r2 regr_slope regr_sxx regr_sxy reject rekey relational relative relaylog release release_lock relies_on relocate rely rem remainder rename repair repeat replace replicate replication required reset resetlogs resize resource respect restore restricted result result_cache resumable resume retention return returning returns reuse reverse revoke right rlike role roles rollback rolling rollup round row row_count rowdependencies rowid rownum rows rtrim rules safe salt sample save savepoint sb1 sb2 sb4 scan schema schemacheck scn scope scroll sdo_georaster sdo_topo_geometry search sec_to_time second section securefile security seed segment select self sequence sequential serializable server servererror session session_user sessions_per_user set sets settings sha sha1 sha2 share shared shared_pool short show shrink shutdown si_averagecolor si_colorhistogram si_featurelist si_positionalcolor si_stillimage si_texture siblings sid sign sin size size_t sizes skip slave sleep smalldatetimefromparts smallfile snapshot some soname sort soundex source space sparse spfile split sql sql_big_result sql_buffer_result sql_cache sql_calc_found_rows sql_small_result sql_variant_property sqlcode sqldata sqlerror sqlname sqlstate sqrt square standalone standby start starting startup statement static statistics stats_binomial_test stats_crosstab stats_ks_test stats_mode stats_mw_test stats_one_way_anova stats_t_test_ stats_t_test_indep stats_t_test_one stats_t_test_paired stats_wsr_test status std stddev stddev_pop stddev_samp stdev stop storage store stored str str_to_date straight_join strcmp strict string struct stuff style subdate subpartition subpartitions substitutable substr substring subtime subtring_index subtype success sum suspend switch switchoffset switchover sync synchronous synonym sys sys_xmlagg sysasm sysaux sysdate sysdatetimeoffset sysdba sysoper system system_user sysutcdatetime table tables tablespace tan tdo template temporary terminated tertiary_weights test than then thread through tier ties time time_format time_zone timediff timefromparts timeout timestamp timestampadd timestampdiff timezone_abbr timezone_minute timezone_region to to_base64 to_date to_days to_seconds todatetimeoffset trace tracking transaction transactional translate translation treat trigger trigger_nestlevel triggers trim truncate try_cast try_convert try_parse type ub1 ub2 ub4 ucase unarchived unbounded uncompress under undo unhex unicode uniform uninstall union unique unix_timestamp unknown unlimited unlock unpivot unrecoverable unsafe unsigned until untrusted unusable unused update updated upgrade upped upper upsert url urowid usable usage use use_stored_outlines user user_data user_resources users using utc_date utc_timestamp uuid uuid_short validate validate_password_strength validation valist value values var var_samp varcharc vari varia variab variabl variable variables variance varp varraw varrawc varray verify version versions view virtual visible void wait wallet warning warnings week weekday weekofyear wellformed when whene whenev wheneve whenever where while whitespace with within without work wrapped xdb xml xmlagg xmlattributes xmlcast xmlcolattval xmlelement xmlexists xmlforest xmlindex xmlnamespaces xmlpi xmlquery xmlroot xmlschema xmlserialize xmltable xmltype xor year year_to_month years yearweek\",literal:\"true false null\",built_in:\"array bigint binary bit blob boolean char character date dec decimal float int int8 integer interval number numeric real record serial serial8 smallint text varchar varying void\"},c:[{cN:\"string\",b:\"'\",e:\"'\",c:[e.BE,{b:\"''\"}]},{cN:\"string\",b:'\"',e:'\"',c:[e.BE,{b:'\"\"'}]},{cN:\"string\",b:\"`\",e:\"`\",c:[e.BE]},e.CNM,e.CBCM,t]},e.CBCM,t]}});hljs.registerLanguage(\"php\",function(e){var c={b:\"\\\\$+[a-zA-Z_-ÿ][a-zA-Z0-9_-ÿ]*\"},i={cN:\"meta\",b:/<\\?(php)?|\\?>/},t={cN:\"string\",c:[e.BE,i],v:[{b:'b\"',e:'\"'},{b:\"b'\",e:\"'\"},e.inherit(e.ASM,{i:null}),e.inherit(e.QSM,{i:null})]},a={v:[e.BNM,e.CNM]};return{aliases:[\"php3\",\"php4\",\"php5\",\"php6\"],cI:!0,k:\"and include_once list abstract global private echo interface as static endswitch array null if endwhile or const for endforeach self var while isset public protected exit foreach throw elseif include __FILE__ empty require_once do xor return parent clone use __CLASS__ __LINE__ else break print eval new catch __METHOD__ case exception default die require __FUNCTION__ enddeclare final try switch continue endfor endif declare unset true false trait goto instanceof insteadof __DIR__ __NAMESPACE__ yield finally\",c:[e.HCM,e.C(\"//\",\"$\",{c:[i]}),e.C(\"/\\\\*\",\"\\\\*/\",{c:[{cN:\"doctag\",b:\"@[A-Za-z]+\"}]}),e.C(\"__halt_compiler.+?;\",!1,{eW:!0,k:\"__halt_compiler\",l:e.UIR}),{cN:\"string\",b:/<<<['\"]?\\w+['\"]?$/,e:/^\\w+;?$/,c:[e.BE,{cN:\"subst\",v:[{b:/\\$\\w+/},{b:/\\{\\$/,e:/\\}/}]}]},i,{cN:\"keyword\",b:/\\$this\\b/},c,{b:/(::|->)+[a-zA-Z_\\x7f-\\xff][a-zA-Z0-9_\\x7f-\\xff]*/},{cN:\"function\",bK:\"function\",e:/[;{]/,eE:!0,i:\"\\\\$|\\\\[|%\",c:[e.UTM,{cN:\"params\",b:\"\\\\(\",e:\"\\\\)\",c:[\"self\",c,e.CBCM,t,a]}]},{cN:\"class\",bK:\"class interface\",e:\"{\",eE:!0,i:/[:\\(\\$\"]/,c:[{bK:\"extends implements\"},e.UTM]},{bK:\"namespace\",e:\";\",i:/[\\.']/,c:[e.UTM]},{bK:\"use\",e:\";\",c:[e.UTM]},{b:\"=>\"},t,a]}});hljs.registerLanguage(\"java\",function(e){var t=e.UIR+\"(<\"+e.UIR+\"(\\\\s*,\\\\s*\"+e.UIR+\")*>)?\",a=\"false synchronized int abstract float private char boolean static null if const for true while long strictfp finally protected import native final void enum else break transient catch instanceof byte super volatile case assert short package default double public try this switch continue throws protected public private module requires exports\",r=\"\\\\b(0[bB]([01]+[01_]+[01]+|[01]+)|0[xX]([a-fA-F0-9]+[a-fA-F0-9_]+[a-fA-F0-9]+|[a-fA-F0-9]+)|(([\\\\d]+[\\\\d_]+[\\\\d]+|[\\\\d]+)(\\\\.([\\\\d]+[\\\\d_]+[\\\\d]+|[\\\\d]+))?|\\\\.([\\\\d]+[\\\\d_]+[\\\\d]+|[\\\\d]+))([eE][-+]?\\\\d+)?)[lLfF]?\",s={cN:\"number\",b:r,r:0};return{aliases:[\"jsp\"],k:a,i:/<\\/|#/,c:[e.C(\"/\\\\*\\\\*\",\"\\\\*/\",{r:0,c:[{b:/\\w+@/,r:0},{cN:\"doctag\",b:\"@[A-Za-z]+\"}]}),e.CLCM,e.CBCM,e.ASM,e.QSM,{cN:\"class\",bK:\"class interface\",e:/[{;=]/,eE:!0,k:\"class interface\",i:/[:\"\\[\\]]/,c:[{bK:\"extends implements\"},e.UTM]},{bK:\"new throw return else\",r:0},{cN:\"function\",b:\"(\"+t+\"\\\\s+)+\"+e.UIR+\"\\\\s*\\\\(\",rB:!0,e:/[{;=]/,eE:!0,k:a,c:[{b:e.UIR+\"\\\\s*\\\\(\",rB:!0,r:0,c:[e.UTM]},{cN:\"params\",b:/\\(/,e:/\\)/,k:a,r:0,c:[e.ASM,e.QSM,e.CNM,e.CBCM]},e.CLCM,e.CBCM]},s,{cN:\"meta\",b:\"@[A-Za-z]+\"}]}});hljs.registerLanguage(\"css\",function(e){var c=\"[a-zA-Z-][a-zA-Z0-9_-]*\",t={b:/[A-Z\\_\\.\\-]+\\s*:/,rB:!0,e:\";\",eW:!0,c:[{cN:\"attribute\",b:/\\S/,e:\":\",eE:!0,starts:{eW:!0,eE:!0,c:[{b:/[\\w-]+\\(/,rB:!0,c:[{cN:\"built_in\",b:/[\\w-]+/},{b:/\\(/,e:/\\)/,c:[e.ASM,e.QSM]}]},e.CSSNM,e.QSM,e.ASM,e.CBCM,{cN:\"number\",b:\"#[0-9A-Fa-f]+\"},{cN:\"meta\",b:\"!important\"}]}}]};return{cI:!0,i:/[=\\/|'\\$]/,c:[e.CBCM,{cN:\"selector-id\",b:/#[A-Za-z0-9_-]+/},{cN:\"selector-class\",b:/\\.[A-Za-z0-9_-]+/},{cN:\"selector-attr\",b:/\\[/,e:/\\]/,i:\"$\"},{cN:\"selector-pseudo\",b:/:(:)?[a-zA-Z0-9\\_\\-\\+\\(\\)\"'.]+/},{b:\"@(font-face|page)\",l:\"[a-z-]+\",k:\"font-face page\"},{b:\"@\",e:\"[{;]\",i:/:/,c:[{cN:\"keyword\",b:/\\w+/},{b:/\\s/,eW:!0,eE:!0,r:0,c:[e.ASM,e.QSM,e.CSSNM]}]},{cN:\"selector-tag\",b:c,r:0},{b:\"{\",e:\"}\",i:/\\S/,c:[e.CBCM,t]}]}});hljs.registerLanguage(\"xml\",function(s){var e=\"[A-Za-z0-9\\\\._:-]+\",t={eW:!0,i:/</,r:0,c:[{cN:\"attr\",b:e,r:0},{b:/=\\s*/,r:0,c:[{cN:\"string\",endsParent:!0,v:[{b:/\"/,e:/\"/},{b:/'/,e:/'/},{b:/[^\\s\"'=<>`]+/}]}]}]};return{aliases:[\"html\",\"xhtml\",\"rss\",\"atom\",\"xjb\",\"xsd\",\"xsl\",\"plist\"],cI:!0,c:[{cN:\"meta\",b:\"<!DOCTYPE\",e:\">\",r:10,c:[{b:\"\\\\[\",e:\"\\\\]\"}]},s.C(\"<!--\",\"-->\",{r:10}),{b:\"<\\\\!\\\\[CDATA\\\\[\",e:\"\\\\]\\\\]>\",r:10},{b:/<\\?(php)?/,e:/\\?>/,sL:\"php\",c:[{b:\"/\\\\*\",e:\"\\\\*/\",skip:!0}]},{cN:\"tag\",b:\"<style(?=\\\\s|>|$)\",e:\">\",k:{name:\"style\"},c:[t],starts:{e:\"</style>\",rE:!0,sL:[\"css\",\"xml\"]}},{cN:\"tag\",b:\"<script(?=\\\\s|>|$)\",e:\">\",k:{name:\"script\"},c:[t],starts:{e:\"</script>\",rE:!0,sL:[\"actionscript\",\"javascript\",\"handlebars\",\"xml\"]}},{cN:\"meta\",v:[{b:/<\\?xml/,e:/\\?>/,r:10},{b:/<\\?\\w+/,e:/\\?>/}]},{cN:\"tag\",b:\"</?\",e:\"/?>\",c:[{cN:\"name\",b:/[^\\/><\\s]+/,r:0},t]}]}});hljs.initHighlightingOnLoad();\n"
  },
  {
    "path": "resources/blog-posts/js/main.js",
    "content": "/**\n * Main JavaScript file for additional main site functionality.\n *\n * @author  Joeri Hermans\n * @version 0.1\n * @since   8 July 2016\n */\n\nvar rippleEffect = function(e) {\n  var target = e.target;\n  var rectangle = target.getBoundingClientRect();\n  var ripple = target.querySelector('.ripple');\n  if( !ripple ) {\n    ripple = document.createElement('span');\n    ripple.className = 'ripple';\n    ripple.style.height = ripple.style.width = Math.max(rectangle.width, rectangle.height) + 'px';\n    // Check if the target has a first child.\n    if( target.firstChild )\n      target.insertBefore(ripple, target.firstChild);\n    else\n      target.appendChild(ripple);\n  }\n  // Check if we need to add a red ripple.\n  if( target.classList.contains('red') )\n    ripple.classList.add('ripple-red');\n  ripple.classList.remove('show');\n  var top = e.pageY - rectangle.top - ripple.offsetHeight / 2 - document.body.scrollTop;\n  var left = e.pageX - rectangle.left - ripple.offsetWidth / 2 - document.body.scrollLeft;\n  ripple.style.top = top + 'px';\n  ripple.style.left = left + 'px';\n  ripple.classList.add('show');\n\n  return false;\n};\n\nfunction addRippleEffects() {\n  // Add the ripple effect to all buttons in the page.\n  var elements = document.getElementsByClassName(\"ripple-button\");\n  for(var i = 0; i < elements.length; ++i) {\n    elements[i].addEventListener('click', rippleEffect, false);\n  }\n};\n\nfunction renderMath() {\n  var currentEquation = 1;\n\n  // Render all the math-elements.\n  var elements = document.getElementsByClassName(\"math\");\n  for(var i = 0; i < elements.length; ++i) {\n    var e = elements[i];\n    var tex = e.innerHTML;\n    katex.render(tex, e);\n    // Check if the element is an equation.\n    if( e.classList.contains(\"equation-math\") ) {\n      // Set the unique id of the equation.\n      e.id = \"equation-\" + currentEquation;\n      // Add the equation number.\n      e.innerHTML += '<span class=\"equation-math-number\">(' + currentEquation + ')</span>';\n      ++currentEquation;\n    }\n  }\n};\n\naddRippleEffects();\nrenderMath();\n"
  },
  {
    "path": "resources/blog-posts/part-1-an-introduction.html",
    "content": "<!DOCTYPE html>\n<html lang=\"en\">\n  <title>Distributed Deep Learning with Apache Spark and Keras - Part 1 - An introduction</title>\n  <meta name=\"author\" content=\"Joeri R. Hermans\">\n  <meta name=\"description\" content=\"In the following blog posts we study the topic of Distributed Deep Learning, or rather, how to parallelize gradient descent using data parallel methods. We start by laying out the theory, while supplying you with some intuition into the techniques we applied. At the end of this blog post, we conduct some experiments to evaluate how different optimization schemes perform in identical situations. We also introduce dist-keras, which is our distributed deep learning framework built on top of Apache Spark and Keras. For this, we provide several notebooks and examples. This framework is mainly used to test our distributed optimization schemes, however, it also has several practical applications at CERN, not only because of the distributed learning, but also for model serving purposes. For example, we provide several examples which show you how to integrate this framework with Spark Streaming and Apache Kafka. Finally, these series will contain parts of my master-thesis research. As a result, they will mainly show my research progress. However, some might find some of the approaches I present here useful to apply in their own work.\">\n  <link rel=\"stylesheet\" media=\"screen\" href=\"css/main.css\">\n  <link rel=\"stylesheet\" media=\"screen\" href=\"css/katex.min.css\">\n  <body>\n    <section>\n      <p>In the following blog posts we study the topic of Distributed Deep Learning, or rather, how to parallelize gradient descent using data parallel methods. We start by laying out the theory, while supplying you with some intuition into the techniques we applied. At the end of this blog post, we conduct some experiments to evaluate how different optimization schemes perform in identical situations. We also introduce <a href=\"https://github.com/cerndb/dist-keras/\" alt=\"Distributed Keras\">dist-keras</a>, which is our distributed deep learning framework built on top of <a href=\"https://spark.apache.org\" alt=\"Apache Spark\">Apache Spark</a> and <a href=\"https://keras.io\" alt=\"Keras\">Keras</a>. For this, we provide several <a href=\"https://github.com/cerndb/dist-keras/blob/master/examples/mnist.ipynb\" alt=\"Distributed Keras examples\">notebooks and examples</a>. This framework is mainly used to test our distributed optimization schemes, however, it also has several practical applications at CERN, not only because of the distributed learning, but also for model serving purposes. For example, we provide several <a href=\"https://github.com/cerndb/dist-keras/blob/master/examples/kafka_spark_high_throughput_ml_pipeline.ipynb\" alt=\"Distributed Keras Kafka\">examples</a> which show you how to integrate this framework with Spark Streaming and Apache Kafka. Finally, these series will contain parts of my master-thesis research. As a result, they will mainly show my research progress. However, some might find some of the approaches I present here useful to apply in their own work.</p>\n    </section>\n    <section>\n      <h2>Introduction</h2>\n      <p>Unsupervised feature learning and deep learning has shown that being able to train large models on vasts amount of data can drastically improve model performance. However, consider the problem of training a deep network with millions, or even billions of parameters. How do we achieve this without waiting for days, or even multiple weeks? Dean et al. <a href=\"https://www.researchgate.net/profile/Gs_Corrado/publication/266225209_Large_Scale_Distributed_Deep_Networks/links/544672590cf2f14fb80f3c77.pdf\">[2]</a> propose a different training paradigm which allows us to train and serve a model on multiple physical machines. The authors propose two novel methodologies to accomplish this, namely, <i>model parallelism</i> and <i>data parallelism</i>. In the following blog post, we briefly mention model parallelism since we will mainly focus on data parallel approaches.</p>\n      <p><strong>Sidenote:</strong> In order to simplify the figures, and make them more intuitive, we negate the gradient <span class=\"inlined-math math\">\\nabla f</span> without adding a <span class=\"inlined-math math\">-</span> sign in front. Thus, all gradient symbols in the following figures will be negated by default, unless stated otherwise. I actually forgot to negate the gradients in the figures, so mentioning this is rather an easy fix. However, this will be corrected in the final version of the master thesis.</p>\n      <section>\n        <h3>Model parallelism</h3>\n        <p>In <i>model parallelism</i>, a single model is distributed over multiple machines. The performance benefits of distributing a deep network across multiple machines mainly depends on the structure of the model. Models with a large number of parameters typically benefit from access to more CPU cores and memory, thus, parallelizing a large model produces a significant performance increase, and thereby reducing the training time.</p>\n        <p>Let us start with a simple example in order to illustrate this concept more clearly. Imagine having a perceptron, as depicted in Figure 1. In order to parallelize this efficiently, we can view a neural network as a dependency graph, where the goal is to minimize the number of synchronization mechanisms, assuming we have unlimited resources. Furthermore, a synchronization mechanism is only required when a node has more than 1 <i>variable</i> dependency. A variable dependency is a dependency which can change in time. For example, a bias would be a <i>static</i> dependency, because the value of a bias remains constant over time. In the case for the perceptron shown in Figure 1, the parallelization is quite straightforward. The only synchronization mechanism which should be implemented resides in output neuron since <span class=\"inlined-math math\">y \\triangleq \\sigma(\\sum_i w_ix_i)</span> where <span class=\"inlined-math math\">\\sigma</span> is the activation function of the output neuron.</p>\n        <div class=\"blog-figure-container\">\n          <img src=\"img/model-parallelism.png\" alt=\"Model Parallelism\">\n          <p><strong>Figure 1:</strong> A perceptron partitioned using the <i>model parallelism</i> paradigm. In this approach every input node is responsible for accepting the input <span class=\"inlined-math math\">x_i</span> from some source, and multiplying the input with the associated weight <span class=\"inlined-math math\">w_i</span>. After the multiplication, the result is sent to the node which is responsible for computing <span class=\"inlined-math math\">y</span>. Of course, this node requires a synchronization mechanism to ensure that the result is consistent. The synchronization mechanism does this by waiting for the results <span class=\"inlined-math\">y</span> depends on.</p>\n        </div>\n      </section>\n      <section>\n        <h3>Data parallelism</h3>\n        <p> Data parallelism is an inherently different methodology of optimizing parameters. The general idea is to <i>reduce the training time</i> by having <span class=\"inlined-math math\">n</span> workers optimizing a central model by processing <span class=\"inlined-math math\">n</span> different shards (partitions) of the dataset in parallel. In this setting we distribute <span class=\"inlined-math math\">n</span> model replicas over <span class=\"inlined-math\">n</span> processing nodes, i.e., every node (or process) holds one model replica. Then, the workers train their local replica using the assigned data shard. However, it is possible to coordinate the workers in such a way that, together, they will optimize a single objective. There are several approaches to achieve this, and these will be discussed in greater detail in the coming sections and blog posts.</p>\n        <p>Nevertheless, a popular approach to optimize this objective is to employ a centralized <i>parameter server</i>. A parameter server is responsible for the aggregation of model updates, and parameter requests coming from different workers. The distributed learning process starts by partitioning a dataset into <span class=\"inlined-math math\">n</span> <i>shards</i>. Every individual shard will be assigned to a particular worker. Next, a worker will sample mini-batches from its shard in order to train the local model replica. After every mini-batch (or multiple mini-batches), the workers will communicate a variable with the parameter server. This variable is in most implementations the gradient <span class=\"inlined-math math\">\\nabla f_i(x)</span>. Finally, the parameter server will integrate this variable by applying a specific update procedure which knows how to handle this variable. This process repeats itself until all workers have sampled all mini-batches from their shard. This high-level description is summarized in Figure 2.</p>\n        <div class=\"blog-figure-container\">\n          <img src=\"img/data-parallelism.png\" alt=\"Data Parallelism\">\n          <p><strong>Figure 2:</strong> Schematic representation of a data parallel approach. In this methodology we spawn <span class=\"inlined-math math\">n</span> workers (not necessarily on different machines), and assign a data shard (partition) of the dataset to every worker. Using this data shard, a worker <span class=\"inlined-math math\">i</span> will iterate through all mini-batches to produce a gradient, <span class=\"inlined-math math\">\\nabla f_i(x)</span> for every mini-batch <span class=\"inlined-math math\">x</span>. Next, <span class=\"inlined-math math\">\\nabla f_i(x)</span> is send to the parameter server, which will incorperate the gradient using an update mechanism.</p>\n        </div>\n      </section>\n    </section>\n    <section>\n      <h2>Approaches</h2>\n      <p>In this section we discuss several approaches towards parallelizing gradient descent (GD). This is not an intuitive task since gradient descent is an inherently sequential algorithm where every data point (instance) provides a direction to a minimum. However, training a model with a lot of parameters while using a very large dataset, will result in a high training time. If one would like the reduce the training time, the obvious choice would be to buy better, or rather, more suitable hardware (e.g., a GPU). However, this is not always possible. For this reason, several attempts have been made to parallelize gradient descent. In the following subsections, we will examine some of the popular approaches to parallelize gradient descent, and provide some intuition into these techniques work, and how they should be used.</p>\n      <section>\n        <h3>Synchronous Data Parallel Methods</h3>\n        <p>There are two distinct approaches towards solving data parallelism. Personally, the most conceptually straightforward one is <i>synchronous data parallelism</i>. In synchronous data parallelism, as depicted in Figure 3, all workers compute their gradients based on the same center variable. This means whenever a worker is done computing a gradient for the current batch, it will <i>commit</i> a parameter (i.e., the gradient or the parameterization of the model) to the parameter server. However, before incorporating this information into the center variable, the parameter server stores all the information until all workers have committed their work. After this, the parameter server will apply a specific update mechanism (depending on the algorithm) to incorporate the commits into the parameter server. <i>In essence, one can see synchronous data parallelism as a way to parallelize the computation of a mini-batch.</i></p>\n        <div class=\"blog-figure-container\">\n          <img src=\"img/synchronous-data-parallelism.png\" alt=\"Synchronous Data Parallelism\">\n          <p><strong>Figure 3:</strong> In a synchronous data parallel setting, one has <span class=\"math-inlined math\">n</span> workers (not necessarily on different machines). At the start of the training procedure, every worker fetches the most recent center variable. Next, every worker will start their training procedure. After the computation of the gradient, a worker commits the computed information (gradient or parametrization, depending on the algorithm) to the parameter server. However, due to unmodeled system behaviour, some workers might induce a significant delay, which results in other workers to be taskless while still consuming the same memory resources.</p>\n        </div>\n        <p>However, due to unmodeled system behaviour of the workers, workers might commit their results with a certain delay. Depending on the system load, this delay can be quite significant. As a result, this data parallel method is a case of the age-old saying <i>\"a synchronous data parallel method is only as strong, as the weakest worker in the cluster\"</i> <strong>:-)</strong>.</p>\n        <section>\n          <h4>Model Averaging</h4>\n          <p>In essence, this is a data parallel approach as mentioned in the introduction. However, in contrary to more conventional data parallel approaches, there is no parameter server. In model averaging, every worker will get a copy of the model at the start of the training period. However, one can have different weight initialization techniques for the workers to cover more of the parameter space after several iterations, as shown in Figure 4. Though, it is not recommended to do this since this will result in very different solutions for every worker. Thus wasting initial iterations converging to a \"good solution\" on which all workers \"agree\". However, this problem is related to most distributed optimization algorithms discussed here, and will be discussed in more detail in the following blog posts.</p>\n          <p>After every worker is initialized with a copy of the model, all workers start the training procedure independently from each other. This means that during the training procedure, no communication between the workers occurs. Thus, eliminating the communication overhead that is present in approaches with parameter servers. After the end of an epoch, i.e., a full iteration of the dataset, the models are aggregated and averaged on a single worker. The resulting averaged model will then be distributed to all workers, where the training process repeats until the averaged model converges.</p>\n          <div class=\"blog-figure-container\">\n            <img src=\"img/model-averaging.png\" alt=\"Model Averaging\">\n            <p><strong>Figure 4:</strong> In this setting we have 4 independent workers, each having a randomly initialized model. In order to simplify the situation, let us assume we can obtain the gradient directly from <span class=\"inlined-math math\">E(\\theta)</span>, which is our loss function. In model averaging, every worker only applies gradient descent to its own model without communicating with other workers. After the end of an epoch, as shown in the center plot, the models are averaged in order to produce a <i>central model</i>. In the following epoch, the central model will be used as a starting point for all workers.</p>\n          </div>\n        </section>\n        <section>\n          <h4>EASGD</h4>\n          <p>EASGD, or Elastic Averaging SGD, introduced by Zhang et al. <a href=\"http://papers.nips.cc/paper/5761-deep-learning-with-elastic-averaging-sgd\" alt=\"EASGD\">[3]</a>, is a distributed optimization scheme designed to reduce communication overhead with the parameter server. This is in contrast to approaches such as DOWNPOUR, which most of the time require a small communication window in order to converge properly. The issue with a small communication window is that the learning process needs to be stopped in order to synchronize the model with the parameter server, and as a result, limiting the throughput of the training process. Of course, the number of parameters in a model is also an important factor. For example, one can imagine that having a model with 100 MB worth of parameters could severely influence the training performance if every 5 mini-batches a synchronization with the parameter server would occur. Furthermore, the authors state that due to the distributed nature, exploration of the nearby parameter space by the workers actually improves the statistical performance of a model with respect to sequential gradient descent. However, at the moment, we do not have any evidence to support this claim, nor to deny it. What we do observe, is that the statistical performance of a model after a single epoch, is usually (significantly) less than a single epoch of Adam (sequential training) and ADAG (distributed training). However, if we would let EASGD reach the same amount of wallclock training time, then we still have an identical or slightly worse model performance. So there is evidence to suggest that this claim is not completely true, at least, in the case of EASGD. This however, requires more investigation.</p>\n          <p>The authors solve the communication constraint by applying an \"elastic force\" between the parameters of the workers and the center variable. Furthermore, due to the elasticity and reduction in communication with the parameter server, the workers are allowed to explore the surrounding parameter space. As stated above, the authors claim that allowing for more exploration can be beneficial for the statistical performance of the model. However, we argue that, as in model averaging, this will only work well if the workers are in the neighbourhood of the center variable, we will show this empirically in the Experiments section. However, in contrast to model averaging, the workers are not synchronized with the center variable. This begs the question, how does EASGD ensure that the workers remain in the \"neighbourhood\" of the center variable? Because as in model averaging, too much exploration of the parameter space actually deteriorates the performance of the center variable, and may prevent convergence, because the workers cover inherently different spaces of the parameter space, as shown in Figure 4. However, if the elasticity parameter is too high, exploration will not take place at all.</p>\n          <div class=\"equation-math math\">\\theta^i_{t+1} = \\theta^i_t - \\eta\\nabla f(\\theta^i_t) - \\eta\\rho(\\theta^i_t - \\theta^c_t)</div>\n          <div class=\"equation-math math\">\\theta^c_{t+1} = (1 - n\\eta\\rho) \\theta^c_t + \\eta\\rho(\\frac{1}{n} \\sum_{i = 0}^{n} \\theta^i_t)</div>\n          <p>To fully understand the implications of the EASGD equations, as shown in Equation 1 and Equation 2, we refer the reader to Figure 5, which shows the intuition behind the elastic force. Having two vectors, the gradient <span class=\"inlined-math math\">\\nabla f</span>, and the elastic difference <span class=\"inlined-math math\">\\eta\\rho(\\theta^i_t - \\theta^c_t)</span> where <span class=\"inlined-math math\">\\eta</span> is the learning rate and <span class=\"inlined-math math\">\\rho</span> is the elasticity parameter, the authors say when <span class=\"inlined-math math\">\\rho</span> is small, you allow for more exploration of the parameter space. This can be observed from Figure 5. When <span class=\"inlined-math math\">\\rho</span> is small, the vector <span class=\"inlined-math math\">\\eta\\rho(\\theta^i_t - \\theta^c_t)</span> will be small as well (unless the distance between the worker and the center variable is large). As a result, the attraction between the center variable and the worker is small, thus, allowing for more exploration of the parameter space.</p>\n          <p>Analogously, imagine that you are walking with your dog, and the dog is responsible for getting you home (guiding you to a minimum). If you would let your dog drift too far away from you (because you have a leash which is very flexible (small <span class=\"inlined-math math\">\\rho</span>)). In the most extreme case, the dog will get home without you because your leash was simply too flexible. As a result, the dog could not pull you home. At this point you think, maybe I should buy more dogs? Thinking that together they will help you. However, due to the nature of these creatures you soon realize that instead of going home, they all go to different places (multiple workers in the parameter space having different inputs, e.g., one dog sees a particular tree, while an other dog sees a bush, etc.). From this experience, you notice that the problem is the leash, it is way too flexible because the dogs are all over the place. As a result, you buy a less flexible leash, with the effect that the dogs stay closer to you, and eventually \"pull\" together to bring you home faster.</p>\n          <div class=\"blog-figure-container\">\n            <img src=\"img/easgd.png\" alt=\"EASGD\">\n            <p><strong>Figure 5:</strong> The worker variable <span class=\"inlined-math math\">w</span> is exploring the parameter space in order to optimize <span class=\"inlined-math math\">C</span>. However, the amount of exploration is proportional by the elasticity factor <span class=\"inlined-math math\">\\rho</span>, and the difference <span class=\"inlined-math math\">(\\theta^w_t - \\theta^C_t)</span>. In general, when <span class=\"inlined-math math\">\\rho</span> is small, you allow for more exploration to occur. It is to be noted, that as in model averaging, too much exploration will actually deteriorate the statistical performance of a model (as shown in the first subfigure of Figure 4), because the workers do not agree on a good solution. Especially when you take into account that the center variable is updated using an average of the worker variables, shown in Equation 2.</p>\n          </div>\n          <p>Now, from Equation 1 and the intuition above, we can expect that for some worker update <span class=\"inlined-math math\">i</span> within a communication window, the accumulated gradient is larger or equal to the elastic force. As a result, this prevents the workers from further exploration (as expected). However, a significant side-effect is that the following gradient computations are <i>wasted</i> since they are countered by the elastic difference, as shown in Figure 5. Using the analogy from above, this is equivalent to a situation where no matter how hard a dog is trying to pull, you just don't let it go any further. Thus, the efforts of the dog are wasted. This condition is described by Equation 3.</p>\n          <div class=\"equation-math math\">\n            - \\sum_i - \\eta\\nabla f(x_{t + i};\\theta_{t + i}) \\geq -\\eta\\rho(\\theta_{t + i + 1} - \\theta_t^C)\n          </div>\n          <p>A straightforward technique to prevent the squandering of computations after the condition described by Equation 3, is to simply check for this condition after the computation of every mini-batch. When this condition is met, then the term <span class=\"inlined-math math\">\\sum_i - \\eta\\nabla f(x_{t + i};\\theta_{t + i})</span> is communicated with the parameter server. As a result, we do not waste any computations, and furthermore, we loose a hyperparameter since the communication window is now controlled (indirectly) by the hyperparameter <span class=\"inlined-math math\">\\rho</span>, which controls the elastic force. In essence, the core idea of ADAG (which will be mentioned later in this blog post), can also be applied to this scheme to even further improve the quality of the gradient updates, and making the optimization scheme less sensitive to other hyperparameters, e.g., the number of parallel workers.</p>\n        </section>\n      </section>\n      <section>\n        <h3>Asynchronous Data Parallel Methods</h3>\n        <p>In order to overcome the significant delays induced by loaded workers in synchronous data parallelism, and thereby decrease the training time even further, let us simply remove the synchronization constraint. However, this imposes several other effects, and some of them are not very obvious. The conceptually simplest, is <i>parameter staleness</i>. Parameter staleness is simply the number of commits <i>other workers</i> performed between the last pull (center variable synchronization), and the last commit (parameter update) of the current worker. Intuitively, this implies that a worker is updating a \"newer\" model using gradients based on a previous parametrization of that model. This is shown in Figure 6.</p>\n        <div class=\"blog-figure-container\">\n          <img src=\"img/asynchronous-data-parallelism.png\" alt=\"Asynchronous Data Parallelism\">\n          <p><strong>Figure 6:</strong> In asynchronous data parallelism, training time is even further reduced (on average) due to the removal of the synchronization mechanism in synchronous data parallelism. However, this induces several effects such as <i>parameter staleness</i>, and <i>asynchrony induced momentum</i>.</p>\n        </div>\n        <p><strong>Note:</strong> It is not required to read the paragraphs below, unless you really want to. However, the take-away point is: <i>increasing the number of parallel workers behaves like adding more momentum</i>.</p>\n        <p>The other, less intuitive side-effect is <i>asynchrony induced momentum</i> <a href=\"https://arxiv.org/abs/1605.09774\">[1]</a>. Roughly stated, this means that adding more workers to the problem also adds more <i>implicit momentum</i> to the optimization process. This <i>implicit momentum</i> is the result of the queuing model required by asynchrony. Note that some approaches, such as <i><a href=\"https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf\" alt=\"Hogwild!\">Hogwild!</a></i> do not require locking mechanisms, since they assume sparse gradient updates. However, distributed SGD works with dense gradient updates as well. We also confirm the statements of the authors that adding more asynchronous workers to the problem actually deteriorates the statistical performance of the model when using algorithms which do not take staleness and asynchrony into account. Furthermore, they state that the behaviour of an asynchronous algorithm is roughly described by Equation 4. Which implies that the implicit momentum produced by asynchrony is <span class=\"inlined-math math\">(1 - \\frac{1}{n})</span>.</p>\n        <div class=\"equation-math math\">E[\\theta_{t+1} - \\theta_t] = (1 - \\frac{1}{n})E[\\theta_t - \\theta_{t-1}] - \\frac{1}{n}\\eta E \\nabla f_i(\\theta_t)</div>\n        <p>But personally, I think this is not the complete story. I agree with the nicely formalized queueing model, and that in general, an increase in the number of asynchronous workers decreases the statistical performance of a model (we also observe this in our experiments). However, I would say that the effect rather <i>behaves like</i> momentum, but cannot be necessarily be defined as such (with ADAG, we do not observe this effect, at least for 30 parallel processes). We will go more in-depth into this topic in the following blog posts, since this is still a topic that requires some more research on my part.</p>\n        <section>\n          <h4>Asynchronous EASGD</h4>\n          <p>The update scheme of asynchronous EASGD is quite similar, however, there are some important details. In the following paragraphs we will call the vector <span class=\"inlined-math math\">- \\eta\\rho(\\theta^i_t - \\theta^c_t)</span> the <i>elastic difference</i>, and thereby following the notation of the paper. Remember that in the synchronous version this vector is actually used to enforce the exploration policy. Meaning, in Equation 1 this vector has the task to not let a worker drift too \"far\" from the center variable. Repeating the analogy with the dogs, imagine having a dog with an elastic leash. The further the dog walks away from you (the center variable), the stronger the force will be to pull it back. As a result, at some point the force the dogs exerts will be equal to the force the elastic leash exerts in the opposite direction. At this point, the dog cannot move any further. This is exactly what happens when the elastic difference is applied to a worker, as shown in Figure 5.</p>\n          <p>In the asynchronous version, the elastic difference has the same function. However, it will also be used to update the center variable. As stated in the paragraph above, the elastic difference is actually used to limit exploration. However, if we negate the elastic difference, which is <span class=\"inlined-math math\">+ \\eta\\rho(\\theta^i_t - \\theta^c_t)</span>, then the elastic difference can be used to optimize the center variable (reverse the arrow in Figure 5), while still holding true to the communication constraints EASGD is trying to solve.</p>\n        </section>\n        <section>\n          <h4>DOWNPOUR</h4>\n          <p>In DOWNPOUR, whenever a worker computes a gradient (or a sequence of gradients), the gradient is communicated with the parameter server. When the parameter server receives the gradient update from a worker, it will incorporate the update in the center variable, as shown in Figure 7. Contrary to EASGD, DOWNPOUR does not assume any communication constraints. Even more, if frequent communication with the parameter server does not take place (in order to reduce worker variance), DOWNPOUR will not converge (this is also related to the <i>asynchrony induces momentum</i> issue, see Figure 8). This is because of the same issues discussed in the sections above. If we allow the workers to explore \"too much\" of the parameter space, then the workers will not work together on finding a good solution for the center variable. Furthermore, DOWNPOUR does not have any intrinsic mechanisms in place to remain in the neighbourhood of the center variable. As a result, if you would increase the communication window, you would <i>proportionally</i> increase the length of the gradient which is sent to the parameter server, thus, the center variable is updated more aggressively in order to keep the variance of the workers in the parameter space \"small\".</p>\n          <div class=\"blog-figure-container\">\n            <img src=\"img/downpour.gif\" alt=\"DOWNPOUR animated\">\n            <p><strong>Figure 7:</strong> Animation of DOWNPOUR with 20 parallel workers (blue) with identical learning rates which are trying to optimize a single objective (center variable, red) compared to regular <i>sequential</i> gradient descent (green). From this animation we can observe the momentum induced by the asynchrony of the parallel workers, as discussed above.</p>\n          </div>\n          <div class=\"blog-figure-container\">\n            <img src=\"img/downpour_momentum.gif\" alt=\"DOWNPOUR with too much implicit momentum\">\n            <p><strong>Figure 8:</strong> Animation of DOWNPOUR with 40 parallel workers. In this case, the implicit momentum produced by the number of workers causes the algorithm to diverge.</p>\n          </div>\n        </section>\n        <section>\n          <h4>ADAG</h4>\n          <p>We noticed that a large communication window is correlated with a decline in model performance. Using some simulations (like DOWNPOUR, as shown above), we noticed that this effect can be mitigated when you normalize the accumulated gradient with the communication window. This has several positive effects, for one, you are not normalizing with respect to the number of parallel workers, thus, as a result, you are not losing the (convergence speed) benefit of parallelizing gradient descent. This has as a side-effect, that the variance of the workers with respect to the center variable will also remain small, thus contributing positively to the central objective! Furthermore, because of the normalization, you are less sensitive to hyperparametrization, especially regarding the communication window. However, it is to say that a large communication window typically also degrades the performance of the model because you allow the workers to explore more of the parameter space using the samples from their data shard. In our first prototype, we adapted DOWNPOUR to fit this idea. We observed the following results. First, we observe a significant increase in model performance, even compared to a sequential optimization scheme such as Adam. Second, compared to DOWNPOUR, we can increase the communication window with a factor 3. Thus, allow to utilize the CPU resources more efficiently, and decreasing the total training time even further. Finally, normalizing the accumulated gradient allows us the increase the communication window. As a result, we are able to match the training time of EASGD, and achieve the roughly the same (sometimes better, sometimes worse).</p>\n          <div class=\"equation-math math\">\n            \\Large \\frac{\\sum_{i=0}^{\\tau}-\\eta\\nabla f(x_{t + i};\\theta_{t + i})}{\\tau}\n          </div>\n          <p>To conclude, the core idea of ADAG, or <i>asynchronous distributed adaptive gradients</i>, can be applied to any distributed optimization scheme. Using our observations, and intuition (especially with respect to <i>implicit momentum due to asynchrony</i>), we can make a calculated guess that the idea of normalized accumulated gradients can be applied to any distributed optimization scheme. However, we need to conduct several experiments in order to verify this claim. ADAG will be discussed in detail in the following blog posts.</p>\n        </section>\n      </section>\n    </section>\n    <section>\n      <h2>Distributed Keras</h2>\n      <p><a href=\"https://github.com/cerndb/dist-keras/\" alt=\"GitHub Distributed Keras\">Distributed Keras</a> is a distributed deep learning framework built on top of <a href=\"https://spark.apache.org\" alt=\"Apache Spark\">Apache Spark</a> and <a href=\"https://keras.io/\" alt=\"Keras\">Keras</a> with the goal to significantly reduce the training using distributed machine learning algorithms, and allow bigger than memory datasets. This project initially started as a prototype with the CMS collaboration. However, the project has seen several iterations already since its start in August 2016.</p>\n      <h3>Architecture</h3>\n      <p>In essence, a training procedure is passed on to the Spark workers as a lambda function. However, in order to pass on multiple parameters, e.g., port number of the parameter server, we wrap everything in an object, and define a <i>train</i> function which accepts the parameters required by Spark. To give a complete overview, let us assume that a user just called the train method shown in the code block below. The trainer object will first allocate and start a parameter server on the Spark driver. Next, it allocates the worker procedure, which holds all parameters and procedures to train the Keras model. Furthermore, in order to comply with the required number of parallel workers, we partition the dataset according to this specific amount. However, when processing big datasets it is recommended to increase the parallelism factor, this will prevent that some workers remain idle while other (slower) workers are still processing their old batch (in literature, this is know as the <i>straggler</i> problem). In such cases, we recommend a parallelism factor of 3, as suggested by the <a href=\"https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism\" alt=\"Tuning parallelism\">Spark documentation</a>.</p>\n      <p>Although, one needs to consider the implications of a large parallelism factor. Basically, the parallelism is proportional to the number of partitions you will create. So lets say you assign 20 workers to a specific task, with a parallelism factor of 3. Spark will then repartition the dataset into 60 shards. Now, before a worker starts processing a partition, it first has to load all the Python libraries which are required to process the task, next, it will also have to deserialize and compile the Keras model. This induces a significant overhead. So this technique is only effective on non-heterogenous systems (meaning, different hardware, or variable load per worker), and large datasets due to the large \"warmup\" overhead.</p>\n      <div class=\"blog-figure-container\">\n        <img src=\"img/distkeras-architecture.png\" alt=\"Distributed Keras architecture\">\n        <p><strong>Figure 9:</strong> Imagine we have a Spark Context with <span class=\"inlined-math math\">k</span> executors, and <span class=\"inlined-math math\">l</span> cores per executor. Using these parameters, there will be <span class=\"inlined-math math\">n = k \\cdot l</span> workers allocated by dist-keras. However, if you would like to use a smaller amount of parallel workers, you can simply parameterize the training algorithm without having to reinitialize the Spark Context.</p>\n      </div>\n      <pre>\n        <code class=\"python hljs\">\n        # Allocate the parameter server.\n        self.parameter_server = self.allocate_parameter_server()\n        # Start the communication service.\n        self.start_service()\n        # Allocate a worker.\n        worker = self.allocate_worker()\n        # Repartition in order to fit the number of workers.\n        num_partitions = dataframe.rdd.getNumPartitions()\n        # Assign the dataset.\n        dataset = dataframe\n        if shuffle:\n            dataset = shuffle(dataset)\n        if num_partitions > self.num_workers:\n            dataset = dataset.coalesce(self.num_workers)\n        else:\n            dataset = dataset.repartition(self.num_workers)\n        dataset.cache()\n        # Iterate through the epochs (some trainers require a result).\n        dataset.rdd.mapPartitionsWithIndex(worker.train).collect()\n        # Stop the communication service.\n        self.stop_service()\n\n        return self.parameter_server.get_model()\n        </code>\n      </pre>\n    </section>\n    <section>\n      <h2>Experiments</h2>\n      <p>In the following experiments we set up the different optimization schemes against each other, i.e. (sequential) Adam, (distributed) DOWNPOUR, Asynchronous EASGD, and ADAG, and evaluate them using the MNIST dataset (samples are shown in Figure 10).  We will use the following parameters during our experiments:</p>\n      <ul>\n        <li>Multilayer perceptron with 1 000 000 trainable parameters (~4 MB model) (complete model summarized below)</li>\n        <li>4 sample mini-batches</li>\n        <li>1 epoch</li>\n        <li>Parallelism factor: 1</li>\n        <li><a href=\"https://arxiv.org/abs/1412.6980\" alt=\"Adam optimizer\">Adam</a> as worker optimizer</li>\n        <li>Communication windows:\n          <ul>\n            <li>DOWNPOUR: 5</li>\n            <li>ADAG: 5</li>\n            <li>Asynchronous EASGD: 32</li>\n          </ul>\n        </li>\n        <li>20 parallel workers:\n          <ul>\n            <li>10 compute nodes with 10 Gbps network cards</li>\n            <li>2 processes per compute node (32 cores)</li>\n          </ul>\n        </li>\n      </ul>\n      <pre>\n        <code class=\"python hljs\">\nmlp = Sequential()\nmlp.add(Dense(1000, input_shape=(784,)))\nmlp.add(Activation('relu'))\nmlp.add(Dropout(0.2))\nmlp.add(Dense(200))\nmlp.add(Activation('relu'))\nmlp.add(Dropout(0.2))\nmlp.add(Dense(10))\nmlp.add(Activation('softmax'))\n        </code>\n      </pre>\n      <div class=\"blog-figure-container\">\n        <img src=\"img/mnist.png\" alt=\"MNIST normalized\">\n        <p><strong>Figure 10:</strong> The MNIST dataset is a collection of handwritten digits. This dataset is usually used as a \"unit test\" for optimization algorithms. Every sample consists of 784 pixels, with values ranging between 0 and 255. We normalize these using our framework <a href=\"https://github.com/cerndb/dist-keras\" alt=\"Distributed Keras\">dist-keras</a>, which is built on top of Apache Spark, thus, profiting from the parallelization.</p>\n      </div>\n      <p>In the following experiments we evaluate the accuracy of the central variable, and the training time (wallclock) compared to the number of parallel workers. Although this is a relatively small dataset, it gives us some indications into the scaling abilities of the optimization schems. In the following blog posts we will also focus on large scale deep learning, meaning, we will handle very large datasets and train models in a data parallel setting.</p>\n      <h3>DOWNPOUR</h3>\n      <div class=\"blog-figure-container\">\n        <img src=\"img/experiment_downpour.png\" alt=\"DOWNPOUR experiment MNIST\">\n        <p><strong>Figure 11:</strong> A key observation in this experiment is that DOWNPOUR actually <i>diverges</i> when it reaches a critical amount of <i>implicit momentum</i>, as shown in Figure 8. We made this observation in several other datasets as well. However, the constantly declining performance the authors in <a href=\"https://arxiv.org/abs/1605.09774\">[1]</a> is not observed. Rather, we have a sudden decline in model performance. This is rather contradictory to the claims made in <a href=\"https://arxiv.org/abs/1605.09774\">[1]</a>. According to their theory, we should not see a sudden decline in model performance, but rather a steady decline. As a result, we think that their statement <i>\"there exists a limit to asynchrony\"</i> is false as well. Though, their intuition is correct! Furthermore, on the left, we see the scaling of the algorithm. We actually expected that the scaling should work better, however, this could be because of the unbalanced partitions (we are doing experiments with other partitioners to correct for this) and relatively small dataset.</p>\n      </div>\n      <h3>Asynchronous EASGD</h3>\n      <div class=\"blog-figure-container\">\n        <img src=\"img/experiment_easgd.png\" alt=\"Asynchronous EASGD experiment MNIST\">\n        <p><strong>Figure 12:</strong> As stated above, EASGD is an algorithm designed with communication constraints in mind, which is a realistic constraint. As a result, the authors incorporate an <i>elastic force</i> which allows the worker to explore a certain area of the neighbouring parameter space w.r.t. the center variable. As a result, it will not have an immediate decline in model performance, as observed in DOWNPOUR, but rather a steady decline. This decline (with respect to the number of model performance), is due to the increased amount of staleness (since the center variable will have covered more distance because of the queuing model) compared to the worker. As a result, the positive information a worker can contribute is proportional to the elastic difference, and this elastic difference will be smaller when the number of parallel workers is higher (due to parameter staleness). However, since EASGD scales very well with the number of workers, we simply match the training time of ADAG or DOWNPOUR. However, even if we would match the training time, EASGD usually results in a lower accuracy compared to ADAG. This is phenomena is subject to further study, as it is not really completely understood why this is actually happening. Furthermore, it also consumes more CPU compared to ADAG, if we would match the model performance of ADAG (ADAG spends a lot of time waiting for network operations).</p>\n      </div>\n      <h3>ADAG</h3>\n      <div class=\"blog-figure-container\">\n        <img src=\"img/experiment_adag.png\" alt=\"ADAG experiment MNIST\">\n        <p><strong>Figure 13:</strong> If we would assume no communication constraints, then how would we solve the problem DOWNPOUR has? Averaging the gradients will work, but it is not very desireable since the gradient will act as if they were a sequential optimization algorithm. So what if we would normalize with respect to the communication window? Since this really is the parameter which induces parameter staleness, as can be observed from Figure 12 (declining model performance). An interesting observation we can make here is the absence of any decline in model performance (compared to DOWNPOUR and EASGD). We think this is because one of the following reasons; for one, we keep the variance of the workers small (limited exploration), and normalize the accumulated gradient on the workers with the communication window (which is a prime factor in implicit momentum).</p>\n      </div>\n      <h3>Influence of the communication window on accuracy and training time</h3>\n      <p>In the following set of experiments we will investigate the influence of the communication window on accuracy and training time. The communication window is a hyperparameter which defines the frequency of communication with the parameter server. A communication window of 35 means that a worker will accumulate 35 mini-batch updates, and finally synchronizes with the parameter server. In the experiments, all optimization schemes use identical hyperparameters, where the only variable between tests is the communication window. As before, we will use MNIST as a dataset, a mini-batch of size 4, and Adam as the worker optimizer.</p>\n      <div class=\"blog-figure-container\">\n        <img src=\"img/experiment_communication_window_accuracy.png\" alt=\"Influence of communication window on accuracy\">\n        <p><strong>Figure 14:</strong> As expected, DOWNPOUR is not able to handle large communication windows. EASGD on the other hand is not able to handle small communication windows! As stated above, this is because the elastic force (due to the number of workers) is stronger then the exploration of the parameter space. Thus, causing EASGD to not converge. ADAGA on the other hand is able to handle the varying communication window, however, a slight decline in model performance is observed. This is expected due to the increase in exploration of the parameter space by the workers.</p>\n      </div>\n      <div class=\"blog-figure-container\">\n        <img src=\"img/experiment_communication_window_training_time.png\" alt=\"Influence of communication window on training time\">\n        <p><strong>Figure 15:</strong> Again, the training time of all optimization schemes decrease significantly when the communication window is increased. However, we think we can further decrease the training time by allocated a thread in every worker which sole responsibility is to send the parameters to the parameter server. However, this is an idea that has yet to be explored. To conclude, we suggest to make a trade-off between training time and accuracy, in the case of ADAG, we recommend a communication window of 10-15, since this hyperparametrization achieves similar model performance. However, when applying this to a differen dataset. We recommend that you test these settings for yourself, since they can differ.</p>\n      </div>\n    </section>\n    <section>\n      <h2>Summary</h2>\n      <p>In this work we gave the reader an introduction to the problem of distributed deep learning, and some of the aspects which one needs to consider when applying it, such as, for example, <i>implicit momentum</i>. We also suggested some techniques which are able to significantly improve existing distributed optimization schemes. Furthermore, we introduced our framework, <a href=\"https://github.com/cerndb/dist-keras\" alt=\"Distributed Keras\">dist-keras</a>, and applied different distributed optimization schemes to the MNIST dataset. Finally, we also provided several production-ready <a href=\"https://github.com/cerndb/dist-keras/tree/master/examples\" alt=\"Distributed Keras examples\">examples and notebooks</a>.\n    </section>\n    <section>\n      <h2>Acknowledgements</h2>\n      <p>This work was done as part of my Technical Student contract at CERN IT. I would like to thank Zbigniew Baranowski and Luca Canali of the IT-DB group, Volodimir Begy of the University of Vienna, and to Jean-Roch Vlimant, Maurizio Pierini, and Federico Presutti (CalTech) of the EP-UCM group for their collaboration on this work.</p>\n    </section>\n    <section>\n      <h2>References</h2>\n      <ol>\n        <li><a href=\"https://arxiv.org/abs/1605.09774\">Mitliagkas, I., Zhang, C., Hadjis, S., & Ré, C. (2016). Asynchrony begets Momentum, with an Application to Deep Learning. arXiv preprint arXiv:1605.09774.</a></li>\n        <li><a href=\"https://www.researchgate.net/profile/Gs_Corrado/publication/266225209_Large_Scale_Distributed_Deep_Networks/links/544672590cf2f14fb80f3c77.pdf\">Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., ... & Ng, A. Y. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223-1231).</a></li>\n        <li><a href=\"http://papers.nips.cc/paper/5761-deep-learning-with-elastic-averaging-sgd\">Zhang, S., Choromanska, A. E., & LeCun, Y. (2015). Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems (pp. 685-693).</a></li>\n        <li><a href=\"http://yann.lecun.com/exdb/mnist/\">The MNIST database, of handwritten digits.</a></li>\n      </ol>\n    </section>\n  </body>\n  <script type=\"text/javascript\" src=\"js/katex.min.js\"></script>\n  <script type=\"text/javascript\" src=\"js/highlight.pack.js\"></script>\n  <script type=\"text/javascript\" src=\"js/main.js\"></script>\n</html>\n"
  },
  {
    "path": "scripts/generate_secret.py",
    "content": "\"\"\"Generates a JSON structure that needs to be added to the\nsecrets file.\n\nAuthor: Joeri Hermans\n\"\"\"\n\n## BEGIN Imports. ##############################################################\n\nimport json\n\nimport optparse\n\nimport random\n\nimport string\n\n## END Imports. ################################################################\n\ndef generate_secret(identity):\n    secret = ''.join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(64))\n    d = {}\n    d['secret'] = secret\n    d['identity'] = identity\n    print(json.dumps(d))\n\ndef parse_arguments():\n    parser = optparse.OptionParser()\n    parser.set_defaults(identity=None)\n    parser.add_option('--identity', action='store', dest='identity', type='string')\n    (options, args) = parser.parse_args()\n\n    return options\n\ndef main():\n    # Parse the options.\n    options = parse_arguments()\n    # Check if an identity has been provided.\n    if options.identity is not None:\n        generate_secret(options.identity)\n    else:\n        print(\"Please specify an identity (--identity).\")\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "scripts/punchcard.py",
    "content": "\"\"\"Script which starts the Punchcard daemon. Punchcard will accept remote job\nrequests and execute them on the local cluster.\n\nAuthor: Joeri Hermans\n\"\"\"\n\n## BEGIN Imports. ##############################################################\n\nfrom distkeras.job_deployment import Job\nfrom distkeras.job_deployment import Punchcard\n\nimport os\n\nimport sys\n\nimport optparse\n\n## END Imports. ################################################################\n\ndef parse_arguments():\n    parser = optparse.OptionParser()\n    parser.set_defaults(port=8000, secrets_path='secrets.json')\n    parser.add_option('--port', action='store', dest='port', type='int')\n    parser.add_option('--secrets', action='store', dest='secrets_path', type='string')\n    (options, args) = parser.parse_args()\n\n    return options\n\ndef start_punchcard(port, secrets):\n    punchcard = Punchcard(secrets, port)\n    punchcard.run()\n\ndef main():\n    # Parse the program arguments.\n    options = parse_arguments()\n    port = options.port\n    secrets_path = options.secrets_path\n    # Start the Punchcard instance.\n    start_punchcard(port, secrets_path)\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "setup.py",
    "content": "\"\"\"Setup-module for DistKeras.\n\nThis software enables distrubuted Machine Learning on Apache Spark using Keras.\n\nSee:\nhttps://github.com/JoeriHermans/dist-keras/\nhttp://joerihermans.com/\n\"\"\"\n\nfrom setuptools import setup\nfrom setuptools import find_packages\n\nsetup(name='dist-keras',\n      description='Distributed Deep learning with Apache Spark with Keras.',\n      url='https://github.com/JoeriHermans/dist-keras',\n      author='Joeri Hermans',\n      version='0.2.1',\n      author_email='joeri@joerihermans.com',\n      license='GPLv3',\n      install_requires=['theano', 'tensorflow', 'keras', 'flask'],\n      packages=['distkeras'],\n      package_data={'distkeras': ['distkeras/*.py']},\n      # Keywords related to the project.\n      keywords=['Keras', 'Deep Learning', 'Machine Learning', 'Theano', 'Tensorflow', 'Distributed', 'Apache Spark'],\n)\n"
  }
]