[
  {
    "path": ".gitignore",
    "content": "*~\n*.dSYM\n.DS_Store\n*-debug\n*-s\n*-l\ncentrifuge.xcodeproj/project.xcworkspace\ncentrifuge.xcodeproj/xcuserdata\ncentrifuge.xcodeproj/xcshareddata\n\ncentrifuge-build-bin\ncentrifuge-buildc\ncentrifuge-class\ncentrifuge-inspect-bin\n"
  },
  {
    "path": "AUTHORS",
    "content": "Ben Langmead <langmea@cs.jhu.edu> wrote Bowtie 2, which is based partially on\nBowtie.  Bowtie was written by Ben Langmead and Cole Trapnell.\n\n  Bowtie & Bowtie 2:  http://bowtie-bio.sf.net\n\nA DLL from the pthreads for Win32 library is distributed with the Win32 version\nof Bowtie 2.  The pthreads for Win32 library and the GnuWin32 package have many\ncontributors (see their respective web sites).\n\n  pthreads for Win32: http://sourceware.org/pthreads-win32\n  GnuWin32:           http://gnuwin32.sf.net\n\nThe ForkManager.pm perl module is used in Bowtie 2's random testing framework,\nand is included as scripts/sim/contrib/ForkManager.pm.  ForkManager.pm is\nwritten by dLux (Szabo, Balazs), with contributions by others.  See the perldoc\nin ForkManager.pm for the complete list.\n\nThe file ls.h includes an implementation of the Larsson-Sadakane suffix sorting\nalgorithm.  The implementation is by N. Jesper Larsson and was adapted somewhat\nfor use in Bowtie 2.\n\nTinyThreads is a portable thread implementation with a fairly compatible subset \nof C++11 thread management classes written by Marcus Geelnard. For more info\ncheck http://tinythreadpp.bitsnbites.eu/ \n\nVarious users have kindly supplied patches, bug reports and feature requests\nover the years.  Many, many thanks go to them.\n\nSeptember 2011\n"
  },
  {
    "path": "LICENSE",
    "content": "                    GNU GENERAL PUBLIC LICENSE\n                       Version 3, 29 June 2007\n\n Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>\n Everyone is permitted to copy and distribute verbatim copies\n of this license document, but changing it is not allowed.\n\n                            Preamble\n\n  The GNU General Public License is a free, copyleft license for\nsoftware and other kinds of works.\n\n  The licenses for most software and other practical works are designed\nto take away your freedom to share and change the works.  By contrast,\nthe GNU General Public License is intended to guarantee your freedom to\nshare and change all versions of a program--to make sure it remains free\nsoftware for all its users.  We, the Free Software Foundation, use the\nGNU General Public License for most of our software; it applies also to\nany other work released this way by its authors.  You can apply it to\nyour programs, too.\n\n  When we speak of free software, we are referring to freedom, not\nprice.  Our General Public Licenses are designed to make sure that you\nhave the freedom to distribute copies of free software (and charge for\nthem if you wish), that you receive source code or can get it if you\nwant it, that you can change the software or use pieces of it in new\nfree programs, and that you know you can do these things.\n\n  To protect your rights, we need to prevent others from denying you\nthese rights or asking you to surrender the rights.  Therefore, you have\ncertain responsibilities if you distribute copies of the software, or if\nyou modify it: responsibilities to respect the freedom of others.\n\n  For example, if you distribute copies of such a program, whether\ngratis or for a fee, you must pass on to the recipients the same\nfreedoms that you received.  You must make sure that they, too, receive\nor can get the source code.  And you must show them these terms so they\nknow their rights.\n\n  Developers that use the GNU GPL protect your rights with two steps:\n(1) assert copyright on the software, and (2) offer you this License\ngiving you legal permission to copy, distribute and/or modify it.\n\n  For the developers' and authors' protection, the GPL clearly explains\nthat there is no warranty for this free software.  For both users' and\nauthors' sake, the GPL requires that modified versions be marked as\nchanged, so that their problems will not be attributed erroneously to\nauthors of previous versions.\n\n  Some devices are designed to deny users access to install or run\nmodified versions of the software inside them, although the manufacturer\ncan do so.  This is fundamentally incompatible with the aim of\nprotecting users' freedom to change the software.  The systematic\npattern of such abuse occurs in the area of products for individuals to\nuse, which is precisely where it is most unacceptable.  Therefore, we\nhave designed this version of the GPL to prohibit the practice for those\nproducts.  If such problems arise substantially in other domains, we\nstand ready to extend this provision to those domains in future versions\nof the GPL, as needed to protect the freedom of users.\n\n  Finally, every program is threatened constantly by software patents.\nStates should not allow patents to restrict development and use of\nsoftware on general-purpose computers, but in those that do, we wish to\navoid the special danger that patents applied to a free program could\nmake it effectively proprietary.  To prevent this, the GPL assures that\npatents cannot be used to render the program non-free.\n\n  The precise terms and conditions for copying, distribution and\nmodification follow.\n\n                       TERMS AND CONDITIONS\n\n  0. Definitions.\n\n  \"This License\" refers to version 3 of the GNU General Public License.\n\n  \"Copyright\" also means copyright-like laws that apply to other kinds of\nworks, such as semiconductor masks.\n\n  \"The Program\" refers to any copyrightable work licensed under this\nLicense.  Each licensee is addressed as \"you\".  \"Licensees\" and\n\"recipients\" may be individuals or organizations.\n\n  To \"modify\" a work means to copy from or adapt all or part of the work\nin a fashion requiring copyright permission, other than the making of an\nexact copy.  The resulting work is called a \"modified version\" of the\nearlier work or a work \"based on\" the earlier work.\n\n  A \"covered work\" means either the unmodified Program or a work based\non the Program.\n\n  To \"propagate\" a work means to do anything with it that, without\npermission, would make you directly or secondarily liable for\ninfringement under applicable copyright law, except executing it on a\ncomputer or modifying a private copy.  Propagation includes copying,\ndistribution (with or without modification), making available to the\npublic, and in some countries other activities as well.\n\n  To \"convey\" a work means any kind of propagation that enables other\nparties to make or receive copies.  Mere interaction with a user through\na computer network, with no transfer of a copy, is not conveying.\n\n  An interactive user interface displays \"Appropriate Legal Notices\"\nto the extent that it includes a convenient and prominently visible\nfeature that (1) displays an appropriate copyright notice, and (2)\ntells the user that there is no warranty for the work (except to the\nextent that warranties are provided), that licensees may convey the\nwork under this License, and how to view a copy of this License.  If\nthe interface presents a list of user commands or options, such as a\nmenu, a prominent item in the list meets this criterion.\n\n  1. Source Code.\n\n  The \"source code\" for a work means the preferred form of the work\nfor making modifications to it.  \"Object code\" means any non-source\nform of a work.\n\n  A \"Standard Interface\" means an interface that either is an official\nstandard defined by a recognized standards body, or, in the case of\ninterfaces specified for a particular programming language, one that\nis widely used among developers working in that language.\n\n  The \"System Libraries\" of an executable work include anything, other\nthan the work as a whole, that (a) is included in the normal form of\npackaging a Major Component, but which is not part of that Major\nComponent, and (b) serves only to enable use of the work with that\nMajor Component, or to implement a Standard Interface for which an\nimplementation is available to the public in source code form.  A\n\"Major Component\", in this context, means a major essential component\n(kernel, window system, and so on) of the specific operating system\n(if any) on which the executable work runs, or a compiler used to\nproduce the work, or an object code interpreter used to run it.\n\n  The \"Corresponding Source\" for a work in object code form means all\nthe source code needed to generate, install, and (for an executable\nwork) run the object code and to modify the work, including scripts to\ncontrol those activities.  However, it does not include the work's\nSystem Libraries, or general-purpose tools or generally available free\nprograms which are used unmodified in performing those activities but\nwhich are not part of the work.  For example, Corresponding Source\nincludes interface definition files associated with source files for\nthe work, and the source code for shared libraries and dynamically\nlinked subprograms that the work is specifically designed to require,\nsuch as by intimate data communication or control flow between those\nsubprograms and other parts of the work.\n\n  The Corresponding Source need not include anything that users\ncan regenerate automatically from other parts of the Corresponding\nSource.\n\n  The Corresponding Source for a work in source code form is that\nsame work.\n\n  2. Basic Permissions.\n\n  All rights granted under this License are granted for the term of\ncopyright on the Program, and are irrevocable provided the stated\nconditions are met.  This License explicitly affirms your unlimited\npermission to run the unmodified Program.  The output from running a\ncovered work is covered by this License only if the output, given its\ncontent, constitutes a covered work.  This License acknowledges your\nrights of fair use or other equivalent, as provided by copyright law.\n\n  You may make, run and propagate covered works that you do not\nconvey, without conditions so long as your license otherwise remains\nin force.  You may convey covered works to others for the sole purpose\nof having them make modifications exclusively for you, or provide you\nwith facilities for running those works, provided that you comply with\nthe terms of this License in conveying all material for which you do\nnot control copyright.  Those thus making or running the covered works\nfor you must do so exclusively on your behalf, under your direction\nand control, on terms that prohibit them from making any copies of\nyour copyrighted material outside their relationship with you.\n\n  Conveying under any other circumstances is permitted solely under\nthe conditions stated below.  Sublicensing is not allowed; section 10\nmakes it unnecessary.\n\n  3. Protecting Users' Legal Rights From Anti-Circumvention Law.\n\n  No covered work shall be deemed part of an effective technological\nmeasure under any applicable law fulfilling obligations under article\n11 of the WIPO copyright treaty adopted on 20 December 1996, or\nsimilar laws prohibiting or restricting circumvention of such\nmeasures.\n\n  When you convey a covered work, you waive any legal power to forbid\ncircumvention of technological measures to the extent such circumvention\nis effected by exercising rights under this License with respect to\nthe covered work, and you disclaim any intention to limit operation or\nmodification of the work as a means of enforcing, against the work's\nusers, your or third parties' legal rights to forbid circumvention of\ntechnological measures.\n\n  4. Conveying Verbatim Copies.\n\n  You may convey verbatim copies of the Program's source code as you\nreceive it, in any medium, provided that you conspicuously and\nappropriately publish on each copy an appropriate copyright notice;\nkeep intact all notices stating that this License and any\nnon-permissive terms added in accord with section 7 apply to the code;\nkeep intact all notices of the absence of any warranty; and give all\nrecipients a copy of this License along with the Program.\n\n  You may charge any price or no price for each copy that you convey,\nand you may offer support or warranty protection for a fee.\n\n  5. Conveying Modified Source Versions.\n\n  You may convey a work based on the Program, or the modifications to\nproduce it from the Program, in the form of source code under the\nterms of section 4, provided that you also meet all of these conditions:\n\n    a) The work must carry prominent notices stating that you modified\n    it, and giving a relevant date.\n\n    b) The work must carry prominent notices stating that it is\n    released under this License and any conditions added under section\n    7.  This requirement modifies the requirement in section 4 to\n    \"keep intact all notices\".\n\n    c) You must license the entire work, as a whole, under this\n    License to anyone who comes into possession of a copy.  This\n    License will therefore apply, along with any applicable section 7\n    additional terms, to the whole of the work, and all its parts,\n    regardless of how they are packaged.  This License gives no\n    permission to license the work in any other way, but it does not\n    invalidate such permission if you have separately received it.\n\n    d) If the work has interactive user interfaces, each must display\n    Appropriate Legal Notices; however, if the Program has interactive\n    interfaces that do not display Appropriate Legal Notices, your\n    work need not make them do so.\n\n  A compilation of a covered work with other separate and independent\nworks, which are not by their nature extensions of the covered work,\nand which are not combined with it such as to form a larger program,\nin or on a volume of a storage or distribution medium, is called an\n\"aggregate\" if the compilation and its resulting copyright are not\nused to limit the access or legal rights of the compilation's users\nbeyond what the individual works permit.  Inclusion of a covered work\nin an aggregate does not cause this License to apply to the other\nparts of the aggregate.\n\n  6. Conveying Non-Source Forms.\n\n  You may convey a covered work in object code form under the terms\nof sections 4 and 5, provided that you also convey the\nmachine-readable Corresponding Source under the terms of this License,\nin one of these ways:\n\n    a) Convey the object code in, or embodied in, a physical product\n    (including a physical distribution medium), accompanied by the\n    Corresponding Source fixed on a durable physical medium\n    customarily used for software interchange.\n\n    b) Convey the object code in, or embodied in, a physical product\n    (including a physical distribution medium), accompanied by a\n    written offer, valid for at least three years and valid for as\n    long as you offer spare parts or customer support for that product\n    model, to give anyone who possesses the object code either (1) a\n    copy of the Corresponding Source for all the software in the\n    product that is covered by this License, on a durable physical\n    medium customarily used for software interchange, for a price no\n    more than your reasonable cost of physically performing this\n    conveying of source, or (2) access to copy the\n    Corresponding Source from a network server at no charge.\n\n    c) Convey individual copies of the object code with a copy of the\n    written offer to provide the Corresponding Source.  This\n    alternative is allowed only occasionally and noncommercially, and\n    only if you received the object code with such an offer, in accord\n    with subsection 6b.\n\n    d) Convey the object code by offering access from a designated\n    place (gratis or for a charge), and offer equivalent access to the\n    Corresponding Source in the same way through the same place at no\n    further charge.  You need not require recipients to copy the\n    Corresponding Source along with the object code.  If the place to\n    copy the object code is a network server, the Corresponding Source\n    may be on a different server (operated by you or a third party)\n    that supports equivalent copying facilities, provided you maintain\n    clear directions next to the object code saying where to find the\n    Corresponding Source.  Regardless of what server hosts the\n    Corresponding Source, you remain obligated to ensure that it is\n    available for as long as needed to satisfy these requirements.\n\n    e) Convey the object code using peer-to-peer transmission, provided\n    you inform other peers where the object code and Corresponding\n    Source of the work are being offered to the general public at no\n    charge under subsection 6d.\n\n  A separable portion of the object code, whose source code is excluded\nfrom the Corresponding Source as a System Library, need not be\nincluded in conveying the object code work.\n\n  A \"User Product\" is either (1) a \"consumer product\", which means any\ntangible personal property which is normally used for personal, family,\nor household purposes, or (2) anything designed or sold for incorporation\ninto a dwelling.  In determining whether a product is a consumer product,\ndoubtful cases shall be resolved in favor of coverage.  For a particular\nproduct received by a particular user, \"normally used\" refers to a\ntypical or common use of that class of product, regardless of the status\nof the particular user or of the way in which the particular user\nactually uses, or expects or is expected to use, the product.  A product\nis a consumer product regardless of whether the product has substantial\ncommercial, industrial or non-consumer uses, unless such uses represent\nthe only significant mode of use of the product.\n\n  \"Installation Information\" for a User Product means any methods,\nprocedures, authorization keys, or other information required to install\nand execute modified versions of a covered work in that User Product from\na modified version of its Corresponding Source.  The information must\nsuffice to ensure that the continued functioning of the modified object\ncode is in no case prevented or interfered with solely because\nmodification has been made.\n\n  If you convey an object code work under this section in, or with, or\nspecifically for use in, a User Product, and the conveying occurs as\npart of a transaction in which the right of possession and use of the\nUser Product is transferred to the recipient in perpetuity or for a\nfixed term (regardless of how the transaction is characterized), the\nCorresponding Source conveyed under this section must be accompanied\nby the Installation Information.  But this requirement does not apply\nif neither you nor any third party retains the ability to install\nmodified object code on the User Product (for example, the work has\nbeen installed in ROM).\n\n  The requirement to provide Installation Information does not include a\nrequirement to continue to provide support service, warranty, or updates\nfor a work that has been modified or installed by the recipient, or for\nthe User Product in which it has been modified or installed.  Access to a\nnetwork may be denied when the modification itself materially and\nadversely affects the operation of the network or violates the rules and\nprotocols for communication across the network.\n\n  Corresponding Source conveyed, and Installation Information provided,\nin accord with this section must be in a format that is publicly\ndocumented (and with an implementation available to the public in\nsource code form), and must require no special password or key for\nunpacking, reading or copying.\n\n  7. Additional Terms.\n\n  \"Additional permissions\" are terms that supplement the terms of this\nLicense by making exceptions from one or more of its conditions.\nAdditional permissions that are applicable to the entire Program shall\nbe treated as though they were included in this License, to the extent\nthat they are valid under applicable law.  If additional permissions\napply only to part of the Program, that part may be used separately\nunder those permissions, but the entire Program remains governed by\nthis License without regard to the additional permissions.\n\n  When you convey a copy of a covered work, you may at your option\nremove any additional permissions from that copy, or from any part of\nit.  (Additional permissions may be written to require their own\nremoval in certain cases when you modify the work.)  You may place\nadditional permissions on material, added by you to a covered work,\nfor which you have or can give appropriate copyright permission.\n\n  Notwithstanding any other provision of this License, for material you\nadd to a covered work, you may (if authorized by the copyright holders of\nthat material) supplement the terms of this License with terms:\n\n    a) Disclaiming warranty or limiting liability differently from the\n    terms of sections 15 and 16 of this License; or\n\n    b) Requiring preservation of specified reasonable legal notices or\n    author attributions in that material or in the Appropriate Legal\n    Notices displayed by works containing it; or\n\n    c) Prohibiting misrepresentation of the origin of that material, or\n    requiring that modified versions of such material be marked in\n    reasonable ways as different from the original version; or\n\n    d) Limiting the use for publicity purposes of names of licensors or\n    authors of the material; or\n\n    e) Declining to grant rights under trademark law for use of some\n    trade names, trademarks, or service marks; or\n\n    f) Requiring indemnification of licensors and authors of that\n    material by anyone who conveys the material (or modified versions of\n    it) with contractual assumptions of liability to the recipient, for\n    any liability that these contractual assumptions directly impose on\n    those licensors and authors.\n\n  All other non-permissive additional terms are considered \"further\nrestrictions\" within the meaning of section 10.  If the Program as you\nreceived it, or any part of it, contains a notice stating that it is\ngoverned by this License along with a term that is a further\nrestriction, you may remove that term.  If a license document contains\na further restriction but permits relicensing or conveying under this\nLicense, you may add to a covered work material governed by the terms\nof that license document, provided that the further restriction does\nnot survive such relicensing or conveying.\n\n  If you add terms to a covered work in accord with this section, you\nmust place, in the relevant source files, a statement of the\nadditional terms that apply to those files, or a notice indicating\nwhere to find the applicable terms.\n\n  Additional terms, permissive or non-permissive, may be stated in the\nform of a separately written license, or stated as exceptions;\nthe above requirements apply either way.\n\n  8. Termination.\n\n  You may not propagate or modify a covered work except as expressly\nprovided under this License.  Any attempt otherwise to propagate or\nmodify it is void, and will automatically terminate your rights under\nthis License (including any patent licenses granted under the third\nparagraph of section 11).\n\n  However, if you cease all violation of this License, then your\nlicense from a particular copyright holder is reinstated (a)\nprovisionally, unless and until the copyright holder explicitly and\nfinally terminates your license, and (b) permanently, if the copyright\nholder fails to notify you of the violation by some reasonable means\nprior to 60 days after the cessation.\n\n  Moreover, your license from a particular copyright holder is\nreinstated permanently if the copyright holder notifies you of the\nviolation by some reasonable means, this is the first time you have\nreceived notice of violation of this License (for any work) from that\ncopyright holder, and you cure the violation prior to 30 days after\nyour receipt of the notice.\n\n  Termination of your rights under this section does not terminate the\nlicenses of parties who have received copies or rights from you under\nthis License.  If your rights have been terminated and not permanently\nreinstated, you do not qualify to receive new licenses for the same\nmaterial under section 10.\n\n  9. Acceptance Not Required for Having Copies.\n\n  You are not required to accept this License in order to receive or\nrun a copy of the Program.  Ancillary propagation of a covered work\noccurring solely as a consequence of using peer-to-peer transmission\nto receive a copy likewise does not require acceptance.  However,\nnothing other than this License grants you permission to propagate or\nmodify any covered work.  These actions infringe copyright if you do\nnot accept this License.  Therefore, by modifying or propagating a\ncovered work, you indicate your acceptance of this License to do so.\n\n  10. Automatic Licensing of Downstream Recipients.\n\n  Each time you convey a covered work, the recipient automatically\nreceives a license from the original licensors, to run, modify and\npropagate that work, subject to this License.  You are not responsible\nfor enforcing compliance by third parties with this License.\n\n  An \"entity transaction\" is a transaction transferring control of an\norganization, or substantially all assets of one, or subdividing an\norganization, or merging organizations.  If propagation of a covered\nwork results from an entity transaction, each party to that\ntransaction who receives a copy of the work also receives whatever\nlicenses to the work the party's predecessor in interest had or could\ngive under the previous paragraph, plus a right to possession of the\nCorresponding Source of the work from the predecessor in interest, if\nthe predecessor has it or can get it with reasonable efforts.\n\n  You may not impose any further restrictions on the exercise of the\nrights granted or affirmed under this License.  For example, you may\nnot impose a license fee, royalty, or other charge for exercise of\nrights granted under this License, and you may not initiate litigation\n(including a cross-claim or counterclaim in a lawsuit) alleging that\nany patent claim is infringed by making, using, selling, offering for\nsale, or importing the Program or any portion of it.\n\n  11. Patents.\n\n  A \"contributor\" is a copyright holder who authorizes use under this\nLicense of the Program or a work on which the Program is based.  The\nwork thus licensed is called the contributor's \"contributor version\".\n\n  A contributor's \"essential patent claims\" are all patent claims\nowned or controlled by the contributor, whether already acquired or\nhereafter acquired, that would be infringed by some manner, permitted\nby this License, of making, using, or selling its contributor version,\nbut do not include claims that would be infringed only as a\nconsequence of further modification of the contributor version.  For\npurposes of this definition, \"control\" includes the right to grant\npatent sublicenses in a manner consistent with the requirements of\nthis License.\n\n  Each contributor grants you a non-exclusive, worldwide, royalty-free\npatent license under the contributor's essential patent claims, to\nmake, use, sell, offer for sale, import and otherwise run, modify and\npropagate the contents of its contributor version.\n\n  In the following three paragraphs, a \"patent license\" is any express\nagreement or commitment, however denominated, not to enforce a patent\n(such as an express permission to practice a patent or covenant not to\nsue for patent infringement).  To \"grant\" such a patent license to a\nparty means to make such an agreement or commitment not to enforce a\npatent against the party.\n\n  If you convey a covered work, knowingly relying on a patent license,\nand the Corresponding Source of the work is not available for anyone\nto copy, free of charge and under the terms of this License, through a\npublicly available network server or other readily accessible means,\nthen you must either (1) cause the Corresponding Source to be so\navailable, or (2) arrange to deprive yourself of the benefit of the\npatent license for this particular work, or (3) arrange, in a manner\nconsistent with the requirements of this License, to extend the patent\nlicense to downstream recipients.  \"Knowingly relying\" means you have\nactual knowledge that, but for the patent license, your conveying the\ncovered work in a country, or your recipient's use of the covered work\nin a country, would infringe one or more identifiable patents in that\ncountry that you have reason to believe are valid.\n\n  If, pursuant to or in connection with a single transaction or\narrangement, you convey, or propagate by procuring conveyance of, a\ncovered work, and grant a patent license to some of the parties\nreceiving the covered work authorizing them to use, propagate, modify\nor convey a specific copy of the covered work, then the patent license\nyou grant is automatically extended to all recipients of the covered\nwork and works based on it.\n\n  A patent license is \"discriminatory\" if it does not include within\nthe scope of its coverage, prohibits the exercise of, or is\nconditioned on the non-exercise of one or more of the rights that are\nspecifically granted under this License.  You may not convey a covered\nwork if you are a party to an arrangement with a third party that is\nin the business of distributing software, under which you make payment\nto the third party based on the extent of your activity of conveying\nthe work, and under which the third party grants, to any of the\nparties who would receive the covered work from you, a discriminatory\npatent license (a) in connection with copies of the covered work\nconveyed by you (or copies made from those copies), or (b) primarily\nfor and in connection with specific products or compilations that\ncontain the covered work, unless you entered into that arrangement,\nor that patent license was granted, prior to 28 March 2007.\n\n  Nothing in this License shall be construed as excluding or limiting\nany implied license or other defenses to infringement that may\notherwise be available to you under applicable patent law.\n\n  12. No Surrender of Others' Freedom.\n\n  If conditions are imposed on you (whether by court order, agreement or\notherwise) that contradict the conditions of this License, they do not\nexcuse you from the conditions of this License.  If you cannot convey a\ncovered work so as to satisfy simultaneously your obligations under this\nLicense and any other pertinent obligations, then as a consequence you may\nnot convey it at all.  For example, if you agree to terms that obligate you\nto collect a royalty for further conveying from those to whom you convey\nthe Program, the only way you could satisfy both those terms and this\nLicense would be to refrain entirely from conveying the Program.\n\n  13. Use with the GNU Affero General Public License.\n\n  Notwithstanding any other provision of this License, you have\npermission to link or combine any covered work with a work licensed\nunder version 3 of the GNU Affero General Public License into a single\ncombined work, and to convey the resulting work.  The terms of this\nLicense will continue to apply to the part which is the covered work,\nbut the special requirements of the GNU Affero General Public License,\nsection 13, concerning interaction through a network will apply to the\ncombination as such.\n\n  14. Revised Versions of this License.\n\n  The Free Software Foundation may publish revised and/or new versions of\nthe GNU General Public License from time to time.  Such new versions will\nbe similar in spirit to the present version, but may differ in detail to\naddress new problems or concerns.\n\n  Each version is given a distinguishing version number.  If the\nProgram specifies that a certain numbered version of the GNU General\nPublic License \"or any later version\" applies to it, you have the\noption of following the terms and conditions either of that numbered\nversion or of any later version published by the Free Software\nFoundation.  If the Program does not specify a version number of the\nGNU General Public License, you may choose any version ever published\nby the Free Software Foundation.\n\n  If the Program specifies that a proxy can decide which future\nversions of the GNU General Public License can be used, that proxy's\npublic statement of acceptance of a version permanently authorizes you\nto choose that version for the Program.\n\n  Later license versions may give you additional or different\npermissions.  However, no additional obligations are imposed on any\nauthor or copyright holder as a result of your choosing to follow a\nlater version.\n\n  15. Disclaimer of Warranty.\n\n  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY\nAPPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT\nHOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM \"AS IS\" WITHOUT WARRANTY\nOF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,\nTHE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR\nPURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM\nIS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF\nALL NECESSARY SERVICING, REPAIR OR CORRECTION.\n\n  16. Limitation of Liability.\n\n  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING\nWILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS\nTHE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY\nGENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE\nUSE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF\nDATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD\nPARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),\nEVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF\nSUCH DAMAGES.\n\n  17. Interpretation of Sections 15 and 16.\n\n  If the disclaimer of warranty and limitation of liability provided\nabove cannot be given local legal effect according to their terms,\nreviewing courts shall apply local law that most closely approximates\nan absolute waiver of all civil liability in connection with the\nProgram, unless a warranty or assumption of liability accompanies a\ncopy of the Program in return for a fee.\n\n                     END OF TERMS AND CONDITIONS\n\n            How to Apply These Terms to Your New Programs\n\n  If you develop a new program, and you want it to be of the greatest\npossible use to the public, the best way to achieve this is to make it\nfree software which everyone can redistribute and change under these terms.\n\n  To do so, attach the following notices to the program.  It is safest\nto attach them to the start of each source file to most effectively\nstate the exclusion of warranty; and each file should have at least\nthe \"copyright\" line and a pointer to where the full notice is found.\n\n    <one line to give the program's name and a brief idea of what it does.>\n    Copyright (C) <year>  <name of author>\n\n    This program is free software: you can redistribute it and/or modify\n    it under the terms of the GNU General Public License as published by\n    the Free Software Foundation, either version 3 of the License, or\n    (at your option) any later version.\n\n    This program is distributed in the hope that it will be useful,\n    but WITHOUT ANY WARRANTY; without even the implied warranty of\n    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n    GNU General Public License for more details.\n\n    You should have received a copy of the GNU General Public License\n    along with this program.  If not, see <http://www.gnu.org/licenses/>.\n\nAlso add information on how to contact you by electronic and paper mail.\n\n  If the program does terminal interaction, make it output a short\nnotice like this when it starts in an interactive mode:\n\n    <program>  Copyright (C) <year>  <name of author>\n    This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.\n    This is free software, and you are welcome to redistribute it\n    under certain conditions; type `show c' for details.\n\nThe hypothetical commands `show w' and `show c' should show the appropriate\nparts of the General Public License.  Of course, your program's commands\nmight be different; for a GUI interface, you would use an \"about box\".\n\n  You should also get your employer (if you work as a programmer) or school,\nif any, to sign a \"copyright disclaimer\" for the program, if necessary.\nFor more information on this, and how to apply and follow the GNU GPL, see\n<http://www.gnu.org/licenses/>.\n\n  The GNU General Public License does not permit incorporating your program\ninto proprietary programs.  If your program is a subroutine library, you\nmay consider it more useful to permit linking proprietary applications with\nthe library.  If this is what you want to do, use the GNU Lesser General\nPublic License instead of this License.  But first, please read\n<http://www.gnu.org/philosophy/why-not-lgpl.html>.\n"
  },
  {
    "path": "MANUAL",
    "content": "\nIntroduction\n============\n\nWhat is Centrifuge?\n-----------------\n\n[Centrifuge] is a novel microbial classification engine that enables\nrapid, accurate, and sensitive labeling of reads and quantification of\nspecies on desktop computers.  The system uses a novel indexing scheme\nbased on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini\n(FM) index, optimized specifically for the metagenomic classification\nproblem. Centrifuge requires a relatively small index (5.8 GB for all\ncomplete bacterial and viral genomes plus the human genome) and\nclassifies sequences at a very high speed, allowing it to process the\nmillions of reads from a typical high-throughput DNA sequencing run\nwithin a few minutes.  Together these advances enable timely and\naccurate analysis of large metagenomics data sets on conventional\ndesktop computers.\n\n[Centrifuge]:     http://www.ccb.jhu.edu/software/centrifuge\n\n[Burrows-Wheeler Transform]: http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\n[FM Index]:        http://en.wikipedia.org/wiki/FM-index\n\n[GPLv3 license]:   http://www.gnu.org/licenses/gpl-3.0.html\n\nObtaining Centrifuge\n==================\n\nDownload Centrifuge and binaries from the Releases sections on the right side.\nBinaries are available for Intel architectures (`x86_64`) running Linux, and Mac OS X.\n\nBuilding from source\n--------------------\n\nBuilding Centrifuge from source requires a GNU-like environment with GCC, GNU Make\nand other basics.  It should be possible to build Centrifuge on most vanilla Linux\ninstallations or on a Mac installation with [Xcode] installed.  Centrifuge can\nalso be built on Windows using [Cygwin] or [MinGW] (MinGW recommended). For a \nMinGW build the choice of what compiler is to be used is important since this\nwill determine if a 32 or 64 bit code can be successfully compiled using it. If \nthere is a need to generate both 32 and 64 bit on the same machine then a multilib \nMinGW has to be properly installed. [MSYS], the [zlib] library, and depending on \narchitecture [pthreads] library are also required. We are recommending a 64 bit\nbuild since it has some clear advantages in real life research problems. In order \nto simplify the MinGW setup it might be worth investigating popular MinGW personal \nbuilds since these are coming already prepared with most of the toolchains needed.\n\nFirst, download the [source package] from the Releases secion on the right side.\nUnzip the file, change to the unzipped directory, and build the\nCentrifuge tools by running GNU `make` (usually with the command `make`, but\nsometimes with `gmake`) with no arguments.  If building with MinGW, run `make`\nfrom the MSYS environment.\n\nCentrifuge is using the multithreading software model in order to speed up \nexecution times on SMP architectures where this is possible. On POSIX \nplatforms (like linux, Mac OS, etc) it needs the pthread library. Although\nit is possible to use pthread library on non-POSIX platform like Windows, due\nto performance reasons Centrifuge will try to use Windows native multithreading\nif possible.\n\nFor the support of SRA data access in HISAT2, please download and install the [NCBI-NGS] toolkit.\nWhen running `make`, specify additional variables as follow.\n`make USE_SRA=1 NCBI_NGS_DIR=/path/to/NCBI-NGS-directory NCBI_VDB_DIR=/path/to/NCBI-NGS-directory`,\nwhere `NCBI_NGS_DIR` and `NCBI_VDB_DIR` will be used in Makefile for -I and -L compilation options.\nFor example, $(NCBI_NGS_DIR)/include and $(NCBI_NGS_DIR)/lib64 will be used.  \n\n[Cygwin]:   http://www.cygwin.com/\n[MinGW]:    http://www.mingw.org/\n[MSYS]:     http://www.mingw.org/wiki/msys\n[zlib]:     http://cygwin.com/packages/mingw-zlib/\n[pthreads]: http://sourceware.org/pthreads-win32/\n[GnuWin32]: http://gnuwin32.sf.net/packages/coreutils.htm\n[Xcode]:    http://developer.apple.com/xcode/\n[Github site]: https://github.com/infphilo/centrifuge\n[NCBI-NGS]: https://github.com/ncbi/ngs/wiki/Downloads\n\nRunning Centrifuge\n=============\n\nAdding to PATH\n--------------\n\nBy adding your new Centrifuge directory to your [PATH environment variable], you\nensure that whenever you run `centrifuge`, `centrifuge-build`, `centrifuge-download` or `centrifuge-inspect`\nfrom the command line, you will get the version you just installed without\nhaving to specify the entire path.  This is recommended for most users.  To do\nthis, follow your operating system's instructions for adding the directory to\nyour [PATH].\n\nIf you would like to install Centrifuge by copying the Centrifuge executable files\nto an existing directory in your [PATH], make sure that you copy all the\nexecutables, including `centrifuge`, `centrifuge-class`, `centrifuge-build`, `centrifuge-build-bin`, `centrifuge-download` `centrifuge-inspect`\nand `centrifuge-inspect-bin`. Furthermore you need the programs\nin the scripts/ folder if you opt for genome compression in the database construction.\n\n[PATH environment variable]: http://en.wikipedia.org/wiki/PATH_(variable)\n[PATH]: http://en.wikipedia.org/wiki/PATH_(variable)\n\nBefore running Centrifuge\n-----------------\n\nClassification is considerably different from alignment in that classification is performed on a large set of genomes as opposed to on just one reference genome as in alignment.  Currently, an enormous number of complete genomes are available at the GenBank (e.g. >4,000 bacterial genomes, >10,000 viral genomes, …).  These genomes are organized in a taxonomic tree where each genome is located at the bottom of the tree, at the strain or subspecies level.  On the taxonomic tree, genomes have ancestors usually situated at the species level, and those ancestors also have ancestors at the genus level and so on up the family level, the order level, class level, phylum, kingdom, and finally at the root level.\n\nGiven the gigantic number of genomes available, which continues to expand at a rapid rate, and the development of the taxonomic tree, which continues to evolve with new advancements in research, we have designed Centrifuge to be flexible and general enough to reflect this huge database.  We provide several standard indexes that will meet most of users’ needs (see the side panel - Indexes).  In our approach our indexes not only include raw genome sequences, but also genome names/sizes and taxonomic trees.  This enables users to perform additional analyses on Centrifuge’s classification output without the need to download extra database sources.  This also eliminates the potential issue of discrepancy between the indexes we provide and the databases users may otherwise download.  We plan to provide a couple of additional standard indexes in the near future, and update the indexes on a regular basis.\n\nWe encourage first time users to take a look at and follow a `small example` that illustrates how to build an index, how to run Centrifuge using the index, how to interpret the classification results, and how to extract additional genomic information from the index.  For those who choose to build customized indexes, please take a close look at the following description.\n\nDatabase download and index building\n-----------------\n\nCentrifuge indexes can be built with arbritary sequences. Standard choices are\nall of the complete bacterial and viral genomes, or using the sequences that\nare part of the BLAST nt database. Centrifuge always needs the\nnodes.dmp file from the NCBI taxonomy dump to build the taxonomy tree,\nas well as a sequence ID to taxonomy ID map. The map is a tab-separated\nfile with the sequence ID to taxonomy ID map.\n\nTo download all of the complete archaeal, viral, and bacterial genomes from RefSeq, and\nbuild the index:\n\nCentrifuge indices can be build on arbritary sequences. Usually an ensemble of\ngenomes is used - such as all complete microbial genomes in the RefSeq database,\nor all sequences in the BLAST nt database. \n\nTo map sequence identifiers to taxonomy IDs, and taxonomy IDs to names and \nits parents, three files are necessary in addition to the sequence files:\n\n - taxonomy tree: typically nodes.dmp from the NCBI taxonomy dump. Links taxonomy IDs to their parents\n - names file: typically names.dmp from the NCBI taxonomy dump. Links taxonomy IDs to their scientific name\n - a tab-separated sequence ID to taxonomy ID mapping\n\nWhen using the provided scripts to download the genomes, these files are automatically downloaded or generated. \nWhen using a custom taxonomy or sequence files, please refer to the section `TODO` to learn more about their format.\n\n### Building index on all complete bacterial and viral genomes\n\nUse `centrifuge-download` to download genomes from NCBI. The following two commands download\nthe NCBI taxonomy to `taxonomy/` in the current directory, and all complete archaeal,\nbacterial and viral genomes to `library/`. Low-complexity regions in the genomes are masked after\ndownload (parameter `-m`) using blast+'s `dustmasker`. `centrifuge-download` outputs tab-separated \nsequence ID to taxonomy ID mappings to standard out, which are required by `centrifuge-build`.\n\n    centrifuge-download -o taxonomy taxonomy\n    centrifuge-download -o library -m -d \"archaea,bacteria,viral\" refseq > seqid2taxid.map\n\nTo build the index, first concatenate all downloaded sequences into a single file, and then\nrun `centrifuge-build`:\n    \n    cat library/*/*.fna > input-sequences.fna\n\n    ## build centrifuge index with 4 threads\n    centrifuge-build -p 4 --conversion-table seqid2taxid.map \\\n                     --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \\\n                     input-sequences.fna abv\n\nAfter building the index, all files except the index *.[123].cf files may be removed.\nIf you also want to include the human and/or the mouse genome, add their sequences to \nthe library folder before building the index with one of the following commands:\n\nAfter the index building, all but the *.[123].cf index files may be removed. I.e. the files in\nthe `library/` and `taxonomy/` directories are no longer needed.\n\n### Adding human or mouse genome to the index\nThe human and mouse genomes can also be downloaded using `centrifuge-download`. They are in the\ndomain \"vertebrate_mammalian\" (argument `-d`), are assembled at the chromosome level (argument `-a`)\nand categorized as reference genomes by RefSeq (`-c`). The argument `-t` takes a comma-separated\nlist of taxonomy IDs - e.g. `9606` for human and `10090` for mouse:\n\n    # download mouse and human reference genomes\n    centrifuge-download -o library -d \"vertebrate_mammalian\" -a \"Chromosome\" -t 9606,10090 -c 'reference genome' refseq >> seqid2taxid.map\n    # only human\n    centrifuge-download -o library -d \"vertebrate_mammalian\" -a \"Chromosome\" -t 9606 -c 'reference genome' refseq >> seqid2taxid.map\n    # only mouse\n    centrifuge-download -o library -d \"vertebrate_mammalian\" -a \"Chromosome\" -t 10090 -c 'reference genome' refseq >> seqid2taxid.map\n\n### nt database\n\nNCBI BLAST's nt database contains all spliced non-redundant coding\nsequences from multiplpe databases, inferred from genommic\nsequences. Traditionally used with BLAST, a download of the FASTA is\nprovided on the NCBI homepage. Building an index with any database \nrequires the user to creates a sequence ID to taxonomy ID map that \ncan be generated from a GI taxid dump:\n\n    wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz\n    gunzip nt.gz && mv -v nt nt.fa\n\n    # Get mapping file\n    wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz\n    gunzip -c gi_taxid_nucl.dmp.gz | sed 's/^/gi|/' > gi_taxid_nucl.map\n\n    # build index using 16 cores and a small bucket size, which will require less memory\n    centrifuge-build -p 16 --bmax 1342177280 --conversion-table gi_taxid_nucl.map \\\n                     --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \\ \n                     nt.fa nt\n\n### Custom database\n\nTo build a custom database, you need the provide the follwing four files to `centrifuge-build`:\n\n  - `--conversion-table`: tab-separated file mapping sequence IDs to taxonomy IDs. Sequence IDs are the header up to the first space or second pipe (`|`).  \n  - `--taxonomy-tree`: `\\t|\\t`-separated file mapping taxonomy IDs to their parents and rank, up to the root of the tree. When using NCBI taxonomy IDs, this will be the `nodes.dmp` from `ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz`.\n  - `--name-table`: '\\t|\\t'-separated file mapping taxonomy IDs to a name. A further column (typically column 4) must specify `scientific name`. When using NCBI taxonomy IDs, `names.dmp` is the appropriate file.\n  - reference sequences: The ID of the sequences are the header up to the first space or second pipe (`|`)\n\nWhen using custom taxonomy IDs, use only positive integers greater-equal to `1` and use `1` for the root of the tree.\n\n#### More info on `--taxonomy-tree` and `--name-table`\n\nThe format of these files are based on `nodes.dmp` and `names.dmp` from the NCBI taxonomy database dump. \n\n- Field terminator is `\\t|\\t`\n- Row terminator is `\\t|\\n`\n\nThe `taxonomy-tree` / nodes.dmp file consists of taxonomy nodes. The description for each node includes the following\nfields:\n\n    tax_id                  -- node id in GenBank taxonomy database\n    parent tax_id           -- parent node id in GenBank taxonomy database\n    rank                    -- rank of this node (superkingdom, kingdom, ..., no rank)\n\nFurther fields are ignored.\n\nThe `name-table` / names.dmp is the taxonomy names file:\n\n    tax_id                  -- the id of node associated with this name\n    name_txt                -- name itself\n    unique name             -- the unique variant of this name if name not unique\n    name class              -- (scientific name, synonym, common name, ...)\n\n`name class` **has** to be `scientific name` to be included in the build. All other lines are ignored\n\n#### Example\n\n*Conversion table `ex.conv`*: \n    \n    Seq1\t11\n    Seq2\t12\n    Seq3\t13\n    Seq4\t11\n\n*Taxonomy tree `ex.tree`*: \n\n    1\t|\t1\t|\troot\n    10\t|\t1\t|\tkingdom\n    11\t|\t10\t|\tspecies\n    12\t|\t10\t|\tspecies\n    13\t|\t1\t|\tspecies\n\n*Name table `ex.name`*:\n\n    1\t|\troot\t|\t\t|\tscientific name\t|\n    10\t|\tBacteria\t|\t\t|\tscientific name\t|\n    11\t|\tBacterium A\t|\t\t|\tscientific name\t|\n    12\t|\tBacterium B\t|\t\t|\tscientific name\t|\n    12\t|\tSome other species\t|\t\t|\tscientific name\t|\n\n*Reference sequences `ex.fa`*:\n\n    >Seq1\n    AAAACGTACGA.....\n    >Seq2\n    AAAACGTACGA.....\n    >Seq3\n    AAAACGTACGA.....\n    >Seq4\n    AAAACGTACGA.....\n\nTo build the database, call\n\n    centrifuge-build --conversion-table ex.conv \\\n                     --taxonomy-tree ex.tree --name-table ex.name \\ \n                     ex.fa ex\n\nwhich results in three index files named `ex.1.cf`, `ex.2.cf` and `ex.3.cf`.\n\n### Centrifuge classification output\n\nThe following example shows classification assignments for a read.  The assignment output has 8 columns.\n\n    readID    seqID   taxID score\t   2ndBestScore\t   hitLength\tqueryLength\tnumMatches\n    1_1\t      gi|4    9646  4225\t   0\t\t       80\t80\t\t1\n\n    The first column is the read ID from a raw sequencing read (e.g., 1_1 in the example).\n    The second column is the sequence ID of the genomic sequence, where the read is classified (e.g., gi|4).\n    The third column is the taxonomic ID of the genomic sequence in the second column (e.g., 9646).\n    The fourth column is the score for the classification, which is the weighted sum of hits (e.g., 4225)\n    The fifth column is the score for the next best classification (e.g., 0).\n    The sixth column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80).\n    The seventh column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80). \n    The eighth column is the number of classifications for this read, indicating how many assignments were made (e.g.,1).\n\n### Centrifuge summary output (the default filename is centrifuge_report.tsv)\n\nThe following example shows a classification summary for each genome or taxonomic unit.  The assignment output has 7 columns.\n\n    name      \t      \t      \t\t     \t     \t      \t     \ttaxID\ttaxRank\t   genomeSize \tnumReads   numUniqueReads   abundance\n    Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis\t36870\tleaf\t   703004\t\t5981\t   5964\t            0.0152317\n\n    The first column is the name of a genome, or the name corresponding to a taxonomic ID (the second column) at a rank higher than the strain (e.g., Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis).\n    The second column is the taxonomic ID (e.g., 36870).\n    The third column is the taxonomic rank (e.g., leaf).\n    The fourth column is the length of the genome sequence (e.g., 703004).\n    The fifth column is the number of reads classified to this genomic sequence including multi-classified reads (e.g., 5981).\n    The sixth column is the number of reads uniquely classified to this genomic sequence (e.g., 5964).\n    The seventh column is the proportion of this genome normalized by its genomic length (e.g., 0.0152317).\n\nAs the GenBank database is incomplete (i.e., many more genomes remain to be identified and added), and reads have sequencing errors, classification programs including Centrifuge often report many false assignments.  In order to perform more conservative analyses, users may want to discard assignments for reads having a matching length (8th column in the output of Centrifuge) of 40% or lower.  It may be also helpful to use a score (4th column) for filtering out some assignments.   Our future research plans include working on developing methods that estimate confidence scores for assignments.\n\n### Kraken-style report\n\n`centrifuge-kreport` can be used to make a Kraken-style report from the Centrifuge output including taxonomy information:\n\n`centrifuge-kreport -x <centrifuge index> <centrifuge out file>`\n\nInspecting the Centrifuge index\n-----------------------\n\nThe index can be inspected with `centrifuge-inspect`.  To extract raw sequences:\n\n    centrifuge-inspect <centrifuge index>\n\nExtract the sequence ID to taxonomy ID conversion table from the index\n\n    centrifuge-inspect --conversion-table <centrifuge index>\n\nExtract the taxonomy tree from the index:\n\n    centrifuge-inspect --taxonomy-tree <centrifuge index>\n\nExtract the lengths of the sequences from the index (each row has two columns: taxonomic ID and length):\n\n    centrifuge-inspect --size-table <centrifuge index>\n\nExtract the names from the index (each row has two columns: taxonomic ID and name):\n\n    centrifuge-inspect --name-table <centrifuge index>\n    \nWrapper\n-------\n\nThe `centrifuge`, `centrifuge-build` and `centrifuge-inspect` executables are actually \nwrapper scripts that call binary programs as appropriate. Also, the `centrifuge` wrapper\nprovides some key functionality, like the ability to handle compressed inputs,\nand the functionality for `--un`, `--al` and related options.\n\nIt is recommended that you always run the centrifuge wrappers and not run the\nbinaries directly.\n\nPerformance tuning\n------------------\n\n1.  If your computer has multiple processors/cores, use `-p NTHREADS`\n\n    The `-p` option causes Centrifuge to launch a specified number of parallel\n    search threads.  Each thread runs on a different processor/core and all\n    threads find alignments in parallel, increasing alignment throughput by\n    approximately a multiple of the number of threads (though in practice,\n    speedup is somewhat worse than linear).\n\nCommand Line\n------------\n\n### Usage\n\n    centrifuge [options]* -x <centrifuge-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [--report-file <report file name> -S <classification output file name>]\n\n### Main arguments\n\n    -x <centrifuge-idx>\n\nThe basename of the index for the reference genomes.  The basename is the name of\nany of the index files up to but not including the final `.1.cf` / etc.  \n`centrifuge` looks for the specified index first in the current directory,\nthen in the directory specified in the `CENTRIFUGE_INDEXES` environment variable.\n\n    -1 <m1>\n\nComma-separated list of files containing mate 1s (filename usually includes\n`_1`), e.g. `-1 flyA_1.fq,flyB_1.fq`.  Sequences specified with this option must\ncorrespond file-for-file and read-for-read with those specified in `<m2>`. Reads\nmay be a mix of different lengths. If `-` is specified, `centrifuge` will read the\nmate 1s from the \"standard in\" or \"stdin\" filehandle.\n\n    -2 <m2>\n\nComma-separated list of files containing mate 2s (filename usually includes\n`_2`), e.g. `-2 flyA_2.fq,flyB_2.fq`.  Sequences specified with this option must\ncorrespond file-for-file and read-for-read with those specified in `<m1>`. Reads\nmay be a mix of different lengths. If `-` is specified, `centrifuge` will read the\nmate 2s from the \"standard in\" or \"stdin\" filehandle.\n\n    -U <r>\n\nComma-separated list of files containing unpaired reads to be aligned, e.g.\n`lane1.fq,lane2.fq,lane3.fq,lane4.fq`.  Reads may be a mix of different lengths.\nIf `-` is specified, `centrifuge` gets the reads from the \"standard in\" or \"stdin\"\nfilehandle.\n\n    --sra-acc <SRA accession number>\n\nComma-separated list of SRA accession numbers, e.g. `--sra-acc SRR353653,SRR353654`.\nInformation about read types is available at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?sp=runinfo&acc=<b>sra-acc</b>&retmode=xml,\nwhere <b>sra-acc</b> is SRA accession number.  If users run HISAT2 on a computer cluster, it is recommended to disable SRA-related caching (see the instruction at [SRA-MANUAL]).\n\n[SRA-MANUAL]:\t     https://github.com/ncbi/sra-tools/wiki/Toolkit-Configuration\n\n    -S <filename>\n\nFile to write classification results to.  By default, assignments are written to the\n\"standard out\" or \"stdout\" filehandle (i.e. the console).\n\n    --report-file <filename>\n\nFile to write a classification summary to (default: centrifuge_report.tsv).\n\n### Options\n\n#### Input options\n\n    -q\n\nReads (specified with `<m1>`, `<m2>`, `<s>`) are FASTQ files.  FASTQ files\nusually have extension `.fq` or `.fastq`.  FASTQ is the default format.  See\nalso: `--solexa-quals` and `--int-quals`.\n\n    --qseq\n\nReads (specified with `<m1>`, `<m2>`, `<s>`) are QSEQ files.  QSEQ files usually\nend in `_qseq.txt`.  See also: `--solexa-quals` and `--int-quals`.\n\n    -f\n\nReads (specified with `<m1>`, `<m2>`, `<s>`) are FASTA files.  FASTA files\nusually have extension `.fa`, `.fasta`, `.mfa`, `.fna` or similar.  FASTA files\ndo not have a way of specifying quality values, so when `-f` is set, the result\nis as if `--ignore-quals` is also set.\n\n    -r\n\nReads (specified with `<m1>`, `<m2>`, `<s>`) are files with one input sequence\nper line, without any other information (no read names, no qualities).  When\n`-r` is set, the result is as if `--ignore-quals` is also set.\n\n    -c\n\nThe read sequences are given on command line.  I.e. `<m1>`, `<m2>` and\n`<singles>` are comma-separated lists of reads rather than lists of read files.\nThere is no way to specify read names or qualities, so `-c` also implies\n`--ignore-quals`.\n\n    -s/--skip <int>\n\nSkip (i.e. do not align) the first `<int>` reads or pairs in the input.\n\n    -u/--qupto <int>\n\nAlign the first `<int>` reads or read pairs from the input (after the\n`-s`/`--skip` reads or pairs have been skipped), then stop.  Default: no limit.\n\n    -5/--trim5 <int>\n\nTrim `<int>` bases from 5' (left) end of each read before alignment (default: 0).\n\n    -3/--trim3 <int>\n\nTrim `<int>` bases from 3' (right) end of each read before alignment (default:\n0).\n\n    --phred33\n\nInput qualities are ASCII chars equal to the [Phred quality] plus 33.  This is\nalso called the \"Phred+33\" encoding, which is used by the very latest Illumina\npipelines.\n\n[Phred quality]: http://en.wikipedia.org/wiki/Phred_quality_score\n\n    --phred64\n\nInput qualities are ASCII chars equal to the [Phred quality] plus 64.  This is\nalso called the \"Phred+64\" encoding.\n\n    --solexa-quals\n\nConvert input qualities from [Solexa][Phred quality] (which can be negative) to\n[Phred][Phred quality] (which can't).  This scheme was used in older Illumina GA\nPipeline versions (prior to 1.3).  Default: off.\n\n    --int-quals\n\nQuality values are represented in the read input file as space-separated ASCII\nintegers, e.g., `40 40 30 40`..., rather than ASCII characters, e.g., `II?I`....\n Integers are treated as being on the [Phred quality] scale unless\n`--solexa-quals` is also specified. Default: off.\n\n#### Classification\n\n    --min-hitlen <int>\n\nMinimum length of partial hits, which must be greater than 15 (default: 22)\"\n\n    -k <int>\n\nIt searches for at most `<int>` distinct, primary assignments for each read or pair.  \nPrimary assignments mean assignments whose assignment score is equal or higher than any other assignments.\nIf there are more primary assignments than this value, \nthe search will merge some of the assignments into a higher taxonomic rank.\nThe assignment score for a paired-end assignment equals the sum of the assignment scores of the individual mates. \nDefault: 5\n\n    --host-taxids\n\nA comma-separated list of taxonomic IDs that will be preferred in classification procedure.\nThe descendants from these IDs will also be preferred.  In case some of a read's assignments correspond to\nthese taxonomic IDs, only those corresponding assignments will be reported.\n\n    --exclude-taxids\n\nA comma-separated list of taxonomic IDs that will be excluded in classification procedure.\nThe descendants from these IDs will also be exclude. \n\n#### Alignment options\n\n    --n-ceil <func>\n\nSets a function governing the maximum number of ambiguous characters (usually\n`N`s and/or `.`s) allowed in a read as a function of read length.  For instance,\nspecifying `-L,0,0.15` sets the N-ceiling function `f` to `f(x) = 0 + 0.15 * x`,\nwhere x is the read length.  See also: [setting function options].  Reads\nexceeding this ceiling are [filtered out].  Default: `L,0,0.15`.\n\n    --ignore-quals\n\nWhen calculating a mismatch penalty, always consider the quality value at the\nmismatched position to be the highest possible, regardless of the actual value. \nI.e. input is treated as though all quality values are high.  This is also the\ndefault behavior when the input doesn't specify quality values (e.g. in `-f`,\n`-r`, or `-c` modes).\n\n    --nofw/--norc\n\nIf `--nofw` is specified, `centrifuge` will not attempt to align unpaired reads to\nthe forward (Watson) reference strand.  If `--norc` is specified, `centrifuge` will\nnot attempt to align unpaired reads against the reverse-complement (Crick)\nreference strand. In paired-end mode, `--nofw` and `--norc` pertain to the\nfragments; i.e. specifying `--nofw` causes `centrifuge` to explore only those\npaired-end configurations corresponding to fragments from the reverse-complement\n(Crick) strand.  Default: both strands enabled. \n\n#### Paired-end options\n\n    --fr/--rf/--ff\n\nThe upstream/downstream mate orientations for a valid paired-end alignment\nagainst the forward reference strand.  E.g., if `--fr` is specified and there is\na candidate paired-end alignment where mate 1 appears upstream of the reverse\ncomplement of mate 2 and the fragment length constraints (`-I` and `-X`) are\nmet, that alignment is valid.  Also, if mate 2 appears upstream of the reverse\ncomplement of mate 1 and all other constraints are met, that too is valid.\n`--rf` likewise requires that an upstream mate1 be reverse-complemented and a\ndownstream mate2 be forward-oriented. ` --ff` requires both an upstream mate 1\nand a downstream mate 2 to be forward-oriented.  Default: `--fr` (appropriate\nfor Illumina's Paired-end Sequencing Assay).\n\n#### Output options\n\n    -t/--time\n\nPrint the wall-clock time required to load the index files and align the reads. \nThis is printed to the \"standard error\" (\"stderr\") filehandle.  Default: off.\n\n    --un <path>\n    --un-gz <path>\n    --un-bz2 <path>\n\nWrite unpaired reads that fail to align to file at `<path>`.  These reads\ncorrespond to the SAM records with the FLAGS `0x4` bit set and neither the\n`0x40` nor `0x80` bits set.  If `--un-gz` is specified, output will be gzip\ncompressed. If `--un-bz2` is specified, output will be bzip2 compressed.  Reads\nwritten in this way will appear exactly as they did in the input file, without\nany modification (same sequence, same name, same quality string, same quality\nencoding).  Reads will not necessarily appear in the same order as they did in\nthe input.\n\n    --al <path>\n    --al-gz <path>\n    --al-bz2 <path>\n\nWrite unpaired reads that align at least once to file at `<path>`.  These reads\ncorrespond to the SAM records with the FLAGS `0x4`, `0x40`, and `0x80` bits\nunset.  If `--al-gz` is specified, output will be gzip compressed. If `--al-bz2`\nis specified, output will be bzip2 compressed.  Reads written in this way will\nappear exactly as they did in the input file, without any modification (same\nsequence, same name, same quality string, same quality encoding).  Reads will\nnot necessarily appear in the same order as they did in the input.\n\n    --un-conc <path>\n    --un-conc-gz <path>\n    --un-conc-bz2 <path>\n\nWrite paired-end reads that fail to align concordantly to file(s) at `<path>`.\nThese reads correspond to the SAM records with the FLAGS `0x4` bit set and\neither the `0x40` or `0x80` bit set (depending on whether it's mate #1 or #2).\n`.1` and `.2` strings are added to the filename to distinguish which file\ncontains mate #1 and mate #2.  If a percent symbol, `%`, is used in `<path>`,\nthe percent symbol is replaced with `1` or `2` to make the per-mate filenames.\nOtherwise, `.1` or `.2` are added before the final dot in `<path>` to make the\nper-mate filenames.  Reads written in this way will appear exactly as they did\nin the input files, without any modification (same sequence, same name, same\nquality string, same quality encoding).  Reads will not necessarily appear in\nthe same order as they did in the inputs.\n\n    --al-conc <path>\n    --al-conc-gz <path>\n    --al-conc-bz2 <path>\n\nWrite paired-end reads that align concordantly at least once to file(s) at\n`<path>`. These reads correspond to the SAM records with the FLAGS `0x4` bit\nunset and either the `0x40` or `0x80` bit set (depending on whether it's mate #1\nor #2). `.1` and `.2` strings are added to the filename to distinguish which\nfile contains mate #1 and mate #2.  If a percent symbol, `%`, is used in\n`<path>`, the percent symbol is replaced with `1` or `2` to make the per-mate\nfilenames. Otherwise, `.1` or `.2` are added before the final dot in `<path>` to\nmake the per-mate filenames.  Reads written in this way will appear exactly as\nthey did in the input files, without any modification (same sequence, same name,\nsame quality string, same quality encoding).  Reads will not necessarily appear\nin the same order as they did in the inputs.\n\n    --quiet\n\nPrint nothing besides alignments and serious errors.\n\n    --met-file <path>\n\nWrite `centrifuge` metrics to file `<path>`.  Having alignment metric can be useful\nfor debugging certain problems, especially performance issues.  See also:\n`--met`.  Default: metrics disabled.\n\n    --met-stderr\n\nWrite `centrifuge` metrics to the \"standard error\" (\"stderr\") filehandle.  This is\nnot mutually exclusive with `--met-file`.  Having alignment metric can be\nuseful for debugging certain problems, especially performance issues.  See also:\n`--met`.  Default: metrics disabled.\n\n    --met <int>\n\nWrite a new `centrifuge` metrics record every `<int>` seconds.  Only matters if\neither `--met-stderr` or `--met-file` are specified.  Default: 1.\n\n#### Performance options\n\n    -o/--offrate <int>\n\nOverride the offrate of the index with `<int>`.  If `<int>` is greater\nthan the offrate used to build the index, then some row markings are\ndiscarded when the index is read into memory.  This reduces the memory\nfootprint of the aligner but requires more time to calculate text\noffsets.  `<int>` must be greater than the value used to build the\nindex.\n\n    -p/--threads NTHREADS\n\nLaunch `NTHREADS` parallel search threads (default: 1).  Threads will run on\nseparate processors/cores and synchronize when parsing reads and outputting\nalignments.  Searching for alignments is highly parallel, and speedup is close\nto linear.  Increasing `-p` increases Centrifuge's memory footprint. E.g. when\naligning to a human genome index, increasing `-p` from 1 to 8 increases the\nmemory footprint by a few hundred megabytes.  This option is only available if\n`bowtie` is linked with the `pthreads` library (i.e. if `BOWTIE_PTHREADS=0` is\nnot specified at build time).\n\n    --reorder\n\nGuarantees that output records are printed in an order corresponding to the\norder of the reads in the original input file, even when `-p` is set greater\nthan 1.  Specifying `--reorder` and setting `-p` greater than 1 causes Centrifuge\nto run somewhat slower and use somewhat more memory then if `--reorder` were\nnot specified.  Has no effect if `-p` is set to 1, since output order will\nnaturally correspond to input order in that case.\n\n    --mm\n\nUse memory-mapped I/O to load the index, rather than typical file I/O.\nMemory-mapping allows many concurrent `bowtie` processes on the same computer to\nshare the same memory image of the index (i.e. you pay the memory overhead just\nonce).  This facilitates memory-efficient parallelization of `bowtie` in\nsituations where using `-p` is not possible or not preferable.\n\n#### Other options\n\n    --qc-filter\n\nFilter out reads for which the QSEQ filter field is non-zero.  Only has an\neffect when read format is `--qseq`.  Default: off.\n\n    --seed <int>\n\nUse `<int>` as the seed for pseudo-random number generator.  Default: 0.\n\n    --non-deterministic\n\nNormally, Centrifuge re-initializes its pseudo-random generator for each read.  It\nseeds the generator with a number derived from (a) the read name, (b) the\nnucleotide sequence, (c) the quality sequence, (d) the value of the `--seed`\noption.  This means that if two reads are identical (same name, same\nnucleotides, same qualities) Centrifuge will find and report the same classification(s)\nfor both, even if there was ambiguity.  When `--non-deterministic` is specified,\nCentrifuge re-initializes its pseudo-random generator for each read using the\ncurrent time.  This means that Centrifuge will not necessarily report the same\nclassification for two identical reads.  This is counter-intuitive for some users,\nbut might be more appropriate in situations where the input consists of many\nidentical reads.\n\n    --version\n\nPrint version information and quit.\n\n    -h/--help\n\nPrint usage information and quit.\n\nThe `centrifuge-build` indexer\n===========================\n\n`centrifuge-build` builds a Centrifuge index from a set of DNA sequences.\n`centrifuge-build` outputs a set of 6 files with suffixes `.1.cf`, `.2.cf`, and\n`.3.cf`.  These files together\nconstitute the index: they are all that is needed to align reads to that\nreference.  The original sequence FASTA files are no longer used by Centrifuge\nonce the index is built.\n\nUse of Karkkainen's [blockwise algorithm] allows `centrifuge-build` to trade off\nbetween running time and memory usage. `centrifuge-build` has two options\ngoverning how it makes this trade: `--bmax`/`--bmaxdivn`,\nand `--dcv`.  By default, `centrifuge-build` will automatically search for the\nsettings that yield the best running time without exhausting memory.  This\nbehavior can be disabled using the `-a`/`--noauto` option.\n\nThe indexer provides options pertaining to the \"shape\" of the index, e.g.\n`--offrate` governs the fraction of [Burrows-Wheeler]\nrows that are \"marked\" (i.e., the density of the suffix-array sample; see the\noriginal [FM Index] paper for details).  All of these options are potentially\nprofitable trade-offs depending on the application.  They have been set to\ndefaults that are reasonable for most cases according to our experiments.  See\n[Performance tuning] for details.\n\nThe Centrifuge index is based on the [FM Index] of Ferragina and Manzini, which in\nturn is based on the [Burrows-Wheeler] transform.  The algorithm used to build\nthe index is based on the [blockwise algorithm] of Karkkainen.\n\n[Blockwise algorithm]: http://portal.acm.org/citation.cfm?id=1314852\n[Burrows-Wheeler]: http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\n\nCommand Line\n------------\n\nUsage:\n\n    centrifuge-build [options]* --conversion-table <table_in> --taxonomy-tree <taxonomy_in> --name-table <table_in2> <reference_in> <cf_base>\n\n### Main arguments\n\nA comma-separated list of FASTA files containing the reference sequences to be\naligned to, or, if `-c` is specified, the sequences\nthemselves. E.g., `<reference_in>` might be `chr1.fa,chr2.fa,chrX.fa,chrY.fa`,\nor, if `-c` is specified, this might be\n`GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA`.\n\nThe basename of the index files to write.  By default, `centrifuge-build` writes\nfiles named `NAME.1.cf`, `NAME.2.cf`, and `NAME.3.cf`, where `NAME` is `<cf_base>`.\n\n### Options\n\n    -f\n\nThe reference input files (specified as `<reference_in>`) are FASTA files\n(usually having extension `.fa`, `.mfa`, `.fna` or similar).\n\n    -c\n\nThe reference sequences are given on the command line.  I.e. `<reference_in>` is\na comma-separated list of sequences rather than a list of FASTA files.\n\n    -a/--noauto\n\nDisable the default behavior whereby `centrifuge-build` automatically selects\nvalues for the `--bmax`, `--dcv` and `--packed` parameters according to\navailable memory.  Instead, user may specify values for those parameters.  If\nmemory is exhausted during indexing, an error message will be printed; it is up\nto the user to try new parameters.\n\n    -p/--threads <int>\n\nLaunch `NTHREADS` parallel search threads (default: 1).\n\n    --conversion-table <file>\n\nList of UIDs (unique ID) and corresponding taxonomic IDs.\n\n    --taxonomy-tree <file>\n\nTaxonomic tree (e.g. nodes.dmp).\n\n    --name-table <file>\n\nName table (e.g. names.dmp).\n\n    --size-table <file>\n\nList of taxonomic IDs and lengths of the sequences belonging to the same taxonomic IDs.\n\n    --bmax <int>\n\nThe maximum number of suffixes allowed in a block.  Allowing more suffixes per\nblock makes indexing faster, but increases peak memory usage.  Setting this\noption overrides any previous setting for `--bmax`, or `--bmaxdivn`. \nDefault (in terms of the `--bmaxdivn` parameter) is `--bmaxdivn` 4.  This is\nconfigured automatically by default; use `-a`/`--noauto` to configure manually.\n\n    --bmaxdivn <int>\n\nThe maximum number of suffixes allowed in a block, expressed as a fraction of\nthe length of the reference.  Setting this option overrides any previous setting\nfor `--bmax`, or `--bmaxdivn`.  Default: `--bmaxdivn` 4.  This is\nconfigured automatically by default; use `-a`/`--noauto` to configure manually.\n\n    --dcv <int>\n\nUse `<int>` as the period for the difference-cover sample.  A larger period\nyields less memory overhead, but may make suffix sorting slower, especially if\nrepeats are present.  Must be a power of 2 no greater than 4096.  Default: 1024.\n This is configured automatically by default; use `-a`/`--noauto` to configure\nmanually.\n\n    --nodc\n\nDisable use of the difference-cover sample.  Suffix sorting becomes\nquadratic-time in the worst case (where the worst case is an extremely\nrepetitive reference).  Default: off.\n\n    -o/--offrate <int>\n\nTo map alignments back to positions on the reference sequences, it's necessary\nto annotate (\"mark\") some or all of the [Burrows-Wheeler] rows with their\ncorresponding location on the genome. \n`-o`/`--offrate` governs how many rows get marked:\nthe indexer will mark every 2^`<int>` rows.  Marking more rows makes\nreference-position lookups faster, but requires more memory to hold the\nannotations at runtime.  The default is 4 (every 16th row is marked; for human\ngenome, annotations occupy about 680 megabytes).  \n\n    -t/--ftabchars <int>\n\nThe ftab is the lookup table used to calculate an initial [Burrows-Wheeler]\nrange with respect to the first `<int>` characters of the query.  A larger\n`<int>` yields a larger lookup table but faster query times.  The ftab has size\n4^(`<int>`+1) bytes.  The default setting is 10 (ftab is 4MB).\n\n    --seed <int>\n\nUse `<int>` as the seed for pseudo-random number generator.\n\n    --kmer-count <int>\n\nUse `<int>` as kmer-size for counting the distinct number of k-mers in the input sequences.\n\n    -q/--quiet\n\n`centrifuge-build` is verbose by default.  With this option `centrifuge-build` will\nprint only error messages.\n\n    -h/--help\n\nPrint usage information and quit.\n\n    --version\n\nPrint version information and quit.\n\nThe `centrifuge-inspect` index inspector\n=====================================\n\n`centrifuge-inspect` extracts information from a Centrifuge index about what kind of\nindex it is and what reference sequences were used to build it. When run without\nany options, the tool will output a FASTA file containing the sequences of the\noriginal references (with all non-`A`/`C`/`G`/`T` characters converted to `N`s).\n It can also be used to extract just the reference sequence names using the\n`-n`/`--names` option or a more verbose summary using the `-s`/`--summary`\noption.\n\nCommand Line\n------------\n\nUsage:\n\n    centrifuge-inspect [options]* <cf_base>\n\n### Main arguments\n\nThe basename of the index to be inspected.  The basename is name of any of the\nindex files but with the `.X.cf` suffix omitted.\n`centrifuge-inspect` first looks in the current directory for the index files, then\nin the directory specified in the `Centrifuge_INDEXES` environment variable.\n\n### Options\n\n    -a/--across <int>\n\nWhen printing FASTA output, output a newline character every `<int>` bases\n(default: 60).\n\n    -n/--names\n\nPrint reference sequence names, one per line, and quit.\n\n    -s/--summary\n\nPrint a summary that includes information about index settings, as well as the\nnames and lengths of the input sequences.  The summary has this format: \n\n    Colorspace\t<0 or 1>\n    SA-Sample\t1 in <sample>\n    FTab-Chars\t<chars>\n    Sequence-1\t<name>\t<len>\n    Sequence-2\t<name>\t<len>\n    ...\n    Sequence-N\t<name>\t<len>\n\nFields are separated by tabs.  Colorspace is always set to 0 for Centrifuge.\n\n    --conversion-table\n\nPrint a list of UIDs (unique ID) and corresponding taxonomic IDs.\n\n    --taxonomy-tree\n\nPrint taxonomic tree.\n\n    --name-table\n\nPrint name table.\n\n    --size-table\n\nPrint a list of taxonomic IDs and lengths of the sequences belonging to the same taxonomic IDs.\n\n    -v/--verbose\n\nPrint verbose output (for debugging).\n\n    --version\n\nPrint version information and quit.\n\n    -h/--help\n\nPrint usage information and quit.\n\nGetting started with Centrifuge\n===================================================\n\nCentrifuge comes with some example files to get you started.  The example files\nare not scientifically significant; these files will simply let you start running Centrifuge and\ndownstream tools right away.\n\nFirst follow the manual instructions to [obtain Centrifuge].  Set the `CENTRIFUGE_HOME`\nenvironment variable to point to the new Centrifuge directory containing the\n`centrifuge`, `centrifuge-build` and `centrifuge-inspect` binaries.  This is important,\nas the `CENTRIFUGE_HOME` variable is used in the commands below to refer to that\ndirectory.\n\nIndexing a reference genome\n---------------------------\n\nTo create an index for two small sequences included with Centrifuge, create a new temporary directory (it doesn't matter where), change into that directory, and run:\n\n    $CENTRIFUGE_HOME/centrifuge-build --conversion-table $CENTRIFUGE_HOME/example/reference/gi_to_tid.dmp --taxonomy-tree $CENTRIFUGE_HOME/example/reference/nodes.dmp --name-table $CENTRIFUGE_HOME/example/reference/names.dmp $CENTRIFUGE_HOME/example/reference/test.fa test\n\nThe command should print many lines of output then quit. When the command\ncompletes, the current directory will contain ten new files that all start with\n`test` and end with `.1.cf`, `.2.cf`, `.3.cf`.  These files constitute the index - you're done!\n\nYou can use `centrifuge-build` to create an index for a set of FASTA files obtained\nfrom any source, including sites such as [UCSC], [NCBI], and [Ensembl]. When\nindexing multiple FASTA files, specify all the files using commas to separate\nfile names.  For more details on how to create an index with `centrifuge-build`,\nsee the [manual section on index building].  You may also want to bypass this\nprocess by obtaining a pre-built index.\n\n[UCSC]: http://genome.ucsc.edu/cgi-bin/hgGateway\n[NCBI]: http://www.ncbi.nlm.nih.gov/sites/genome\n[Ensembl]: http://www.ensembl.org/\n\nClassifying example reads\n----------------------\n\nStay in the directory created in the previous step, which now contains the\n`test` index files.  Next, run:\n\n    $CENTRIFUGE_HOME/centrifuge -f -x test $CENTRIFUGE_HOME/example/reads/input.fa\n\nThis runs the Centrifuge classifier, which classifies a set of unpaired reads to the\nthe genomes using the index generated in the previous step.\nThe classification results are reported to stdout, and a\nshort classification summary is written to centrifuge-species_report.tsv.\n\nYou will see something like this:\n\n    readID  seqID taxID     score\t2ndBestScore\thitLength\tnumMatches\n    C_1 gi|7     9913      4225\t4225\t\t80\t\t2\n    C_1 gi|4     9646      4225\t4225\t\t80\t\t2\n    C_2 gi|4     9646      4225\t4225\t\t80\t\t2\n    C_2 gi|7     9913      4225\t4225\t\t80\t\t2\n    C_3 gi|7     9913      4225\t4225\t\t80\t\t2\n    C_3 gi|4     9646      4225\t4225\t\t80\t\t2\n    C_4 gi|4     9646      4225\t4225\t\t80\t\t2\n    C_4 gi|7     9913      4225\t4225\t\t80\t\t2\n    1_1 gi|4     9646      4225\t0\t\t80\t\t1\n    1_2 gi|4     9646      4225\t0\t\t80\t\t1\n    2_1 gi|7     9913      4225\t0\t\t80\t\t1\n    2_2 gi|7     9913      4225\t0\t\t80\t\t1\n    2_3 gi|7     9913      4225\t0\t\t80\t\t1\n    2_4 gi|7     9913      4225\t0\t\t80\t\t1\n    2_5 gi|7     9913      4225\t0\t\t80\t\t1\n    2_6 gi|7     9913      4225\t0\t\t80\t\t1\n"
  },
  {
    "path": "MANUAL.markdown",
    "content": "\n\n<!--\n ! This manual is written in \"markdown\" format and thus contains some\n ! distracting formatting clutter.  See 'MANUAL' for an easier-to-read version\n ! of this text document, or see the HTML manual online.\n ! -->\n\nIntroduction\n============\n\nWhat is Centrifuge?\n-----------------\n\n[Centrifuge] is a novel microbial classification engine that enables\nrapid, accurate, and sensitive labeling of reads and quantification of\nspecies on desktop computers.  The system uses a novel indexing scheme\nbased on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini\n(FM) index, optimized specifically for the metagenomic classification\nproblem. Centrifuge requires a relatively small index (5.8 GB for all\ncomplete bacterial and viral genomes plus the human genome) and\nclassifies sequences at a very high speed, allowing it to process the\nmillions of reads from a typical high-throughput DNA sequencing run\nwithin a few minutes.  Together these advances enable timely and\naccurate analysis of large metagenomics data sets on conventional\ndesktop computers.\n\n[Centrifuge]:     http://www.ccb.jhu.edu/software/centrifuge\n\n[Burrows-Wheeler Transform]: http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\n[FM Index]:        http://en.wikipedia.org/wiki/FM-index\n\n[GPLv3 license]:   http://www.gnu.org/licenses/gpl-3.0.html\n\n\nObtaining Centrifuge\n==================\n\nDownload Centrifuge and binaries from the Releases sections on the right side.\nBinaries are available for Intel architectures (`x86_64`) running Linux, and Mac OS X.\n\nBuilding from source\n--------------------\n\nBuilding Centrifuge from source requires a GNU-like environment with GCC, GNU Make\nand other basics.  It should be possible to build Centrifuge on most vanilla Linux\ninstallations or on a Mac installation with [Xcode] installed.  Centrifuge can\nalso be built on Windows using [Cygwin] or [MinGW] (MinGW recommended). For a \nMinGW build the choice of what compiler is to be used is important since this\nwill determine if a 32 or 64 bit code can be successfully compiled using it. If \nthere is a need to generate both 32 and 64 bit on the same machine then a multilib \nMinGW has to be properly installed. [MSYS], the [zlib] library, and depending on \narchitecture [pthreads] library are also required. We are recommending a 64 bit\nbuild since it has some clear advantages in real life research problems. In order \nto simplify the MinGW setup it might be worth investigating popular MinGW personal \nbuilds since these are coming already prepared with most of the toolchains needed.\n\nFirst, download the [source package] from the Releases secion on the right side.\nUnzip the file, change to the unzipped directory, and build the\nCentrifuge tools by running GNU `make` (usually with the command `make`, but\nsometimes with `gmake`) with no arguments.  If building with MinGW, run `make`\nfrom the MSYS environment.\n\nCentrifuge is using the multithreading software model in order to speed up \nexecution times on SMP architectures where this is possible. On POSIX \nplatforms (like linux, Mac OS, etc) it needs the pthread library. Although\nit is possible to use pthread library on non-POSIX platform like Windows, due\nto performance reasons Centrifuge will try to use Windows native multithreading\nif possible.\n\nFor the support of SRA data access in HISAT2, please download and install the [NCBI-NGS] toolkit.\nWhen running `make`, specify additional variables as follow.\n`make USE_SRA=1 NCBI_NGS_DIR=/path/to/NCBI-NGS-directory NCBI_VDB_DIR=/path/to/NCBI-NGS-directory`,\nwhere `NCBI_NGS_DIR` and `NCBI_VDB_DIR` will be used in Makefile for -I and -L compilation options.\nFor example, $(NCBI_NGS_DIR)/include and $(NCBI_NGS_DIR)/lib64 will be used.  \n\n[Cygwin]:   http://www.cygwin.com/\n[MinGW]:    http://www.mingw.org/\n[MSYS]:     http://www.mingw.org/wiki/msys\n[zlib]:     http://cygwin.com/packages/mingw-zlib/\n[pthreads]: http://sourceware.org/pthreads-win32/\n[GnuWin32]: http://gnuwin32.sf.net/packages/coreutils.htm\n[Xcode]:    http://developer.apple.com/xcode/\n[Github site]: https://github.com/infphilo/centrifuge\n[NCBI-NGS]: https://github.com/ncbi/ngs/wiki/Downloads\n\nRunning Centrifuge\n=============\n\nAdding to PATH\n--------------\n\nBy adding your new Centrifuge directory to your [PATH environment variable], you\nensure that whenever you run `centrifuge`, `centrifuge-build`, `centrifuge-download` or `centrifuge-inspect`\nfrom the command line, you will get the version you just installed without\nhaving to specify the entire path.  This is recommended for most users.  To do\nthis, follow your operating system's instructions for adding the directory to\nyour [PATH].\n\nIf you would like to install Centrifuge by copying the Centrifuge executable files\nto an existing directory in your [PATH], make sure that you copy all the\nexecutables, including `centrifuge`, `centrifuge-class`, `centrifuge-build`, `centrifuge-build-bin`, `centrifuge-download` `centrifuge-inspect`\nand `centrifuge-inspect-bin`. Furthermore you need the programs\nin the scripts/ folder if you opt for genome compression in the database construction.\n\n[PATH environment variable]: http://en.wikipedia.org/wiki/PATH_(variable)\n[PATH]: http://en.wikipedia.org/wiki/PATH_(variable)\n\n\nBefore running Centrifuge\n-----------------\n\nClassification is considerably different from alignment in that classification is performed on a large set of genomes as opposed to on just one reference genome as in alignment.  Currently, an enormous number of complete genomes are available at the GenBank (e.g. >4,000 bacterial genomes, >10,000 viral genomes, …).  These genomes are organized in a taxonomic tree where each genome is located at the bottom of the tree, at the strain or subspecies level.  On the taxonomic tree, genomes have ancestors usually situated at the species level, and those ancestors also have ancestors at the genus level and so on up the family level, the order level, class level, phylum, kingdom, and finally at the root level.\n\nGiven the gigantic number of genomes available, which continues to expand at a rapid rate, and the development of the taxonomic tree, which continues to evolve with new advancements in research, we have designed Centrifuge to be flexible and general enough to reflect this huge database.  We provide several standard indexes that will meet most of users’ needs (see the side panel - Indexes).  In our approach our indexes not only include raw genome sequences, but also genome names/sizes and taxonomic trees.  This enables users to perform additional analyses on Centrifuge’s classification output without the need to download extra database sources.  This also eliminates the potential issue of discrepancy between the indexes we provide and the databases users may otherwise download.  We plan to provide a couple of additional standard indexes in the near future, and update the indexes on a regular basis.\n\nWe encourage first time users to take a look at and follow a [`small example`] that illustrates how to build an index, how to run Centrifuge using the index, how to interpret the classification results, and how to extract additional genomic information from the index.  For those who choose to build customized indexes, please take a close look at the following description.\n\nDatabase download and index building\n-----------------\n\nCentrifuge indexes can be built with arbritary sequences. Standard choices are\nall of the complete bacterial and viral genomes, or using the sequences that\nare part of the BLAST nt database. Centrifuge always needs the\nnodes.dmp file from the NCBI taxonomy dump to build the taxonomy tree,\nas well as a sequence ID to taxonomy ID map. The map is a tab-separated\nfile with the sequence ID to taxonomy ID map.\n\nTo download all of the complete archaeal, viral, and bacterial genomes from RefSeq, and\nbuild the index:\n\nCentrifuge indices can be build on arbritary sequences. Usually an ensemble of\ngenomes is used - such as all complete microbial genomes in the RefSeq database,\nor all sequences in the BLAST nt database. \n\n\nTo map sequence identifiers to taxonomy IDs, and taxonomy IDs to names and \nits parents, three files are necessary in addition to the sequence files:\n\n - taxonomy tree: typically nodes.dmp from the NCBI taxonomy dump. Links taxonomy IDs to their parents\n - names file: typically names.dmp from the NCBI taxonomy dump. Links taxonomy IDs to their scientific name\n - a tab-separated sequence ID to taxonomy ID mapping\n\nWhen using the provided scripts to download the genomes, these files are automatically downloaded or generated. \nWhen using a custom taxonomy or sequence files, please refer to the section `TODO` to learn more about their format.\n\n### Building index on all complete bacterial and viral genomes\n\nUse `centrifuge-download` to download genomes from NCBI. The following two commands download\nthe NCBI taxonomy to `taxonomy/` in the current directory, and all complete archaeal,\nbacterial and viral genomes to `library/`. Low-complexity regions in the genomes are masked after\ndownload (parameter `-m`) using blast+'s `dustmasker`. `centrifuge-download` outputs tab-separated \nsequence ID to taxonomy ID mappings to standard out, which are required by `centrifuge-build`.\n\n    centrifuge-download -o taxonomy taxonomy\n    centrifuge-download -o library -m -d \"archaea,bacteria,viral\" refseq > seqid2taxid.map\n\nTo build the index, first concatenate all downloaded sequences into a single file, and then\nrun `centrifuge-build`:\n    \n    cat library/*/*.fna > input-sequences.fna\n\n    ## build centrifuge index with 4 threads\n    centrifuge-build -p 4 --conversion-table seqid2taxid.map \\\n                     --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \\\n                     input-sequences.fna abv\n\nAfter building the index, all files except the index *.[123].cf files may be removed.\nIf you also want to include the human and/or the mouse genome, add their sequences to \nthe library folder before building the index with one of the following commands:\n\nAfter the index building, all but the *.[123].cf index files may be removed. I.e. the files in\nthe `library/` and `taxonomy/` directories are no longer needed.\n\n### Adding human or mouse genome to the index\nThe human and mouse genomes can also be downloaded using `centrifuge-download`. They are in the\ndomain \"vertebrate_mammalian\" (argument `-d`), are assembled at the chromosome level (argument `-a`)\nand categorized as reference genomes by RefSeq (`-c`). The argument `-t` takes a comma-separated\nlist of taxonomy IDs - e.g. `9606` for human and `10090` for mouse:\n\n    # download mouse and human reference genomes\n    centrifuge-download -o library -d \"vertebrate_mammalian\" -a \"Chromosome\" -t 9606,10090 -c 'reference genome' refseq >> seqid2taxid.map\n    # only human\n    centrifuge-download -o library -d \"vertebrate_mammalian\" -a \"Chromosome\" -t 9606 -c 'reference genome' refseq >> seqid2taxid.map\n    # only mouse\n    centrifuge-download -o library -d \"vertebrate_mammalian\" -a \"Chromosome\" -t 10090 -c 'reference genome' refseq >> seqid2taxid.map\n\n\n### nt database\n\nNCBI BLAST's nt database contains all spliced non-redundant coding\nsequences from multiplpe databases, inferred from genommic\nsequences. Traditionally used with BLAST, a download of the FASTA is\nprovided on the NCBI homepage. Building an index with any database \nrequires the user to creates a sequence ID to taxonomy ID map that \ncan be generated from a GI taxid dump:\n\n    wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz\n    gunzip nt.gz && mv -v nt nt.fa\n\n    # Get mapping file\n    wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz\n    gunzip -c gi_taxid_nucl.dmp.gz | sed 's/^/gi|/' > gi_taxid_nucl.map\n\n    # build index using 16 cores and a small bucket size, which will require less memory\n    centrifuge-build -p 16 --bmax 1342177280 --conversion-table gi_taxid_nucl.map \\\n                     --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \\ \n                     nt.fa nt\n\n\n\n### Custom database\n\nTo build a custom database, you need the provide the follwing four files to `centrifuge-build`:\n\n  - `--conversion-table`: tab-separated file mapping sequence IDs to taxonomy IDs. Sequence IDs are the header up to the first space or second pipe (`|`).  \n  - `--taxonomy-tree`: `\\t|\\t`-separated file mapping taxonomy IDs to their parents and rank, up to the root of the tree. When using NCBI taxonomy IDs, this will be the `nodes.dmp` from `ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz`.\n  - `--name-table`: '\\t|\\t'-separated file mapping taxonomy IDs to a name. A further column (typically column 4) must specify `scientific name`. When using NCBI taxonomy IDs, `names.dmp` is the appropriate file.\n  - reference sequences: The ID of the sequences are the header up to the first space or second pipe (`|`)\n\nWhen using custom taxonomy IDs, use only positive integers greater-equal to `1` and use `1` for the root of the tree.\n\n#### More info on `--taxonomy-tree` and `--name-table`\n\nThe format of these files are based on `nodes.dmp` and `names.dmp` from the NCBI taxonomy database dump. \n\n- Field terminator is `\\t|\\t`\n- Row terminator is `\\t|\\n`\n\nThe `taxonomy-tree` / nodes.dmp file consists of taxonomy nodes. The description for each node includes the following\nfields:\n\n    tax_id                  -- node id in GenBank taxonomy database\n    parent tax_id           -- parent node id in GenBank taxonomy database\n    rank                    -- rank of this node (superkingdom, kingdom, ..., no rank)\n\nFurther fields are ignored.\n\nThe `name-table` / names.dmp is the taxonomy names file:\n\n    tax_id                  -- the id of node associated with this name\n    name_txt                -- name itself\n    unique name             -- the unique variant of this name if name not unique\n    name class              -- (scientific name, synonym, common name, ...)\n\n`name class` **has** to be `scientific name` to be included in the build. All other lines are ignored\n\n#### Example\n\n*Conversion table `ex.conv`*: \n    \n    Seq1\t11\n    Seq2\t12\n    Seq3\t13\n    Seq4\t11\n\n\n*Taxonomy tree `ex.tree`*: \n\n    1\t|\t1\t|\troot\n    10\t|\t1\t|\tkingdom\n    11\t|\t10\t|\tspecies\n    12\t|\t10\t|\tspecies\n    13\t|\t1\t|\tspecies\n\n*Name table `ex.name`*:\n\n    1\t|\troot\t|\t\t|\tscientific name\t|\n    10\t|\tBacteria\t|\t\t|\tscientific name\t|\n    11\t|\tBacterium A\t|\t\t|\tscientific name\t|\n    12\t|\tBacterium B\t|\t\t|\tscientific name\t|\n    12\t|\tSome other species\t|\t\t|\tscientific name\t|\n\n*Reference sequences `ex.fa`*:\n\n    >Seq1\n    AAAACGTACGA.....\n    >Seq2\n    AAAACGTACGA.....\n    >Seq3\n    AAAACGTACGA.....\n    >Seq4\n    AAAACGTACGA.....\n\nTo build the database, call\n\n    centrifuge-build --conversion-table ex.conv \\\n                     --taxonomy-tree ex.tree --name-table ex.name \\ \n                     ex.fa ex\n\nwhich results in three index files named `ex.1.cf`, `ex.2.cf` and `ex.3.cf`.\n\n\n### Centrifuge classification output\n\nThe following example shows classification assignments for a read.  The assignment output has 8 columns.\n\n    readID    seqID   taxID score\t   2ndBestScore\t   hitLength\tqueryLength\tnumMatches\n    1_1\t      gi|4    9646  4225\t   0\t\t       80\t80\t\t1\n\n    The first column is the read ID from a raw sequencing read (e.g., 1_1 in the example).\n    The second column is the sequence ID of the genomic sequence, where the read is classified (e.g., gi|4).\n    The third column is the taxonomic ID of the genomic sequence in the second column (e.g., 9646).\n    The fourth column is the score for the classification, which is the weighted sum of hits (e.g., 4225)\n    The fifth column is the score for the next best classification (e.g., 0).\n    The sixth column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80).\n    The seventh column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80). \n    The eighth column is the number of classifications for this read, indicating how many assignments were made (e.g.,1).\n\n### Centrifuge summary output (the default filename is centrifuge_report.tsv)\n\nThe following example shows a classification summary for each genome or taxonomic unit.  The assignment output has 7 columns.\n\n    name      \t      \t      \t\t     \t     \t      \t     \ttaxID\ttaxRank\t   genomeSize \tnumReads   numUniqueReads   abundance\n    Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis\t36870\tleaf\t   703004\t\t5981\t   5964\t            0.0152317\n\n    The first column is the name of a genome, or the name corresponding to a taxonomic ID (the second column) at a rank higher than the strain (e.g., Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis).\n    The second column is the taxonomic ID (e.g., 36870).\n    The third column is the taxonomic rank (e.g., leaf).\n    The fourth column is the length of the genome sequence (e.g., 703004).\n    The fifth column is the number of reads classified to this genomic sequence including multi-classified reads (e.g., 5981).\n    The sixth column is the number of reads uniquely classified to this genomic sequence (e.g., 5964).\n    The seventh column is the proportion of this genome normalized by its genomic length (e.g., 0.0152317).\n\nAs the GenBank database is incomplete (i.e., many more genomes remain to be identified and added), and reads have sequencing errors, classification programs including Centrifuge often report many false assignments.  In order to perform more conservative analyses, users may want to discard assignments for reads having a matching length (8th column in the output of Centrifuge) of 40% or lower.  It may be also helpful to use a score (4th column) for filtering out some assignments.   Our future research plans include working on developing methods that estimate confidence scores for assignments.\n\n### Kraken-style report\n\n`centrifuge-kreport` can be used to make a Kraken-style report from the Centrifuge output including taxonomy information:\n\n`centrifuge-kreport -x <centrifuge index> <centrifuge out file>`\n\n\nInspecting the Centrifuge index\n-----------------------\n\nThe index can be inspected with `centrifuge-inspect`.  To extract raw sequences:\n\n    centrifuge-inspect <centrifuge index>\n\nExtract the sequence ID to taxonomy ID conversion table from the index\n\n    centrifuge-inspect --conversion-table <centrifuge index>\n\nExtract the taxonomy tree from the index:\n\n    centrifuge-inspect --taxonomy-tree <centrifuge index>\n\nExtract the lengths of the sequences from the index (each row has two columns: taxonomic ID and length):\n\n    centrifuge-inspect --size-table <centrifuge index>\n\nExtract the names from the index (each row has two columns: taxonomic ID and name):\n\n    centrifuge-inspect --name-table <centrifuge index>\n    \n\n\nWrapper\n-------\n\nThe `centrifuge`, `centrifuge-build` and `centrifuge-inspect` executables are actually \nwrapper scripts that call binary programs as appropriate. Also, the `centrifuge` wrapper\nprovides some key functionality, like the ability to handle compressed inputs,\nand the functionality for [`--un`], [`--al`] and related options.\n\nIt is recommended that you always run the centrifuge wrappers and not run the\nbinaries directly.\n\nPerformance tuning\n------------------\n\n1.  If your computer has multiple processors/cores, use `-p NTHREADS`\n\n    The [`-p`] option causes Centrifuge to launch a specified number of parallel\n    search threads.  Each thread runs on a different processor/core and all\n    threads find alignments in parallel, increasing alignment throughput by\n    approximately a multiple of the number of threads (though in practice,\n    speedup is somewhat worse than linear).\n\nCommand Line\n------------\n\n\n### Usage\n\n    centrifuge [options]* -x <centrifuge-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [--report-file <report file name> -S <classification output file name>]\n\n### Main arguments\n\n<table><tr><td>\n\n[`-x`]: #centrifuge-options-x\n\n    -x <centrifuge-idx>\n\n</td><td>\n\nThe basename of the index for the reference genomes.  The basename is the name of\nany of the index files up to but not including the final `.1.cf` / etc.  \n`centrifuge` looks for the specified index first in the current directory,\nthen in the directory specified in the `CENTRIFUGE_INDEXES` environment variable.\n\n</td></tr><tr><td>\n\n[`-1`]: #centrifuge-options-1\n\n    -1 <m1>\n\n</td><td>\n\nComma-separated list of files containing mate 1s (filename usually includes\n`_1`), e.g. `-1 flyA_1.fq,flyB_1.fq`.  Sequences specified with this option must\ncorrespond file-for-file and read-for-read with those specified in `<m2>`. Reads\nmay be a mix of different lengths. If `-` is specified, `centrifuge` will read the\nmate 1s from the \"standard in\" or \"stdin\" filehandle.\n\n</td></tr><tr><td>\n\n[`-2`]: #centrifuge-options-2\n\n    -2 <m2>\n\n</td><td>\n\nComma-separated list of files containing mate 2s (filename usually includes\n`_2`), e.g. `-2 flyA_2.fq,flyB_2.fq`.  Sequences specified with this option must\ncorrespond file-for-file and read-for-read with those specified in `<m1>`. Reads\nmay be a mix of different lengths. If `-` is specified, `centrifuge` will read the\nmate 2s from the \"standard in\" or \"stdin\" filehandle.\n\n</td></tr><tr><td>\n\n[`-U`]: #centrifuge-options-U\n\n    -U <r>\n\n</td><td>\n\nComma-separated list of files containing unpaired reads to be aligned, e.g.\n`lane1.fq,lane2.fq,lane3.fq,lane4.fq`.  Reads may be a mix of different lengths.\nIf `-` is specified, `centrifuge` gets the reads from the \"standard in\" or \"stdin\"\nfilehandle.\n\n</td></tr><tr><td>\n\n[`--sra-acc`]: #hisat2-options-sra-acc\n\n    --sra-acc <SRA accession number>\n\n</td><td>\n\nComma-separated list of SRA accession numbers, e.g. `--sra-acc SRR353653,SRR353654`.\nInformation about read types is available at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?sp=runinfo&acc=<b>sra-acc</b>&retmode=xml,\nwhere <b>sra-acc</b> is SRA accession number.  If users run HISAT2 on a computer cluster, it is recommended to disable SRA-related caching (see the instruction at [SRA-MANUAL]).\n\n[SRA-MANUAL]:\t     https://github.com/ncbi/sra-tools/wiki/Toolkit-Configuration\n\n</td></tr><tr><td>\n\n[`-S`]: #centrifuge-options-S\n\n    -S <filename>\n\n</td><td>\n\nFile to write classification results to.  By default, assignments are written to the\n\"standard out\" or \"stdout\" filehandle (i.e. the console).\n\n</td></tr><tr><td>\n\n[`--report-file`]: #centrifuge-options-report-file\n\n    --report-file <filename>\n\n</td><td>\n\nFile to write a classification summary to (default: centrifuge_report.tsv).\n\n</td></tr></table>\n\n### Options\n\n#### Input options\n\n<table>\n<tr><td id=\"centrifuge-options-q\">\n\n[`-q`]: #centrifuge-options-q\n\n    -q\n\n</td><td>\n\nReads (specified with `<m1>`, `<m2>`, `<s>`) are FASTQ files.  FASTQ files\nusually have extension `.fq` or `.fastq`.  FASTQ is the default format.  See\nalso: [`--solexa-quals`] and [`--int-quals`].\n\n</td></tr>\n<tr><td id=\"centrifuge-options-qseq\">\n\n[`--qseq`]: #centrifuge-options-qseq\n\n    --qseq\n\n</td><td>\n\nReads (specified with `<m1>`, `<m2>`, `<s>`) are QSEQ files.  QSEQ files usually\nend in `_qseq.txt`.  See also: [`--solexa-quals`] and [`--int-quals`].\n\n</td></tr>\n<tr><td id=\"centrifuge-options-f\">\n\n[`-f`]: #centrifuge-options-f\n\n    -f\n\n</td><td>\n\nReads (specified with `<m1>`, `<m2>`, `<s>`) are FASTA files.  FASTA files\nusually have extension `.fa`, `.fasta`, `.mfa`, `.fna` or similar.  FASTA files\ndo not have a way of specifying quality values, so when `-f` is set, the result\nis as if `--ignore-quals` is also set.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-r\">\n\n[`-r`]: #centrifuge-options-r\n\n    -r\n\n</td><td>\n\nReads (specified with `<m1>`, `<m2>`, `<s>`) are files with one input sequence\nper line, without any other information (no read names, no qualities).  When\n`-r` is set, the result is as if `--ignore-quals` is also set.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-c\">\n\n[`-c`]: #centrifuge-options-c\n\n    -c\n\n</td><td>\n\nThe read sequences are given on command line.  I.e. `<m1>`, `<m2>` and\n`<singles>` are comma-separated lists of reads rather than lists of read files.\nThere is no way to specify read names or qualities, so `-c` also implies\n`--ignore-quals`.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-s\">\n\n[`-s`/`--skip`]: #centrifuge-options-s\n[`-s`]: #centrifuge-options-s\n\n    -s/--skip <int>\n\n</td><td>\n\nSkip (i.e. do not align) the first `<int>` reads or pairs in the input.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-u\">\n\n[`-u`/`--qupto`]: #centrifuge-options-u\n[`-u`]: #centrifuge-options-u\n\n    -u/--qupto <int>\n\n</td><td>\n\nAlign the first `<int>` reads or read pairs from the input (after the\n[`-s`/`--skip`] reads or pairs have been skipped), then stop.  Default: no limit.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-5\">\n\n[`-5`/`--trim5`]: #centrifuge-options-5\n[`-5`]: #centrifuge-options-5\n\n    -5/--trim5 <int>\n\n</td><td>\n\nTrim `<int>` bases from 5' (left) end of each read before alignment (default: 0).\n\n</td></tr>\n<tr><td id=\"centrifuge-options-3\">\n\n[`-3`/`--trim3`]: #centrifuge-options-3\n[`-3`]: #centrifuge-options-3\n\n    -3/--trim3 <int>\n\n</td><td>\n\nTrim `<int>` bases from 3' (right) end of each read before alignment (default:\n0).\n\n</td></tr><tr><td id=\"centrifuge-options-phred33-quals\">\n\n[`--phred33`]: #centrifuge-options-phred33-quals\n\n    --phred33\n\n</td><td>\n\nInput qualities are ASCII chars equal to the [Phred quality] plus 33.  This is\nalso called the \"Phred+33\" encoding, which is used by the very latest Illumina\npipelines.\n\n[Phred quality]: http://en.wikipedia.org/wiki/Phred_quality_score\n\n</td></tr>\n<tr><td id=\"centrifuge-options-phred64-quals\">\n\n[`--phred64`]: #centrifuge-options-phred64-quals\n\n    --phred64\n\n</td><td>\n\nInput qualities are ASCII chars equal to the [Phred quality] plus 64.  This is\nalso called the \"Phred+64\" encoding.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-solexa-quals\">\n\n[`--solexa-quals`]: #centrifuge-options-solexa-quals\n\n    --solexa-quals\n\n</td><td>\n\nConvert input qualities from [Solexa][Phred quality] (which can be negative) to\n[Phred][Phred quality] (which can't).  This scheme was used in older Illumina GA\nPipeline versions (prior to 1.3).  Default: off.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-int-quals\">\n\n[`--int-quals`]: #centrifuge-options-int-quals\n\n    --int-quals\n\n</td><td>\n\nQuality values are represented in the read input file as space-separated ASCII\nintegers, e.g., `40 40 30 40`..., rather than ASCII characters, e.g., `II?I`....\n Integers are treated as being on the [Phred quality] scale unless\n[`--solexa-quals`] is also specified. Default: off.\n\n</td></tr></table>\n\n#### Classification\n\n<table>\n\n<tr><td id=\"centrifuge-options-min-hitlen\">\n\n[`--min-hitlen`]: #centrifuge-options-min-hitlen\n\n    --min-hitlen <int>\n\n</td><td>\n\nMinimum length of partial hits, which must be greater than 15 (default: 22)\"\n\n</td></tr>\n\n<tr><td id=\"centrifuge-options-k\">\n\n[`-k`]: #centrifuge-options-k\n\n    -k <int>\n\n</td><td>\n\nIt searches for at most `<int>` distinct, primary assignments for each read or pair.  \nPrimary assignments mean assignments whose assignment score is equal or higher than any other assignments.\nIf there are more primary assignments than this value, \nthe search will merge some of the assignments into a higher taxonomic rank.\nThe assignment score for a paired-end assignment equals the sum of the assignment scores of the individual mates. \nDefault: 5\n\n</td></tr>\n\n<tr><td id=\"centrifuge-options-host-taxids\">\n\n[`--host-taxids`]: #centrifuge-options-host-taxids\n\n    --host-taxids\n\n</td><td>\n\nA comma-separated list of taxonomic IDs that will be preferred in classification procedure.\nThe descendants from these IDs will also be preferred.  In case some of a read's assignments correspond to\nthese taxonomic IDs, only those corresponding assignments will be reported.\n\n</td></tr>\n\n<tr><td id=\"centrifuge-options-exclude-taxids\">\n\n[`--exclude-taxids`]: #centrifuge-options-exclude-taxids\n\n    --exclude-taxids\n\n</td><td>\n\nA comma-separated list of taxonomic IDs that will be excluded in classification procedure.\nThe descendants from these IDs will also be exclude. \n\n</td></tr>\n\n</table>\n\n\n<!--\n#### Alignment options\n\n<table>\n\n<tr><td id=\"centrifuge-options-n-ceil\">\n\n[`--n-ceil`]: #centrifuge-options-n-ceil\n\n    --n-ceil <func>\n\n</td><td>\n\nSets a function governing the maximum number of ambiguous characters (usually\n`N`s and/or `.`s) allowed in a read as a function of read length.  For instance,\nspecifying `-L,0,0.15` sets the N-ceiling function `f` to `f(x) = 0 + 0.15 * x`,\nwhere x is the read length.  See also: [setting function options].  Reads\nexceeding this ceiling are [filtered out].  Default: `L,0,0.15`.\n\n[filtered out]: #filtering\n\n</td></tr>\n\n<tr><td id=\"centrifuge-options-ignore-quals\">\n\n[`--ignore-quals`]: #centrifuge-options-ignore-quals\n\n    --ignore-quals\n\n</td><td>\n\nWhen calculating a mismatch penalty, always consider the quality value at the\nmismatched position to be the highest possible, regardless of the actual value. \nI.e. input is treated as though all quality values are high.  This is also the\ndefault behavior when the input doesn't specify quality values (e.g. in [`-f`],\n[`-r`], or [`-c`] modes).\n\n</td></tr>\n<tr><td id=\"centrifuge-options-nofw\">\n\n[`--nofw`]: #centrifuge-options-nofw\n\n    --nofw/--norc\n\n</td><td>\n\nIf `--nofw` is specified, `centrifuge` will not attempt to align unpaired reads to\nthe forward (Watson) reference strand.  If `--norc` is specified, `centrifuge` will\nnot attempt to align unpaired reads against the reverse-complement (Crick)\nreference strand. In paired-end mode, `--nofw` and `--norc` pertain to the\nfragments; i.e. specifying `--nofw` causes `centrifuge` to explore only those\npaired-end configurations corresponding to fragments from the reverse-complement\n(Crick) strand.  Default: both strands enabled. \n\n</td></tr>\n\n</table>\n\n#### Paired-end options\n\n<table>\n\n<tr><td id=\"centrifuge-options-fr\">\n\n[`--fr`/`--rf`/`--ff`]: #centrifuge-options-fr\n[`--fr`]: #centrifuge-options-fr\n[`--rf`]: #centrifuge-options-fr\n[`--ff`]: #centrifuge-options-fr\n\n    --fr/--rf/--ff\n\n</td><td>\n\nThe upstream/downstream mate orientations for a valid paired-end alignment\nagainst the forward reference strand.  E.g., if `--fr` is specified and there is\na candidate paired-end alignment where mate 1 appears upstream of the reverse\ncomplement of mate 2 and the fragment length constraints ([`-I`] and [`-X`]) are\nmet, that alignment is valid.  Also, if mate 2 appears upstream of the reverse\ncomplement of mate 1 and all other constraints are met, that too is valid.\n`--rf` likewise requires that an upstream mate1 be reverse-complemented and a\ndownstream mate2 be forward-oriented. ` --ff` requires both an upstream mate 1\nand a downstream mate 2 to be forward-oriented.  Default: `--fr` (appropriate\nfor Illumina's Paired-end Sequencing Assay).\n\n</td></tr></table>\n-->\n\n#### Output options\n\n<table>\n\n<tr><td id=\"centrifuge-options-t\">\n\n[`-t`/`--time`]: #centrifuge-options-t\n[`-t`]: #centrifuge-options-t\n\n    -t/--time\n\n</td><td>\n\nPrint the wall-clock time required to load the index files and align the reads. \nThis is printed to the \"standard error\" (\"stderr\") filehandle.  Default: off.\n\n</td></tr>\n\n<!--\n<tr><td id=\"centrifuge-options-un\">\n\n[`--un`]: #centrifuge-options-un\n[`--un-gz`]: #centrifuge-options-un\n[`--un-bz2`]: #centrifuge-options-un\n\n    --un <path>\n    --un-gz <path>\n    --un-bz2 <path>\n\n</td><td>\n\nWrite unpaired reads that fail to align to file at `<path>`.  These reads\ncorrespond to the SAM records with the FLAGS `0x4` bit set and neither the\n`0x40` nor `0x80` bits set.  If `--un-gz` is specified, output will be gzip\ncompressed. If `--un-bz2` is specified, output will be bzip2 compressed.  Reads\nwritten in this way will appear exactly as they did in the input file, without\nany modification (same sequence, same name, same quality string, same quality\nencoding).  Reads will not necessarily appear in the same order as they did in\nthe input.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-al\">\n\n[`--al`]: #centrifuge-options-al\n[`--al-gz`]: #centrifuge-options-al\n[`--al-bz2`]: #centrifuge-options-al\n\n    --al <path>\n    --al-gz <path>\n    --al-bz2 <path>\n\n</td><td>\n\nWrite unpaired reads that align at least once to file at `<path>`.  These reads\ncorrespond to the SAM records with the FLAGS `0x4`, `0x40`, and `0x80` bits\nunset.  If `--al-gz` is specified, output will be gzip compressed. If `--al-bz2`\nis specified, output will be bzip2 compressed.  Reads written in this way will\nappear exactly as they did in the input file, without any modification (same\nsequence, same name, same quality string, same quality encoding).  Reads will\nnot necessarily appear in the same order as they did in the input.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-un-conc\">\n\n[`--un-conc`]: #centrifuge-options-un-conc\n[`--un-conc-gz`]: #centrifuge-options-un-conc\n[`--un-conc-bz2`]: #centrifuge-options-un-conc\n\n    --un-conc <path>\n    --un-conc-gz <path>\n    --un-conc-bz2 <path>\n\n</td><td>\n\nWrite paired-end reads that fail to align concordantly to file(s) at `<path>`.\nThese reads correspond to the SAM records with the FLAGS `0x4` bit set and\neither the `0x40` or `0x80` bit set (depending on whether it's mate #1 or #2).\n`.1` and `.2` strings are added to the filename to distinguish which file\ncontains mate #1 and mate #2.  If a percent symbol, `%`, is used in `<path>`,\nthe percent symbol is replaced with `1` or `2` to make the per-mate filenames.\nOtherwise, `.1` or `.2` are added before the final dot in `<path>` to make the\nper-mate filenames.  Reads written in this way will appear exactly as they did\nin the input files, without any modification (same sequence, same name, same\nquality string, same quality encoding).  Reads will not necessarily appear in\nthe same order as they did in the inputs.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-al-conc\">\n\n[`--al-conc`]: #centrifuge-options-al-conc\n[`--al-conc-gz`]: #centrifuge-options-al-conc\n[`--al-conc-bz2`]: #centrifuge-options-al-conc\n\n    --al-conc <path>\n    --al-conc-gz <path>\n    --al-conc-bz2 <path>\n\n</td><td>\n\nWrite paired-end reads that align concordantly at least once to file(s) at\n`<path>`. These reads correspond to the SAM records with the FLAGS `0x4` bit\nunset and either the `0x40` or `0x80` bit set (depending on whether it's mate #1\nor #2). `.1` and `.2` strings are added to the filename to distinguish which\nfile contains mate #1 and mate #2.  If a percent symbol, `%`, is used in\n`<path>`, the percent symbol is replaced with `1` or `2` to make the per-mate\nfilenames. Otherwise, `.1` or `.2` are added before the final dot in `<path>` to\nmake the per-mate filenames.  Reads written in this way will appear exactly as\nthey did in the input files, without any modification (same sequence, same name,\nsame quality string, same quality encoding).  Reads will not necessarily appear\nin the same order as they did in the inputs.\n\n</td></tr>\n-->\n\n<tr><td id=\"centrifuge-options-quiet\">\n\n[`--quiet`]: #centrifuge-options-quiet\n\n    --quiet\n\n</td><td>\n\nPrint nothing besides alignments and serious errors.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-met-file\">\n\n[`--met-file`]: #centrifuge-options-met-file\n\n    --met-file <path>\n\n</td><td>\n\nWrite `centrifuge` metrics to file `<path>`.  Having alignment metric can be useful\nfor debugging certain problems, especially performance issues.  See also:\n[`--met`].  Default: metrics disabled.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-met-stderr\">\n\n[`--met-stderr`]: #centrifuge-options-met-stderr\n\n    --met-stderr\n\n</td><td>\n\nWrite `centrifuge` metrics to the \"standard error\" (\"stderr\") filehandle.  This is\nnot mutually exclusive with [`--met-file`].  Having alignment metric can be\nuseful for debugging certain problems, especially performance issues.  See also:\n[`--met`].  Default: metrics disabled.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-met\">\n\n[`--met`]: #centrifuge-options-met\n\n    --met <int>\n\n</td><td>\n\nWrite a new `centrifuge` metrics record every `<int>` seconds.  Only matters if\neither [`--met-stderr`] or [`--met-file`] are specified.  Default: 1.\n\n</td></tr>\n</table>\n\n#### Performance options\n\n<table><tr>\n\n<td id=\"centrifuge-options-o\">\n\n[`-o`/`--offrate`]: #centrifuge-options-o\n[`-o`]: #centrifuge-options-o\n[`--offrate`]: #centrifuge-options-o\n\n    -o/--offrate <int>\n\n</td><td>\n\nOverride the offrate of the index with `<int>`.  If `<int>` is greater\nthan the offrate used to build the index, then some row markings are\ndiscarded when the index is read into memory.  This reduces the memory\nfootprint of the aligner but requires more time to calculate text\noffsets.  `<int>` must be greater than the value used to build the\nindex.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-p\">\n\n[`-p`/`--threads`]: #centrifuge-options-p\n[`-p`]: #centrifuge-options-p\n\n    -p/--threads NTHREADS\n\n</td><td>\n\nLaunch `NTHREADS` parallel search threads (default: 1).  Threads will run on\nseparate processors/cores and synchronize when parsing reads and outputting\nalignments.  Searching for alignments is highly parallel, and speedup is close\nto linear.  Increasing `-p` increases Centrifuge's memory footprint. E.g. when\naligning to a human genome index, increasing `-p` from 1 to 8 increases the\nmemory footprint by a few hundred megabytes.  This option is only available if\n`bowtie` is linked with the `pthreads` library (i.e. if `BOWTIE_PTHREADS=0` is\nnot specified at build time).\n\n</td></tr>\n<tr><td id=\"centrifuge-options-reorder\">\n\n[`--reorder`]: #centrifuge-options-reorder\n\n    --reorder\n\n</td><td>\n\nGuarantees that output records are printed in an order corresponding to the\norder of the reads in the original input file, even when [`-p`] is set greater\nthan 1.  Specifying `--reorder` and setting [`-p`] greater than 1 causes Centrifuge\nto run somewhat slower and use somewhat more memory then if `--reorder` were\nnot specified.  Has no effect if [`-p`] is set to 1, since output order will\nnaturally correspond to input order in that case.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-mm\">\n\n[`--mm`]: #centrifuge-options-mm\n\n    --mm\n\n</td><td>\n\nUse memory-mapped I/O to load the index, rather than typical file I/O.\nMemory-mapping allows many concurrent `bowtie` processes on the same computer to\nshare the same memory image of the index (i.e. you pay the memory overhead just\nonce).  This facilitates memory-efficient parallelization of `bowtie` in\nsituations where using [`-p`] is not possible or not preferable.\n\n</td></tr></table>\n\n#### Other options\n\n<table>\n<tr><td id=\"centrifuge-options-qc-filter\">\n\n[`--qc-filter`]: #centrifuge-options-qc-filter\n\n    --qc-filter\n\n</td><td>\n\nFilter out reads for which the QSEQ filter field is non-zero.  Only has an\neffect when read format is [`--qseq`].  Default: off.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-seed\">\n\n[`--seed`]: #centrifuge-options-seed\n\n    --seed <int>\n\n</td><td>\n\nUse `<int>` as the seed for pseudo-random number generator.  Default: 0.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-non-deterministic\">\n\n[`--non-deterministic`]: #centrifuge-options-non-deterministic\n\n    --non-deterministic\n\n</td><td>\n\nNormally, Centrifuge re-initializes its pseudo-random generator for each read.  It\nseeds the generator with a number derived from (a) the read name, (b) the\nnucleotide sequence, (c) the quality sequence, (d) the value of the [`--seed`]\noption.  This means that if two reads are identical (same name, same\nnucleotides, same qualities) Centrifuge will find and report the same classification(s)\nfor both, even if there was ambiguity.  When `--non-deterministic` is specified,\nCentrifuge re-initializes its pseudo-random generator for each read using the\ncurrent time.  This means that Centrifuge will not necessarily report the same\nclassification for two identical reads.  This is counter-intuitive for some users,\nbut might be more appropriate in situations where the input consists of many\nidentical reads.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-version\">\n\n[`--version`]: #centrifuge-options-version\n\n    --version\n\n</td><td>\n\nPrint version information and quit.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-h\">\n\n    -h/--help\n\n</td><td>\n\nPrint usage information and quit.\n\n</td></tr></table>\n\n\nThe `centrifuge-build` indexer\n===========================\n\n`centrifuge-build` builds a Centrifuge index from a set of DNA sequences.\n`centrifuge-build` outputs a set of 6 files with suffixes `.1.cf`, `.2.cf`, and\n`.3.cf`.  These files together\nconstitute the index: they are all that is needed to align reads to that\nreference.  The original sequence FASTA files are no longer used by Centrifuge\nonce the index is built.\n\nUse of Karkkainen's [blockwise algorithm] allows `centrifuge-build` to trade off\nbetween running time and memory usage. `centrifuge-build` has two options\ngoverning how it makes this trade: [`--bmax`]/[`--bmaxdivn`],\nand [`--dcv`].  By default, `centrifuge-build` will automatically search for the\nsettings that yield the best running time without exhausting memory.  This\nbehavior can be disabled using the [`-a`/`--noauto`] option.\n\nThe indexer provides options pertaining to the \"shape\" of the index, e.g.\n[`--offrate`](#centrifuge-build-options-o) governs the fraction of [Burrows-Wheeler]\nrows that are \"marked\" (i.e., the density of the suffix-array sample; see the\noriginal [FM Index] paper for details).  All of these options are potentially\nprofitable trade-offs depending on the application.  They have been set to\ndefaults that are reasonable for most cases according to our experiments.  See\n[Performance tuning] for details.\n\nThe Centrifuge index is based on the [FM Index] of Ferragina and Manzini, which in\nturn is based on the [Burrows-Wheeler] transform.  The algorithm used to build\nthe index is based on the [blockwise algorithm] of Karkkainen.\n\n[Blockwise algorithm]: http://portal.acm.org/citation.cfm?id=1314852\n[Burrows-Wheeler]: http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\n[Performance tuning]: #performance-tuning\n\nCommand Line\n------------\n\nUsage:\n\n    centrifuge-build [options]* --conversion-table <table_in> --taxonomy-tree <taxonomy_in> --name-table <table_in2> <reference_in> <cf_base>\n\n### Main arguments\n\n<table><tr><td>\n\n    <reference_in>\n\n</td><td>\n\nA comma-separated list of FASTA files containing the reference sequences to be\naligned to, or, if [`-c`](#centrifuge-build-options-c) is specified, the sequences\nthemselves. E.g., `<reference_in>` might be `chr1.fa,chr2.fa,chrX.fa,chrY.fa`,\nor, if [`-c`](#centrifuge-build-options-c) is specified, this might be\n`GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA`.\n\n</td></tr><tr><td>\n\n    <cf_base>\n\n</td><td>\n\nThe basename of the index files to write.  By default, `centrifuge-build` writes\nfiles named `NAME.1.cf`, `NAME.2.cf`, and `NAME.3.cf`, where `NAME` is `<cf_base>`.\n\n</td></tr></table>\n\n### Options\n\n<table><tr><td>\n\n    -f\n\n</td><td>\n\nThe reference input files (specified as `<reference_in>`) are FASTA files\n(usually having extension `.fa`, `.mfa`, `.fna` or similar).\n\n</td></tr><tr><td id=\"centrifuge-build-options-c\">\n\n    -c\n\n</td><td>\n\nThe reference sequences are given on the command line.  I.e. `<reference_in>` is\na comma-separated list of sequences rather than a list of FASTA files.\n\n</td></tr>\n<tr><td id=\"centrifuge-build-options-a\">\n\n[`-a`/`--noauto`]: #centrifuge-build-options-a\n\n    -a/--noauto\n\n</td><td>\n\nDisable the default behavior whereby `centrifuge-build` automatically selects\nvalues for the [`--bmax`], [`--dcv`] and [`--packed`] parameters according to\navailable memory.  Instead, user may specify values for those parameters.  If\nmemory is exhausted during indexing, an error message will be printed; it is up\nto the user to try new parameters.\n\n</td></tr><tr><td id=\"centrifuge-build-options-p\">\n\n[`-p`]: #centrifuge-build-options-p\n\n    -p/--threads <int>\n\n</td><td>\n\nLaunch `NTHREADS` parallel search threads (default: 1).\n\n</td></tr><tr><td id=\"centrifuge-build-options-conversion-table\">\n\n[`--conversion-table`]: #centrifuge-build-options-conversion-table\n\n    --conversion-table <file>\n\n</td><td>\n\nList of UIDs (unique ID) and corresponding taxonomic IDs.\n\n</td></tr><tr><td id=\"centrifuge-build-options-taxonomy-tree\">\n\n[`--taxonomy-tree`]: #centrifuge-build-options-taxonomy-tree\n\n    --taxonomy-tree <file>\n\n</td><td>\n\nTaxonomic tree (e.g. nodes.dmp).\n\n</td></tr><tr><td id=\"centrifuge-build-options-name-table\">\n\n[`--taxonomy-tree`]: #centrifuge-build-options-name-table\n\n    --name-table <file>\n\n</td><td>\n\nName table (e.g. names.dmp).\n\n</td></tr><tr><td id=\"centrifuge-build-options-taxonomy-tree\">\n\n[`--size-table`]: #centrifuge-build-options-size-table\n\n    --size-table <file>\n\n</td><td>\n\nList of taxonomic IDs and lengths of the sequences belonging to the same taxonomic IDs.\n\n</td></tr><tr><td id=\"centrifuge-build-options-bmax\">\n\n[`--bmax`]: #centrifuge-build-options-bmax\n\n    --bmax <int>\n\n</td><td>\n\nThe maximum number of suffixes allowed in a block.  Allowing more suffixes per\nblock makes indexing faster, but increases peak memory usage.  Setting this\noption overrides any previous setting for [`--bmax`], or [`--bmaxdivn`]. \nDefault (in terms of the [`--bmaxdivn`] parameter) is [`--bmaxdivn`] 4.  This is\nconfigured automatically by default; use [`-a`/`--noauto`] to configure manually.\n\n</td></tr><tr><td id=\"centrifuge-build-options-bmaxdivn\">\n\n[`--bmaxdivn`]: #centrifuge-build-options-bmaxdivn\n\n    --bmaxdivn <int>\n\n</td><td>\n\nThe maximum number of suffixes allowed in a block, expressed as a fraction of\nthe length of the reference.  Setting this option overrides any previous setting\nfor [`--bmax`], or [`--bmaxdivn`].  Default: [`--bmaxdivn`] 4.  This is\nconfigured automatically by default; use [`-a`/`--noauto`] to configure manually.\n\n</td></tr><tr><td id=\"centrifuge-build-options-dcv\">\n\n[`--dcv`]: #centrifuge-build-options-dcv\n\n    --dcv <int>\n\n</td><td>\n\nUse `<int>` as the period for the difference-cover sample.  A larger period\nyields less memory overhead, but may make suffix sorting slower, especially if\nrepeats are present.  Must be a power of 2 no greater than 4096.  Default: 1024.\n This is configured automatically by default; use [`-a`/`--noauto`] to configure\nmanually.\n\n</td></tr><tr><td id=\"centrifuge-build-options-nodc\">\n\n[`--nodc`]: #centrifuge-build-options-nodc\n\n    --nodc\n\n</td><td>\n\nDisable use of the difference-cover sample.  Suffix sorting becomes\nquadratic-time in the worst case (where the worst case is an extremely\nrepetitive reference).  Default: off.\n\n</td></tr><tr><td id=\"centrifuge-build-options-o\">\n\n    -o/--offrate <int>\n\n</td><td>\n\nTo map alignments back to positions on the reference sequences, it's necessary\nto annotate (\"mark\") some or all of the [Burrows-Wheeler] rows with their\ncorresponding location on the genome. \n[`-o`/`--offrate`](#centrifuge-build-options-o) governs how many rows get marked:\nthe indexer will mark every 2^`<int>` rows.  Marking more rows makes\nreference-position lookups faster, but requires more memory to hold the\nannotations at runtime.  The default is 4 (every 16th row is marked; for human\ngenome, annotations occupy about 680 megabytes).  \n\n</td></tr><tr><td>\n\n    -t/--ftabchars <int>\n\n</td><td>\n\nThe ftab is the lookup table used to calculate an initial [Burrows-Wheeler]\nrange with respect to the first `<int>` characters of the query.  A larger\n`<int>` yields a larger lookup table but faster query times.  The ftab has size\n4^(`<int>`+1) bytes.  The default setting is 10 (ftab is 4MB).\n\n</td></tr><tr><td>\n\n    --seed <int>\n\n</td><td>\n\nUse `<int>` as the seed for pseudo-random number generator.\n\n</td></tr><tr><td>\n\n    --kmer-count <int>\n\n</td><td>\n\nUse `<int>` as kmer-size for counting the distinct number of k-mers in the input sequences.\n\n</td></tr><tr><td>\n\n    -q/--quiet\n\n</td><td>\n\n`centrifuge-build` is verbose by default.  With this option `centrifuge-build` will\nprint only error messages.\n\n</td></tr><tr><td>\n\n    -h/--help\n\n</td><td>\n\nPrint usage information and quit.\n\n</td></tr><tr><td>\n\n    --version\n\n</td><td>\n\nPrint version information and quit.\n\n</td></tr></table>\n\nThe `centrifuge-inspect` index inspector\n=====================================\n\n`centrifuge-inspect` extracts information from a Centrifuge index about what kind of\nindex it is and what reference sequences were used to build it. When run without\nany options, the tool will output a FASTA file containing the sequences of the\noriginal references (with all non-`A`/`C`/`G`/`T` characters converted to `N`s).\n It can also be used to extract just the reference sequence names using the\n[`-n`/`--names`] option or a more verbose summary using the [`-s`/`--summary`]\noption.\n\nCommand Line\n------------\n\nUsage:\n\n    centrifuge-inspect [options]* <cf_base>\n\n### Main arguments\n\n<table><tr><td>\n\n    <cf_base>\n\n</td><td>\n\nThe basename of the index to be inspected.  The basename is name of any of the\nindex files but with the `.X.cf` suffix omitted.\n`centrifuge-inspect` first looks in the current directory for the index files, then\nin the directory specified in the `Centrifuge_INDEXES` environment variable.\n\n</td></tr></table>\n\n### Options\n\n<table><tr><td>\n\n    -a/--across <int>\n\n</td><td>\n\nWhen printing FASTA output, output a newline character every `<int>` bases\n(default: 60).\n\n</td></tr><tr><td id=\"centrifuge-inspect-options-n\">\n\n[`-n`/`--names`]: #centrifuge-inspect-options-n\n\n    -n/--names\n\n</td><td>\n\nPrint reference sequence names, one per line, and quit.\n\n</td></tr><tr><td id=\"centrifuge-inspect-options-s\">\n\n[`-s`/`--summary`]: #centrifuge-inspect-options-s\n\n    -s/--summary\n\n</td><td>\n\nPrint a summary that includes information about index settings, as well as the\nnames and lengths of the input sequences.  The summary has this format: \n\n    Colorspace\t<0 or 1>\n    SA-Sample\t1 in <sample>\n    FTab-Chars\t<chars>\n    Sequence-1\t<name>\t<len>\n    Sequence-2\t<name>\t<len>\n    ...\n    Sequence-N\t<name>\t<len>\n\nFields are separated by tabs.  Colorspace is always set to 0 for Centrifuge.\n\n</td></tr><tr><td id=\"centrifuge-inspect-options-conversion-table\">\n\n[`--conversion-table`]: #centrifuge-inspect-options-conversion-table\n\n    --conversion-table\n\n</td><td>\n\nPrint a list of UIDs (unique ID) and corresponding taxonomic IDs.\n\n</td></tr><tr><td id=\"centrifuge-inspect-options-taxonomy-tree\">\n\n[`--taxonomy-tree`]: #centrifuge-inspect-options-taxonomy-tree\n\n    --taxonomy-tree\n\n</td><td>\n\nPrint taxonomic tree.\n\n</td></tr><tr><td id=\"centrifuge-inspect-options-name-table\">\n\n[`--taxonomy-tree`]: #centrifuge-inspect-options-name-table\n\n    --name-table\n\n</td><td>\n\nPrint name table.\n\n</td></tr><tr><td id=\"centrifuge-inspect-options-taxonomy-tree\">\n\n[`--size-table`]: #centrifuge-inspect-options-size-table\n\n    --size-table\n\n</td><td>\n\nPrint a list of taxonomic IDs and lengths of the sequences belonging to the same taxonomic IDs.\n\n</td></tr><tr><td>\n\n    -v/--verbose\n\n</td><td>\n\nPrint verbose output (for debugging).\n\n</td></tr><tr><td>\n\n    --version\n\n</td><td>\n\nPrint version information and quit.\n\n</td></tr><tr><td>\n\n    -h/--help\n\n</td><td>\n\nPrint usage information and quit.\n\n</td></tr></table>\n\n[`small example`]: #centrifuge-example\n\nGetting started with Centrifuge\n===================================================\n\nCentrifuge comes with some example files to get you started.  The example files\nare not scientifically significant; these files will simply let you start running Centrifuge and\ndownstream tools right away.\n\nFirst follow the manual instructions to [obtain Centrifuge].  Set the `CENTRIFUGE_HOME`\nenvironment variable to point to the new Centrifuge directory containing the\n`centrifuge`, `centrifuge-build` and `centrifuge-inspect` binaries.  This is important,\nas the `CENTRIFUGE_HOME` variable is used in the commands below to refer to that\ndirectory.\n\n[obtain Centrifuge]: #obtaining-centrifuge\n\nIndexing a reference genome\n---------------------------\n\nTo create an index for two small sequences included with Centrifuge, create a new temporary directory (it doesn't matter where), change into that directory, and run:\n\n    $CENTRIFUGE_HOME/centrifuge-build --conversion-table $CENTRIFUGE_HOME/example/reference/gi_to_tid.dmp --taxonomy-tree $CENTRIFUGE_HOME/example/reference/nodes.dmp --name-table $CENTRIFUGE_HOME/example/reference/names.dmp $CENTRIFUGE_HOME/example/reference/test.fa test\n\nThe command should print many lines of output then quit. When the command\ncompletes, the current directory will contain ten new files that all start with\n`test` and end with `.1.cf`, `.2.cf`, `.3.cf`.  These files constitute the index - you're done!\n\nYou can use `centrifuge-build` to create an index for a set of FASTA files obtained\nfrom any source, including sites such as [UCSC], [NCBI], and [Ensembl]. When\nindexing multiple FASTA files, specify all the files using commas to separate\nfile names.  For more details on how to create an index with `centrifuge-build`,\nsee the [manual section on index building].  You may also want to bypass this\nprocess by obtaining a pre-built index.\n\n[UCSC]: http://genome.ucsc.edu/cgi-bin/hgGateway\n[NCBI]: http://www.ncbi.nlm.nih.gov/sites/genome\n[Ensembl]: http://www.ensembl.org/\n[manual section on index building]: #the-centrifuge-build-indexer\n[using a pre-built index]: #using-a-pre-built-index\n\nClassifying example reads\n----------------------\n\nStay in the directory created in the previous step, which now contains the\n`test` index files.  Next, run:\n\n    $CENTRIFUGE_HOME/centrifuge -f -x test $CENTRIFUGE_HOME/example/reads/input.fa\n\nThis runs the Centrifuge classifier, which classifies a set of unpaired reads to the\nthe genomes using the index generated in the previous step.\nThe classification results are reported to stdout, and a\nshort classification summary is written to centrifuge-species_report.tsv.\n\nYou will see something like this:\n\n    readID  seqID taxID     score\t2ndBestScore\thitLength\tnumMatches\n    C_1 gi|7     9913      4225\t4225\t\t80\t\t2\n    C_1 gi|4     9646      4225\t4225\t\t80\t\t2\n    C_2 gi|4     9646      4225\t4225\t\t80\t\t2\n    C_2 gi|7     9913      4225\t4225\t\t80\t\t2\n    C_3 gi|7     9913      4225\t4225\t\t80\t\t2\n    C_3 gi|4     9646      4225\t4225\t\t80\t\t2\n    C_4 gi|4     9646      4225\t4225\t\t80\t\t2\n    C_4 gi|7     9913      4225\t4225\t\t80\t\t2\n    1_1 gi|4     9646      4225\t0\t\t80\t\t1\n    1_2 gi|4     9646      4225\t0\t\t80\t\t1\n    2_1 gi|7     9913      4225\t0\t\t80\t\t1\n    2_2 gi|7     9913      4225\t0\t\t80\t\t1\n    2_3 gi|7     9913      4225\t0\t\t80\t\t1\n    2_4 gi|7     9913      4225\t0\t\t80\t\t1\n    2_5 gi|7     9913      4225\t0\t\t80\t\t1\n    2_6 gi|7     9913      4225\t0\t\t80\t\t1\n"
  },
  {
    "path": "Makefile",
    "content": "#\n# Copyright 2014, Daehwan Kim <infphilo@gmail.com>\n#\n# This file is part of Centrifuge, which is copied and modified from Makefile in the Bowtie2 package.\n#\n# Centrifuge is free software: you can redistribute it and/or modify\n# it under the terms of the GNU General Public License as published by\n# the Free Software Foundation, either version 3 of the License, or\n# (at your option) any later version.\n#\n# Centrifuge is distributed in the hope that it will be useful,\n# but WITHOUT ANY WARRANTY; without even the implied warranty of\n# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n# GNU General Public License for more details.\n#\n# You should have received a copy of the GNU General Public License\n# along with Centrifuge.  If not, see <http://www.gnu.org/licenses/>.\n#\n#\n# Makefile for centrifuge-bin, centrifuge-build, centrifuge-inspect\n#\n\nINC =\nGCC_PREFIX = $(shell dirname `which gcc`)\nGCC_SUFFIX =\nCC = $(GCC_PREFIX)/gcc$(GCC_SUFFIX)\nCPP = $(GCC_PREFIX)/g++$(GCC_SUFFIX)\nCXX = $(CPP) #-fdiagnostics-color=always\nHEADERS = $(wildcard *.h)\nBOWTIE_MM = 1\nBOWTIE_SHARED_MEM = 0\n\n# Detect Cygwin or MinGW\nWINDOWS = 0\nCYGWIN = 0\nMINGW = 0\nifneq (,$(findstring CYGWIN,$(shell uname)))\n\tWINDOWS = 1 \n\tCYGWIN = 1\n\t# POSIX memory-mapped files not currently supported on Windows\n\tBOWTIE_MM = 0\n\tBOWTIE_SHARED_MEM = 0\nelse\n\tifneq (,$(findstring MINGW,$(shell uname)))\n\t\tWINDOWS = 1\n\t\tMINGW = 1\n\t\t# POSIX memory-mapped files not currently supported on Windows\n\t\tBOWTIE_MM = 0\n\t\tBOWTIE_SHARED_MEM = 0\n\tendif\nendif\n\nMACOS = 0\nifneq (,$(findstring Darwin,$(shell uname)))\n\tMACOS = 1\nendif\n\nPOPCNT_CAPABILITY ?= 1\nifeq (1, $(POPCNT_CAPABILITY))\n    EXTRA_FLAGS += -DPOPCNT_CAPABILITY\n    INC += -I third_party\nendif\n\nMM_DEF = \n\nifeq (1,$(BOWTIE_MM))\n\tMM_DEF = -DBOWTIE_MM\nendif\n\nSHMEM_DEF = \n\nifeq (1,$(BOWTIE_SHARED_MEM))\n\tSHMEM_DEF = -DBOWTIE_SHARED_MEM\nendif\n\nPTHREAD_PKG =\nPTHREAD_LIB = \n\nifeq (1,$(MINGW))\n\tPTHREAD_LIB = \nelse\n\tPTHREAD_LIB = -lpthread\nendif\n\nSEARCH_LIBS = \nBUILD_LIBS = \nINSPECT_LIBS =\n\nifeq (1,$(MINGW))\n\tBUILD_LIBS = \n\tINSPECT_LIBS = \nendif\n\nUSE_SRA = 0\nSRA_DEF =\nSRA_LIB =\nSERACH_INC = \nifeq (1,$(USE_SRA))\n\tSRA_DEF = -DUSE_SRA\n\tSRA_LIB = -lncbi-ngs-c++-static -lngs-c++-static -lncbi-vdb-static -ldl\n\tSEARCH_INC += -I$(NCBI_NGS_DIR)/include -I$(NCBI_VDB_DIR)/include\n\tSEARCH_LIBS += -L$(NCBI_NGS_DIR)/lib64 -L$(NCBI_VDB_DIR)/lib64\nendif\n\nLIBS = $(PTHREAD_LIB)\n\nSHARED_CPPS = ccnt_lut.cpp ref_read.cpp alphabet.cpp shmem.cpp \\\n\tedit.cpp bt2_idx.cpp \\\n\treference.cpp ds.cpp limit.cpp \\\n\trandom_source.cpp tinythread.cpp\nSEARCH_CPPS = qual.cpp pat.cpp \\\n\tread_qseq.cpp ref_coord.cpp mask.cpp \\\n\tpe.cpp aligner_seed_policy.cpp \\\n\tscoring.cpp presets.cpp \\\n\tsimple_func.cpp random_util.cpp outq.cpp\n\nBUILD_CPPS = diff_sample.cpp\n\nCENTRIFUGE_CPPS_MAIN = $(SEARCH_CPPS) centrifuge_main.cpp\nCENTRIFUGE_BUILD_CPPS_MAIN = $(BUILD_CPPS) centrifuge_build_main.cpp\nCENTRIFUGE_COMPRESS_CPPS_MAIN = $(BUILD_CPPS) \\\n\taligner_seed.cpp \\\n\taligner_sw.cpp \\\n\taligner_cache.cpp \\\n\tdp_framer.cpp \\\n\taligner_bt.cpp sse_util.cpp \\\n\taligner_swsse.cpp \\\n\taligner_swsse_loc_i16.cpp \\\n\taligner_swsse_ee_i16.cpp \\\n\taligner_swsse_loc_u8.cpp \\\n\taligner_swsse_ee_u8.cpp \\\n\tscoring.cpp \\\n\tmask.cpp \\\n\tqual.cpp\n\nCENTRIFUGE_REPORT_CPPS_MAIN=$(BUILD_CPPS)\n\nSEARCH_FRAGMENTS = $(wildcard search_*_phase*.c)\nVERSION = $(shell cat VERSION)\nGIT_VERSION = $(VERSION)\n#GIT_VERSION = $(shell command -v git 2>&1 > /dev/null && git describe --long --tags --dirty --always --abbrev=10 || cat VERSION)\n\n# Convert BITS=?? to a -m flag\nBITS=32\nifeq (x86_64,$(shell uname -m))\nBITS=64\nendif\n# msys will always be 32 bit so look at the cpu arch instead.\nifneq (,$(findstring AMD64,$(PROCESSOR_ARCHITEW6432)))\n\tifeq (1,$(MINGW))\n\t\tBITS=64\n\tendif\nendif\nBITS_FLAG =\n\nifeq (32,$(BITS))\n\tBITS_FLAG = -m32\nendif\n\nifeq (64,$(BITS))\n\tBITS_FLAG = -m64\nendif\nSSE_FLAG=-msse2\n\nDEBUG_FLAGS    = -O0 -g3 $(BIToS_FLAG) $(SSE_FLAG) -std=c++11\nDEBUG_DEFS     = -DCOMPILER_OPTIONS=\"\\\"$(DEBUG_FLAGS) $(EXTRA_FLAGS)\\\"\"\nRELEASE_FLAGS  = -O3 $(BITS_FLAG) $(SSE_FLAG) -funroll-loops -g3 -std=c++11\nRELEASE_DEFS   = -DCOMPILER_OPTIONS=\"\\\"$(RELEASE_FLAGS) $(EXTRA_FLAGS)\\\"\"\nNOASSERT_FLAGS = -DNDEBUG\nFILE_FLAGS     = -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE\nCFLAGS         = \n#CFLAGS         = -fdiagnostics-color=always\n\nifeq (1,$(USE_SRA))\n\tifeq (1, $(MACOS))\n\t\tDEBUG_FLAGS += -mmacosx-version-min=10.6\n\t\tRELEASE_FLAGS += -mmacosx-version-min=10.6\n\tendif\nendif\n\n\nCENTRIFUGE_BIN_LIST = centrifuge-build-bin \\\n\tcentrifuge-class \\\n\tcentrifuge-inspect-bin\n\nCENTRIFUGE_BIN_LIST_AUX = centrifuge-build-bin-debug \\\n\tcentrifuge-class-debug \\\n\tcentrifuge-inspect-bin-debug\n\nCENTRIFUGE_SCRIPT_LIST = \tcentrifuge \\\n\tcentrifuge-build \\\n\tcentrifuge-inspect \\\n\tcentrifuge-download \\\n\tcentrifuge-kreport \\\n\t$(wildcard centrifuge-*.pl)\n\n\nGENERAL_LIST = $(wildcard scripts/*.sh) \\\n\t$(wildcard scripts/*.pl) \\\n\t$(wildcard *.py) \\\n\t$(wildcard *.pl) \\\n\tdoc/manual.inc.html \\\n\tdoc/README \\\n\tdoc/style.css \\\n\t$(wildcard example/index/*.cf) \\\n\t$(wildcard example/reads/*.fa) \\\n\t$(wildcard example/reference/*) \\\n\tindices/Makefile \\\n\t$(PTHREAD_PKG) \\\n\t$(CENTRIFUGE_SCRIPT_LIST) \\\n\tAUTHORS \\\n\tLICENSE \\\n\tNEWS \\\n\tMANUAL \\\n\tMANUAL.markdown \\\n\tTUTORIAL \\\n\tVERSION\n\nifeq (1,$(WINDOWS))\n\tCENTRIFUGE_BIN_LIST := $(CENTRIFUGE_BIN_LIST) centrifuge.bat centrifuge-build.bat centrifuge-inspect.bat \nendif\n\n# This is helpful on Windows under MinGW/MSYS, where Make might go for\n# the Windows FIND tool instead.\nFIND=$(shell which find)\n\nSRC_PKG_LIST = $(wildcard *.h) \\\n\t$(wildcard *.hh) \\\n\t$(wildcard *.c) \\\n\t$(wildcard *.cpp) \\\n\t$(wildcard third_party/*.h) \\\n\t$(wildcard third_party/*.cpp) \\\n\tdoc/strip_markdown.pl \\\n\tMakefile \\\n\t$(GENERAL_LIST)\n\nBIN_PKG_LIST = $(GENERAL_LIST)\n\n.PHONY: all allall both both-debug\n\nall: $(CENTRIFUGE_BIN_LIST)\n\nallall: $(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_BIN_LIST_AUX)\n\nboth: centrifuge-class centrifuge-build-bin\n\nboth-debug: centrifuge-class-debug centrifuge-build-bin-debug\n\nDEFS=-fno-strict-aliasing \\\n     -DCENTRIFUGE_VERSION=\"\\\"$(GIT_VERSION)\\\"\" \\\n     -DBUILD_HOST=\"\\\"`hostname`\\\"\" \\\n     -DBUILD_TIME=\"\\\"`date`\\\"\" \\\n     -DCOMPILER_VERSION=\"\\\"`$(CXX) -v 2>&1 | tail -1`\\\"\" \\\n     $(FILE_FLAGS) \\\n\t $(CFLAGS) \\\n     $(PREF_DEF) \\\n     $(MM_DEF) \\\n     $(SHMEM_DEF)\n\n#\n# centrifuge targets\n#\n\ncentrifuge-class: centrifuge.cpp $(SEARCH_CPPS) $(SHARED_CPPS) $(HEADERS) $(SEARCH_FRAGMENTS)\n\t$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \\\n\t$(DEFS) $(SRA_DEF) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX $(NOASSERT_FLAGS) -Wall \\\n\t$(INC) $(SEARCH_INC) \\\n\t-o $@ $< \\\n\t$(SHARED_CPPS) $(CENTRIFUGE_CPPS_MAIN) \\\n\t$(LIBS) $(SRA_LIB) $(SEARCH_LIBS)\n\ncentrifuge-class-debug: centrifuge.cpp $(SEARCH_CPPS) $(SHARED_CPPS) $(HEADERS) $(SEARCH_FRAGMENTS)\n\t$(CXX) $(DEBUG_FLAGS) $(DEBUG_DEFS) $(EXTRA_FLAGS) \\\n\t$(DEFS) $(SRA_DEF) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX -Wall \\\n\t$(INC) $(SRA_LIB) $(SEARCH_INC) \\\n\t-o $@ $< \\\n\t$(SHARED_CPPS) $(CENTRIFUGE_CPPS_MAIN) \\\n\t$(LIBS) $(SRA_LIB) $(SEARCH_LIBS)\n\ncentrifuge-build-bin: centrifuge_build.cpp $(SHARED_CPPS) $(HEADERS)\n\t$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \\\n\t$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX $(NOASSERT_FLAGS) -Wall \\\n\t$(INC) \\\n\t-o $@ $< \\\n\t$(SHARED_CPPS) $(CENTRIFUGE_BUILD_CPPS_MAIN) \\\n\t$(LIBS) $(BUILD_LIBS)\n\ncentrifuge-build-bin-debug: centrifuge_build.cpp $(SHARED_CPPS) $(HEADERS)\n\t$(CXX) $(DEBUG_FLAGS) $(DEBUG_DEFS) $(EXTRA_FLAGS) \\\n\t$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX -Wall \\\n\t$(INC) \\\n\t-o $@ $< \\\n\t$(SHARED_CPPS) $(CENTRIFUGE_BUILD_CPPS_MAIN) \\\n\t$(LIBS) $(BUILD_LIBS)\n\ncentrifuge-compress-bin: centrifuge_compress.cpp $(SHARED_CPPS) $(CENTRIFUGE_COMPRESS_CPPS_MAIN) $(HEADERS)\n\t$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \\\n\t$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX $(NOASSERT_FLAGS) -Wall \\\n\t$(INC) \\\n\t-o $@ $< \\\n\t$(SHARED_CPPS) $(CENTRIFUGE_COMPRESS_CPPS_MAIN) \\\n\t$(LIBS) $(BUILD_LIBS)\n\ncentrifuge-compress-bin-debug: centrifuge_compress.cpp $(SHARED_CPPS) $(CENTRIFUGE_COMPRESS_CPPS_MAIN) $(HEADERS)\n\t$(CXX) $(DEBUG_FLAGS) $(DEBUG_DEFS) $(EXTRA_FLAGS) \\\n\t$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX -Wall \\\n\t$(INC) \\\n\t-o $@ $< \\\n\t$(SHARED_CPPS) $(CENTRIFUGE_COMPRESS_CPPS_MAIN) \\\n\t$(LIBS) $(BUILD_LIBS)\n\ncentrifuge-report-bin: centrifuge_report.cpp $(SHARED_CPPS) $(CENTRIFUGE_REPORT_CPPS_MAIN) $(HEADERS)\n\t$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \\\n\t$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX $(NOASSERT_FLAGS) -Wall \\\n\t$(INC) \\\n\t-o $@ $< \\\n\t$(SHARED_CPPS) $(CENTRIFUGE_REPORT_CPPS_MAIN) \\\n\t$(LIBS) $(BUILD_LIBS)\n\ncentrifuge-report-bin-debug: centrifuge_report.cpp $(SHARED_CPPS) $(CENTRIFUGE_REPORT_CPPS_MAIN) $(HEADERS)\n\t$(CXX) $(DEBUG_FLAGS) $(DEBUG_DEFS) $(EXTRA_FLAGS) \\\n\t$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX -Wall \\\n\t$(INC) \\\n\t-o $@ $< \\\n\t$(SHARED_CPPS) $(CENTRIFUGE_REPORT_CPPS_MAIN) \\\n\t$(LIBS) $(BUILD_LIBS)\n\n#centrifuge-RemoveN: centrifuge-RemoveN.cpp \n#\t$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \\\n#\t$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX $(NOASSERT_FLAGS) -Wall \\\n#\t$(INC) \\\n#\t-o $@ $< \n\n\n#\n# centrifuge-inspect targets\n#\n\ncentrifuge-inspect-bin: centrifuge_inspect.cpp $(HEADERS) $(SHARED_CPPS)\n\t$(CXX) $(RELEASE_FLAGS) \\\n\t$(RELEASE_DEFS) $(EXTRA_FLAGS) \\\n\t$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX -Wall \\\n\t$(INC) -I . \\\n\t-o $@ $< \\\n\t$(SHARED_CPPS) \\\n\t$(LIBS) $(INSPECT_LIBS)\n\ncentrifuge-inspect-bin-debug: centrifuge_inspect.cpp $(HEADERS) $(SHARED_CPPS) \n\t$(CXX) $(DEBUG_FLAGS) \\\n\t$(DEBUG_DEFS) $(EXTRA_FLAGS) \\\n\t$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX -Wall \\\n\t$(INC) -I . \\\n\t-o $@ $< \\\n\t$(SHARED_CPPS) \\\n\t$(LIBS) $(INSPECT_LIBS)\n\n\ncentrifuge: ;\n\ncentrifuge.bat:\n\techo \"@echo off\" > centrifuge.bat\n\techo \"perl %~dp0/centrifuge %*\" >> centrifuge.bat\n\ncentrifuge-build.bat:\n\techo \"@echo off\" > centrifuge-build.bat\n\techo \"python %~dp0/centrifuge-build %*\" >> centrifuge-build.bat\n\ncentrifuge-inspect.bat:\n\techo \"@echo off\" > centrifuge-inspect.bat\n\techo \"python %~dp0/centrifuge-inspect %*\" >> centrifuge-inspect.bat\n\n\n.PHONY: centrifuge-src\ncentrifuge-src: $(SRC_PKG_LIST)\n\tmkdir .src.tmp\n\tmkdir .src.tmp/centrifuge-$(VERSION)\n\tzip tmp.zip $(SRC_PKG_LIST)\n\tmv tmp.zip .src.tmp/centrifuge-$(VERSION)\n\tcd .src.tmp/centrifuge-$(VERSION) ; unzip tmp.zip ; rm -f tmp.zip\n\tcd .src.tmp ; zip -r centrifuge-$(VERSION)-source.zip centrifuge-$(VERSION)\n\tcp .src.tmp/centrifuge-$(VERSION)-source.zip .\n\trm -rf .src.tmp\n\n.PHONY: centrifuge-bin\ncentrifuge-bin: $(BIN_PKG_LIST) $(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_BIN_LIST_AUX) \n\trm -rf .bin.tmp\n\tmkdir .bin.tmp\n\tmkdir .bin.tmp/centrifuge-$(VERSION)\n\tif [ -f centrifuge.exe ] ; then \\\n\t\tzip tmp.zip $(BIN_PKG_LIST) $(addsuffix .exe,$(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_BIN_LIST_AUX)) ; \\\n\telse \\\n\t\tzip tmp.zip $(BIN_PKG_LIST) $(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_BIN_LIST_AUX) ; \\\n\tfi\n\tmv tmp.zip .bin.tmp/centrifuge-$(VERSION)\n\tcd .bin.tmp/centrifuge-$(VERSION) ; unzip tmp.zip ; rm -f tmp.zip\n\tcd .bin.tmp ; zip -r centrifuge-$(VERSION)-$(BITS).zip centrifuge-$(VERSION)\n\tcp .bin.tmp/centrifuge-$(VERSION)-$(BITS).zip .\n\trm -rf .bin.tmp\n\n.PHONY: doc\ndoc: doc/manual.inc.html MANUAL\n\ndoc/manual.inc.html: MANUAL.markdown\n\tpandoc -T \"Centrifuge Manual\" -o $@ \\\n\t --from markdown --to HTML --toc $^\n\tperl -i -ne \\\n\t '$$w=0 if m|^</body>|;print if $$w;$$w=1 if m|^<body>|;' $@\n\nMANUAL: MANUAL.markdown\n\tperl doc/strip_markdown.pl < $^ > $@\n\nprefix=/usr/local\n\n.PHONY: install\ninstall: all\n\tmkdir -p $(prefix)/bin\n\tmkdir -p $(prefix)/share/centrifuge/indices\n\tinstall -m 0644 indices/Makefile $(prefix)/share/centrifuge/indices\n\tinstall -d -m 0755 $(prefix)/share/centrifuge/doc\n\tinstall -m 0644 doc/* $(prefix)/share/centrifuge/doc\n\tfor file in $(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_SCRIPT_LIST); do \\\n\t\tinstall -m 0755 $$file $(prefix)/bin ; \\\n\tdone\n\n.PHONY: uninstall\nuninstall: all\n\tfor file in $(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_SCRIPT_LIST); do \\\n\t\trm -v $(prefix)/bin/$$file ; \\\n\t\trm -v $(prefix)/share/centrifuge; \\\n\tdone\n\n\n.PHONY: clean\nclean:\n\trm -f $(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_BIN_LIST_AUX) \\\n\t$(addsuffix .exe,$(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_BIN_LIST_AUX)) \\\n\tcentrifuge-src.zip centrifuge-bin.zip\n\trm -f core.* .tmp.head\n\trm -rf *.dSYM\npush-doc: doc/manual.inc.html\n\tscp doc/*.*html igm1:/data1/igm3/www/ccb.jhu.edu/html/software/centrifuge/\n"
  },
  {
    "path": "NEWS",
    "content": "Centrifuge NEWS\n=============\n\n"
  },
  {
    "path": "README.md",
    "content": "# Centrifuge\nClassifier for metagenomic sequences\n\n[Centrifuge] is a novel microbial classification engine that enables\nrapid, accurate and sensitive labeling of reads and quantification of\nspecies on desktop computers.  The system uses a novel indexing scheme\nbased on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini\n(FM) index, optimized specifically for the metagenomic classification\nproblem. Centrifuge requires a relatively small index (4.7 GB for all\ncomplete bacterial and viral genomes plus the human genome) and\nclassifies sequences at very high speed, allowing it to process the\nmillions of reads from a typical high-throughput DNA sequencing run\nwithin a few minutes.  Together these advances enable timely and\naccurate analysis of large metagenomics data sets on conventional\ndesktop computers\n\nThe Centrifuge hompage is  http://www.ccb.jhu.edu/software/centrifuge\n\nThe Centrifuge paper is available at https://genome.cshlp.org/content/26/12/1721\n\nThe Centrifuge poster is available at http://www.ccb.jhu.edu/people/infphilo/data/Centrifuge-poster.pdf\n\nFor more details on installing and running Centrifuge, look at MANUAL\n\n## Quick guide\n### Installation from source\n\n    git clone https://github.com/infphilo/centrifuge\n    cd centrifuge\n    make\n    sudo make install prefix=/usr/local\n\n### Building indexes\n\nWe provide several indexes on the Centrifuge homepage at http://www.ccb.jhu.edu/software/centrifuge.\nCentrifuge needs sequence and taxonomy files,  as well as sequence ID to taxonomy ID mapping. \nSee the MANUAL files for details. We provide a Makefile that simplifies the building of several\nstandard and custom indices\n\n    cd indices\n    make p+h+v                   # bacterial, human, and viral genomes [~12G]\n    make p_compressed            # bacterial genomes compressed at the species level [~4.2G]\n    make p_compressed+h+v        # combination of the two above [~8G]\n"
  },
  {
    "path": "TUTORIAL",
    "content": "See section toward end of MANUAL entited \"Getting started with Bowtie 2: Lambda\nphage example\".  Or, for tutorial for latest Bowtie 2 version, visit:\n\nhttp://bowtie-bio.sf.net/bowtie2/manual.shtml#getting-started-with-bowtie-2-lambda-phage-example\n"
  },
  {
    "path": "VERSION",
    "content": "1.0.4\n"
  },
  {
    "path": "aligner_bt.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"aligner_bt.h\"\n#include \"mask.h\"\n\nusing namespace std;\n\n#define CHECK_ROW_COL(rowc, colc) \\\n\tif(rowc >= 0 && colc >= 0) { \\\n\t\tif(!sawcell_[colc].insert(rowc)) { \\\n\t\t\t/* was already in there */ \\\n\t\t\tabort = true; \\\n\t\t\treturn; \\\n\t\t} \\\n\t\tassert(local || prob_.cper_->debugCell(rowc, colc, hefc)); \\\n\t}\n\n/**\n * Fill in a triangle of the DP table and backtrace from the given cell to\n * a cell in the previous checkpoint, or to the terminal cell.\n */\nvoid BtBranchTracer::triangleFill(\n\tint64_t rw,          // row of cell to backtrace from\n\tint64_t cl,          // column of cell to backtrace from\n\tint hef,             // cell to backtrace from is H (0), E (1), or F (2)\n\tTAlScore targ,       // score of cell to backtrace from\n\tTAlScore targ_final, // score of alignment we're looking for\n\tRandomSource& rnd,   // pseudo-random generator\n\tint64_t& row_new,    // out: row we ended up in after backtrace\n\tint64_t& col_new,    // out: column we ended up in after backtrace\n\tint& hef_new,        // out: H/E/F after backtrace\n\tTAlScore& targ_new,  // out: score up to cell we ended up in\n\tbool& done,          // out: finished tracing out an alignment?\n\tbool& abort)         // out: aborted b/c cell was seen before?\n{\n\tassert_geq(rw, 0);\n\tassert_geq(cl, 0);\n\tassert_range(0, 2, hef);\n\tassert_lt(rw, (int64_t)prob_.qrylen_);\n\tassert_lt(cl, (int64_t)prob_.reflen_);\n\tassert(prob_.usecp_ && prob_.fill_);\n\tint64_t row = rw, col = cl;\n\tconst int64_t colmin = 0;\n\tconst int64_t rowmin = 0;\n\tconst int64_t colmax = prob_.reflen_ - 1;\n\tconst int64_t rowmax = prob_.qrylen_ - 1;\n\tassert_leq(prob_.reflen_, (TRefOff)sawcell_.size());\n\tassert_leq(col, (int64_t)prob_.cper_->hicol());\n\tassert_geq(col, (int64_t)prob_.cper_->locol());\n\tassert_geq(prob_.cper_->per(), 2);\n\tsize_t mod = (row + col) & prob_.cper_->lomask();\n\tassert_lt(mod, prob_.cper_->per());\n\t// Allocate room for diags\n\tsize_t depth = mod+1;\n\tassert_leq(depth, prob_.cper_->per());\n\tsize_t breadth = depth;\n\ttri_.resize(depth);\n\t// Allocate room for each diag\n\tfor(size_t i = 0; i < depth; i++) {\n\t\ttri_[i].resize(breadth - i);\n\t}\n\tbool upperleft = false;\n\tsize_t off = (row + col) >> prob_.cper_->perpow2();\n\tif(off == 0) {\n\t\tupperleft = true;\n\t} else {\n\t\toff--;\n\t}\n\tconst TAlScore sc_rdo = prob_.sc_->readGapOpen();\n\tconst TAlScore sc_rde = prob_.sc_->readGapExtend();\n\tconst TAlScore sc_rfo = prob_.sc_->refGapOpen();\n\tconst TAlScore sc_rfe = prob_.sc_->refGapExtend();\n\tconst bool local = !prob_.sc_->monotone;\n\tint64_t row_lo = row - (int64_t)mod;\n\tconst CpQuad *prev2 = NULL, *prev1 = NULL;\n\tif(!upperleft) {\n\t\t// Read-only pointer to cells in diagonal -2.  Start one row above the\n\t\t// target row.\n\t\tprev2 = prob_.cper_->qdiag1sPtr() + (off * prob_.cper_->nrow() + row_lo - 1);\n\t\t// Read-only pointer to cells in diagonal -1.  Start one row above the\n\t\t// target row\n\t\tprev1 = prob_.cper_->qdiag2sPtr() + (off * prob_.cper_->nrow() + row_lo - 1);\n#ifndef NDEBUG\n\t\tif(row >= (int64_t)mod) {\n\t\t\tsize_t rowc = row - mod, colc = col;\n\t\t\tif(rowc > 0 && prob_.cper_->isCheckpointed(rowc-1, colc)) {\n\t\t\t\tTAlScore al = prev1[0].sc[0];\n\t\t\t\tif(al == MIN_I16) al = MIN_I64;\n\t\t\t\tassert_eq(prob_.cper_->scoreTriangle(rowc-1, colc, 0), al);\n\t\t\t}\n\t\t\tif(rowc > 0 && colc > 0 && prob_.cper_->isCheckpointed(rowc-1, colc-1)) {\n\t\t\t\tTAlScore al = prev2[0].sc[0];\n\t\t\t\tif(al == MIN_I16) al = MIN_I64;\n\t\t\t\tassert_eq(prob_.cper_->scoreTriangle(rowc-1, colc-1, 0), al);\n\t\t\t}\n\t\t}\n#endif\n\t}\n\t// Pointer to cells in current diagonal\n\t// For each diagonal we need to fill in\n\tfor(size_t i = 0; i < depth; i++) {\n\t\tCpQuad * cur = tri_[i].ptr();\n\t\tCpQuad * curc = cur;\n\t\tsize_t doff = mod - i; // # diagonals we are away from target diag\n\t\t//assert_geq(row, (int64_t)doff);\n\t\tint64_t rowc = row - doff;\n\t\tint64_t colc = col;\n\t\tsize_t neval = 0; // # cells evaluated in this diag\n\t\tASSERT_ONLY(const CpQuad *last = NULL);\n\t\t// Fill this diagonal from upper right to lower left\n\t\tfor(size_t j = 0; j < breadth; j++) {\n\t\t\tif(rowc >= rowmin && rowc <= rowmax &&\n\t\t\t   colc >= colmin && colc <= colmax)\n\t\t\t{\n\t\t\t\tneval++;\n\t\t\t\tint64_t fromend = prob_.qrylen_ - rowc - 1;\n\t\t\t\tbool allowGaps = fromend >= prob_.sc_->gapbar && rowc >= prob_.sc_->gapbar;\n\t\t\t\t// Fill this cell\n\t\t\t\t// Some things we might want to calculate about this cell up front:\n\t\t\t\t// 1. How many matches are possible from this cell to the cell in\n\t\t\t\t//    row, col, in case this allows us to prune\n\t\t\t\t// Get character from read\n\t\t\t\tint qc = prob_.qry_[rowc];\n\t\t\t\t// Get quality value from read\n\t\t\t\tint qq = prob_.qual_[rowc];\n\t\t\t\tassert_geq(qq, 33);\n\t\t\t\t// Get character from reference\n\t\t\t\tint rc = prob_.ref_[colc];\n\t\t\t\tassert_range(0, 16, rc);\n\t\t\t\tint16_t sc_diag = prob_.sc_->score(qc, rc, qq - 33);\n\t\t\t\tint16_t sc_h_up = MIN_I16;\n\t\t\t\tint16_t sc_f_up = MIN_I16;\n\t\t\t\tint16_t sc_h_lf = MIN_I16;\n\t\t\t\tint16_t sc_e_lf = MIN_I16;\n\t\t\t\tif(allowGaps) {\n\t\t\t\t\tif(rowc > 0) {\n\t\t\t\t\t\tassert(local || prev1[j+0].sc[2] < 0);\n\t\t\t\t\t\tif(prev1[j+0].sc[0] > MIN_I16) {\n\t\t\t\t\t\t\tsc_h_up = prev1[j+0].sc[0] - sc_rfo;\n\t\t\t\t\t\t\tif(local) sc_h_up = max<int16_t>(sc_h_up, 0);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(prev1[j+0].sc[2] > MIN_I16) {\n\t\t\t\t\t\t\tsc_f_up = prev1[j+0].sc[2] - sc_rfe;\n\t\t\t\t\t\t\tif(local) sc_f_up = max<int16_t>(sc_f_up, 0);\n\t\t\t\t\t\t}\n#ifndef NDEBUG\n\t\t\t\t\t\tTAlScore hup = prev1[j+0].sc[0];\n\t\t\t\t\t\tTAlScore fup = prev1[j+0].sc[2];\n\t\t\t\t\t\tif(hup == MIN_I16) hup = MIN_I64;\n\t\t\t\t\t\tif(fup == MIN_I16) fup = MIN_I64;\n\t\t\t\t\t\tif(local) {\n\t\t\t\t\t\t\thup = max<int16_t>(hup, 0);\n\t\t\t\t\t\t\tfup = max<int16_t>(fup, 0);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(prob_.cper_->isCheckpointed(rowc-1, colc)) {\n\t\t\t\t\t\t\tassert_eq(hup, prob_.cper_->scoreTriangle(rowc-1, colc, 0));\n\t\t\t\t\t\t\tassert_eq(fup, prob_.cper_->scoreTriangle(rowc-1, colc, 2));\n\t\t\t\t\t\t}\n#endif\n\t\t\t\t\t}\n\t\t\t\t\tif(colc > 0) {\n\t\t\t\t\t\tassert(local || prev1[j+1].sc[1] < 0);\n\t\t\t\t\t\tif(prev1[j+1].sc[0] > MIN_I16) {\n\t\t\t\t\t\t\tsc_h_lf = prev1[j+1].sc[0] - sc_rdo;\n\t\t\t\t\t\t\tif(local) sc_h_lf = max<int16_t>(sc_h_lf, 0);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(prev1[j+1].sc[1] > MIN_I16) {\n\t\t\t\t\t\t\tsc_e_lf = prev1[j+1].sc[1] - sc_rde;\n\t\t\t\t\t\t\tif(local) sc_e_lf = max<int16_t>(sc_e_lf, 0);\n\t\t\t\t\t\t}\n#ifndef NDEBUG\n\t\t\t\t\t\tTAlScore hlf = prev1[j+1].sc[0];\n\t\t\t\t\t\tTAlScore elf = prev1[j+1].sc[1];\n\t\t\t\t\t\tif(hlf == MIN_I16) hlf = MIN_I64;\n\t\t\t\t\t\tif(elf == MIN_I16) elf = MIN_I64;\n\t\t\t\t\t\tif(local) {\n\t\t\t\t\t\t\thlf = max<int16_t>(hlf, 0);\n\t\t\t\t\t\t\telf = max<int16_t>(elf, 0);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(prob_.cper_->isCheckpointed(rowc, colc-1)) {\n\t\t\t\t\t\t\tassert_eq(hlf, prob_.cper_->scoreTriangle(rowc, colc-1, 0));\n\t\t\t\t\t\t\tassert_eq(elf, prob_.cper_->scoreTriangle(rowc, colc-1, 1));\n\t\t\t\t\t\t}\n#endif\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tassert(rowc <= 1 || colc <= 0 || prev2 != NULL);\n\t\t\t\tint16_t sc_h_dg = ((rowc > 0 && colc > 0) ? prev2[j+0].sc[0] : 0);\n\t\t\t\tif(colc == 0 && rowc > 0 && !local) {\n\t\t\t\t\tsc_h_dg = MIN_I16;\n\t\t\t\t}\n\t\t\t\tif(sc_h_dg > MIN_I16) {\n\t\t\t\t\tsc_h_dg += sc_diag;\n\t\t\t\t}\n\t\t\t\tif(local) sc_h_dg = max<int16_t>(sc_h_dg, 0);\n\t\t\t\t// cerr << sc_diag << \" \" << sc_h_dg << \" \" << sc_h_up << \" \" << sc_f_up << \" \" << sc_h_lf << \" \" << sc_e_lf << endl;\n\t\t\t\tint mask = 0;\n\t\t\t\t// Calculate best ways into H, E, F cells starting with H.\n\t\t\t\t// Mask bits:\n\t\t\t\t// H: 1=diag, 2=hhoriz, 4=ehoriz, 8=hvert, 16=fvert\n\t\t\t\t// E: 32=hhoriz, 64=ehoriz\n\t\t\t\t// F: 128=hvert, 256=fvert\n\t\t\t\tint16_t sc_best = sc_h_dg;\n\t\t\t\tif(sc_h_dg > MIN_I64) {\n\t\t\t\t\tmask = 1;\n\t\t\t\t}\n\t\t\t\tif(colc > 0 && sc_h_lf >= sc_best && sc_h_lf > MIN_I64) {\n\t\t\t\t\tif(sc_h_lf > sc_best) mask = 0;\n\t\t\t\t\tmask |= 2;\n\t\t\t\t\tsc_best = sc_h_lf;\n\t\t\t\t}\n\t\t\t\tif(colc > 0 && sc_e_lf >= sc_best && sc_e_lf > MIN_I64) {\n\t\t\t\t\tif(sc_e_lf > sc_best) mask = 0;\n\t\t\t\t\tmask |= 4;\n\t\t\t\t\tsc_best = sc_e_lf;\n\t\t\t\t}\n\t\t\t\tif(rowc > 0 && sc_h_up >= sc_best && sc_h_up > MIN_I64) {\n\t\t\t\t\tif(sc_h_up > sc_best) mask = 0;\n\t\t\t\t\tmask |= 8;\n\t\t\t\t\tsc_best = sc_h_up;\n\t\t\t\t}\n\t\t\t\tif(rowc > 0 && sc_f_up >= sc_best && sc_f_up > MIN_I64) {\n\t\t\t\t\tif(sc_f_up > sc_best) mask = 0;\n\t\t\t\t\tmask |= 16;\n\t\t\t\t\tsc_best = sc_f_up;\n\t\t\t\t}\n\t\t\t\t// Calculate best way into E cell\n\t\t\t\tint16_t sc_e_best = sc_h_lf;\n\t\t\t\tif(colc > 0) {\n\t\t\t\t\tif(sc_h_lf >= sc_e_lf && sc_h_lf > MIN_I64) {\n\t\t\t\t\t\tif(sc_h_lf == sc_e_lf) {\n\t\t\t\t\t\t\tmask |= 64;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tmask |= 32;\n\t\t\t\t\t} else if(sc_e_lf > MIN_I64) {\n\t\t\t\t\t\tsc_e_best = sc_e_lf;\n\t\t\t\t\t\tmask |= 64;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(sc_e_best > sc_best) {\n\t\t\t\t\tsc_best = sc_e_best;\n\t\t\t\t\tmask &= ~31; // don't go diagonal\n\t\t\t\t}\n\t\t\t\t// Calculate best way into F cell\n\t\t\t\tint16_t sc_f_best = sc_h_up;\n\t\t\t\tif(rowc > 0) {\n\t\t\t\t\tif(sc_h_up >= sc_f_up && sc_h_up > MIN_I64) {\n\t\t\t\t\t\tif(sc_h_up == sc_f_up) {\n\t\t\t\t\t\t\tmask |= 256;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tmask |= 128;\n\t\t\t\t\t} else if(sc_f_up > MIN_I64) {\n\t\t\t\t\t\tsc_f_best = sc_f_up;\n\t\t\t\t\t\tmask |= 256;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(sc_f_best > sc_best) {\n\t\t\t\t\tsc_best = sc_f_best;\n\t\t\t\t\tmask &= ~127; // don't go horizontal or diagonal\n\t\t\t\t}\n\t\t\t\t// Install results in cur\n\t\t\t\tassert(!prob_.sc_->monotone || sc_best <= 0);\n\t\t\t\tassert(!prob_.sc_->monotone || sc_e_best <= 0);\n\t\t\t\tassert(!prob_.sc_->monotone || sc_f_best <= 0);\n\t\t\t\tcurc->sc[0] = sc_best;\n\t\t\t\tassert( local || sc_e_best < 0);\n\t\t\t\tassert( local || sc_f_best < 0);\n\t\t\t\tassert(!local || sc_e_best >= 0 || sc_e_best == MIN_I16);\n\t\t\t\tassert(!local || sc_f_best >= 0 || sc_f_best == MIN_I16);\n\t\t\t\tcurc->sc[1] = sc_e_best;\n\t\t\t\tcurc->sc[2] = sc_f_best;\n\t\t\t\tcurc->sc[3] = mask;\n\t\t\t\t// cerr << curc->sc[0] << \" \" << curc->sc[1] << \" \" << curc->sc[2] << \" \" << curc->sc[3] << endl;\n\t\t\t\tASSERT_ONLY(last = curc);\n#ifndef NDEBUG\n\t\t\t\tif(prob_.cper_->isCheckpointed(rowc, colc)) {\n\t\t\t\t\tif(local) {\n\t\t\t\t\t\tsc_e_best = max<int16_t>(sc_e_best, 0);\n\t\t\t\t\t\tsc_f_best = max<int16_t>(sc_f_best, 0);\n\t\t\t\t\t}\n\t\t\t\t\tTAlScore sc_best64   = sc_best;   if(sc_best   == MIN_I16) sc_best64   = MIN_I64;\n\t\t\t\t\tTAlScore sc_e_best64 = sc_e_best; if(sc_e_best == MIN_I16) sc_e_best64 = MIN_I64;\n\t\t\t\t\tTAlScore sc_f_best64 = sc_f_best; if(sc_f_best == MIN_I16) sc_f_best64 = MIN_I64;\n\t\t\t\t\tassert_eq(prob_.cper_->scoreTriangle(rowc, colc, 0), sc_best64);\n\t\t\t\t\tassert_eq(prob_.cper_->scoreTriangle(rowc, colc, 1), sc_e_best64);\n\t\t\t\t\tassert_eq(prob_.cper_->scoreTriangle(rowc, colc, 2), sc_f_best64);\n\t\t\t\t}\n#endif\n\t\t\t}\n\t\t\t// Update row, col\n\t\t\tassert_lt(rowc, (int64_t)prob_.qrylen_);\n\t\t\trowc++;\n\t\t\tcolc--;\n\t\t\tcurc++;\n\t\t} // for(size_t j = 0; j < breadth; j++)\n\t\tif(i == depth-1) {\n\t\t\t// Final iteration\n\t\t\tassert(last != NULL);\n\t\t\tassert_eq(1, neval);\n\t\t\tassert_neq(0, last->sc[3]);\n\t\t\tassert_eq(targ, last->sc[hef]);\n\t\t} else {\n\t\t\tbreadth--;\n\t\t\tprev2 = prev1 + 1;\n\t\t\tprev1 = cur;\n\t\t}\n\t} // for(size_t i = 0; i < depth; i++)\n\t//\n\t// Now backtrack through the triangle.  Abort as soon as we enter a cell\n\t// that was visited by a previous backtrace.\n\t//\n\tint64_t rowc = row, colc = col;\n\tsize_t curid;\n\tint hefc = hef;\n\tif(bs_.empty()) {\n\t\t// Start an initial branch\n\t\tCHECK_ROW_COL(rowc, colc);\n\t\tcurid = bs_.alloc();\n\t\tassert_eq(0, curid);\n\t\tEdit e;\n\t\tbs_[curid].init(\n\t\t\tprob_,\n\t\t\t0,      // parent ID\n\t\t\t0,      // penalty\n\t\t\t0,      // score_en\n\t\t\trowc,   // row\n\t\t\tcolc,   // col\n\t\t\te,      // edit\n\t\t\t0,      // hef\n\t\t\ttrue,   // I am the root\n\t\t\tfalse); // don't try to extend with exact matches\n\t\tbs_[curid].len_ = 0;\n\t} else {\n\t\tcurid = bs_.size()-1;\n\t}\n\tsize_t idx_orig = (row + col) >> prob_.cper_->perpow2();\n\twhile(true) {\n\t\t// What depth are we?\n\t\tsize_t mod = (rowc + colc) & prob_.cper_->lomask();\n\t\tassert_lt(mod, prob_.cper_->per());\n\t\tCpQuad * cur = tri_[mod].ptr();\n\t\tint64_t row_off = rowc - row_lo - mod;\n\t\tassert(!local || cur[row_off].sc[0] > 0);\n\t\tassert_geq(row_off, 0);\n\t\tint mask = cur[row_off].sc[3];\n\t\tassert_gt(mask, 0);\n\t\tint sel = -1;\n\t\t// Select what type of move to make, which depends on whether we're\n\t\t// currently in H, E, F:\n\t\tif(hefc == 0) {\n\t\t\tif(       (mask & 1) != 0) {\n\t\t\t\t// diagonal\n\t\t\t\tsel = 0;\n\t\t\t} else if((mask & 8) != 0) {\n\t\t\t\t// up to H\n\t\t\t\tsel = 3;\n\t\t\t} else if((mask & 16) != 0) {\n\t\t\t\t// up to F\n\t\t\t\tsel = 4;\n\t\t\t} else if((mask & 2) != 0) {\n\t\t\t\t// left to H\n\t\t\t\tsel = 1;\n\t\t\t} else if((mask & 4) != 0) {\n\t\t\t\t// left to E\n\t\t\t\tsel = 2;\n\t\t\t}\n\t\t} else if(hefc == 1) {\n\t\t\tif(       (mask & 32) != 0) {\n\t\t\t\t// left to H\n\t\t\t\tsel = 5;\n\t\t\t} else if((mask & 64) != 0) {\n\t\t\t\t// left to E\n\t\t\t\tsel = 6;\n\t\t\t}\n\t\t} else {\n\t\t\tassert_eq(2, hefc);\n\t\t\tif(       (mask & 128) != 0) {\n\t\t\t\t// up to H\n\t\t\t\tsel = 7;\n\t\t\t} else if((mask & 256) != 0) {\n\t\t\t\t// up to F\n\t\t\t\tsel = 8;\n\t\t\t}\n\t\t}\n\t\tassert_geq(sel, 0);\n\t\t// Get character from read\n\t\tint qc = prob_.qry_[rowc], qq = prob_.qual_[rowc];\n\t\t// Get character from reference\n\t\tint rc = prob_.ref_[colc];\n\t\tassert_range(0, 16, rc);\n\t\t// Now that we know what type of move to make, make it, updating our\n\t\t// row and column and moving updating the branch.\n\t\tif(sel == 0) {\n\t\t\tassert_geq(rowc, 0);\n\t\t\tassert_geq(colc, 0);\n\t\t\tTAlScore scd = prob_.sc_->score(qc, rc, qq - 33);\n\t\t\tif((rc & (1 << qc)) == 0) {\n\t\t\t\t// Mismatch\n\t\t\t\tsize_t id = curid;\n\t\t\t\t// Check if the previous branch was the initial (bottommost)\n\t\t\t\t// branch with no matches.  If so, the mismatch should be added\n\t\t\t\t// to the initial branch, instead of starting a new branch.\n\t\t\t\tbool empty = (bs_[curid].len_ == 0 && curid == 0);\n\t\t\t\tif(!empty) {\n\t\t\t\t\tid = bs_.alloc();\n\t\t\t\t}\n\t\t\t\tEdit e((int)rowc, mask2dna[rc], \"ACGTN\"[qc], EDIT_TYPE_MM);\n\t\t\t\tassert_lt(scd, 0);\n\t\t\t\tTAlScore score_en = bs_[curid].score_st_ + scd;\n\t\t\t\tbs_[id].init(\n\t\t\t\t\tprob_,\n\t\t\t\t\tcurid,    // parent ID\n\t\t\t\t\t-scd,     // penalty\n\t\t\t\t\tscore_en, // score_en\n\t\t\t\t\trowc,     // row\n\t\t\t\t\tcolc,     // col\n\t\t\t\t\te,        // edit\n\t\t\t\t\thefc,     // hef\n\t\t\t\t\tempty,    // root?\n\t\t\t\t\tfalse);   // don't try to extend with exact matches\n\t\t\t\t//assert(!local || bs_[id].score_st_ >= 0);\n\t\t\t\tcurid = id;\n\t\t\t} else {\n\t\t\t\t// Match\n\t\t\t\tbs_[curid].score_st_ += prob_.sc_->match();\n\t\t\t\tbs_[curid].len_++;\n\t\t\t\tassert_leq((int64_t)bs_[curid].len_, bs_[curid].row_ + 1);\n\t\t\t}\n\t\t\trowc--;\n\t\t\tcolc--;\n\t\t\tassert(local || bs_[curid].score_st_ >= targ_final);\n\t\t\thefc = 0;\n\t\t} else if((sel >= 1 && sel <= 2) || (sel >= 5 && sel <= 6)) {\n\t\t\tassert_gt(colc, 0);\n\t\t\t// Read gap\n\t\t\tsize_t id = bs_.alloc();\n\t\t\tEdit e((int)rowc+1, mask2dna[rc], '-', EDIT_TYPE_READ_GAP);\n\t\t\tTAlScore gapp = prob_.sc_->readGapOpen();\n\t\t\tif(bs_[curid].len_ == 0 && bs_[curid].e_.inited() && bs_[curid].e_.isReadGap()) {\n\t\t\t\tgapp = prob_.sc_->readGapExtend();\n\t\t\t}\n\t\t\tTAlScore score_en = bs_[curid].score_st_ - gapp;\n\t\t\tbs_[id].init(\n\t\t\t\tprob_,\n\t\t\t\tcurid,    // parent ID\n\t\t\t\tgapp,     // penalty\n\t\t\t\tscore_en, // score_en\n\t\t\t\trowc,     // row\n\t\t\t\tcolc-1,   // col\n\t\t\t\te,        // edit\n\t\t\t\thefc,     // hef\n\t\t\t\tfalse,    // root?\n\t\t\t\tfalse);   // don't try to extend with exact matches\n\t\t\tcolc--;\n\t\t\tcurid = id;\n\t\t\tassert( local || bs_[curid].score_st_ >= targ_final);\n\t\t\t//assert(!local || bs_[curid].score_st_ >= 0);\n\t\t\tif(sel == 1 || sel == 5) {\n\t\t\t\thefc = 0;\n\t\t\t} else {\n\t\t\t\thefc = 1;\n\t\t\t}\n\t\t} else {\n\t\t\tassert_gt(rowc, 0);\n\t\t\t// Reference gap\n\t\t\tsize_t id = bs_.alloc();\n\t\t\tEdit e((int)rowc, '-', \"ACGTN\"[qc], EDIT_TYPE_REF_GAP);\n\t\t\tTAlScore gapp = prob_.sc_->refGapOpen();\n\t\t\tif(bs_[curid].len_ == 0 && bs_[curid].e_.inited() && bs_[curid].e_.isRefGap()) {\n\t\t\t\tgapp = prob_.sc_->refGapExtend();\n\t\t\t}\n\t\t\tTAlScore score_en = bs_[curid].score_st_ - gapp;\n\t\t\tbs_[id].init(\n\t\t\t\tprob_,\n\t\t\t\tcurid,    // parent ID\n\t\t\t\tgapp,     // penalty\n\t\t\t\tscore_en, // score_en\n\t\t\t\trowc-1,   // row\n\t\t\t\tcolc,     // col\n\t\t\t\te,        // edit\n\t\t\t\thefc,     // hef\n\t\t\t\tfalse,    // root?\n\t\t\t\tfalse);   // don't try to extend with exact matches\n\t\t\trowc--;\n\t\t\tcurid = id;\n\t\t\t//assert(!local || bs_[curid].score_st_ >= 0);\n\t\t\tif(sel == 3 || sel == 7) {\n\t\t\t\thefc = 0;\n\t\t\t} else {\n\t\t\t\thefc = 2;\n\t\t\t}\n\t\t}\n\t\tCHECK_ROW_COL(rowc, colc);\n\t\tsize_t mod_new = (rowc + colc) & prob_.cper_->lomask();\n\t\tsize_t idx = (rowc + colc) >> prob_.cper_->perpow2();\n\t\tassert_lt(mod_new, prob_.cper_->per());\n\t\tint64_t row_off_new = rowc - row_lo - mod_new;\n\t\tCpQuad * cur_new = NULL;\n\t\tif(colc >= 0 && rowc >= 0 && idx == idx_orig) {\n\t\t\tcur_new = tri_[mod_new].ptr();\n\t\t}\n\t\tbool hit_new_tri = (idx < idx_orig && colc >= 0 && rowc >= 0);\n\t\t// Check whether we made it to the top row or to a cell with score 0\n\t\tif(colc < 0 || rowc < 0 ||\n\t\t   (cur_new != NULL && (local && cur_new[row_off_new].sc[0] == 0)))\n\t\t{\n\t\t\tdone = true;\n\t\t\tassert(bs_[curid].isSolution(prob_));\n\t\t\taddSolution(curid);\n#ifndef NDEBUG\n\t\t\t// A check to see if any two adjacent branches in the backtrace\n\t\t\t// overlap.  If they do, the whole alignment will be filtered out\n\t\t\t// in trySolution(...)\n\t\t\tsize_t cur = curid;\n\t\t\tif(!bs_[cur].root_) {\n\t\t\t\tsize_t next = bs_[cur].parentId_;\n\t\t\t\twhile(!bs_[next].root_) {\n\t\t\t\t\tassert_neq(cur, next);\n\t\t\t\t\tif(bs_[next].len_ != 0 || bs_[cur].len_ == 0) {\n\t\t\t\t\t\tassert(!bs_[cur].overlap(prob_, bs_[next]));\n\t\t\t\t\t}\n\t\t\t\t\tcur = next;\n\t\t\t\t\tnext = bs_[cur].parentId_;\n\t\t\t\t}\n\t\t\t}\n#endif\n\t\t\treturn;\n\t\t}\n\t\tif(hit_new_tri) {\n\t\t\tassert(rowc < 0 || colc < 0 || prob_.cper_->isCheckpointed(rowc, colc));\n\t\t\trow_new = rowc; col_new = colc;\n\t\t\thef_new = hefc;\n\t\t\tdone = false;\n\t\t\tif(rowc < 0 || colc < 0) {\n\t\t\t\tassert(local);\n\t\t\t\ttarg_new = 0;\n\t\t\t} else {\n\t\t\t\ttarg_new = prob_.cper_->scoreTriangle(rowc, colc, hefc);\n\t\t\t}\n\t\t\tif(local && targ_new == 0) {\n\t\t\t\tdone = true;\n\t\t\t\tassert(bs_[curid].isSolution(prob_));\n\t\t\t\taddSolution(curid);\n\t\t\t}\n\t\t\tassert((row_new >= 0 && col_new >= 0) || done);\n\t\t\treturn;\n\t\t}\n\t}\n\tassert(false);\n}\n\n#ifndef NDEBUG\n#define DEBUG_CHECK(ss, row, col, hef) { \\\n\tif(prob_.cper_->debug() && row >= 0 && col >= 0) { \\\n\t\tTAlScore s = ss; \\\n\t\tif(s == MIN_I16) s = MIN_I64; \\\n\t\tif(local && s < 0) s = 0; \\\n\t\tTAlScore deb = prob_.cper_->debugCell(row, col, hef); \\\n\t\tif(local && deb < 0) deb = 0; \\\n\t\tassert_eq(s, deb); \\\n\t} \\\n}\n#else\n#define DEBUG_CHECK(ss, row, col, hef)\n#endif\n\n\n/**\n * Fill in a square of the DP table and backtrace from the given cell to\n * a cell in the previous checkpoint, or to the terminal cell.\n */\nvoid BtBranchTracer::squareFill(\n\tint64_t rw,          // row of cell to backtrace from\n\tint64_t cl,          // column of cell to backtrace from\n\tint hef,             // cell to backtrace from is H (0), E (1), or F (2)\n\tTAlScore targ,       // score of cell to backtrace from\n\tTAlScore targ_final, // score of alignment we're looking for\n\tRandomSource& rnd,   // pseudo-random generator\n\tint64_t& row_new,    // out: row we ended up in after backtrace\n\tint64_t& col_new,    // out: column we ended up in after backtrace\n\tint& hef_new,        // out: H/E/F after backtrace\n\tTAlScore& targ_new,  // out: score up to cell we ended up in\n\tbool& done,          // out: finished tracing out an alignment?\n\tbool& abort)         // out: aborted b/c cell was seen before?\n{\n\tassert_geq(rw, 0);\n\tassert_geq(cl, 0);\n\tassert_range(0, 2, hef);\n\tassert_lt(rw, (int64_t)prob_.qrylen_);\n\tassert_lt(cl, (int64_t)prob_.reflen_);\n\tassert(prob_.usecp_ && prob_.fill_);\n\tconst bool is8_ = prob_.cper_->is8_;\n\tint64_t row = rw, col = cl;\n\tassert_leq(prob_.reflen_, (TRefOff)sawcell_.size());\n\tassert_leq(col, (int64_t)prob_.cper_->hicol());\n\tassert_geq(col, (int64_t)prob_.cper_->locol());\n\tassert_geq(prob_.cper_->per(), 2);\n\tsize_t xmod = col & prob_.cper_->lomask();\n\tsize_t ymod = row & prob_.cper_->lomask();\n\tsize_t xdiv = col >> prob_.cper_->perpow2();\n\tsize_t ydiv = row >> prob_.cper_->perpow2();\n\tsize_t sq_ncol = xmod+1, sq_nrow = ymod+1;\n\tsq_.resize(sq_ncol * sq_nrow);\n\tbool upper = ydiv == 0;\n\tbool left  = xdiv == 0;\n\tconst TAlScore sc_rdo = prob_.sc_->readGapOpen();\n\tconst TAlScore sc_rde = prob_.sc_->readGapExtend();\n\tconst TAlScore sc_rfo = prob_.sc_->refGapOpen();\n\tconst TAlScore sc_rfe = prob_.sc_->refGapExtend();\n\tconst bool local = !prob_.sc_->monotone;\n\tconst CpQuad *qup = NULL;\n\tconst __m128i *qlf = NULL;\n\tsize_t per = prob_.cper_->per_;\n\tASSERT_ONLY(size_t nrow = prob_.cper_->nrow());\n\tsize_t ncol = prob_.cper_->ncol();\n\tassert_eq(prob_.qrylen_, nrow);\n\tassert_eq(prob_.reflen_, (TRefOff)ncol);\n\tsize_t niter = prob_.cper_->niter_;\n\tif(!upper) {\n\t\tqup = prob_.cper_->qrows_.ptr() + (ncol * (ydiv-1)) + xdiv * per;\n\t}\n\tif(!left) {\n\t\t// Set up the column pointers to point to the first __m128i word in the\n\t\t// relevant column\n\t\tsize_t off = (niter << 2) * (xdiv-1);\n\t\tqlf = prob_.cper_->qcols_.ptr() + off;\n\t}\n\tsize_t xedge = xdiv * per; // absolute offset of leftmost cell in square\n\tsize_t yedge = ydiv * per; // absolute offset of topmost cell in square\n\tsize_t xi = xedge, yi = yedge; // iterators for columns, rows\n\tsize_t ii = 0; // iterator into packed square\n\t// Iterate over rows, then over columns\n\tsize_t m128mod = yi % prob_.cper_->niter_;\n\tsize_t m128div = yi / prob_.cper_->niter_;\n\tint16_t sc_h_dg_lastrow = MIN_I16;\n\tfor(size_t i = 0; i <= ymod; i++, yi++) {\n\t\tassert_lt(yi, nrow);\n \t\txi = xedge;\n\t\t// Handling for first column is done outside the loop\n\t\tsize_t fromend = prob_.qrylen_ - yi - 1;\n\t\tbool allowGaps = fromend >= (size_t)prob_.sc_->gapbar && yi >= (size_t)prob_.sc_->gapbar;\n\t\t// Get character, quality from read\n\t\tint qc = prob_.qry_[yi], qq = prob_.qual_[yi];\n\t\tassert_geq(qq, 33);\n\t\tint16_t sc_h_lf_last = MIN_I16;\n\t\tint16_t sc_e_lf_last = MIN_I16;\n\t\tfor(size_t j = 0; j <= xmod; j++, xi++) {\n\t\t\tassert_lt(xi, ncol);\n\t\t\t// Get character from reference\n\t\t\tint rc = prob_.ref_[xi];\n\t\t\tassert_range(0, 16, rc);\n\t\t\tint16_t sc_diag = prob_.sc_->score(qc, rc, qq - 33);\n\t\t\tint16_t sc_h_up = MIN_I16, sc_f_up = MIN_I16,\n\t\t\t        sc_h_lf = MIN_I16, sc_e_lf = MIN_I16,\n\t\t\t\t\tsc_h_dg = MIN_I16;\n\t\t\tint16_t sc_h_up_c = MIN_I16, sc_f_up_c = MIN_I16,\n\t\t\t        sc_h_lf_c = MIN_I16, sc_e_lf_c = MIN_I16,\n\t\t\t\t\tsc_h_dg_c = MIN_I16;\n\t\t\tif(yi == 0) {\n\t\t\t\t// If I'm in the first first row or column set it to 0\n\t\t\t\tsc_h_dg = 0;\n\t\t\t} else if(xi == 0) {\n\t\t\t\t// Do nothing; leave it at min\n\t\t\t\tif(local) {\n\t\t\t\t\tsc_h_dg = 0;\n\t\t\t\t}\n\t\t\t} else if(i == 0 && j == 0) {\n\t\t\t\t// Otherwise, if I'm in the upper-left square corner, I can get\n\t\t\t\t// it from the checkpoint \n\t\t\t\tsc_h_dg = qup[-1].sc[0];\n\t\t\t} else if(j == 0) {\n\t\t\t\t// Otherwise, if I'm in the leftmost cell of this row, I can\n\t\t\t\t// get it from sc_h_lf in first column of previous row\n\t\t\t\tsc_h_dg = sc_h_dg_lastrow;\n\t\t\t} else {\n\t\t\t\t// Otherwise, I can get it from qup\n\t\t\t\tsc_h_dg = qup[j-1].sc[0];\n\t\t\t}\n\t\t\tif(yi > 0 && xi > 0) DEBUG_CHECK(sc_h_dg, yi-1, xi-1, 2);\n\t\t\t\n\t\t\t// If we're in the leftmost column, calculate sc_h_lf regardless of\n\t\t\t// allowGaps.\n\t\t\tif(j == 0 && xi > 0) {\n\t\t\t\t// Get values for left neighbors from the checkpoint\n\t\t\t\tif(is8_) {\n\t\t\t\t\tsize_t vecoff = (m128mod << 6) + m128div;\n\t\t\t\t\tsc_e_lf = ((uint8_t*)(qlf + 0))[vecoff];\n\t\t\t\t\tsc_h_lf = ((uint8_t*)(qlf + 2))[vecoff];\n\t\t\t\t\tif(local) {\n\t\t\t\t\t\t// No adjustment\n\t\t\t\t\t} else {\n\t\t\t\t\t\tif(sc_h_lf == 0) sc_h_lf = MIN_I16;\n\t\t\t\t\t\telse sc_h_lf -= 0xff;\n\t\t\t\t\t\tif(sc_e_lf == 0) sc_e_lf = MIN_I16;\n\t\t\t\t\t\telse sc_e_lf -= 0xff;\n\t\t\t\t\t}\n\t\t\t\t} else {\n\t\t\t\t\tsize_t vecoff = (m128mod << 5) + m128div;\n\t\t\t\t\tsc_e_lf = ((int16_t*)(qlf + 0))[vecoff];\n\t\t\t\t\tsc_h_lf = ((int16_t*)(qlf + 2))[vecoff];\n\t\t\t\t\tif(local) {\n\t\t\t\t\t\tsc_h_lf += 0x8000; assert_geq(sc_h_lf, 0);\n\t\t\t\t\t\tsc_e_lf += 0x8000; assert_geq(sc_e_lf, 0);\n\t\t\t\t\t} else {\n\t\t\t\t\t\tif(sc_h_lf != MIN_I16) sc_h_lf -= 0x7fff;\n\t\t\t\t\t\tif(sc_e_lf != MIN_I16) sc_e_lf -= 0x7fff;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tDEBUG_CHECK(sc_e_lf, yi, xi-1, 0);\n\t\t\t\tDEBUG_CHECK(sc_h_lf, yi, xi-1, 2);\n\t\t\t\tsc_h_dg_lastrow = sc_h_lf;\n\t\t\t}\n\t\t\t\n\t\t\tif(allowGaps) {\n\t\t\t\tif(j == 0 /* at left edge */ && xi > 0 /* not extreme */) {\n\t\t\t\t\tsc_h_lf_c = sc_h_lf;\n\t\t\t\t\tsc_e_lf_c = sc_e_lf;\n\t\t\t\t\tif(sc_h_lf_c != MIN_I16) sc_h_lf_c -= sc_rdo;\n\t\t\t\t\tif(sc_e_lf_c != MIN_I16) sc_e_lf_c -= sc_rde;\n\t\t\t\t\tassert_leq(sc_h_lf_c, prob_.cper_->perf_);\n\t\t\t\t\tassert_leq(sc_e_lf_c, prob_.cper_->perf_);\n\t\t\t\t} else if(xi > 0) {\n\t\t\t\t\t// Get values for left neighbors from the previous iteration\n\t\t\t\t\tif(sc_h_lf_last != MIN_I16) {\n\t\t\t\t\t\tsc_h_lf = sc_h_lf_last;\n\t\t\t\t\t\tsc_h_lf_c = sc_h_lf - sc_rdo;\n\t\t\t\t\t}\n\t\t\t\t\tif(sc_e_lf_last != MIN_I16) {\n\t\t\t\t\t\tsc_e_lf = sc_e_lf_last;\n\t\t\t\t\t\tsc_e_lf_c = sc_e_lf - sc_rde;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(yi > 0 /* not extreme */) {\n\t\t\t\t\t// Get column values\n\t\t\t\t\tassert(qup != NULL);\n\t\t\t\t\tassert(local || qup[j].sc[2] < 0);\n\t\t\t\t\tif(qup[j].sc[0] > MIN_I16) {\n\t\t\t\t\t\tDEBUG_CHECK(qup[j].sc[0], yi-1, xi, 2);\n\t\t\t\t\t\tsc_h_up = qup[j].sc[0];\n\t\t\t\t\t\tsc_h_up_c = sc_h_up - sc_rfo;\n\t\t\t\t\t}\n\t\t\t\t\tif(qup[j].sc[2] > MIN_I16) {\n\t\t\t\t\t\tDEBUG_CHECK(qup[j].sc[2], yi-1, xi, 1);\n\t\t\t\t\t\tsc_f_up = qup[j].sc[2];\n\t\t\t\t\t\tsc_f_up_c = sc_f_up - sc_rfe;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(local) {\n\t\t\t\t\tsc_h_up_c = max<int16_t>(sc_h_up_c, 0);\n\t\t\t\t\tsc_f_up_c = max<int16_t>(sc_f_up_c, 0);\n\t\t\t\t\tsc_h_lf_c = max<int16_t>(sc_h_lf_c, 0);\n\t\t\t\t\tsc_e_lf_c = max<int16_t>(sc_e_lf_c, 0);\n\t\t\t\t}\n\t\t\t}\n\t\t\t\n\t\t\tif(sc_h_dg > MIN_I16) {\n\t\t\t\tsc_h_dg_c = sc_h_dg + sc_diag;\n\t\t\t}\n\t\t\tif(local) sc_h_dg_c = max<int16_t>(sc_h_dg_c, 0);\n\t\t\t\n\t\t\tint mask = 0;\n\t\t\t// Calculate best ways into H, E, F cells starting with H.\n\t\t\t// Mask bits:\n\t\t\t// H: 1=diag, 2=hhoriz, 4=ehoriz, 8=hvert, 16=fvert\n\t\t\t// E: 32=hhoriz, 64=ehoriz\n\t\t\t// F: 128=hvert, 256=fvert\n\t\t\tint16_t sc_best = sc_h_dg_c;\n\t\t\tif(sc_h_dg_c > MIN_I64) {\n\t\t\t\tmask = 1;\n\t\t\t}\n\t\t\tif(xi > 0 && sc_h_lf_c >= sc_best && sc_h_lf_c > MIN_I64) {\n\t\t\t\tif(sc_h_lf_c > sc_best) mask = 0;\n\t\t\t\tmask |= 2;\n\t\t\t\tsc_best = sc_h_lf_c;\n\t\t\t}\n\t\t\tif(xi > 0 && sc_e_lf_c >= sc_best && sc_e_lf_c > MIN_I64) {\n\t\t\t\tif(sc_e_lf_c > sc_best) mask = 0;\n\t\t\t\tmask |= 4;\n\t\t\t\tsc_best = sc_e_lf_c;\n\t\t\t}\n\t\t\tif(yi > 0 && sc_h_up_c >= sc_best && sc_h_up_c > MIN_I64) {\n\t\t\t\tif(sc_h_up_c > sc_best) mask = 0;\n\t\t\t\tmask |= 8;\n\t\t\t\tsc_best = sc_h_up_c;\n\t\t\t}\n\t\t\tif(yi > 0 && sc_f_up_c >= sc_best && sc_f_up_c > MIN_I64) {\n\t\t\t\tif(sc_f_up_c > sc_best) mask = 0;\n\t\t\t\tmask |= 16;\n\t\t\t\tsc_best = sc_f_up_c;\n\t\t\t}\n\t\t\t// Calculate best way into E cell\n\t\t\tint16_t sc_e_best = sc_h_lf_c;\n\t\t\tif(xi > 0) {\n\t\t\t\tif(sc_h_lf_c >= sc_e_lf_c && sc_h_lf_c > MIN_I64) {\n\t\t\t\t\tif(sc_h_lf_c == sc_e_lf_c) {\n\t\t\t\t\t\tmask |= 64;\n\t\t\t\t\t}\n\t\t\t\t\tmask |= 32;\n\t\t\t\t} else if(sc_e_lf_c > MIN_I64) {\n\t\t\t\t\tsc_e_best = sc_e_lf_c;\n\t\t\t\t\tmask |= 64;\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(sc_e_best > sc_best) {\n\t\t\t\tsc_best = sc_e_best;\n\t\t\t\tmask &= ~31; // don't go diagonal\n\t\t\t}\n\t\t\t// Calculate best way into F cell\n\t\t\tint16_t sc_f_best = sc_h_up_c;\n\t\t\tif(yi > 0) {\n\t\t\t\tif(sc_h_up_c >= sc_f_up_c && sc_h_up_c > MIN_I64) {\n\t\t\t\t\tif(sc_h_up_c == sc_f_up_c) {\n\t\t\t\t\t\tmask |= 256;\n\t\t\t\t\t}\n\t\t\t\t\tmask |= 128;\n\t\t\t\t} else if(sc_f_up_c > MIN_I64) {\n\t\t\t\t\tsc_f_best = sc_f_up_c;\n\t\t\t\t\tmask |= 256;\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(sc_f_best > sc_best) {\n\t\t\t\tsc_best = sc_f_best;\n\t\t\t\tmask &= ~127; // don't go horizontal or diagonal\n\t\t\t}\n\t\t\t// Install results in cur\n\t\t\tassert( local || sc_best <= 0);\n\t\t\tsq_[ii+j].sc[0] = sc_best;\n\t\t\tassert( local || sc_e_best < 0);\n\t\t\tassert( local || sc_f_best < 0);\n\t\t\tassert(!local || sc_e_best >= 0 || sc_e_best == MIN_I16);\n\t\t\tassert(!local || sc_f_best >= 0 || sc_f_best == MIN_I16);\n\t\t\tsq_[ii+j].sc[1] = sc_e_best;\n\t\t\tsq_[ii+j].sc[2] = sc_f_best;\n\t\t\tsq_[ii+j].sc[3] = mask;\n\t\t\tDEBUG_CHECK(sq_[ii+j].sc[0], yi, xi, 2); // H\n\t\t\tDEBUG_CHECK(sq_[ii+j].sc[1], yi, xi, 0); // E\n\t\t\tDEBUG_CHECK(sq_[ii+j].sc[2], yi, xi, 1); // F\n\t\t\t// Update sc_h_lf_last, sc_e_lf_last\n\t\t\tsc_h_lf_last = sc_best;\n\t\t\tsc_e_lf_last = sc_e_best;\n\t\t}\n\t\t// Update m128mod, m128div\n\t\tm128mod++;\n\t\tif(m128mod == prob_.cper_->niter_) {\n\t\t\tm128mod = 0;\n\t\t\tm128div++;\n\t\t}\n\t\t// update qup\n\t\tii += sq_ncol;\n\t\t// dimensions of sq_\n\t\tqup = sq_.ptr() + sq_ncol * i;\n\t}\n\tassert_eq(targ, sq_[ymod * sq_ncol + xmod].sc[hef]);\n\t//\n\t// Now backtrack through the triangle.  Abort as soon as we enter a cell\n\t// that was visited by a previous backtrace.\n\t//\n\tint64_t rowc = row, colc = col;\n\tsize_t curid;\n\tint hefc = hef;\n\tif(bs_.empty()) {\n\t\t// Start an initial branch\n\t\tCHECK_ROW_COL(rowc, colc);\n\t\tcurid = bs_.alloc();\n\t\tassert_eq(0, curid);\n\t\tEdit e;\n\t\tbs_[curid].init(\n\t\t\tprob_,\n\t\t\t0,      // parent ID\n\t\t\t0,      // penalty\n\t\t\t0,      // score_en\n\t\t\trowc,   // row\n\t\t\tcolc,   // col\n\t\t\te,      // edit\n\t\t\t0,      // hef\n\t\t\ttrue,   // root?\n\t\t\tfalse); // don't try to extend with exact matches\n\t\tbs_[curid].len_ = 0;\n\t} else {\n\t\tcurid = bs_.size()-1;\n\t}\n\tsize_t ymodTimesNcol = ymod * sq_ncol;\n\twhile(true) {\n\t\t// What depth are we?\n\t\tassert_eq(ymodTimesNcol, ymod * sq_ncol);\n\t\tCpQuad * cur = sq_.ptr() + ymodTimesNcol + xmod;\n\t\tint mask = cur->sc[3];\n\t\tassert_gt(mask, 0);\n\t\tint sel = -1;\n\t\t// Select what type of move to make, which depends on whether we're\n\t\t// currently in H, E, F:\n\t\tif(hefc == 0) {\n\t\t\tif(       (mask & 1) != 0) {\n\t\t\t\t// diagonal\n\t\t\t\tsel = 0;\n\t\t\t} else if((mask & 8) != 0) {\n\t\t\t\t// up to H\n\t\t\t\tsel = 3;\n\t\t\t} else if((mask & 16) != 0) {\n\t\t\t\t// up to F\n\t\t\t\tsel = 4;\n\t\t\t} else if((mask & 2) != 0) {\n\t\t\t\t// left to H\n\t\t\t\tsel = 1;\n\t\t\t} else if((mask & 4) != 0) {\n\t\t\t\t// left to E\n\t\t\t\tsel = 2;\n\t\t\t}\n\t\t} else if(hefc == 1) {\n\t\t\tif(       (mask & 32) != 0) {\n\t\t\t\t// left to H\n\t\t\t\tsel = 5;\n\t\t\t} else if((mask & 64) != 0) {\n\t\t\t\t// left to E\n\t\t\t\tsel = 6;\n\t\t\t}\n\t\t} else {\n\t\t\tassert_eq(2, hefc);\n\t\t\tif(       (mask & 128) != 0) {\n\t\t\t\t// up to H\n\t\t\t\tsel = 7;\n\t\t\t} else if((mask & 256) != 0) {\n\t\t\t\t// up to F\n\t\t\t\tsel = 8;\n\t\t\t}\n\t\t}\n\t\tassert_geq(sel, 0);\n\t\t// Get character from read\n\t\tint qc = prob_.qry_[rowc], qq = prob_.qual_[rowc];\n\t\t// Get character from reference\n\t\tint rc = prob_.ref_[colc];\n\t\tassert_range(0, 16, rc);\n\t\tbool xexit = false, yexit = false;\n\t\t// Now that we know what type of move to make, make it, updating our\n\t\t// row and column and moving updating the branch.\n\t\tif(sel == 0) {\n\t\t\tassert_geq(rowc, 0);\n\t\t\tassert_geq(colc, 0);\n\t\t\tTAlScore scd = prob_.sc_->score(qc, rc, qq - 33);\n\t\t\tif((rc & (1 << qc)) == 0) {\n\t\t\t\t// Mismatch\n\t\t\t\tsize_t id = curid;\n\t\t\t\t// Check if the previous branch was the initial (bottommost)\n\t\t\t\t// branch with no matches.  If so, the mismatch should be added\n\t\t\t\t// to the initial branch, instead of starting a new branch.\n\t\t\t\tbool empty = (bs_[curid].len_ == 0 && curid == 0);\n\t\t\t\tif(!empty) {\n\t\t\t\t\tid = bs_.alloc();\n\t\t\t\t}\n\t\t\t\tEdit e((int)rowc, mask2dna[rc], \"ACGTN\"[qc], EDIT_TYPE_MM);\n\t\t\t\tassert_lt(scd, 0);\n\t\t\t\tTAlScore score_en = bs_[curid].score_st_ + scd;\n\t\t\t\tbs_[id].init(\n\t\t\t\t\tprob_,\n\t\t\t\t\tcurid,    // parent ID\n\t\t\t\t\t-scd,     // penalty\n\t\t\t\t\tscore_en, // score_en\n\t\t\t\t\trowc,     // row\n\t\t\t\t\tcolc,     // col\n\t\t\t\t\te,        // edit\n\t\t\t\t\thefc,     // hef\n\t\t\t\t\tempty,    // root?\n\t\t\t\t\tfalse);   // don't try to extend with exact matches\n\t\t\t\tcurid = id;\n\t\t\t\t//assert(!local || bs_[curid].score_st_ >= 0);\n\t\t\t} else {\n\t\t\t\t// Match\n\t\t\t\tbs_[curid].score_st_ += prob_.sc_->match();\n\t\t\t\tbs_[curid].len_++;\n\t\t\t\tassert_leq((int64_t)bs_[curid].len_, bs_[curid].row_ + 1);\n\t\t\t}\n\t\t\tif(xmod == 0) xexit = true;\n\t\t\tif(ymod == 0) yexit = true;\n\t\t\trowc--; ymod--; ymodTimesNcol -= sq_ncol;\n\t\t\tcolc--; xmod--;\n\t\t\tassert(local || bs_[curid].score_st_ >= targ_final);\n\t\t\thefc = 0;\n\t\t} else if((sel >= 1 && sel <= 2) || (sel >= 5 && sel <= 6)) {\n\t\t\tassert_gt(colc, 0);\n\t\t\t// Read gap\n\t\t\tsize_t id = bs_.alloc();\n\t\t\tEdit e((int)rowc+1, mask2dna[rc], '-', EDIT_TYPE_READ_GAP);\n\t\t\tTAlScore gapp = prob_.sc_->readGapOpen();\n\t\t\tif(bs_[curid].len_ == 0 && bs_[curid].e_.inited() && bs_[curid].e_.isReadGap()) {\n\t\t\t\tgapp = prob_.sc_->readGapExtend();\n\t\t\t}\n\t\t\t//assert(!local || bs_[curid].score_st_ >= gapp);\n\t\t\tTAlScore score_en = bs_[curid].score_st_ - gapp;\n\t\t\tbs_[id].init(\n\t\t\t\tprob_,\n\t\t\t\tcurid,    // parent ID\n\t\t\t\tgapp,     // penalty\n\t\t\t\tscore_en, // score_en\n\t\t\t\trowc,     // row\n\t\t\t\tcolc-1,   // col\n\t\t\t\te,        // edit\n\t\t\t\thefc,     // hef\n\t\t\t\tfalse,    // root?\n\t\t\t\tfalse);   // don't try to extend with exact matches\n\t\t\tif(xmod == 0) xexit = true;\n\t\t\tcolc--; xmod--;\n\t\t\tcurid = id;\n\t\t\tassert( local || bs_[curid].score_st_ >= targ_final);\n\t\t\t//assert(!local || bs_[curid].score_st_ >= 0);\n\t\t\tif(sel == 1 || sel == 5) {\n\t\t\t\thefc = 0;\n\t\t\t} else {\n\t\t\t\thefc = 1;\n\t\t\t}\n\t\t} else {\n\t\t\tassert_gt(rowc, 0);\n\t\t\t// Reference gap\n\t\t\tsize_t id = bs_.alloc();\n\t\t\tEdit e((int)rowc, '-', \"ACGTN\"[qc], EDIT_TYPE_REF_GAP);\n\t\t\tTAlScore gapp = prob_.sc_->refGapOpen();\n\t\t\tif(bs_[curid].len_ == 0 && bs_[curid].e_.inited() && bs_[curid].e_.isRefGap()) {\n\t\t\t\tgapp = prob_.sc_->refGapExtend();\n\t\t\t}\n\t\t\t//assert(!local || bs_[curid].score_st_ >= gapp);\n\t\t\tTAlScore score_en = bs_[curid].score_st_ - gapp;\n\t\t\tbs_[id].init(\n\t\t\t\tprob_,\n\t\t\t\tcurid,    // parent ID\n\t\t\t\tgapp,     // penalty\n\t\t\t\tscore_en, // score_en\n\t\t\t\trowc-1,   // row\n\t\t\t\tcolc,     // col\n\t\t\t\te,        // edit\n\t\t\t\thefc,     // hef\n\t\t\t\tfalse,    // root?\n\t\t\t\tfalse);   // don't try to extend with exact matches\n\t\t\tif(ymod == 0) yexit = true;\n\t\t\trowc--; ymod--; ymodTimesNcol -= sq_ncol;\n\t\t\tcurid = id;\n\t\t\tassert( local || bs_[curid].score_st_ >= targ_final);\n\t\t\t//assert(!local || bs_[curid].score_st_ >= 0);\n\t\t\tif(sel == 3 || sel == 7) {\n\t\t\t\thefc = 0;\n\t\t\t} else {\n\t\t\t\thefc = 2;\n\t\t\t}\n\t\t}\n\t\tCHECK_ROW_COL(rowc, colc);\n\t\tCpQuad * cur_new = NULL;\n\t\tif(!xexit && !yexit) {\n\t\t\tcur_new = sq_.ptr() + ymodTimesNcol + xmod;\n\t\t}\n\t\t// Check whether we made it to the top row or to a cell with score 0\n\t\tif(colc < 0 || rowc < 0 ||\n\t\t   (cur_new != NULL && local && cur_new->sc[0] == 0))\n\t\t{\n\t\t\tdone = true;\n\t\t\tassert(bs_[curid].isSolution(prob_));\n\t\t\taddSolution(curid);\n#ifndef NDEBUG\n\t\t\t// A check to see if any two adjacent branches in the backtrace\n\t\t\t// overlap.  If they do, the whole alignment will be filtered out\n\t\t\t// in trySolution(...)\n\t\t\tsize_t cur = curid;\n\t\t\tif(!bs_[cur].root_) {\n\t\t\t\tsize_t next = bs_[cur].parentId_;\n\t\t\t\twhile(!bs_[next].root_) {\n\t\t\t\t\tassert_neq(cur, next);\n\t\t\t\t\tif(bs_[next].len_ != 0 || bs_[cur].len_ == 0) {\n\t\t\t\t\t\tassert(!bs_[cur].overlap(prob_, bs_[next]));\n\t\t\t\t\t}\n\t\t\t\t\tcur = next;\n\t\t\t\t\tnext = bs_[cur].parentId_;\n\t\t\t\t}\n\t\t\t}\n#endif\n\t\t\treturn;\n\t\t}\n\t\tassert(!xexit || hefc == 0 || hefc == 1);\n\t\tassert(!yexit || hefc == 0 || hefc == 2);\n\t\tif(xexit || yexit) {\n\t\t\t//assert(rowc < 0 || colc < 0 || prob_.cper_->isCheckpointed(rowc, colc));\n\t\t\trow_new = rowc; col_new = colc;\n\t\t\thef_new = hefc;\n\t\t\tdone = false;\n\t\t\tif(rowc < 0 || colc < 0) {\n\t\t\t\tassert(local);\n\t\t\t\ttarg_new = 0;\n\t\t\t} else {\n\t\t\t\t// TODO: Don't use scoreSquare\n\t\t\t\ttarg_new = prob_.cper_->scoreSquare(rowc, colc, hefc);\n\t\t\t\tassert(local || targ_new >= targ);\n\t\t\t\tassert(local || targ_new >= targ_final);\n\t\t\t}\n\t\t\tif(local && targ_new == 0) {\n\t\t\t\tassert_eq(0, hefc);\n\t\t\t\tdone = true;\n\t\t\t\tassert(bs_[curid].isSolution(prob_));\n\t\t\t\taddSolution(curid);\n\t\t\t}\n\t\t\tassert((row_new >= 0 && col_new >= 0) || done);\n\t\t\treturn;\n\t\t}\n\t}\n\tassert(false);\n}\n\n/**\n * Caller gives us score_en, row and col.  We figure out score_st and len_\n * by comparing characters from the strings.\n *\n * If this branch comes after a mismatch, (row, col) describe the cell that the\n * mismatch occurs in.  len_ is initially set to 1, and the next cell we test\n * is the next cell up and to the left (row-1, col-1).\n *\n * If this branch comes after a read gap, (row, col) describe the leftmost cell\n * involved in the gap.  len_ is initially set to 0, and the next cell we test\n * is the current cell (row, col).\n *\n * If this branch comes after a reference gap, (row, col) describe the upper\n * cell involved in the gap.  len_ is initially set to 0, and the next cell we\n * test is the current cell (row, col).\n */\nvoid BtBranch::init(\n\tconst BtBranchProblem& prob,\n\tsize_t parentId,\n\tTAlScore penalty,\n\tTAlScore score_en,\n\tint64_t row,\n\tint64_t col,\n\tEdit e,\n\tint hef,\n\tbool root,\n\tbool extend)\n{\n\tscore_en_ = score_en;\n\tpenalty_ = penalty;\n\tscore_st_ = score_en_;\n\trow_ = row;\n\tcol_ = col;\n\tparentId_ = parentId;\n\te_ = e;\n\troot_ = root;\n\tassert(!root_ || parentId == 0);\n\tassert_lt(row, (int64_t)prob.qrylen_);\n\tassert_lt(col, (int64_t)prob.reflen_);\n\t// First match to check is diagonally above and to the left of the cell\n\t// where the edit occurs\n\tint64_t rowc = row;\n\tint64_t colc = col;\n\tlen_ = 0;\n\tif(e.inited() && e.isMismatch()) {\n\t\trowc--; colc--;\n\t\tlen_ = 1;\n\t}\n\tint64_t match = prob.sc_->match();\n\tbool cp = prob.usecp_;\n\tsize_t iters = 0;\n\tcurtailed_ = false;\n\tif(extend) {\n\t\twhile(rowc >= 0 && colc >= 0) {\n\t\t\tint rfm = prob.ref_[colc];\n\t\t\tassert_range(0, 16, rfm);\n\t\t\tint rdc = prob.qry_[rowc];\n\t\t\tbool matches = (rfm & (1 << rdc)) != 0;\n\t\t\tif(!matches) {\n\t\t\t\t// What's the mismatch penalty?\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\t// Get score from checkpointer\n\t\t\tscore_st_ += match;\n\t\t\tif(cp && rowc - 1 >= 0 && colc - 1 >= 0 &&\n\t\t\t   prob.cper_->isCheckpointed(rowc - 1, colc - 1))\n\t\t\t{\n\t\t\t\t// Possibly prune\n\t\t\t\tint16_t cpsc;\n\t\t\t\tcpsc = prob.cper_->scoreTriangle(rowc - 1, colc - 1, hef);\n\t\t\t\tif(cpsc + score_st_ < prob.targ_) {\n\t\t\t\t\tcurtailed_ = true;\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t}\n\t\t\titers++;\n\t\t\trowc--; colc--;\n\t\t}\n\t}\n\tassert_geq(rowc, -1);\n\tassert_geq(colc, -1);\n\tlen_ = (int64_t)row - rowc;\n\tassert_leq((int64_t)len_, row_+1);\n\tassert_leq((int64_t)len_, col_+1);\n\tassert_leq((int64_t)score_st_, (int64_t)prob.qrylen_ * match);\n}\n\n/**\n * Given a potential branch to add to the queue, see if we can follow the\n * branch a little further first.  If it's still valid, or if we reach a\n * choice between valid outgoing paths, go ahead and add it to the queue.\n */\nvoid BtBranchTracer::examineBranch(\n\tint64_t row,\n\tint64_t col,\n\tconst Edit& e,\n\tTAlScore pen,  // penalty associated with edit\n\tTAlScore sc,\n\tsize_t parentId)\n{\n\tsize_t id = bs_.alloc();\n\tbs_[id].init(prob_, parentId, pen, sc, row, col, e, 0, false, true);\n\tif(bs_[id].isSolution(prob_)) {\n\t\tassert(bs_[id].isValid(prob_));\n\t\taddSolution(id);\n\t} else {\n\t\t// Check if this branch is legit\n\t\tif(bs_[id].isValid(prob_)) {\n\t\t\tadd(id);\n\t\t} else {\n\t\t\tbs_.pop();\n\t\t}\n\t}\n}\n\n/**\n * Take all possible ways of leaving the given branch and add them to the\n * branch queue.\n */\nvoid BtBranchTracer::addOffshoots(size_t bid) {\n\tBtBranch& b = bs_[bid];\n\tTAlScore sc = b.score_en_;\n\tint64_t match = prob_.sc_->match();\n\tint64_t scoreFloor = prob_.sc_->monotone ? MIN_I64 : 0;\n\tbool cp = prob_.usecp_; // Are there are any checkpoints?\n\tASSERT_ONLY(TAlScore perfectScore = prob_.sc_->perfectScore(prob_.qrylen_));\n\tassert_leq(prob_.targ_, perfectScore);\n\t// For each cell in the branch\n\tfor(size_t i = 0 ; i < b.len_; i++) {\n\t\tassert_leq((int64_t)i, b.row_+1);\n\t\tassert_leq((int64_t)i, b.col_+1);\n\t\tint64_t row = b.row_ - i, col = b.col_ - i;\n\t\tint64_t bonusLeft = (row + 1) * match;\n\t\tint64_t fromend = prob_.qrylen_ - row - 1;\n\t\tbool allowGaps = fromend >= prob_.sc_->gapbar && row >= prob_.sc_->gapbar;\n\t\tif(allowGaps && row >= 0 && col >= 0) {\n\t\t\tif(col > 0) {\n\t\t\t\t// Try a read gap - it's either an extension or an open\n\t\t\t\tbool extend = b.e_.inited() && b.e_.isReadGap() && i == 0;\n\t\t\t\tTAlScore rdgapPen = extend ?\n\t\t\t\t\tprob_.sc_->readGapExtend() : prob_.sc_->readGapOpen();\n\t\t\t\tbool prune = false;\n\t\t\t\tassert_gt(rdgapPen, 0);\n\t\t\t\tif(cp && prob_.cper_->isCheckpointed(row, col - 1)) {\n\t\t\t\t\t// Possibly prune\n\t\t\t\t\tint16_t cpsc = (int16_t)prob_.cper_->scoreTriangle(row, col - 1, 0);\n\t\t\t\t\tassert_leq(cpsc, perfectScore);\n\t\t\t\t\tassert_geq(prob_.sc_->readGapOpen(), prob_.sc_->readGapExtend());\n\t\t\t\t\tTAlScore bonus = prob_.sc_->readGapOpen() - prob_.sc_->readGapExtend();\n\t\t\t\t\tassert_geq(bonus, 0);\n\t\t\t\t\tif(cpsc + bonus + sc - rdgapPen < prob_.targ_) {\n\t\t\t\t\t\tprune = true;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(prune) {\n\t\t\t\t\tif(extend) { nrdexPrune_++; } else { nrdopPrune_++; }\n\t\t\t\t} else if(sc - rdgapPen >= scoreFloor && sc - rdgapPen + bonusLeft >= prob_.targ_) {\n\t\t\t\t\t// Yes, we can introduce a read gap here\n\t\t\t\t\tEdit e((int)row + 1, mask2dna[(int)prob_.ref_[col]], '-', EDIT_TYPE_READ_GAP);\n\t\t\t\t\tassert(e.isReadGap());\n\t\t\t\t\texamineBranch(row, col - 1, e, rdgapPen, sc - rdgapPen, bid);\n\t\t\t\t\tif(extend) { nrdex_++; } else { nrdop_++; }\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(row > 0) {\n\t\t\t\t// Try a reference gap - it's either an extension or an open\n\t\t\t\tbool extend = b.e_.inited() && b.e_.isRefGap() && i == 0;\n\t\t\t\tTAlScore rfgapPen = (b.e_.inited() && b.e_.isRefGap()) ?\n\t\t\t\t\tprob_.sc_->refGapExtend() : prob_.sc_->refGapOpen();\n\t\t\t\tbool prune = false;\n\t\t\t\tassert_gt(rfgapPen, 0);\n\t\t\t\tif(cp && prob_.cper_->isCheckpointed(row - 1, col)) {\n\t\t\t\t\t// Possibly prune\n\t\t\t\t\tint16_t cpsc = (int16_t)prob_.cper_->scoreTriangle(row - 1, col, 0);\n\t\t\t\t\tassert_leq(cpsc, perfectScore);\n\t\t\t\t\tassert_geq(prob_.sc_->refGapOpen(), prob_.sc_->refGapExtend());\n\t\t\t\t\tTAlScore bonus = prob_.sc_->refGapOpen() - prob_.sc_->refGapExtend();\n\t\t\t\t\tassert_geq(bonus, 0);\n\t\t\t\t\tif(cpsc + bonus + sc - rfgapPen < prob_.targ_) {\n\t\t\t\t\t\tprune = true;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(prune) {\n\t\t\t\t\tif(extend) { nrfexPrune_++; } else { nrfopPrune_++; }\n\t\t\t\t} else if(sc - rfgapPen >= scoreFloor && sc - rfgapPen + bonusLeft >= prob_.targ_) {\n\t\t\t\t\t// Yes, we can introduce a ref gap here\n\t\t\t\t\tEdit e((int)row, '-', \"ACGTN\"[(int)prob_.qry_[row]], EDIT_TYPE_REF_GAP);\n\t\t\t\t\tassert(e.isRefGap());\n\t\t\t\t\texamineBranch(row - 1, col, e, rfgapPen, sc - rfgapPen, bid);\n\t\t\t\t\tif(extend) { nrfex_++; } else { nrfop_++; }\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\t// If we're at the top of the branch but not yet at the top of\n\t\t// the DP table, a mismatch branch is also possible.\n\t\tif(i == b.len_ && !b.curtailed_ && row >= 0 && col >= 0) {\n\t\t\tint rfm = prob_.ref_[col];\n\t\t\tassert_lt(row, (int64_t)prob_.qrylen_);\n\t\t\tint rdc = prob_.qry_[row];\n\t\t\tint rdq = prob_.qual_[row];\n\t\t\tint scdiff = prob_.sc_->score(rdc, rfm, rdq - 33);\n\t\t\tassert_lt(scdiff, 0); // at end of branch, so can't match\n\t\t\tbool prune = false;\n\t\t\tif(cp && row > 0 && col > 0 && prob_.cper_->isCheckpointed(row - 1, col - 1)) {\n\t\t\t\t// Possibly prune\n\t\t\t\tint16_t cpsc = prob_.cper_->scoreTriangle(row - 1, col - 1, 0);\n\t\t\t\tassert_leq(cpsc, perfectScore);\n\t\t\t\tassert_leq(cpsc + scdiff + sc, perfectScore);\n\t\t\t\tif(cpsc + scdiff + sc < prob_.targ_) {\n\t\t\t\t\tprune = true;\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(prune) {\n\t\t\t\tnmm_++;\n\t\t\t} else  {\n\t\t\t\t// Yes, we can introduce a mismatch here\n\t\t\t\tif(sc + scdiff >= scoreFloor && sc + scdiff + bonusLeft >= prob_.targ_) {\n\t\t\t\t\tEdit e((int)row, mask2dna[rfm], \"ACGTN\"[rdc], EDIT_TYPE_MM);\n\t\t\t\t\tbool nmm = (mask2dna[rfm] == 'N' || rdc > 4);\n\t\t\t\t\tassert_neq(e.chr, e.qchr);\n\t\t\t\t\tassert_lt(scdiff, 0);\n\t\t\t\t\texamineBranch(row - 1, col - 1, e, -scdiff, sc + scdiff, bid);\n\t\t\t\t\tif(nmm) { nnmm_++; } else { nmm_++; }\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tsc += match;\n\t}\n}\n\n/**\n * Sort unsorted branches, merge them with master sorted list.\n */\nvoid BtBranchTracer::flushUnsorted() {\n\tif(unsorted_.empty()) {\n\t\treturn;\n\t}\n\tunsorted_.sort();\n\tunsorted_.reverse();\n#ifndef NDEBUG\n\tfor(size_t i = 1; i < unsorted_.size(); i++) {\n\t\tassert_leq(bs_[unsorted_[i].second].score_st_, bs_[unsorted_[i-1].second].score_st_);\n\t}\n#endif\n\tEList<size_t> *src2 = sortedSel_ ? &sorted1_ : &sorted2_;\n\tEList<size_t> *dest = sortedSel_ ? &sorted2_ : &sorted1_;\n\t// Merge src1 and src2 into dest\n\tdest->clear();\n\tsize_t cur1 = 0, cur2 = cur_;\n\twhile(cur1 < unsorted_.size() || cur2 < src2->size()) {\n\t\t// Take from 1 or 2 next?\n\t\tbool take1 = true;\n\t\tif(cur1 == unsorted_.size()) {\n\t\t\ttake1 = false;\n\t\t} else if(cur2 == src2->size()) {\n\t\t\ttake1 = true;\n\t\t} else {\n\t\t\tassert_neq(unsorted_[cur1].second, (*src2)[cur2]);\n\t\t\ttake1 = bs_[unsorted_[cur1].second] < bs_[(*src2)[cur2]];\n\t\t}\n\t\tif(take1) {\n\t\t\tdest->push_back(unsorted_[cur1++].second); // Take from list 1\n\t\t} else {\n\t\t\tdest->push_back((*src2)[cur2++]); // Take from list 2\n\t\t}\n\t}\n\tassert_eq(cur1, unsorted_.size());\n\tassert_eq(cur2, src2->size());\n\tsortedSel_ = !sortedSel_;\n\tcur_ = 0;\n\tunsorted_.clear();\n}\n\n/**\n * Try all the solutions accumulated so far.  Solutions might be rejected\n * if they, for instance, overlap a previous solution, have too many Ns,\n * fail to overlap a core diagonal, etc.\n */\nbool BtBranchTracer::trySolutions(\n\tbool lookForOlap,\n\tSwResult& res,\n\tsize_t& off,\n\tsize_t& nrej,\n\tRandomSource& rnd,\n\tbool& success)\n{\n\tif(solutions_.size() > 0) {\n\t\tfor(size_t i = 0; i < solutions_.size(); i++) {\n\t\t\tint ret = trySolution(solutions_[i], lookForOlap, res, off, nrej, rnd);\n\t\t\tif(ret == BT_FOUND) {\n\t\t\t\tsuccess = true;\n\t\t\t\treturn true; // there were solutions and one was good\n\t\t\t}\n\t\t}\n\t\tsolutions_.clear();\n\t\tsuccess = false;\n\t\treturn true; // there were solutions but none were good\n\t}\n\treturn false; // there were no solutions to check\n}\n\n/**\n * Given the id of a branch that completes a successful backtrace, turn the\n * chain of branches into \n */\nint BtBranchTracer::trySolution(\n\tsize_t id,\n\tbool lookForOlap,\n\tSwResult& res,\n\tsize_t& off,\n\tsize_t& nrej,\n\tRandomSource& rnd)\n{\n#if 0\n\tAlnScore score;\n\tBtBranch *br = &bs_[id];\n\t// 'br' corresponds to the leftmost edit in a right-to-left\n\t// chain of edits.  \n\tEList<Edit>& ned = res.alres.ned();\n\tconst BtBranch *cur = br, *prev = NULL;\n\tsize_t ns = 0, nrefns = 0;\n\tsize_t ngap = 0;\n\twhile(true) {\n\t\tif(cur->e_.inited()) {\n\t\t\tif(cur->e_.isMismatch()) {\n\t\t\t\tif(cur->e_.qchr == 'N' || cur->e_.chr == 'N') {\n\t\t\t\t\tns++;\n\t\t\t\t}\n\t\t\t} else if(cur->e_.isGap()) {\n\t\t\t\tngap++;\n\t\t\t}\n\t\t\tif(cur->e_.chr == 'N') {\n\t\t\t\tnrefns++;\n\t\t\t}\n\t\t\tned.push_back(cur->e_);\n\t\t}\n\t\tif(cur->root_) {\n\t\t\tbreak;\n\t\t}\n\t\tcur = &bs_[cur->parentId_];\n\t}\n\tif(ns > prob_.nceil_) {\n\t\t// Alignment has too many Ns in it!\n\t\tres.reset();\n\t\tassert(res.alres.ned().empty());\n\t\tnrej++;\n\t\treturn BT_REJECTED_N;\n\t}\n\t// Update 'seenPaths_'\n\tcur = br;\n\tbool rejSeen = false; // set =true if we overlap prev path\n\tbool rejCore = true; // set =true if we don't touch core diag\n\twhile(true) {\n\t\t// Consider row, col, len, then do something\n\t\tint64_t row = cur->row_, col = cur->col_;\n\t\tassert_lt(row, (int64_t)prob_.qrylen_);\n\t\tsize_t fromend = prob_.qrylen_ - row - 1;\n\t\tsize_t diag = fromend + col;\n\t\t// Calculate the diagonal within the *trimmed* rectangle,\n\t\t// i.e. the rectangle we dealt with in align, gather and\n\t\t// backtrack.\n\t\tint64_t diagi = col - row;\n\t\t// Now adjust to the diagonal within the *untrimmed*\n\t\t// rectangle by adding on the amount trimmed from the left.\n\t\tdiagi += prob_.rect_->triml;\n\t\tassert_lt(diag, seenPaths_.size());\n\t\t// Does it overlap a core diagonal?\n\t\tif(diagi >= 0) {\n\t\t\tsize_t diag = (size_t)diagi;\n\t\t\tif(diag >= prob_.rect_->corel &&\n\t\t\t   diag <= prob_.rect_->corer)\n\t\t\t{\n\t\t\t\t// Yes it does - it's OK\n\t\t\t\trejCore = false;\n\t\t\t}\n\t\t}\n\t\tif(lookForOlap) {\n\t\t\tint64_t newlo, newhi;\n\t\t\tif(cur->len_ == 0) {\n\t\t\t\tif(prev != NULL && prev->len_ > 0) {\n\t\t\t\t\t// If there's a gap at the base of a non-0 length branch, the\n\t\t\t\t\t// gap will appear to overlap the branch if we give it length 1.\n\t\t\t\t\tnewhi = newlo = 0;\n\t\t\t\t} else {\n\t\t\t\t\t// Read or ref gap with no matches coming off of it\n\t\t\t\t\tnewlo = row;\n\t\t\t\t\tnewhi = row + 1;\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\t// Diagonal with matches\n\t\t\t\tnewlo = row - (cur->len_ - 1);\n\t\t\t\tnewhi = row + 1;\n\t\t\t}\n\t\t\tassert_geq(newlo, 0);\n\t\t\tassert_geq(newhi, 0);\n\t\t\t// Does the diagonal cover cells?\n\t\t\tif(newhi > newlo) {\n\t\t\t\t// Check whether there is any overlap with previously traversed\n\t\t\t\t// cells\n\t\t\t\tbool added = false;\n\t\t\t\tconst size_t sz = seenPaths_[diag].size();\n\t\t\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\t\t\t// Does the new interval overlap this already-seen\n\t\t\t\t\t// interval?  Also of interest: does it abut this\n\t\t\t\t\t// already-seen interval?  If so, we should merge them.\n\t\t\t\t\tsize_t lo = seenPaths_[diag][i].first;\n\t\t\t\t\tsize_t hi = seenPaths_[diag][i].second;\n\t\t\t\t\tassert_lt(lo, hi);\n\t\t\t\t\tsize_t lo_sm = newlo, hi_sm = newhi;\n\t\t\t\t\tif(hi - lo < hi_sm - lo_sm) {\n\t\t\t\t\t\tswap(lo, lo_sm);\n\t\t\t\t\t\tswap(hi, hi_sm);\n\t\t\t\t\t}\n\t\t\t\t\tif((lo <= lo_sm && hi > lo_sm) ||\n\t\t\t\t\t   (lo <  hi_sm && hi >= hi_sm))\n\t\t\t\t\t{\n\t\t\t\t\t\t// One or both of the shorter interval's end points\n\t\t\t\t\t\t// are contained in the longer interval - so they\n\t\t\t\t\t\t// overlap.\n\t\t\t\t\t\trejSeen = true;\n\t\t\t\t\t\t// Merge them into one longer interval\n\t\t\t\t\t\tseenPaths_[diag][i].first = min(lo, lo_sm);\n\t\t\t\t\t\tseenPaths_[diag][i].second = max(hi, hi_sm);\n#ifndef NDEBUG\n\t\t\t\t\t\tfor(int64_t ii = seenPaths_[diag][i].first;\n\t\t\t\t\t\t\tii < (int64_t)seenPaths_[diag][i].second;\n\t\t\t\t\t\t\tii++)\n\t\t\t\t\t\t{\n\t\t\t\t\t\t\t//cerr << \"trySolution rejected (\" << ii << \", \" << (ii + col - row) << \")\" << endl;\n\t\t\t\t\t\t}\n#endif\n\t\t\t\t\t\tadded = true;\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t} else if(hi == lo_sm || lo == hi_sm) {\n\t\t\t\t\t\t// Merge them into one longer interval\n\t\t\t\t\t\tseenPaths_[diag][i].first = min(lo, lo_sm);\n\t\t\t\t\t\tseenPaths_[diag][i].second = max(hi, hi_sm);\n#ifndef NDEBUG\n\t\t\t\t\t\tfor(int64_t ii = seenPaths_[diag][i].first;\n\t\t\t\t\t\t\tii < (int64_t)seenPaths_[diag][i].second;\n\t\t\t\t\t\t\tii++)\n\t\t\t\t\t\t{\n\t\t\t\t\t\t\t//cerr << \"trySolution rejected (\" << ii << \", \" << (ii + col - row) << \")\" << endl;\n\t\t\t\t\t\t}\n#endif\n\t\t\t\t\t\tadded = true;\n\t\t\t\t\t\t// Keep going in case it overlaps one of the other\n\t\t\t\t\t\t// intervals\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(!added) {\n\t\t\t\t\tseenPaths_[diag].push_back(make_pair(newlo, newhi));\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\t// After the merging that may have occurred above, it's no\n\t\t// longer guarnateed that all the overlapping intervals in\n\t\t// the list have been merged.  That's OK though.  We'll\n\t\t// still get correct answers to overlap queries.\n\t\tif(cur->root_) {\n\t\t\tassert_eq(0, cur->parentId_);\n\t\t\tbreak;\n\t\t}\n\t\tprev = cur;\n\t\tcur = &bs_[cur->parentId_];\n\t} // while(cur->e_.inited())\n\tif(rejSeen) {\n\t\tres.reset();\n\t\tassert(res.alres.ned().empty());\n\t\tnrej++;\n\t\treturn BT_NOT_FOUND;\n\t}\n\tif(rejCore) {\n\t\tres.reset();\n\t\tassert(res.alres.ned().empty());\n\t\tnrej++;\n\t\treturn BT_REJECTED_CORE_DIAG;\n\t}\n\toff = br->leftmostCol();\n\tscore.score_ = prob_.targ_;\n\tscore.ns_    = ns;\n\tscore.gaps_  = ngap;\n\tres.alres.setScore(score);\n\tres.alres.setRefNs(nrefns);\n\tsize_t trimBeg = br->uppermostRow();\n\tsize_t trimEnd = prob_.qrylen_ - prob_.row_ - 1;\n\tassert_leq(trimBeg, prob_.qrylen_);\n\tassert_leq(trimEnd, prob_.qrylen_);\n\tTRefOff refoff = off + prob_.refoff_ + prob_.rect_->refl;\n\tres.alres.setShape(\n\t\tprob_.refid_,                   // ref id\n\t\trefoff,                         // 0-based ref offset\n\t\tprob_.treflen(),                // ref length\n\t\tprob_.fw_,                      // aligned to Watson?\n\t\tprob_.qrylen_,                  // read length\n\t\ttrue,                           // pretrim soft?\n\t\t0,                              // pretrim 5' end\n\t\t0,                              // pretrim 3' end\n\t\ttrue,                           // alignment trim soft?\n\t\tprob_.fw_ ? trimBeg : trimEnd,  // alignment trim 5' end\n\t\tprob_.fw_ ? trimEnd : trimBeg); // alignment trim 3' end\n#endif\n\treturn BT_FOUND;\n}\n\n/**\n * Get the next valid alignment given a backtrace problem.  Return false\n * if there is no valid solution.  Use a backtracking search to find the\n * solution.  This can be very slow.\n */\nbool BtBranchTracer::nextAlignmentBacktrace(\n\tsize_t maxiter,\n\tSwResult& res,\n\tsize_t& off,\n\tsize_t& nrej,\n\tsize_t& niter,\n\tRandomSource& rnd)\n{\n\tassert(!empty() || !emptySolution());\n\tassert(prob_.inited());\n\t// There's a subtle case where we might fail to backtracing in\n\t// local-alignment mode.  The basic fact to remember is that when we're\n\t// backtracing from the highest-scoring cell in the table, we're guaranteed\n\t// to be able to backtrace without ever dipping below 0.  But if we're\n\t// backtracing from a cell other than the highest-scoring cell in the\n\t// table, we might dip below 0.  Dipping below 0 implies that there's a\n\t// shorted local alignment with a better score.  In which case, it's\n\t// perfectly fair for us to abandon any path that dips below the floor, and\n\t// this might result in the queue becoming empty before we finish.\n\tbool result = false;\n\tniter = 0;\n\twhile(!empty()) {\n\t\tif(trySolutions(true, res, off, nrej, rnd, result)) {\n\t\t\treturn result;\n\t\t}\n\t\tif(niter++ >= maxiter) {\n\t\t\tbreak;\n\t\t}\n\t\tsize_t brid = best(rnd); // put best branch in 'br'\n\t\tassert(!seen_.contains(brid));\n\t\tASSERT_ONLY(seen_.insert(brid));\n#if 0\n\t\tBtBranch *br = &bs_[brid];\n\t\tcerr << brid\n\t\t     << \": targ:\" << prob_.targ_\n\t\t     << \", sc:\" << br->score_st_\n\t\t     << \", row:\" << br->uppermostRow()\n\t\t\t << \", nmm:\" << nmm_\n\t\t\t << \", nnmm:\" << nnmm_\n\t\t\t << \", nrdop:\" << nrdop_\n\t\t\t << \", nrfop:\" << nrfop_\n\t\t\t << \", nrdex:\" << nrdex_\n\t\t\t << \", nrfex:\" << nrfex_\n\t\t\t << \", nrdop_pr: \" << nrdopPrune_\n\t\t\t << \", nrfop_pr: \" << nrfopPrune_\n\t\t\t << \", nrdex_pr: \" << nrdexPrune_\n\t\t\t << \", nrfex_pr: \" << nrfexPrune_\n\t\t\t << endl;\n#endif\n\t\taddOffshoots(brid);\n\t}\n\tif(trySolutions(true, res, off, nrej, rnd, result)) {\n\t\treturn result;\n\t}\n\treturn false;\n}\n\n/**\n * Get the next valid alignment given a backtrace problem.  Return false\n * if there is no valid solution.  Use a triangle-fill backtrace to find\n * the solution.  This is usually fast (it's O(m + n)).\n */\nbool BtBranchTracer::nextAlignmentFill(\n\tsize_t maxiter,\n\tSwResult& res,\n\tsize_t& off,\n\tsize_t& nrej,\n\tsize_t& niter,\n\tRandomSource& rnd)\n{\n\tassert(prob_.inited());\n\tassert(!emptySolution());\n\tbool result = false;\n\tif(trySolutions(false, res, off, nrej, rnd, result)) {\n\t\treturn result;\n\t}\n\treturn false;\n}\n\n/**\n * Get the next valid alignment given the backtrace problem.  Return false\n * if there is no valid solution, e.g., if \n */\nbool BtBranchTracer::nextAlignment(\n\tsize_t maxiter,\n\tSwResult& res,\n\tsize_t& off,\n\tsize_t& nrej,\n\tsize_t& niter,\n\tRandomSource& rnd)\n{\n\tif(prob_.fill_) {\n\t\treturn nextAlignmentFill(\n\t\t\tmaxiter,\n\t\t\tres,\n\t\t\toff,\n\t\t\tnrej,\n\t\t\tniter,\n\t\t\trnd);\n\t} else {\n\t\treturn nextAlignmentBacktrace(\n\t\t\tmaxiter,\n\t\t\tres,\n\t\t\toff,\n\t\t\tnrej,\n\t\t\tniter,\n\t\t\trnd);\n\t}\n}\n\n#ifdef MAIN_ALIGNER_BT\n\n#include <iostream>\n\nint main(int argc, char **argv) {\n\tsize_t off = 0;\n\tRandomSource rnd(77);\n\tBtBranchTracer tr;\n\tScoring sc = Scoring::base1();\n\tSwResult res;\n\ttr.init(\n\t\t\"ACGTACGT\", // in: read sequence\n\t\t\"IIIIIIII\", // in: quality sequence\n\t\t8,          // in: read sequence length\n\t\t\"ACGTACGT\", // in: reference sequence\n\t\t8,          // in: reference sequence length\n\t\t0,          // in: reference id\n\t\t0,          // in: reference offset\n\t\ttrue,       // in: orientation\n\t\tsc,         // in: scoring scheme\n\t\t0,          // in: N ceiling\n\t\t8,          // in: alignment score\n\t\t7,          // start in this row\n\t\t7,          // start in this column\n\t\trnd);       // random gen, to choose among equal paths\n\tsize_t nrej = 0;\n\ttr.nextAlignment(\n\t\tres,\n\t\toff,\n\t\tnrej,\n\t\trnd);\n}\n\n#endif /*def MAIN_ALIGNER_BT*/\n"
  },
  {
    "path": "aligner_bt.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef ALIGNER_BT_H_\n#define ALIGNER_BT_H_\n\n#include <utility>\n#include <stdint.h>\n#include \"aligner_sw_common.h\"\n#include \"aligner_result.h\"\n#include \"scoring.h\"\n#include \"edit.h\"\n#include \"limit.h\"\n#include \"dp_framer.h\"\n#include \"sse_util.h\"\n\n/* Say we've filled in a DP matrix in a cost-only manner, not saving the scores\n * for each of the cells.  At the end, we obtain a list of candidate cells and\n * we'd like to backtrace from them.  The per-cell scores are gone, but we have\n * to re-create the correct path somehow.  Hopefully we can do this without\n * recreating most or al of the score matrix, since this takes too much memory.\n *\n * Approach 1: Naively refill the matrix.\n *\n *  Just refill the matrix, perhaps backwards starting from the backtrace cell.\n *  Since this involves recreating all or most of the score matrix, this is not\n *  a good approach.\n *\n * Approach 2: Naive backtracking.\n *\n *  Conduct a search through the space of possible backtraces, rooted at the\n *  candidate cell.  To speed things along, we can prioritize paths that have a\n *  high score and that align more characters from the read.\n *\n *  The approach is simple, but it's neither fast nor memory-efficient in\n *  general.\n *\n * Approach 3: Refilling with checkpoints.\n *\n *  Refill the matrix \"backwards\" starting from the candidate cell, but use\n *  checkpoints to ensure that only a series of relatively small triangles or\n *  rectangles need to be refilled.  The checkpoints must include elements from\n *  the H, E and F matrices; not just H.  After each refill, we backtrace\n *  through the refilled area, then discard/reuse the fill memory.  I call each\n *  such fill/backtrace a mini-fill/backtrace.\n *\n *  If there's only one path to be found, then this is O(m+n).  But what if\n *  there are many?  And what if we would like to avoid paths that overlap in\n *  one or more cells?  There are two ways we can make this more efficient:\n *\n *   1. Remember the re-calculated E/F/H values and try to retrieve them\n *   2. Keep a record of cells that have already been traversed\n *\n *  Legend:\n *\n *  1: Candidate cell\n *  2: Final cell from first mini-fill/backtrace\n *  3: Final cell from second mini-fill/backtrace (third not shown)\n *  +: Checkpointed cell\n *  *: Cell filled from first or second mini-fill/backtrace\n *  -: Unfilled cell\n *\n *        ---++--------++--------++----\n *        --++--------++*-------++-----\n *        -++--(etc)-++**------++------\n *        ++--------+3***-----++-------\n *        +--------++****----++--------\n *        --------++*****---++--------+\n *        -------++******--++--------++\n *        ------++*******-++*-------++-\n *        -----++********++**------++--\n *        ----++********2+***-----++---\n *        ---++--------++****----++----\n *        --++--------++*****---++-----\n *        -++--------++*****1--++------\n *        ++--------++--------++-------\n *\n * Approach 4: Backtracking with checkpoints.\n *\n *  Conduct a search through the space of possible backtraces, rooted at the\n *  candidate cell.  Use \"checkpoints\" to prune.  That is, when a backtrace\n *  moves through a cell with a checkpointed score, consider the score\n *  accumulated so far and the cell's saved score; abort if those two scores\n *  add to something less than a valid score.  Note we're only checkpointing H\n *  in this case (possibly; see \"subtle point\"), not E or F.\n *\n *  Subtle point: checkpoint scores are a result of moving forward through\n *  the matrix whereas backtracking scores result from moving backward.  This\n *  matters becuase the two paths that meet up at a cell might have both\n *  factored in a gap open penalty for the same gap, in which case we will\n *  underestimate the overall score and prune a good path.  Here are two ideas\n *  for how to resolve this:\n *\n *   Idea 1: when we combine the forward and backward scores to find an overall\n *   score, and our backtrack procedure *just* made a horizontal or vertical\n *   move, add in a \"bonus\" equal to the gap open penalty of the appropraite\n *   type (read gap open for horizontal, ref gap open for vertical). This might\n *   overcompensate, since\n *\n *   Idea 2: keep the E and F values for the checkpoints around, in addition to\n *   the H values.  When it comes time to combine the score from the forward\n *   and backward paths, we consider the last move we made in the backward\n *   backtrace.  If it's a read gap (horizontal move), then we calculate the\n *   overall score as:\n *\n *     max(Score-backward + H-forward, Score-backward + E-forward + read-open)\n *\n *   If it's a reference gap (vertical move), then we calculate the overall\n *   score as:\n *\n *     max(Score-backward + H-forward, Score-backward + F-forward + ref-open)\n *\n *   What does it mean to abort a backtrack?  If we're starting a new branch\n *   and there is a checkpoing in the bottommost cell of the branch, and the\n *   overall score is less than the target, then we can simply ignore the\n *   branch.  If the checkpoint occurs in the middle of a string of matches, we\n *   need to curtail the branch such that it doesn't include the checkpointed\n *   cell and we won't ever try to enter the checkpointed cell, e.g., on a\n *   mismatch.\n *\n * Approaches 3 and 4 seem reasonable, and could be combined.  For simplicity,\n * we implement only approach 4 for now.\n *\n * Checkpoint information is propagated from the fill process to the backtracer\n * via a \n */\n\nenum {\n\tBT_NOT_FOUND = 1,      // could not obtain the backtrace because it\n\t                       // overlapped a previous solution\n\tBT_FOUND,              // obtained a valid backtrace\n\tBT_REJECTED_N,         // backtrace rejected because it had too many Ns\n\tBT_REJECTED_CORE_DIAG  // backtrace rejected because it failed to overlap a\n\t                       // core diagonal\n};\n\n/**\n * Parameters for a matrix of potential backtrace problems to solve.\n * Encapsulates information about:\n *\n * The problem given a particular reference substring:\n *\n * - The query string (nucleotides and qualities)\n * - The reference substring (incl. orientation, offset into overall sequence)\n * - Checkpoints (i.e. values of matrix cells)\n * - Scoring scheme and other thresholds\n *\n * The problem given a particular reference substring AND a particular row and\n * column from which to backtrace:\n *\n * - The row and column\n * - The target score\n */\nclass BtBranchProblem {\n\npublic:\n\n\t/**\n\t * Create new uninitialized problem.\n\t */\n\tBtBranchProblem() { reset(); }\n\n\t/**\n\t * Initialize a new problem.\n\t */\n\tvoid initRef(\n\t\tconst char          *qry,    // query string (along rows)\n\t\tconst char          *qual,   // query quality string (along rows)\n\t\tsize_t               qrylen, // query string (along rows) length\n\t\tconst char          *ref,    // reference string (along columns)\n\t\tTRefOff              reflen, // in-rectangle reference string length\n\t\tTRefOff              treflen,// total reference string length\n\t\tTRefId               refid,  // reference id\n\t\tTRefOff              refoff, // reference offset\n\t\tbool                 fw,     // orientation of problem\n\t\tconst DPRect*        rect,   // dynamic programming rectangle filled out\n\t\tconst Checkpointer*  cper,   // checkpointer\n\t\tconst Scoring       *sc,     // scoring scheme\n\t\tsize_t               nceil)  // max # Ns allowed in alignment\n\t{\n\t\tqry_     = qry;\n\t\tqual_    = qual;\n\t\tqrylen_  = qrylen;\n\t\tref_     = ref;\n\t\treflen_  = reflen;\n\t\ttreflen_ = treflen;\n\t\trefid_   = refid;\n\t\trefoff_  = refoff;\n\t\tfw_      = fw;\n\t\trect_    = rect;\n\t\tcper_    = cper;\n\t\tsc_      = sc;\n\t\tnceil_   = nceil;\n\t}\n\n\t/**\n\t * Initialize a new problem.\n\t */\n\tvoid initBt(\n\t\tsize_t   row,   // row\n\t\tsize_t   col,   // column\n\t\tbool     fill,  // use a filling rather than a backtracking strategy\n\t\tbool     usecp, // use checkpoints to short-circuit while backtracking\n\t\tTAlScore targ)  // target score\n\t{\n\t\trow_    = row;\n\t\tcol_    = col;\n\t\ttarg_   = targ;\n\t\tfill_   = fill;\n\t\tusecp_  = usecp;\n\t\tif(fill) {\n\t\t\tassert(usecp_);\n\t\t}\n\t}\n\n\t/**\n\t * Reset to uninitialized state.\n\t */\n\tvoid reset() {\n\t\tqry_ = qual_ = ref_ = NULL;\n\t\tcper_ = NULL;\n\t\trect_ = NULL;\n\t\tsc_ = NULL;\n\t\tqrylen_ = reflen_ = treflen_ = refid_ = refoff_ = row_ = col_ = targ_ = nceil_ = 0;\n\t\tfill_ = fw_ = usecp_ = false;\n\t}\n\t\n\t/**\n\t * Return true iff the BtBranchProblem has been initialized.\n\t */\n\tbool inited() const {\n\t\treturn qry_ != NULL;\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Sanity-check the problem.\n\t */\n\tbool repOk() const {\n\t\tassert_gt(qrylen_, 0);\n\t\tassert_gt(reflen_, 0);\n\t\tassert_gt(treflen_, 0);\n\t\tassert_lt(row_, qrylen_);\n\t\tassert_lt((TRefOff)col_, reflen_);\n\t\treturn true;\n\t}\n#endif\n\t\n\tsize_t reflen() const { return reflen_; }\n\tsize_t treflen() const { return treflen_; }\n\nprotected:\n\n\tconst char         *qry_;    // query string (along rows)\n\tconst char         *qual_;   // query quality string (along rows)\n\tsize_t              qrylen_; // query string (along rows) length\n\tconst char         *ref_;    // reference string (along columns)\n\tTRefOff             reflen_; // in-rectangle reference string length\n\tTRefOff             treflen_;// total reference string length\n\tTRefId              refid_;  // reference id\n\tTRefOff             refoff_; // reference offset\n\tbool                fw_;     // orientation of problem\n\tconst DPRect*       rect_;   // dynamic programming rectangle filled out\n\tsize_t              row_;    // starting row\n\tsize_t              col_;    // starting column\n\tTAlScore            targ_;   // target score\n\tconst Checkpointer *cper_;   // checkpointer\n\tbool                fill_;   // use mini-fills\n\tbool                usecp_;  // use checkpointing?\n\tconst Scoring      *sc_;     // scoring scheme\n\tsize_t              nceil_;  // max # Ns allowed in alignment\n\t\n\tfriend class BtBranch;\n\tfriend class BtBranchQ;\n\tfriend class BtBranchTracer;\n};\n\n/**\n * Encapsulates a \"branch\" which is a diagonal of cells (possibly of length 0)\n * in the matrix where all the cells are matches.  These stretches are linked\n * together by edits to form a full backtrace path through the matrix.  Lengths\n * are measured w/r/t to the number of rows traversed by the path, so a branch\n * that represents a read gap extension could have length = 0.\n *\n * At the end of the day, the full backtrace path is represented as a list of\n * BtBranch's where each BtBranch represents a stretch of matching cells (and\n * up to one mismatching cell at its bottom extreme) ending in an edit (or in\n * the bottommost row, in which case the edit is uninitialized).  Each\n * BtBranch's row and col fields indicate the bottommost cell involved in the\n * diagonal stretch of matches, and the len_ field indicates the length of the\n * stretch of matches.  Note that the edits themselves also correspond to\n * movement through the matrix.\n *\n * A related issue is how we record which cells have been visited so that we\n * never report a pair of paths both traversing the same (row, col) of the\n * overall DP matrix.  This gets a little tricky because we have to take into\n * account the cells covered by *edits* in addition to the cells covered by the\n * stretches of matches.  For instance: imagine a mismatch.  That takes up a\n * cell of the DP matrix, but it may or may not be preceded by a string of\n * matches.  It's hard to imagine how to represent this unless we let the\n * mismatch \"count toward\" the len_ of the branch and let (row, col) refer to\n * the cell where the mismatch occurs.\n *\n * We need BtBranches to \"live forever\" so that we can make some BtBranches\n * parents of others using parent pointers.  For this reason, BtBranch's are\n * stored in an EFactory object in the BtBranchTracer class.\n */\nclass BtBranch {\n\npublic:\n\n\tBtBranch() { reset(); }\n\n\tBtBranch(\n\t\tconst BtBranchProblem& prob,\n\t\tsize_t parentId,\n\t\tTAlScore penalty,\n\t\tTAlScore score_en,\n\t\tint64_t row,\n\t\tint64_t col,\n\t\tEdit e,\n\t\tint hef,\n\t\tbool root,\n\t\tbool extend)\n\t{\n\t\tinit(prob, parentId, penalty, score_en, row, col, e, hef, root, extend);\n\t}\n\t\n\t/**\n\t * Reset to uninitialized state.\n\t */\n\tvoid reset() {\n\t\tparentId_ = 0;\n\t\tscore_st_ = score_en_ = len_ = row_ = col_ = 0;\n\t\tcurtailed_ = false;\n\t\te_.reset();\n\t}\n\t\n\t/**\n\t * Caller gives us score_en, row and col.  We figure out score_st and len_\n\t * by comparing characters from the strings.\n\t */\n\tvoid init(\n\t\tconst BtBranchProblem& prob,\n\t\tsize_t parentId,\n\t\tTAlScore penalty,\n\t\tTAlScore score_en,\n\t\tint64_t row,\n\t\tint64_t col,\n\t\tEdit e,\n\t\tint hef,\n\t\tbool root,\n\t\tbool extend);\n\t\n\t/**\n\t * Return true iff this branch ends in a solution to the backtrace problem.\n\t */\n\tbool isSolution(const BtBranchProblem& prob) const {\n\t\tconst bool end2end = prob.sc_->monotone;\n\t\treturn score_st_ == prob.targ_ && (!end2end || endsInFirstRow());\n\t}\n\t\n\t/**\n\t * Return true iff this branch could potentially lead to a valid alignment.\n\t */\n\tbool isValid(const BtBranchProblem& prob) const {\n\t\tint64_t scoreFloor = prob.sc_->monotone ? MIN_I64 : 0;\n\t\tif(score_st_ < scoreFloor) {\n\t\t\t// Dipped below the score floor\n\t\t\treturn false;\n\t\t}\n\t\tif(isSolution(prob)) {\n\t\t\t// It's a solution, so it's also valid\n\t\t\treturn true;\n\t\t}\n\t\tif((int64_t)len_ > row_) {\n\t\t\t// Went all the way to the top row\n\t\t\t//assert_leq(score_st_, prob.targ_);\n\t\t\treturn score_st_ == prob.targ_;\n\t\t} else {\n\t\t\tint64_t match = prob.sc_->match();\n\t\t\tint64_t bonusLeft = (row_ + 1 - len_) * match;\n\t\t\treturn score_st_ + bonusLeft >= prob.targ_;\n\t\t}\n\t}\n\t\n\t/**\n\t * Return true iff this branch overlaps with the given branch.\n\t */\n\tbool overlap(const BtBranchProblem& prob, const BtBranch& bt) const {\n\t\t// Calculate this branch's diagonal\n\t\tassert_lt(row_, (int64_t)prob.qrylen_);\n\t\tsize_t fromend = prob.qrylen_ - row_ - 1;\n\t\tsize_t diag = fromend + col_;\n\t\tint64_t lo = 0, hi = row_ + 1;\n\t\tif(len_ == 0) {\n\t\t\tlo = row_;\n\t\t} else {\n\t\t\tlo = row_ - (len_ - 1);\n\t\t}\n\t\t// Calculate other branch's diagonal\n\t\tassert_lt(bt.row_, (int64_t)prob.qrylen_);\n\t\tsize_t ofromend = prob.qrylen_ - bt.row_ - 1;\n\t\tsize_t odiag = ofromend + bt.col_;\n\t\tif(diag != odiag) {\n\t\t\treturn false;\n\t\t}\n\t\tint64_t olo = 0, ohi = bt.row_ + 1;\n\t\tif(bt.len_ == 0) {\n\t\t\tolo = bt.row_;\n\t\t} else {\n\t\t\tolo = bt.row_ - (bt.len_ - 1);\n\t\t}\n\t\tint64_t losm = olo, hism = ohi;\n\t\tif(hi - lo < ohi - olo) {\n\t\t\tswap(lo, losm);\n\t\t\tswap(hi, hism);\n\t\t}\n\t\tif((lo <= losm && hi > losm) || (lo <  hism && hi >= hism)) {\n\t\t\treturn true;\n\t\t}\n\t\treturn false;\n\t}\n\t\n\t/**\n\t * Return true iff this branch is higher priority than the branch 'o'.\n\t */\n\tbool operator<(const BtBranch& o) const {\n\t\t// Prioritize uppermost above score\n\t\tif(uppermostRow() != o.uppermostRow()) {\n\t\t\treturn uppermostRow() < o.uppermostRow();\n\t\t}\n\t\tif(score_st_ != o.score_st_) return score_st_ > o.score_st_;\n\t\tif(row_      != o.row_)      return row_ < o.row_;\n\t\tif(col_      != o.col_)      return col_ > o.col_;\n\t\tif(parentId_ != o.parentId_) return parentId_ > o.parentId_;\n\t\tassert(false);\n\t\treturn false;\n\t}\n\t\n\t/**\n\t * Return true iff the topmost cell involved in this branch is in the top\n\t * row.\n\t */\n\tbool endsInFirstRow() const {\n\t\tassert_leq((int64_t)len_, row_ + 1);\n\t\treturn (int64_t)len_ == row_+1;\n\t}\n\t\n\t/**\n\t * Return the uppermost row covered by this branch.\n\t */\n\tsize_t uppermostRow() const {\n\t\tassert_geq(row_ + 1, (int64_t)len_);\n\t\treturn row_ + 1 - (int64_t)len_;\n\t}\n\n\t/**\n\t * Return the leftmost column covered by this branch.\n\t */\n\tsize_t leftmostCol() const {\n\t\tassert_geq(col_ + 1, (int64_t)len_);\n\t\treturn col_ + 1 - (int64_t)len_;\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Sanity-check this BtBranch.\n\t */\n\tbool repOk() const {\n\t\tassert(root_ || e_.inited());\n\t\tassert_gt(len_, 0);\n\t\tassert_geq(col_ + 1, (int64_t)len_);\n\t\tassert_geq(row_ + 1, (int64_t)len_);\n\t\treturn true;\n\t}\n#endif\n\nprotected:\n\n\t// ID of the parent branch.\n\tsize_t   parentId_;\n\n\t// Penalty associated with the edit at the bottom of this branch (0 if\n\t// there is no edit)\n\tTAlScore penalty_;\n\t\n\t// Score at the beginning of the branch\n\tTAlScore score_st_;\n\t\n\t// Score at the end of the branch (taking the edit into account)\n\tTAlScore score_en_;\n\t\n\t// Length of the branch.  That is, the total number of diagonal cells\n\t// involved in all the matches and in the edit (if any).  Should always be\n\t// > 0.\n\tsize_t   len_;\n\t\n\t// The row of the final (bottommost) cell in the branch.  This might be the\n\t// bottommost match if the branch has no associated edit.  Otherwise, it's\n\t// the cell occupied by the edit.\n\tint64_t  row_;\n\t\n\t// The column of the final (bottommost) cell in the branch.\n\tint64_t  col_;\n\t\n\t// The edit at the bottom of the branch.  If this is the bottommost branch\n\t// in the alignment and it does not end in an edit, then this remains\n\t// uninitialized.\n\tEdit     e_;\n\t\n\t// True iff this is the bottommost branch in the alignment.  We can't just\n\t// use row_ to tell us this because local alignments don't necessarily end\n\t// in the last row.\n\tbool     root_;\n\t\n\tbool     curtailed_;  // true -> pruned at a checkpoint where we otherwise\n\t                      // would have had a match\n\nfriend class BtBranchQ;\nfriend class BtBranchTracer;\n\n};\n\n/**\n * Instantiate and solve best-first branch-based backtraces.\n */\nclass BtBranchTracer {\n\npublic:\n\n\texplicit BtBranchTracer() :\n\t\tprob_(), bs_(), seenPaths_(DP_CAT), sawcell_(DP_CAT), doTri_() { }\n\n\t/**\n\t * Add a branch to the queue.\n\t */\n\tvoid add(size_t id) {\n\t\tassert(!bs_[id].isSolution(prob_));\n\t\tunsorted_.push_back(make_pair(bs_[id].score_st_, id));\n\t}\n\t\n\t/**\n\t * Add a branch to the list of solutions.\n\t */\n\tvoid addSolution(size_t id) {\n\t\tassert(bs_[id].isSolution(prob_));\n\t\tsolutions_.push_back(id);\n\t}\n\n\t/**\n\t * Given a potential branch to add to the queue, see if we can follow the\n\t * branch a little further first.  If it's still valid, or if we reach a\n\t * choice between valid outgoing paths, go ahead and add it to the queue.\n\t */\n\tvoid examineBranch(\n\t\tint64_t row,\n\t\tint64_t col,\n\t\tconst Edit& e,\n\t\tTAlScore pen,\n\t\tTAlScore sc,\n\t\tsize_t parentId);\n\n\t/**\n\t * Take all possible ways of leaving the given branch and add them to the\n\t * branch queue.\n\t */\n\tvoid addOffshoots(size_t bid);\n\t\n\t/**\n\t * Get the best branch and remove it from the priority queue.\n\t */\n\tsize_t best(RandomSource& rnd) {\n\t\tassert(!empty());\n\t\tflushUnsorted();\n\t\tassert_gt(sortedSel_ ? sorted1_.size() : sorted2_.size(), cur_);\n\t\t// Perhaps shuffle everyone who's tied for first?\n\t\tsize_t id = sortedSel_ ? sorted1_[cur_] : sorted2_[cur_];\n\t\tcur_++;\n\t\treturn id;\n\t}\n\t\n\t/**\n\t * Return true iff there are no branches left to try.\n\t */\n\tbool empty() const {\n\t\treturn size() == 0;\n\t}\n\t\n\t/**\n\t * Return the size, i.e. the total number of branches contained.\n\t */\n\tsize_t size() const {\n\t\treturn unsorted_.size() +\n\t\t       (sortedSel_ ? sorted1_.size() : sorted2_.size()) - cur_;\n\t}\n\n\t/**\n\t * Return true iff there are no solutions left to try.\n\t */\n\tbool emptySolution() const {\n\t\treturn sizeSolution() == 0;\n\t}\n\t\n\t/**\n\t * Return the size of the solution set so far.\n\t */\n\tsize_t sizeSolution() const {\n\t\treturn solutions_.size();\n\t}\n\t\n\t/**\n\t * Sort unsorted branches, merge them with master sorted list.\n\t */\n\tvoid flushUnsorted();\n\t\n#ifndef NDEBUG\n\t/**\n\t * Sanity-check the queue.\n\t */\n\tbool repOk() const {\n\t\tassert_lt(cur_, (sortedSel_ ? sorted1_.size() : sorted2_.size()));\n\t\treturn true;\n\t}\n#endif\n\t\n\t/**\n\t * Initialize the tracer with respect to a new read.  This involves\n\t * resetting all the state relating to the set of cells already visited\n\t */\n\tvoid initRef(\n\t\tconst char*         rd,     // in: read sequence\n\t\tconst char*         qu,     // in: quality sequence\n\t\tsize_t              rdlen,  // in: read sequence length\n\t\tconst char*         rf,     // in: reference sequence\n\t\tsize_t              rflen,  // in: in-rectangle reference sequence length\n\t\tTRefOff             trflen, // in: total reference sequence length\n\t\tTRefId              refid,  // in: reference id\n\t\tTRefOff             refoff, // in: reference offset\n\t\tbool                fw,     // in: orientation\n\t\tconst DPRect       *rect,   // in: DP rectangle\n\t\tconst Checkpointer *cper,   // in: checkpointer\n\t\tconst Scoring&      sc,     // in: scoring scheme\n\t\tsize_t              nceil)  // in: N ceiling\n\t{\n\t\tprob_.initRef(rd, qu, rdlen, rf, rflen, trflen, refid, refoff, fw, rect, cper, &sc, nceil);\n\t\tconst size_t ndiag = rflen + rdlen - 1;\n\t\tseenPaths_.resize(ndiag);\n\t\tfor(size_t i = 0; i < ndiag; i++) {\n\t\t\tseenPaths_[i].clear();\n\t\t}\n\t\t// clear each of the per-column sets\n\t\tif(sawcell_.size() < rflen) {\n\t\t\tsize_t isz = sawcell_.size();\n\t\t\tsawcell_.resize(rflen);\n\t\t\tfor(size_t i = isz; i < rflen; i++) {\n\t\t\t\tsawcell_[i].setCat(DP_CAT);\n\t\t\t}\n\t\t}\n\t\tfor(size_t i = 0; i < rflen; i++) {\n\t\t\tsawcell_[i].setCat(DP_CAT);\n\t\t\tsawcell_[i].clear(); // clear the set\n\t\t}\n\t}\n\t\n\t/**\n\t * Initialize with a new backtrace.\n\t */\n\tvoid initBt(\n\t\tTAlScore       escore, // in: alignment score\n\t\tsize_t         row,    // in: start in this row\n\t\tsize_t         col,    // in: start in this column\n\t\tbool           fill,   // in: use mini-filling?\n\t\tbool           usecp,  // in: use checkpointing?\n\t\tbool           doTri,  // in: triangle-shaped mini-fills?\n\t\tRandomSource&  rnd)    // in: random gen, to choose among equal paths\n\t{\n\t\tprob_.initBt(row, col, fill, usecp, escore);\n\t\tEdit e; e.reset();\n\t\tunsorted_.clear();\n\t\tsolutions_.clear();\n\t\tsorted1_.clear();\n\t\tsorted2_.clear();\n\t\tcur_ = 0;\n\t\tnmm_ = 0;         // number of mismatches attempted\n\t\tnnmm_ = 0;        // number of mismatches involving N attempted\n\t\tnrdop_ = 0;       // number of read gap opens attempted\n\t\tnrfop_ = 0;       // number of ref gap opens attempted\n\t\tnrdex_ = 0;       // number of read gap extensions attempted\n\t\tnrfex_ = 0;       // number of ref gap extensions attempted\n\t\tnmmPrune_ = 0;    // number of mismatches attempted\n\t\tnnmmPrune_ = 0;   // number of mismatches involving N attempted\n\t\tnrdopPrune_ = 0;  // number of read gap opens attempted\n\t\tnrfopPrune_ = 0;  // number of ref gap opens attempted\n\t\tnrdexPrune_ = 0;  // number of read gap extensions attempted\n\t\tnrfexPrune_ = 0;  // number of ref gap extensions attempted\n\t\trow_ = row;\n\t\tcol_ = col;\n\t\tdoTri_ = doTri;\n\t\tbs_.clear();\n\t\tif(!prob_.fill_) {\n\t\t\tsize_t id = bs_.alloc();\n\t\t\tbs_[id].init(\n\t\t\t\tprob_,\n\t\t\t\t0,     // parent id\n\t\t\t\t0,     // penalty\n\t\t\t\t0,     // starting score\n\t\t\t\trow,   // row\n\t\t\t\tcol,   // column\n\t\t\t\te,\n\t\t\t\t0,\n\t\t\t    true,  // this is the root\n\t\t\t\ttrue); // this should be extend with exact matches\n\t\t\tif(bs_[id].isSolution(prob_)) {\n\t\t\t\taddSolution(id);\n\t\t\t} else {\n\t\t\t\tadd(id);\n\t\t\t}\n\t\t} else {\n\t\t\tint64_t row = row_, col = col_;\n\t\t\tTAlScore targsc = prob_.targ_;\n\t\t\tint hef = 0;\n\t\t\tbool done = false, abort = false;\n\t\t\tsize_t depth = 0;\n\t\t\twhile(!done && !abort) {\n\t\t\t\t// Accumulate edits as we go.  We can do this by adding\n\t\t\t\t// BtBranches to the bs_ structure.  Each step of the backtrace\n\t\t\t\t// either involves an edit (thereby starting a new branch) or\n\t\t\t\t// extends the previous branch by one more position.\n\t\t\t\t//\n\t\t\t\t// Note: if the BtBranches are in line, then trySolution can be\n\t\t\t\t// used to populate the SwResult and check for various\n\t\t\t\t// situations where we might reject the alignment (i.e. due to\n\t\t\t\t// a cell having been visited previously).\n\t\t\t\tif(doTri_) {\n\t\t\t\t\ttriangleFill(\n\t\t\t\t\t\trow,          // row of cell to backtrace from\n\t\t\t\t\t\tcol,          // column of cell to backtrace from\n\t\t\t\t\t\thef,          // cell to bt from: H (0), E (1), or F (2)\n\t\t\t\t\t\ttargsc,       // score of cell to backtrace from\n\t\t\t\t\t\tprob_.targ_,  // score of alignment we're looking for\n\t\t\t\t\t\trnd,          // pseudo-random generator\n\t\t\t\t\t\trow,          // out: row we ended up in after bt\n\t\t\t\t\t\tcol,          // out: column we ended up in after bt\n\t\t\t\t\t\thef,          // out: H/E/F after backtrace\n\t\t\t\t\t\ttargsc,       // out: score up to cell we ended up in\n\t\t\t\t\t\tdone,         // out: finished tracing out an alignment?\n\t\t\t\t\t\tabort);       // out: aborted b/c cell was seen before?\n\t\t\t\t} else {\n\t\t\t\t\tsquareFill(\n\t\t\t\t\t\trow,          // row of cell to backtrace from\n\t\t\t\t\t\tcol,          // column of cell to backtrace from\n\t\t\t\t\t\thef,          // cell to bt from: H (0), E (1), or F (2)\n\t\t\t\t\t\ttargsc,       // score of cell to backtrace from\n\t\t\t\t\t\tprob_.targ_,  // score of alignment we're looking for\n\t\t\t\t\t\trnd,          // pseudo-random generator\n\t\t\t\t\t\trow,          // out: row we ended up in after bt\n\t\t\t\t\t\tcol,          // out: column we ended up in after bt\n\t\t\t\t\t\thef,          // out: H/E/F after backtrace\n\t\t\t\t\t\ttargsc,       // out: score up to cell we ended up in\n\t\t\t\t\t\tdone,         // out: finished tracing out an alignment?\n\t\t\t\t\t\tabort);       // out: aborted b/c cell was seen before?\n\t\t\t\t}\n\t\t\t\tif(depth >= ndep_.size()) {\n\t\t\t\t\tndep_.resize(depth+1);\n\t\t\t\t\tndep_[depth] = 1;\n\t\t\t\t} else {\n\t\t\t\t\tndep_[depth]++;\n\t\t\t\t}\n\t\t\t\tdepth++;\n\t\t\t\tassert((row >= 0 && col >= 0) || done);\n\t\t\t}\n\t\t}\n\t\tASSERT_ONLY(seen_.clear());\n\t}\n\t\n\t/**\n\t * Get the next valid alignment given the backtrace problem.  Return false\n\t * if there is no valid solution, e.g., if \n\t */\n\tbool nextAlignment(\n\t\tsize_t maxiter,\n\t\tSwResult& res,\n\t\tsize_t& off,\n\t\tsize_t& nrej,\n\t\tsize_t& niter,\n\t\tRandomSource& rnd);\n\t\n\t/**\n\t * Return true iff this tracer has been initialized\n\t */\n\tbool inited() const {\n\t\treturn prob_.inited();\n\t}\n\t\n\t/**\n\t * Return true iff the mini-fills are triangle-shaped.\n\t */\n\tbool doTri() const { return doTri_; }\n\n\t/**\n\t * Fill in a triangle of the DP table and backtrace from the given cell to\n\t * a cell in the previous checkpoint, or to the terminal cell.\n\t */\n\tvoid triangleFill(\n\t\tint64_t rw,          // row of cell to backtrace from\n\t\tint64_t cl,          // column of cell to backtrace from\n\t\tint hef,             // cell to backtrace from is H (0), E (1), or F (2)\n\t\tTAlScore targ,       // score of cell to backtrace from\n\t\tTAlScore targ_final, // score of alignment we're looking for\n\t\tRandomSource& rnd,   // pseudo-random generator\n\t\tint64_t& row_new,    // out: row we ended up in after backtrace\n\t\tint64_t& col_new,    // out: column we ended up in after backtrace\n\t\tint& hef_new,        // out: H/E/F after backtrace\n\t\tTAlScore& targ_new,  // out: score up to cell we ended up in\n\t\tbool& done,          // out: finished tracing out an alignment?\n\t\tbool& abort);        // out: aborted b/c cell was seen before?\n\n\t/**\n\t * Fill in a square of the DP table and backtrace from the given cell to\n\t * a cell in the previous checkpoint, or to the terminal cell.\n\t */\n\tvoid squareFill(\n\t\tint64_t rw,          // row of cell to backtrace from\n\t\tint64_t cl,          // column of cell to backtrace from\n\t\tint hef,             // cell to backtrace from is H (0), E (1), or F (2)\n\t\tTAlScore targ,       // score of cell to backtrace from\n\t\tTAlScore targ_final, // score of alignment we're looking for\n\t\tRandomSource& rnd,   // pseudo-random generator\n\t\tint64_t& row_new,    // out: row we ended up in after backtrace\n\t\tint64_t& col_new,    // out: column we ended up in after backtrace\n\t\tint& hef_new,        // out: H/E/F after backtrace\n\t\tTAlScore& targ_new,  // out: score up to cell we ended up in\n\t\tbool& done,          // out: finished tracing out an alignment?\n\t\tbool& abort);        // out: aborted b/c cell was seen before?\n\nprotected:\n\n\t/**\n\t * Get the next valid alignment given a backtrace problem.  Return false\n\t * if there is no valid solution.  Use a backtracking search to find the\n\t * solution.  This can be very slow.\n\t */\n\tbool nextAlignmentBacktrace(\n\t\tsize_t maxiter,\n\t\tSwResult& res,\n\t\tsize_t& off,\n\t\tsize_t& nrej,\n\t\tsize_t& niter,\n\t\tRandomSource& rnd);\n\n\t/**\n\t * Get the next valid alignment given a backtrace problem.  Return false\n\t * if there is no valid solution.  Use a triangle-fill backtrace to find\n\t * the solution.  This is usually fast (it's O(m + n)).\n\t */\n\tbool nextAlignmentFill(\n\t\tsize_t maxiter,\n\t\tSwResult& res,\n\t\tsize_t& off,\n\t\tsize_t& nrej,\n\t\tsize_t& niter,\n\t\tRandomSource& rnd);\n\n\t/**\n\t * Try all the solutions accumulated so far.  Solutions might be rejected\n\t * if they, for instance, overlap a previous solution, have too many Ns,\n\t * fail to overlap a core diagonal, etc.\n\t */\n\tbool trySolutions(\n\t\tbool lookForOlap,\n\t\tSwResult& res,\n\t\tsize_t& off,\n\t\tsize_t& nrej,\n\t\tRandomSource& rnd,\n\t\tbool& success);\n\t\n\t/**\n\t * See if a given solution branch works as a solution (i.e. doesn't overlap\n\t * another one, have too many Ns, fail to overlap a core diagonal, etc.)\n\t */\n\tint trySolution(\n\t\tsize_t id,\n\t\tbool lookForOlap,\n\t\tSwResult& res,\n\t\tsize_t& off,\n\t\tsize_t& nrej,\n\t\tRandomSource& rnd);\n\n\tBtBranchProblem    prob_; // problem configuration\n\tEFactory<BtBranch> bs_;   // global BtBranch factory\n\t\n\t// already reported alignments going through these diagonal segments\n\tELList<std::pair<size_t, size_t> > seenPaths_;\n\tELSet<size_t> sawcell_; // cells already backtraced through\n\t\n\tEList<std::pair<TAlScore, size_t> > unsorted_;  // unsorted list of as-yet-unflished BtBranches\n\tEList<size_t> sorted1_;   // list of BtBranch, sorted by score\n\tEList<size_t> sorted2_;   // list of BtBranch, sorted by score\n\tEList<size_t> solutions_; // list of solution branches\n\tbool          sortedSel_; // true -> 1, false -> 2\n\tsize_t        cur_;       // cursor into sorted list to start from\n\t\n\tsize_t        nmm_;         // number of mismatches attempted\n\tsize_t        nnmm_;        // number of mismatches involving N attempted\n\tsize_t        nrdop_;       // number of read gap opens attempted\n\tsize_t        nrfop_;       // number of ref gap opens attempted\n\tsize_t        nrdex_;       // number of read gap extensions attempted\n\tsize_t        nrfex_;       // number of ref gap extensions attempted\n\t\n\tsize_t        nmmPrune_;    // \n\tsize_t        nnmmPrune_;   // \n\tsize_t        nrdopPrune_;  // \n\tsize_t        nrfopPrune_;  // \n\tsize_t        nrdexPrune_;  // \n\tsize_t        nrfexPrune_;  // \n\t\n\tsize_t        row_;         // row\n\tsize_t        col_;         // column\n\n\tbool           doTri_;      // true -> fill in triangles; false -> squares\n\tEList<CpQuad>  sq_;         // square to fill when doing mini-fills\n\tELList<CpQuad> tri_;        // triangle to fill when doing mini-fills\n\tEList<size_t>  ndep_;       // # triangles mini-filled at various depths\n\n#ifndef NDEBUG\n\tESet<size_t>  seen_;        // seedn branch ids; should never see same twice\n#endif\n};\n\n#endif /*ndef ALIGNER_BT_H_*/\n"
  },
  {
    "path": "aligner_cache.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"aligner_cache.h\"\n#include \"tinythread.h\"\n\n#ifdef ALIGNER_CACHE_MAIN\n\n#include <iostream>\n#include <getopt.h>\n#include <string>\n#include \"random_source.h\"\n\nusing namespace std;\n\nenum {\n\tARG_TESTS = 256\n};\n\nstatic const char *short_opts = \"vCt\";\nstatic struct option long_opts[] = {\n\t{(char*)\"verbose\",  no_argument, 0, 'v'},\n\t{(char*)\"tests\",    no_argument, 0, ARG_TESTS},\n};\n\nstatic void printUsage(ostream& os) {\n\tos << \"Usage: sawhi-cache [options]*\" << endl;\n\tos << \"Options:\" << endl;\n\tos << \"  --tests       run unit tests\" << endl;\n\tos << \"  -v/--verbose  talkative mode\" << endl;\n}\n\nint gVerbose = 0;\n\nstatic void add(\n\tRedBlack<QKey, QVal>& t,\n\tPool& p,\n\tconst char *dna)\n{\n\tQKey qk;\n\tqk.init(BTDnaString(dna, true));\n\tt.add(p, qk, NULL);\n}\n\n/**\n * Small tests for the AlignmentCache.\n */\nstatic void aligner_cache_tests() {\n\tRedBlack<QKey, QVal> rb(1024);\n\tPool p(64 * 1024, 1024);\n\t// Small test\n\tadd(rb, p, \"ACGTCGATCGT\");\n\tadd(rb, p, \"ACATCGATCGT\");\n\tadd(rb, p, \"ACGACGATCGT\");\n\tadd(rb, p, \"ACGTAGATCGT\");\n\tadd(rb, p, \"ACGTCAATCGT\");\n\tadd(rb, p, \"ACGTCGCTCGT\");\n\tadd(rb, p, \"ACGTCGAACGT\");\n\tassert_eq(7, rb.size());\n\trb.clear();\n\tp.clear();\n\t// Another small test\n\tadd(rb, p, \"ACGTCGATCGT\");\n\tadd(rb, p, \"CCGTCGATCGT\");\n\tadd(rb, p, \"TCGTCGATCGT\");\n\tadd(rb, p, \"GCGTCGATCGT\");\n\tadd(rb, p, \"AAGTCGATCGT\");\n\tassert_eq(5, rb.size());\n\trb.clear();\n\tp.clear();\n\t// Regression test (attempt to make it smaller)\n\tadd(rb, p, \"CCTA\");\n\tadd(rb, p, \"AGAA\");\n\tadd(rb, p, \"TCTA\");\n\tadd(rb, p, \"GATC\");\n\tadd(rb, p, \"CTGC\");\n\tadd(rb, p, \"TTGC\");\n\tadd(rb, p, \"GCCG\");\n\tadd(rb, p, \"GGAT\");\n\trb.clear();\n\tp.clear();\n\t// Regression test\n\tadd(rb, p, \"CCTA\");\n\tadd(rb, p, \"AGAA\");\n\tadd(rb, p, \"TCTA\");\n\tadd(rb, p, \"GATC\");\n\tadd(rb, p, \"CTGC\");\n\tadd(rb, p, \"CATC\");\n\tadd(rb, p, \"CAAA\");\n\tadd(rb, p, \"CTAT\");\n\tadd(rb, p, \"CTCA\");\n\tadd(rb, p, \"TTGC\");\n\tadd(rb, p, \"GCCG\");\n\tadd(rb, p, \"GGAT\");\n\tassert_eq(12, rb.size());\n\trb.clear();\n\tp.clear();\n\t// Larger random test\n\tEList<BTDnaString> strs;\n\tchar buf[5];\n\tfor(int i = 0; i < 4; i++) {\n\t\tfor(int j = 0; j < 4; j++) {\n\t\t\tfor(int k = 0; k < 4; k++) {\n\t\t\t\tfor(int m = 0; m < 4; m++) {\n\t\t\t\t\tbuf[0] = \"ACGT\"[i];\n\t\t\t\t\tbuf[1] = \"ACGT\"[j];\n\t\t\t\t\tbuf[2] = \"ACGT\"[k];\n\t\t\t\t\tbuf[3] = \"ACGT\"[m];\n\t\t\t\t\tbuf[4] = '\\0';\n\t\t\t\t\tstrs.push_back(BTDnaString(buf, true));\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\t// Add all of the 4-mers in several different random orders\n\tRandomSource rand;\n\tfor(uint32_t runs = 0; runs < 100; runs++) {\n\t\trb.clear();\n\t\tp.clear();\n\t\tassert_eq(0, rb.size());\n\t\trand.init(runs);\n\t\tEList<bool> used;\n\t\tused.resize(256);\n\t\tfor(int i = 0; i < 256; i++) used[i] = false;\n\t\tfor(int i = 0; i < 256; i++) {\n\t\t\tint r = rand.nextU32() % (256-i);\n\t\t\tint unused = 0;\n\t\t\tbool added = false;\n\t\t\tfor(int j = 0; j < 256; j++) {\n\t\t\t\tif(!used[j] && unused == r) {\n\t\t\t\t\tused[j] = true;\n\t\t\t\t\tQKey qk;\n\t\t\t\t\tqk.init(strs[j]);\n\t\t\t\t\trb.add(p, qk, NULL);\n\t\t\t\t\tadded = true;\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t\tif(!used[j]) unused++;\n\t\t\t}\n\t\t\tassert(added);\n\t\t}\n\t}\n}\n\n/**\n * A way of feeding simply tests to the seed alignment infrastructure.\n */\nint main(int argc, char **argv) {\n\tint option_index = 0;\n\tint next_option;\n\tdo {\n\t\tnext_option = getopt_long(argc, argv, short_opts, long_opts, &option_index);\n\t\tswitch (next_option) {\n\t\t\tcase 'v':       gVerbose = true; break;\n\t\t\tcase ARG_TESTS: aligner_cache_tests(); return 0;\n\t\t\tcase -1: break;\n\t\t\tdefault: {\n\t\t\t\tcerr << \"Unknown option: \" << (char)next_option << endl;\n\t\t\t\tprintUsage(cerr);\n\t\t\t\texit(1);\n\t\t\t}\n\t\t}\n\t} while(next_option != -1);\n}\n#endif\n"
  },
  {
    "path": "aligner_cache.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef ALIGNER_CACHE_H_\n#define ALIGNER_CACHE_H_\n\n/**\n * CACHEING\n *\n * By caching the results of some alignment sub-problems, we hope to\n * enable a \"fast path\" for read alignment whereby answers are mostly\n * looked up rather than calculated from scratch.  This is particularly\n * effective when the input is sorted or otherwise grouped in a way\n * that brings together reads with (at least some) seed sequences in\n * common.\n *\n * But the cache is also where results are held, regardless of whether\n * the results are maintained & re-used across reads.\n *\n * The cache consists of two linked potions:\n *\n * 1. A multimap from seed strings (i.e. read substrings) to reference strings\n *    that are within some edit distance (roughly speaking).  This is the \"seed\n *    multimap\".\n *\n *    Key:   Read substring (2-bit-per-base encoded + length)\n *    Value: Set of reference substrings (i.e. keys into the suffix\n *           array multimap).\n *\n * 2. A multimap from reference strings to the corresponding elements of the\n *    suffix array.  Elements are filled in with reference-offset info as it's\n *    calculated.  This is the \"suffix array multimap\"\n *\n *    Key:   Reference substring (2-bit-per-base encoded + length)\n *    Value: (a) top from BWT, (b) length of range, (c) offset of first\n *           range element in \n *\n * For both multimaps, we use a combo Red-Black tree and EList.  The payload in\n * the Red-Black tree nodes points to a range in the EList.\n */\n\n#include <iostream>\n#include \"ds.h\"\n#include \"read.h\"\n#include \"threading.h\"\n#include \"mem_ids.h\"\n#include \"simple_func.h\"\n#include \"btypes.h\"\n\n#define CACHE_PAGE_SZ (16 * 1024)\n\ntypedef PListSlice<TIndexOffU, CACHE_PAGE_SZ> TSlice;\n\n/**\n * Key for the query multimap: the read substring and its length.\n */\nstruct QKey {\n\n\t/**\n\t * Initialize invalid QKey.\n\t */\n\tQKey() { reset(); }\n\n\t/**\n\t * Initialize QKey from DNA string.\n\t */\n\tQKey(const BTDnaString& s ASSERT_ONLY(, BTDnaString& tmp)) {\n\t\tinit(s ASSERT_ONLY(, tmp));\n\t}\n\t\n\t/**\n\t * Initialize QKey from DNA string.  Rightmost character is placed in the\n\t * least significant bitpair.\n\t */\n\tbool init(\n\t\tconst BTDnaString& s\n\t\tASSERT_ONLY(, BTDnaString& tmp))\n\t{\n\t\tseq = 0;\n\t\tlen = (uint32_t)s.length();\n\t\tASSERT_ONLY(tmp.clear());\n\t\tif(len > 32) {\n\t\t\tlen = 0xffffffff;\n\t\t\treturn false; // wasn't cacheable\n\t\t} else {\n\t\t\t// Rightmost char of 's' goes in the least significant bitpair\n\t\t\tfor(size_t i = 0; i < 32 && i < s.length(); i++) {\n\t\t\t\tint c = (int)s.get(i);\n\t\t\t\tassert_range(0, 4, c);\n\t\t\t\tif(c == 4) {\n\t\t\t\t\tlen = 0xffffffff;\n\t\t\t\t\treturn false;\n\t\t\t\t}\n\t\t\t\tseq = (seq << 2) | s.get(i);\n\t\t\t}\n\t\t\tASSERT_ONLY(toString(tmp));\n\t\t\tassert(sstr_eq(tmp, s));\n\t\t\tassert_leq(len, 32);\n\t\t\treturn true; // was cacheable\n\t\t}\n\t}\n\t\n\t/**\n\t * Convert this key to a DNA string.\n\t */\n\tvoid toString(BTDnaString& s) {\n\t\ts.resize(len);\n\t\tuint64_t sq = seq;\n\t\tfor(int i = (len)-1; i >= 0; i--) {\n\t\t\ts.set((uint32_t)(sq & 3), i);\n\t\t\tsq >>= 2;\n\t\t}\n\t}\n\t\n\t/**\n\t * Return true iff the read substring is cacheable.\n\t */\n\tbool cacheable() const { return len != 0xffffffff; }\n\t\n\t/**\n\t * Reset to uninitialized state.\n\t */\n\tvoid reset() { seq = 0; len = 0xffffffff; }\n\n\t/**\n\t * True -> my key is less than the given key.\n\t */\n\tbool operator<(const QKey& o) const {\n\t\treturn seq < o.seq || (seq == o.seq && len < o.len);\n\t}\n\n\t/**\n\t * True -> my key is greater than the given key.\n\t */\n\tbool operator>(const QKey& o) const {\n\t\treturn !(*this < o || *this == o);\n\t}\n\n\t/**\n\t * True -> my key is equal to the given key.\n\t */\n\tbool operator==(const QKey& o) const {\n\t\treturn seq == o.seq && len == o.len;\n\t}\n\n\n\t/**\n\t * True -> my key is not equal to the given key.\n\t */\n\tbool operator!=(const QKey& o) const {\n\t\treturn !(*this == o);\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that this is a valid, initialized QKey.\n\t */\n\tbool repOk() const {\n\t\treturn len != 0xffffffff;\n\t}\n#endif\n\n\tuint64_t seq; // sequence\n\tuint32_t len; // length of sequence\n};\n\ntemplate <typename index_t>\nclass AlignmentCache;\n\n/**\n * Payload for the query multimap: a range of elements in the reference\n * string list.\n */\ntemplate <typename index_t>\nclass QVal {\n\npublic:\n\n\tQVal() { reset(); }\n\n\t/**\n\t * Return the offset of the first reference substring in the qlist.\n\t */\n\tindex_t offset() const { return i_; }\n\n\t/**\n\t * Return the number of reference substrings associated with a read\n\t * substring.\n\t */\n\tindex_t numRanges() const {\n\t\tassert(valid());\n\t\treturn rangen_;\n\t}\n\n\t/**\n\t * Return the number of elements associated with all associated\n\t * reference substrings.\n\t */\n\tindex_t numElts() const {\n\t\tassert(valid());\n\t\treturn eltn_;\n\t}\n\t\n\t/**\n\t * Return true iff the read substring is not associated with any\n\t * reference substrings.\n\t */\n\tbool empty() const {\n\t\tassert(valid());\n\t\treturn numRanges() == 0;\n\t}\n\n\t/**\n\t * Return true iff the QVal is valid.\n\t */\n\tbool valid() const { return rangen_ != (index_t)OFF_MASK; }\n\t\n\t/**\n\t * Reset to invalid state.\n\t */\n\tvoid reset() {\n\t\ti_ = 0; rangen_ = eltn_ = (index_t)OFF_MASK;\n\t}\n\t\n\t/**\n\t * Initialize Qval.\n\t */\n\tvoid init(index_t i, index_t ranges, index_t elts) {\n\t\ti_ = i; rangen_ = ranges; eltn_ = elts;\n\t}\n\t\n\t/**\n\t * Tally another range with given number of elements.\n\t */\n\tvoid addRange(index_t numElts) {\n\t\trangen_++;\n\t\teltn_ += numElts;\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that this QVal is internally consistent and consistent\n\t * with the contents of the given cache.\n\t */\n\tbool repOk(const AlignmentCache<index_t>& ac) const;\n#endif\n\nprotected:\n\n\tindex_t i_;      // idx of first elt in qlist\n\tindex_t rangen_; // # ranges (= # associated reference substrings)\n\tindex_t eltn_;   // # elements (total)\n};\n\n/**\n * Key for the suffix array multimap: the reference substring and its\n * length.  Same as QKey so I typedef it.\n */\ntypedef QKey SAKey;\n\n/**\n * Payload for the suffix array multimap: (a) the top element of the\n * range in BWT, (b) the offset of the first elt in the salist, (c)\n * length of the range.\n */\ntemplate <typename index_t>\nstruct SAVal {\n\n\tSAVal() : topf(), topb(), i(), len(OFF_MASK) { }\n\n\t/**\n\t * Return true iff the SAVal is valid.\n\t */\n\tbool valid() { return len != (index_t)OFF_MASK; }\n\n#ifndef NDEBUG\n\t/**\n\t * Check that this SAVal is internally consistent and consistent\n\t * with the contents of the given cache.\n\t */\n\tbool repOk(const AlignmentCache<index_t>& ac) const;\n#endif\n\t\n\t/**\n\t * Initialize the SAVal.\n\t */\n\tvoid init(\n\t\tindex_t tf,\n\t\tindex_t tb,\n\t\tindex_t ii,\n\t\tindex_t ln)\n\t{\n\t\ttopf = tf;\n\t\ttopb = tb;\n\t\ti = ii;\n\t\tlen = ln;\n\t}\n\n\tindex_t topf;  // top in BWT\n\tindex_t topb;  // top in BWT'\n\tindex_t i;     // idx of first elt in salist\n\tindex_t len;   // length of range\n};\n\n/**\n * One data structure that encapsulates all of the cached information\n * associated with a particular reference substring.  This is useful\n * for summarizing what info should be added to the cache for a partial\n * alignment.\n */\ntemplate <typename index_t>\nclass SATuple {\n\npublic:\n\n\tSATuple() { reset(); };\n\n\tSATuple(SAKey k, index_t tf, index_t tb, TSlice o) {\n\t\tinit(k, tf, tb, o);\n\t}\n\t\n\tvoid init(SAKey k, index_t tf, index_t tb, TSlice o) {\n\t\tkey = k; topf = tf; topb = tb; offs = o;\n\t}\n\n\t/**\n\t * Initialize this SATuple from a subrange of the SATuple 'src'.\n\t */\n\tvoid init(const SATuple& src, index_t first, index_t last) {\n\t\tassert_neq((index_t)OFF_MASK, src.topb);\n\t\tkey = src.key;\n\t\ttopf = (index_t)(src.topf + first);\n\t\ttopb = (index_t)OFF_MASK; // unknown!\n\t\toffs.init(src.offs, first, last);\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that this SATuple is internally consistent and that its\n\t * PListSlice is consistent with its backing PList.\n\t */\n\tbool repOk() const {\n\t\tassert(offs.repOk());\n\t\treturn true;\n\t}\n#endif\n\n\t/**\n\t * Function for ordering SATuples.  This is used when prioritizing which to\n\t * explore first when extending seed hits into full alignments.  Smaller\n\t * ranges get higher priority and we use 'top' to break ties, though any\n\t * way of breaking a tie would be fine.\n\t */\n\tbool operator<(const SATuple& o) const {\n\t\tif(offs.size() < o.offs.size()) {\n\t\t\treturn true;\n\t\t}\n\t\tif(offs.size() > o.offs.size()) {\n\t\t\treturn false;\n\t\t}\n\t\treturn topf < o.topf;\n\t}\n\tbool operator>(const SATuple& o) const {\n\t\tif(offs.size() < o.offs.size()) {\n\t\t\treturn false;\n\t\t}\n\t\tif(offs.size() > o.offs.size()) {\n\t\t\treturn true;\n\t\t}\n\t\treturn topf > o.topf;\n\t}\n\t\n\tbool operator==(const SATuple& o) const {\n\t\treturn key == o.key && topf == o.topf && topb == o.topb && offs == o.offs;\n\t}\n\n\tvoid reset() { topf = topb = (index_t)OFF_MASK; offs.reset(); }\n\t\n\t/**\n\t * Set the length to be at most the original length.\n\t */\n\tvoid setLength(index_t nlen) {\n\t\tassert_leq(nlen, offs.size());\n\t\toffs.setLength(nlen);\n\t}\n\t\n\t/**\n\t * Return the number of times this reference substring occurs in the\n\t * reference, which is also the size of the 'offs' TSlice.\n\t */\n\tindex_t size() const { return (index_t)offs.size(); }\n\n\t// bot/length of SA range equals offs.size()\n\tSAKey   key;  // sequence key\n\tindex_t topf;  // top in BWT index\n\tindex_t topb;  // top in BWT' index\n\tTSlice  offs; // offsets\n};\n\n/**\n * Encapsulate the data structures and routines that constitute a\n * particular cache, i.e., a particular stratum of the cache system,\n * which might comprise many strata.\n *\n * Each thread has a \"current-read\" AlignmentCache which is used to\n * build and store subproblem results as alignment is performed.  When\n * we're finished with a read, we might copy the cached results for\n * that read (and perhaps a bundle of other recently-aligned reads) to\n * a higher-level \"across-read\" cache.  Higher-level caches may or may\n * not be shared among threads.\n *\n * A cache consists chiefly of two multimaps, each implemented as a\n * Red-Black tree map backed by an EList.  A 'version' counter is\n * incremented every time the cache is cleared.\n */\ntemplate <typename index_t>\nclass AlignmentCache {\n\n\ttypedef RedBlackNode<QKey,  QVal<index_t> >  QNode;\n\ttypedef RedBlackNode<SAKey, SAVal<index_t> > SANode;\n\n\ttypedef PList<SAKey, CACHE_PAGE_SZ> TQList;\n\ttypedef PList<index_t, CACHE_PAGE_SZ> TSAList;\n\npublic:\n\n\tAlignmentCache(\n\t\tuint64_t bytes,\n\t\tbool shared) :\n\t\tpool_(bytes, CACHE_PAGE_SZ, CA_CAT),\n\t\tqmap_(CACHE_PAGE_SZ, CA_CAT),\n\t\tqlist_(CA_CAT),\n\t\tsamap_(CACHE_PAGE_SZ, CA_CAT),\n\t\tsalist_(CA_CAT),\n\t\tshared_(shared),\n        mutex_m(),\n\t\tversion_(0)\n\t{\n\t}\n\n\t/**\n\t * Given a QVal, populate the given EList of SATuples with records\n\t * describing all of the cached information about the QVal's\n\t * reference substrings.\n\t */\n\ttemplate <int S>\n\tvoid queryQval(\n\t\tconst QVal<index_t>& qv,\n\t\tEList<SATuple<index_t>, S>& satups,\n\t\tindex_t& nrange,\n\t\tindex_t& nelt,\n\t\tbool getLock = true)\n\t{\n        ThreadSafe ts(lockPtr(), shared_ && getLock);\n\t\tassert(qv.repOk(*this));\n\t\tconst index_t refi = qv.offset();\n\t\tconst index_t reff = refi + qv.numRanges();\n\t\t// For each reference sequence sufficiently similar to the\n\t\t// query sequence in the QKey...\n\t\tfor(index_t i = refi; i < reff; i++) {\n\t\t\t// Get corresponding SAKey, containing similar reference\n\t\t\t// sequence & length\n\t\t\tSAKey sak = qlist_.get(i);\n\t\t\t// Shouldn't have identical keys in qlist_\n\t\t\tassert(i == refi || qlist_.get(i) != qlist_.get(i-1));\n\t\t\t// Get corresponding SANode\n\t\t\tSANode *n = samap_.lookup(sak);\n\t\t\tassert(n != NULL);\n\t\t\tconst SAVal<index_t>& sav = n->payload;\n\t\t\tassert(sav.repOk(*this));\n\t\t\tif(sav.len > 0) {\n\t\t\t\tnrange++;\n\t\t\t\tsatups.expand();\n\t\t\t\tsatups.back().init(sak, sav.topf, sav.topb, TSlice(salist_, sav.i, sav.len));\n\t\t\t\tnelt += sav.len;\n#ifndef NDEBUG\n\t\t\t\t// Shouldn't add consecutive identical entries too satups\n\t\t\t\tif(i > refi) {\n\t\t\t\t\tconst SATuple<index_t> b1 = satups.back();\n\t\t\t\t\tconst SATuple<index_t> b2 = satups[satups.size()-2];\n\t\t\t\t\tassert(b1.key != b2.key || b1.topf != b2.topf || b1.offs != b2.offs);\n\t\t\t\t}\n#endif\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Return true iff the cache has no entries in it.\n\t */\n\tbool empty() const {\n\t\tbool ret = qmap_.empty();\n\t\tassert(!ret || qlist_.empty());\n\t\tassert(!ret || samap_.empty());\n\t\tassert(!ret || salist_.empty());\n\t\treturn ret;\n\t}\n\t\n\t/**\n\t * Add a new query key ('qk'), usually a 2-bit encoded substring of\n\t * the read) as the key in a new Red-Black node in the qmap and\n\t * return a pointer to the node's QVal.\n\t *\n\t * The expectation is that the caller is about to set about finding\n\t * associated reference substrings, and that there will be future\n\t * calls to addOnTheFly to add associations to reference substrings\n\t * found.\n\t */\n\tQVal<index_t>* add(\n\t\tconst QKey& qk,\n\t\tbool *added,\n\t\tbool getLock = true)\n\t{\n        ThreadSafe ts(lockPtr(), shared_ && getLock);\n\t\tassert(qk.cacheable());\n\t\tQNode *n = qmap_.add(pool(), qk, added);\n\t\treturn (n != NULL ? &n->payload : NULL);\n\t}\n\n\t/**\n\t * Add a new association between a read sequnce ('seq') and a\n\t * reference sequence ('')\n\t */\n\tbool addOnTheFly(\n\t\tQVal<index_t>& qv, // qval that points to the range of reference substrings\n\t\tconst SAKey& sak,  // the key holding the reference substring\n\t\tindex_t topf,      // top range elt in BWT index\n\t\tindex_t botf,      // bottom range elt in BWT index\n\t\tindex_t topb,      // top range elt in BWT' index\n\t\tindex_t botb,      // bottom range elt in BWT' index\n\t\tbool getLock = true);\n\n\t/**\n\t * Clear the cache, i.e. turn it over.  All HitGens referring to\n\t * ranges in this cache will become invalid and the corresponding\n\t * reads will have to be re-aligned.\n\t */\n\tvoid clear(bool getLock = true) {\n        ThreadSafe ts(lockPtr(), shared_ && getLock);\n\t\tpool_.clear();\n\t\tqmap_.clear();\n\t\tqlist_.clear();\n\t\tsamap_.clear();\n\t\tsalist_.clear();\n\t\tversion_++;\n\t}\n\n\t/**\n\t * Return the number of keys in the query multimap.\n\t */\n\tindex_t qNumKeys() const { return (index_t)qmap_.size(); }\n\n\t/**\n\t * Return the number of keys in the suffix array multimap.\n\t */\n\tindex_t saNumKeys() const { return (index_t)samap_.size(); }\n\n\t/**\n\t * Return the number of elements in the reference substring list.\n\t */\n\tindex_t qSize() const { return (index_t)qlist_.size(); }\n\n\t/**\n\t * Return the number of elements in the SA range list.\n\t */\n\tindex_t saSize() const { return (index_t)salist_.size(); }\n\n\t/**\n\t * Return the pool.\n\t */\n\tPool& pool() { return pool_; }\n\t\n\t/**\n\t * Return the lock object.\n\t */\n\tMUTEX_T& lock() {\n\t    return mutex_m;\n\t}\n\n\t/**\n\t * Return a const pointer to the lock object.  This allows us to\n\t * write const member functions that grab the lock.\n\t */\n\tMUTEX_T* lockPtr() const {\n\t    return const_cast<MUTEX_T*>(&mutex_m);\n\t}\n\t\n\t/**\n\t * Return true iff this cache is shared among threads.\n\t */\n\tbool shared() const { return shared_; }\n\t\n\t/**\n\t * Return the current \"version\" of the cache, i.e. the total number\n\t * of times it has turned over since its creation.\n\t */\n\tuint32_t version() const { return version_; }\n\nprotected:\n\n\tPool                   pool_;   // dispenses memory pages\n\tRedBlack<QKey, QVal<index_t> >   qmap_;   // map from query substrings to reference substrings\n\tTQList                 qlist_;  // list of reference substrings\n\tRedBlack<SAKey, SAVal<index_t> > samap_;  // map from reference substrings to SA ranges\n\tTSAList                salist_; // list of SA ranges\n\t\n\tbool     shared_;  // true -> this cache is global\n\tMUTEX_T mutex_m;    // mutex used for syncronization in case the the cache is shared.\n\tuint32_t version_; // cache version\n};\n\n/**\n * Interface used to query and update a pair of caches: one thread-\n * local and unsynchronized, another shared and synchronized.  One or\n * both can be NULL.\n */\ntemplate <typename index_t>\nclass AlignmentCacheIface {\n\npublic:\n\n\tAlignmentCacheIface(\n\t\tAlignmentCache<index_t> *current,\n\t\tAlignmentCache<index_t> *local,\n\t\tAlignmentCache<index_t> *shared) :\n\t\tqk_(),\n\t\tqv_(NULL),\n\t\tcacheable_(false),\n\t\trangen_(0),\n\t\teltsn_(0),\n\t\tcurrent_(current),\n\t\tlocal_(local),\n\t\tshared_(shared)\n\t{\n\t\tassert(current_ != NULL);\n\t}\n\n#if 0\n\t/**\n\t * Query the relevant set of caches, looking for a QVal to go with\n\t * the provided QKey.  If the QVal is found in a cache other than\n\t * the current-read cache, it is copied into the current-read cache\n\t * first and the QVal pointer for the current-read cache is\n\t * returned.  This function never returns a pointer from any cache\n\t * other than the current-read cache.  If the QVal could not be\n\t * found in any cache OR if the QVal was found in a cache other\n\t * than the current-read cache but could not be copied into the\n\t * current-read cache, NULL is returned.\n\t */\n\tQVal* queryCopy(const QKey& qk, bool getLock = true) {\n\t\tassert(qk.cacheable());\n\t\tAlignmentCache* caches[3] = { current_, local_, shared_ };\n\t\tfor(int i = 0; i < 3; i++) {\n\t\t\tif(caches[i] == NULL) continue;\n\t\t\tQVal* qv = caches[i]->query(qk, getLock);\n\t\t\tif(qv != NULL) {\n\t\t\t\tif(i == 0) return qv;\n\t\t\t\tif(!current_->copy(qk, *qv, *caches[i], getLock)) {\n\t\t\t\t\t// Exhausted memory in the current cache while\n\t\t\t\t\t// attempting to copy in the qk\n\t\t\t\t\treturn NULL;\n\t\t\t\t}\n\t\t\t\tQVal* curqv = current_->query(qk, getLock);\n\t\t\t\tassert(curqv != NULL);\n\t\t\t\treturn curqv;\n\t\t\t}\n\t\t}\n\t\treturn NULL;\n\t}\n\n\t/**\n\t * Query the relevant set of caches, looking for a QVal to go with\n\t * the provided QKey.  If a QVal is found and which is non-NULL,\n\t * *which is set to 0 if the qval was found in the current-read\n\t * cache, 1 if it was found in the local across-read cache, and 2\n\t * if it was found in the shared across-read cache.\n\t */\n\tinline QVal* query(\n\t\tconst QKey& qk,\n\t\tAlignmentCache** which,\n\t\tbool getLock = true)\n\t{\n\t\tassert(qk.cacheable());\n\t\tAlignmentCache* caches[3] = { current_, local_, shared_ };\n\t\tfor(int i = 0; i < 3; i++) {\n\t\t\tif(caches[i] == NULL) continue;\n\t\t\tQVal* qv = caches[i]->query(qk, getLock);\n\t\t\tif(qv != NULL) {\n\t\t\t\tif(which != NULL) *which = caches[i];\n\t\t\t\treturn qv;\n\t\t\t}\n\t\t}\n\t\treturn NULL;\n\t}\n#endif\n\n\t/**\n\t * This function is called whenever we start to align a new read or\n\t * read substring.  We make key for it and store the key in qk_.\n\t * If the sequence is uncacheable, we don't actually add it to the\n\t * map but the corresponding reference substrings are still added\n\t * to the qlist_.\n\t *\n\t * Returns:\n\t *  -1 if out of memory\n\t *  0 if key was found in cache\n\t *  1 if key was not found in cache (and there's enough memory to\n\t *    add a new key)\n\t */\n\tint beginAlign(\n\t\tconst BTDnaString& seq,\n\t\tconst BTString& qual,\n\t\tQVal<index_t>& qv,              // out: filled in if we find it in the cache\n\t\tbool getLock = true)\n\t{\n\t\tassert(repOk());\n\t\tqk_.init(seq ASSERT_ONLY(, tmpdnastr_));\n\t\t//if(qk_.cacheable() && (qv_ = current_->query(qk_, getLock)) != NULL) {\n\t\t//\t// qv_ holds the answer\n\t\t//\tassert(qv_->valid());\n\t\t//\tqv = *qv_;\n\t\t//\tresetRead();\n\t\t//\treturn 1; // found in cache\n\t\t//} else\n\t\tif(qk_.cacheable()) {\n\t\t\t// Make a QNode for this key and possibly add the QNode to the\n\t\t\t// Red-Black map; but if 'seq' isn't cacheable, just create the\n\t\t\t// QNode (without adding it to the map).\n\t\t\tqv_ = current_->add(qk_, &cacheable_, getLock);\n\t\t} else {\n\t\t\tqv_ = &qvbuf_;\n\t\t}\n\t\tif(qv_ == NULL) {\n\t\t\tresetRead();\n \t\t\treturn -1; // Not in memory\n\t\t}\n\t\tqv_->reset();\n\t\treturn 0; // Need to search for it\n\t}\n\tASSERT_ONLY(BTDnaString tmpdnastr_);\n\t\n\t/**\n\t * Called when is finished aligning a read (and so is finished\n\t * adding associated reference strings).  Returns a copy of the\n\t * final QVal object and resets the alignment state of the\n\t * current-read cache.\n\t *\n\t * Also, if the alignment is cacheable, it commits it to the next\n\t * cache up in the cache hierarchy.\n\t */\n\tQVal<index_t> finishAlign(bool getLock = true) {\n\t\tif(!qv_->valid()) {\n\t\t\tqv_->init(0, 0, 0);\n\t\t}\n\t\t// Copy this pointer because we're about to reset the qv_ field\n\t\t// to NULL\n\t\tQVal<index_t>* qv = qv_;\n\t\t// Commit the contents of the current-read cache to the next\n\t\t// cache up in the hierarchy.\n\t\t// If qk is cacheable, then it must be in the cache\n#if 0\n\t\tif(qk_.cacheable()) {\n\t\t\tAlignmentCache* caches[3] = { current_, local_, shared_ };\n\t\t\tASSERT_ONLY(AlignmentCache* which);\n\t\t\tASSERT_ONLY(QVal* qv2 = query(qk_, &which, true));\n\t\t\tassert(qv2 == qv);\n\t\t\tassert(which == current_);\n\t\t\tfor(int i = 1; i < 3; i++) {\n\t\t\t\tif(caches[i] != NULL) {\n\t\t\t\t\t// Copy this key/value pair to the to the higher\n\t\t\t\t\t// level cache and, if its memory is exhausted,\n\t\t\t\t\t// clear the cache and try again.\n\t\t\t\t\tcaches[i]->clearCopy(qk_, *qv_, *current_, getLock);\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n#endif\n\t\t// Reset the state in this iface in preparation for the next\n\t\t// alignment.\n\t\tresetRead();\n\t\tassert(repOk());\n\t\treturn *qv;\n\t}\n\n\t/**\n\t * A call to this member indicates that the caller has finished\n\t * with the last read (if any) and is ready to work on the next.\n\t * This gives the cache a chance to reset some of its state if\n\t * necessary.\n\t */\n\tvoid nextRead() {\n\t\tcurrent_->clear();\n\t\tresetRead();\n\t\tassert(!aligning());\n\t}\n\t\n\t/**\n\t * Return true iff we're in the middle of aligning a sequence.\n\t */\n\tbool aligning() const {\n\t\treturn qv_ != NULL;\n\t}\n\t\n\t/**\n\t * Clears both the local and shared caches.\n\t */\n\tvoid clear() {\n\t\tif(current_ != NULL) current_->clear();\n\t\tif(local_   != NULL) local_->clear();\n\t\tif(shared_  != NULL) shared_->clear();\n\t}\n\t\n\t/**\n\t * Add an alignment to the running list of alignments being\n\t * compiled for the current read in the local cache.\n\t */\n\tbool addOnTheFly(\n\t\tconst BTDnaString& rfseq, // reference sequence close to read seq\n\t\tindex_t topf,            // top in BWT index\n\t\tindex_t botf,            // bot in BWT index\n\t\tindex_t topb,            // top in BWT' index\n\t\tindex_t botb,            // bot in BWT' index\n\t\tbool getLock = true)      // true -> lock is not held by caller\n\t{\n\t\t\n\t\tassert(aligning());\n\t\tassert(repOk());\n\t\tASSERT_ONLY(BTDnaString tmp);\n\t\tSAKey sak(rfseq ASSERT_ONLY(, tmp));\n\t\t//assert(sak.cacheable());\n\t\tif(current_->addOnTheFly((*qv_), sak, topf, botf, topb, botb, getLock)) {\n\t\t\trangen_++;\n\t\t\teltsn_ += (botf-topf);\n\t\t\treturn true;\n\t\t}\n\t\treturn false;\n\t}\n\n\t/**\n\t * Given a QVal, populate the given EList of SATuples with records\n\t * describing all of the cached information about the QVal's\n\t * reference substrings.\n\t */\n\ttemplate<int S>\n\tvoid queryQval(\n\t\tconst QVal<index_t>& qv,\n\t\tEList<SATuple<index_t>, S>& satups,\n\t\tindex_t& nrange,\n\t\tindex_t& nelt,\n\t\tbool getLock = true)\n\t{\n\t\tcurrent_->queryQval(qv, satups, nrange, nelt, getLock);\n\t}\n\n\t/**\n\t * Return a pointer to the current-read cache object.\n\t */\n\tconst AlignmentCache<index_t>* currentCache() const { return current_; }\n\t\n\tindex_t curNumRanges() const { return rangen_; }\n\tindex_t curNumElts()   const { return eltsn_;  }\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that AlignmentCacheIface is internally consistent.\n\t */\n\tbool repOk() const {\n\t\tassert(current_ != NULL);\n\t\tassert_geq(eltsn_, rangen_);\n\t\tif(qv_ == NULL) {\n\t\t\tassert_eq(0, rangen_);\n\t\t\tassert_eq(0, eltsn_);\n\t\t}\n\t\treturn true;\n\t}\n#endif\n\t\n\t/**\n\t * Return the alignment cache for the current read.\n\t */\n\tconst AlignmentCache<index_t>& current() {\n\t\treturn *current_;\n\t}\n\nprotected:\n\n\t/**\n\t * Reset fields encoding info about the in-process read.\n\t */\n\tvoid resetRead() {\n\t\tcacheable_ = false;\n\t\trangen_ = eltsn_ = 0;\n\t\tqv_ = NULL;\n\t}\n\n\tQKey qk_;  // key representation for current read substring\n\tQVal<index_t> *qv_; // pointer to value representation for current read substring\n\tQVal<index_t> qvbuf_; // buffer for when key is uncacheable but we need a qv\n\tbool cacheable_; // true iff the read substring currently being aligned is cacheable\n\t\n\tindex_t rangen_; // number of ranges since last alignment job began\n\tindex_t eltsn_;  // number of elements since last alignment job began\n\n\tAlignmentCache<index_t> *current_; // cache dedicated to the current read\n\tAlignmentCache<index_t> *local_;   // local, unsynchronized cache\n\tAlignmentCache<index_t> *shared_;  // shared, synchronized cache\n};\n\n#ifndef NDEBUG\n/**\n * Check that this QVal is internally consistent and consistent\n * with the contents of the given cache.\n */\ntemplate <typename index_t>\nbool QVal<index_t>::repOk(const AlignmentCache<index_t>& ac) const {\n\tif(rangen_ > 0) {\n\t\tassert_lt(i_, ac.qSize());\n\t\tassert_leq(i_ + rangen_, ac.qSize());\n\t}\n\tassert_geq(eltn_, rangen_);\n\treturn true;\n}\n#endif\n\n#ifndef NDEBUG\n/**\n * Check that this SAVal is internally consistent and consistent\n * with the contents of the given cache.\n */\ntemplate <typename index_t>\nbool SAVal<index_t>::repOk(const AlignmentCache<index_t>& ac) const {\n\tassert(len == 0 || i < ac.saSize());\n\tassert_leq(i + len, ac.saSize());\n\treturn true;\n}\n#endif\n\n/**\n * Add a new association between a read sequnce ('seq') and a\n * reference sequence ('')\n */\ntemplate <typename index_t>\nbool AlignmentCache<index_t>::addOnTheFly(\n\t\t\t\t\t\t\t\t QVal<index_t>& qv, // qval that points to the range of reference substrings\n\t\t\t\t\t\t\t\t const SAKey& sak,  // the key holding the reference substring\n\t\t\t\t\t\t\t\t index_t topf,      // top range elt in BWT index\n\t\t\t\t\t\t\t\t index_t botf,      // bottom range elt in BWT index\n\t\t\t\t\t\t\t\t index_t topb,      // top range elt in BWT' index\n\t\t\t\t\t\t\t\t index_t botb,      // bottom range elt in BWT' index\n\t\t\t\t\t\t\t\t bool getLock)\n{\n    ThreadSafe ts(lockPtr(), shared_ && getLock);\n\tbool added = true;\n\t// If this is the first reference sequence we're associating with\n\t// the query sequence, initialize the QVal.\n\tif(!qv.valid()) {\n\t\tqv.init((index_t)qlist_.size(), 0, 0);\n\t}\n\tqv.addRange(botf-topf); // update tally for # ranges and # elts\n\tif(!qlist_.add(pool(), sak)) {\n\t\treturn false; // Exhausted pool memory\n\t}\n#ifndef NDEBUG\n\tfor(index_t i = qv.offset(); i < qlist_.size(); i++) {\n\t\tif(i > qv.offset()) {\n\t\t\tassert(qlist_.get(i) != qlist_.get(i-1));\n\t\t}\n\t}\n#endif\n\tassert_eq(qv.offset() + qv.numRanges(), qlist_.size());\n\tSANode *s = samap_.add(pool(), sak, &added);\n\tif(s == NULL) {\n\t\treturn false; // Exhausted pool memory\n\t}\n\tassert(s->key.repOk());\n\tif(added) {\n\t\ts->payload.i = (index_t)salist_.size();\n\t\ts->payload.len = botf - topf;\n\t\ts->payload.topf = topf;\n\t\ts->payload.topb = topb;\n\t\tfor(size_t j = 0; j < (botf-topf); j++) {\n\t\t\tif(!salist_.add(pool(), (index_t)0xffffffff)) {\n\t\t\t\t// Change the payload's len field\n\t\t\t\ts->payload.len = (uint32_t)j;\n\t\t\t\treturn false; // Exhausted pool memory\n\t\t\t}\n\t\t}\n\t\tassert(s->payload.repOk(*this));\n\t}\n\t// Now that we know all allocations have succeeded, we can do a few final\n\t// updates\n\t\n\treturn true; \n}\n\n#endif /*ALIGNER_CACHE_H_*/\n"
  },
  {
    "path": "aligner_metrics.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef ALIGNER_METRICS_H_\n#define ALIGNER_METRICS_H_\n\n#include <math.h>\n#include <iostream>\n#include \"alphabet.h\"\n#include \"timer.h\"\n#include \"sstring.h\"\n\nusing namespace std;\n\n/**\n * Borrowed from http://www.johndcook.com/standard_deviation.html,\n * which in turn is borrowed from Knuth.\n */\nclass RunningStat {\npublic:\n\tRunningStat() : m_n(0), m_tot(0.0) { }\n\n\tvoid clear() {\n\t\tm_n = 0;\n\t\tm_tot = 0.0;\n\t}\n\n\tvoid push(float x) {\n\t\tm_n++;\n\t\tm_tot += x;\n\t\t// See Knuth TAOCP vol 2, 3rd edition, page 232\n\t\tif (m_n == 1) {\n\t\t\tm_oldM = m_newM = x;\n\t\t\tm_oldS = 0.0;\n\t\t} else {\n\t\t\tm_newM = m_oldM + (x - m_oldM)/m_n;\n\t\t\tm_newS = m_oldS + (x - m_oldM)*(x - m_newM);\n\t\t\t// set up for next iteration\n\t\t\tm_oldM = m_newM;\n\t\t\tm_oldS = m_newS;\n\t\t}\n\t}\n\n\tint num() const {\n\t\treturn m_n;\n\t}\n\n\tdouble tot() const {\n\t\treturn m_tot;\n\t}\n\n\tdouble mean() const {\n\t\treturn (m_n > 0) ? m_newM : 0.0;\n\t}\n\n\tdouble variance() const {\n\t\treturn ( (m_n > 1) ? m_newS/(m_n - 1) : 0.0 );\n\t}\n\n\tdouble stddev() const {\n\t\treturn sqrt(variance());\n\t}\n\nprivate:\n\tint m_n;\n\tdouble m_tot;\n\tdouble m_oldM, m_newM, m_oldS, m_newS;\n};\n\n/**\n * Encapsulates a set of metrics that we would like an aligner to keep\n * track of, so that we can possibly use it to diagnose performance\n * issues.\n */\nclass AlignerMetrics {\n\npublic:\n\n\tAlignerMetrics() :\n\t\tcurBacktracks_(0),\n\t\tcurBwtOps_(0),\n\t\tfirst_(true),\n\t\tcurIsLowEntropy_(false),\n\t\tcurIsHomoPoly_(false),\n\t\tcurHadRanges_(false),\n\t\tcurNumNs_(0),\n\t\treads_(0),\n\t\thomoReads_(0),\n\t\tlowEntReads_(0),\n\t\thiEntReads_(0),\n\t\talignedReads_(0),\n\t\tunalignedReads_(0),\n\t\tthreeOrMoreNReads_(0),\n\t\tlessThanThreeNRreads_(0),\n\t\tbwtOpsPerRead_(),\n\t\tbacktracksPerRead_(),\n\t\tbwtOpsPerHomoRead_(),\n\t\tbacktracksPerHomoRead_(),\n\t\tbwtOpsPerLoEntRead_(),\n\t\tbacktracksPerLoEntRead_(),\n\t\tbwtOpsPerHiEntRead_(),\n\t\tbacktracksPerHiEntRead_(),\n\t\tbwtOpsPerAlignedRead_(),\n\t\tbacktracksPerAlignedRead_(),\n\t\tbwtOpsPerUnalignedRead_(),\n\t\tbacktracksPerUnalignedRead_(),\n\t\tbwtOpsPer0nRead_(),\n\t\tbacktracksPer0nRead_(),\n\t\tbwtOpsPer1nRead_(),\n\t\tbacktracksPer1nRead_(),\n\t\tbwtOpsPer2nRead_(),\n\t\tbacktracksPer2nRead_(),\n\t\tbwtOpsPer3orMoreNRead_(),\n\t\tbacktracksPer3orMoreNRead_(),\n\t\ttimer_(cout, \"\", false)\n\t\t{ }\n\n\tvoid printSummary() {\n\t\tif(!first_) {\n\t\t\tfinishRead();\n\t\t}\n\t\tcout << \"AlignerMetrics:\" << endl;\n\t\tcout << \"  # Reads:             \" << reads_ << endl;\n\t\tfloat hopct = (reads_ > 0) ? (((float)homoReads_)/((float)reads_)) : (0.0f);\n\t\thopct *= 100.0f;\n\t\tcout << \"  % homo-polymeric:    \" << (hopct) << endl;\n\t\tfloat lopct = (reads_ > 0) ? ((float)lowEntReads_/(float)(reads_)) : (0.0f);\n\t\tlopct *= 100.0f;\n\t\tcout << \"  % low-entropy:       \" << (lopct) << endl;\n\t\tfloat unpct = (reads_ > 0) ? ((float)unalignedReads_/(float)(reads_)) : (0.0f);\n\t\tunpct *= 100.0f;\n\t\tcout << \"  % unaligned:         \" << (unpct) << endl;\n\t\tfloat npct = (reads_ > 0) ? ((float)threeOrMoreNReads_/(float)(reads_)) : (0.0f);\n\t\tnpct *= 100.0f;\n\t\tcout << \"  % with 3 or more Ns: \" << (npct) << endl;\n\t\tcout << endl;\n\t\tcout << \"  Total BWT ops:    avg: \" << bwtOpsPerRead_.mean() << \", stddev: \" << bwtOpsPerRead_.stddev() << endl;\n\t\tcout << \"  Total Backtracks: avg: \" << backtracksPerRead_.mean() << \", stddev: \" << backtracksPerRead_.stddev() << endl;\n\t\ttime_t elapsed = timer_.elapsed();\n\t\tcout << \"  BWT ops per second:    \" << (bwtOpsPerRead_.tot()/elapsed) << endl;\n\t\tcout << \"  Backtracks per second: \" << (backtracksPerRead_.tot()/elapsed) << endl;\n\t\tcout << endl;\n\t\tcout << \"  Homo-poly:\" << endl;\n\t\tcout << \"    BWT ops:    avg: \" << bwtOpsPerHomoRead_.mean() << \", stddev: \" << bwtOpsPerHomoRead_.stddev() << endl;\n\t\tcout << \"    Backtracks: avg: \" << backtracksPerHomoRead_.mean() << \", stddev: \" << backtracksPerHomoRead_.stddev() << endl;\n\t\tcout << \"  Low-entropy:\" << endl;\n\t\tcout << \"    BWT ops:    avg: \" << bwtOpsPerLoEntRead_.mean() << \", stddev: \" << bwtOpsPerLoEntRead_.stddev() << endl;\n\t\tcout << \"    Backtracks: avg: \" << backtracksPerLoEntRead_.mean() << \", stddev: \" << backtracksPerLoEntRead_.stddev() << endl;\n\t\tcout << \"  High-entropy:\" << endl;\n\t\tcout << \"    BWT ops:    avg: \" << bwtOpsPerHiEntRead_.mean() << \", stddev: \" << bwtOpsPerHiEntRead_.stddev() << endl;\n\t\tcout << \"    Backtracks: avg: \" << backtracksPerHiEntRead_.mean() << \", stddev: \" << backtracksPerHiEntRead_.stddev() << endl;\n\t\tcout << endl;\n\t\tcout << \"  Unaligned:\" << endl;\n\t\tcout << \"    BWT ops:    avg: \" << bwtOpsPerUnalignedRead_.mean() << \", stddev: \" << bwtOpsPerUnalignedRead_.stddev() << endl;\n\t\tcout << \"    Backtracks: avg: \" << backtracksPerUnalignedRead_.mean() << \", stddev: \" << backtracksPerUnalignedRead_.stddev() << endl;\n\t\tcout << \"  Aligned:\" << endl;\n\t\tcout << \"    BWT ops:    avg: \" << bwtOpsPerAlignedRead_.mean() << \", stddev: \" << bwtOpsPerAlignedRead_.stddev() << endl;\n\t\tcout << \"    Backtracks: avg: \" << backtracksPerAlignedRead_.mean() << \", stddev: \" << backtracksPerAlignedRead_.stddev() << endl;\n\t\tcout << endl;\n\t\tcout << \"  0 Ns:\" << endl;\n\t\tcout << \"    BWT ops:    avg: \" << bwtOpsPer0nRead_.mean() << \", stddev: \" << bwtOpsPer0nRead_.stddev() << endl;\n\t\tcout << \"    Backtracks: avg: \" << backtracksPer0nRead_.mean() << \", stddev: \" << backtracksPer0nRead_.stddev() << endl;\n\t\tcout << \"  1 N:\" << endl;\n\t\tcout << \"    BWT ops:    avg: \" << bwtOpsPer1nRead_.mean() << \", stddev: \" << bwtOpsPer1nRead_.stddev() << endl;\n\t\tcout << \"    Backtracks: avg: \" << backtracksPer1nRead_.mean() << \", stddev: \" << backtracksPer1nRead_.stddev() << endl;\n\t\tcout << \"  2 Ns:\" << endl;\n\t\tcout << \"    BWT ops:    avg: \" << bwtOpsPer2nRead_.mean() << \", stddev: \" << bwtOpsPer2nRead_.stddev() << endl;\n\t\tcout << \"    Backtracks: avg: \" << backtracksPer2nRead_.mean() << \", stddev: \" << backtracksPer2nRead_.stddev() << endl;\n\t\tcout << \"  >2 Ns:\" << endl;\n\t\tcout << \"    BWT ops:    avg: \" << bwtOpsPer3orMoreNRead_.mean() << \", stddev: \" << bwtOpsPer3orMoreNRead_.stddev() << endl;\n\t\tcout << \"    Backtracks: avg: \" << backtracksPer3orMoreNRead_.mean() << \", stddev: \" << backtracksPer3orMoreNRead_.stddev() << endl;\n\t\tcout << endl;\n\t}\n\n\t/**\n\t *\n\t */\n\tvoid nextRead(const BTDnaString& read) {\n\t\tif(!first_) {\n\t\t\tfinishRead();\n\t\t}\n\t\tfirst_ = false;\n\t\t//float ent = entropyDna5(read);\n\t\tfloat ent = 0.0f;\n\t\tcurIsLowEntropy_ = (ent < 0.75f);\n\t\tcurIsHomoPoly_ = (ent < 0.001f);\n\t\tcurHadRanges_ = false;\n\t\tcurBwtOps_ = 0;\n\t\tcurBacktracks_ = 0;\n\t\t// Count Ns\n\t\tcurNumNs_ = 0;\n\t\tconst size_t len = read.length();\n\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\tif((int)read[i] == 4) curNumNs_++;\n\t\t}\n\t}\n\n\t/**\n\t *\n\t */\n\tvoid setReadHasRange() {\n\t\tcurHadRanges_ = true;\n\t}\n\n\t/**\n\t * Commit the running statistics for this read to\n\t */\n\tvoid finishRead() {\n\t\treads_++;\n\t\tif(curIsHomoPoly_) homoReads_++;\n\t\telse if(curIsLowEntropy_) lowEntReads_++;\n\t\telse hiEntReads_++;\n\t\tif(curHadRanges_) alignedReads_++;\n\t\telse unalignedReads_++;\n\t\tbwtOpsPerRead_.push((float)curBwtOps_);\n\t\tbacktracksPerRead_.push((float)curBacktracks_);\n\t\t// Drill down by entropy\n\t\tif(curIsHomoPoly_) {\n\t\t\tbwtOpsPerHomoRead_.push((float)curBwtOps_);\n\t\t\tbacktracksPerHomoRead_.push((float)curBacktracks_);\n\t\t} else if(curIsLowEntropy_) {\n\t\t\tbwtOpsPerLoEntRead_.push((float)curBwtOps_);\n\t\t\tbacktracksPerLoEntRead_.push((float)curBacktracks_);\n\t\t} else {\n\t\t\tbwtOpsPerHiEntRead_.push((float)curBwtOps_);\n\t\t\tbacktracksPerHiEntRead_.push((float)curBacktracks_);\n\t\t}\n\t\t// Drill down by whether it aligned\n\t\tif(curHadRanges_) {\n\t\t\tbwtOpsPerAlignedRead_.push((float)curBwtOps_);\n\t\t\tbacktracksPerAlignedRead_.push((float)curBacktracks_);\n\t\t} else {\n\t\t\tbwtOpsPerUnalignedRead_.push((float)curBwtOps_);\n\t\t\tbacktracksPerUnalignedRead_.push((float)curBacktracks_);\n\t\t}\n\t\tif(curNumNs_ == 0) {\n\t\t\tlessThanThreeNRreads_++;\n\t\t\tbwtOpsPer0nRead_.push((float)curBwtOps_);\n\t\t\tbacktracksPer0nRead_.push((float)curBacktracks_);\n\t\t} else if(curNumNs_ == 1) {\n\t\t\tlessThanThreeNRreads_++;\n\t\t\tbwtOpsPer1nRead_.push((float)curBwtOps_);\n\t\t\tbacktracksPer1nRead_.push((float)curBacktracks_);\n\t\t} else if(curNumNs_ == 2) {\n\t\t\tlessThanThreeNRreads_++;\n\t\t\tbwtOpsPer2nRead_.push((float)curBwtOps_);\n\t\t\tbacktracksPer2nRead_.push((float)curBacktracks_);\n\t\t} else {\n\t\t\tthreeOrMoreNReads_++;\n\t\t\tbwtOpsPer3orMoreNRead_.push((float)curBwtOps_);\n\t\t\tbacktracksPer3orMoreNRead_.push((float)curBacktracks_);\n\t\t}\n\t}\n\n\t// Running-total of the number of backtracks and BWT ops for the\n\t// current read\n\tuint32_t curBacktracks_;\n\tuint32_t curBwtOps_;\n\nprotected:\n\n\tbool first_;\n\n\t// true iff the current read is low entropy\n\tbool curIsLowEntropy_;\n\t// true if current read is all 1 char (or very close)\n\tbool curIsHomoPoly_;\n\t// true iff the current read has had one or more ranges reported\n\tbool curHadRanges_;\n\t// number of Ns in current read\n\tint curNumNs_;\n\n\t// # reads\n\tuint32_t reads_;\n\t// # homo-poly reads\n\tuint32_t homoReads_;\n\t// # low-entropy reads\n\tuint32_t lowEntReads_;\n\t// # high-entropy reads\n\tuint32_t hiEntReads_;\n\t// # reads with alignments\n\tuint32_t alignedReads_;\n\t// # reads without alignments\n\tuint32_t unalignedReads_;\n\t// # reads with 3 or more Ns\n\tuint32_t threeOrMoreNReads_;\n\t// # reads with < 3 Ns\n\tuint32_t lessThanThreeNRreads_;\n\n\t// Distribution of BWT operations per read\n\tRunningStat bwtOpsPerRead_;\n\tRunningStat backtracksPerRead_;\n\n\t// Distribution of BWT operations per homo-poly read\n\tRunningStat bwtOpsPerHomoRead_;\n\tRunningStat backtracksPerHomoRead_;\n\n\t// Distribution of BWT operations per low-entropy read\n\tRunningStat bwtOpsPerLoEntRead_;\n\tRunningStat backtracksPerLoEntRead_;\n\n\t// Distribution of BWT operations per high-entropy read\n\tRunningStat bwtOpsPerHiEntRead_;\n\tRunningStat backtracksPerHiEntRead_;\n\n\t// Distribution of BWT operations per read that \"aligned\" (for\n\t// which a range was arrived at - range may not have necessarily\n\t// lead to an alignment)\n\tRunningStat bwtOpsPerAlignedRead_;\n\tRunningStat backtracksPerAlignedRead_;\n\n\t// Distribution of BWT operations per read that didn't align\n\tRunningStat bwtOpsPerUnalignedRead_;\n\tRunningStat backtracksPerUnalignedRead_;\n\n\t// Distribution of BWT operations/backtracks per read with no Ns\n\tRunningStat bwtOpsPer0nRead_;\n\tRunningStat backtracksPer0nRead_;\n\n\t// Distribution of BWT operations/backtracks per read with one N\n\tRunningStat bwtOpsPer1nRead_;\n\tRunningStat backtracksPer1nRead_;\n\n\t// Distribution of BWT operations/backtracks per read with two Ns\n\tRunningStat bwtOpsPer2nRead_;\n\tRunningStat backtracksPer2nRead_;\n\n\t// Distribution of BWT operations/backtracks per read with three or\n\t// more Ns\n\tRunningStat bwtOpsPer3orMoreNRead_;\n\tRunningStat backtracksPer3orMoreNRead_;\n\n\tTimer timer_;\n};\n\n#endif /* ALIGNER_METRICS_H_ */\n"
  },
  {
    "path": "aligner_result.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef ALIGNER_RESULT_H_\n#define ALIGNER_RESULT_H_\n\n#include <utility>\n#include <limits>\n#include <vector>\n#include \"mem_ids.h\"\n#include \"ref_coord.h\"\n#include \"read.h\"\n#include \"filebuf.h\"\n#include \"ds.h\"\n#include \"edit.h\"\n#include \"limit.h\"\n\ntypedef int64_t TAlScore;\n\n#define VALID_AL_SCORE(x)   ((x).score_ > MIN_I64)\n#define VALID_SCORE(x)      ((x) > MIN_I64)\n#define INVALIDATE_SCORE(x) ((x) = MIN_I64)\n\n/**\n * A generic score object for an alignment.  Used for accounting during\n * SW and elsewhere.  Encapsulates the score, the number of N positions\n * and the number gaps in the alignment.\n *\n * The scale for 'score' is such that a perfect alignment score is 0\n * and a score with non-zero penalty is less than 0.  So differences\n * between scores work as expected, but interpreting an individual\n * score (larger is better) as a penalty (smaller is better) requires\n * taking the absolute value.\n */\nclass AlnScore {\n\npublic:\n\n\t/**\n\t * Gapped scores are invalid until proven valid.\n\t */\n\tinline AlnScore() {\n\t\treset();\n\t\tinvalidate();\n\t\tassert(!valid());\n\t}\n\n\t/**\n\t * Gapped scores are invalid until proven valid.\n\t */\n\tinline AlnScore(TAlScore score) {\n\t\tscore_ = score;\n\t}\n\t\n\t/**\n\t * Reset the score.\n\t */\n\tvoid reset() {\n\t\tscore_ = 0;\n\t}\n\n\t/**\n\t * Return an invalid SwScore.\n\t */\n\tinline static AlnScore INVALID() {\n\t\tAlnScore s;\n\t\ts.invalidate();\n\t\tassert(!s.valid());\n\t\treturn s;\n\t}\n\t\n\t/**\n\t * Return true iff this score has a valid value.\n\t */\n\tinline bool valid() const {\n\t\treturn score_ != MIN_I64;\n\t}\n\n\t/**\n\t * Make this score invalid (and therefore <= all other scores).\n\t */\n\tinline void invalidate() {\n\t\tscore_ = MIN_I64;\n\t\tassert(!valid());\n\t}\n\t\n\n\t/**\n\t * Scores are equal iff they're bitwise equal.\n\t */\n\tinline bool operator==(const AlnScore& o) const {\n\t\t// Profiling shows cache misses on following line\n\t\treturn VALID_AL_SCORE(*this) && VALID_AL_SCORE(o) && score_ == o.score_;\n\t}\n\n\t/**\n\t * Return true iff the two scores are unequal.\n\t */\n\tinline bool operator!=(const AlnScore& o) const {\n\t\treturn !(*this == o);\n\t}\n\n\t/**\n\t * Return true iff this score is >= score o.\n\t */\n\tinline bool operator>=(const AlnScore& o) const {\n\t\tif(!VALID_AL_SCORE(o)) {\n\t\t\tif(!VALID_AL_SCORE(*this)) {\n\t\t\t\t// both invalid\n\t\t\t\treturn false;\n\t\t\t} else {\n\t\t\t\t// I'm valid, other is invalid\n\t\t\t\treturn true;\n\t\t\t}\n\t\t} else if(!VALID_AL_SCORE(*this)) {\n\t\t\t// I'm invalid, other is valid\n\t\t\treturn false;\n\t\t}\n\t\treturn score_ >= o.score_;\n\t}\n\n\t/**\n\t * Return true iff this score is < score o.\n\t */\n\tinline bool operator<(const AlnScore& o) const {\n\t\treturn !operator>=(o);\n\t}\n\n\t/**\n\t * Return true iff this score is <= score o.\n\t */\n\tinline bool operator<=(const AlnScore& o) const {\n\t\treturn operator<(o) || operator==(o);\n\t}\n    \n    /**\n     * Return true iff this score is < score o.\n     */\n    inline bool operator>(const AlnScore& o) const {\n        return !operator<=(o);\n    }\n\n\tTAlScore score()   const { return  score_; }\n\n    // Score accumulated so far (penalties are subtracted starting at 0)\n\tTAlScore score_;\n};\n\nstatic inline ostream& operator<<(ostream& os, const AlnScore& o) {\n\tos << o.score();\n\treturn os;\n}\n\n// Forward declaration\nclass BitPairReference;\n\n\n/**\n * Encapsulates an alignment result.  The result comprises:\n *\n * 1. All the nucleotide edits for both mates ('ned').\n * 2. All \"edits\" where an ambiguous reference char is resolved to an\n *    unambiguous char ('aed').\n * 3. The score for the alginment, including summary information about the\n *    number of gaps and Ns involved.\n * 4. The reference id, strand, and 0-based offset of the leftmost character\n *    involved in the alignment.\n * 5. Information about trimming prior to alignment and whether it was hard or\n *    soft.\n * 6. Information about trimming during alignment and whether it was hard or\n *    soft.  Local-alignment trimming is usually soft when aligning nucleotide\n *    reads.\n *\n * Note that the AlnRes, together with the Read and an AlnSetSumm (*and* the\n * opposite mate's AlnRes and Read in the case of a paired-end alignment),\n * should contain enough information to print an entire alignment record.\n *\n * TRIMMING\n *\n * Accounting for trimming is tricky.  Trimming affects:\n *\n * 1. The values of the trim* and pretrim* fields.\n * 2. The offsets of the Edits in the EList<Edit>s.\n * 3. The read extent, if the trimming is soft.\n * 4. The read extent and the read sequence and length, if trimming is hard.\n *\n * Handling 1. is not too difficult.  2., 3., and 4. are handled in setShape().\n */\nclass AlnRes {\n\npublic:\n\n\tAlnRes()\n\t{\n\t\treset();\n\t}\n    \n    AlnRes(const AlnRes& other)\n    {\n        score_ = other.score_;\n        max_score_ = other.max_score_;\n        uid_ = other.uid_;\n        tid_ = other.tid_;\n        taxRank_ = other.taxRank_;\n        summedHitLen_ = other.summedHitLen_;\n\t\treadPositions_ = other.readPositions_;\n\t\tisFw_ = other.isFw_;\n    }\n    \n    AlnRes& operator=(const AlnRes& other) {\n        if(this == &other) return *this;\n        score_ = other.score_;\n        max_score_ = other.max_score_;\n        uid_ = other.uid_;\n        tid_ = other.tid_;\n        taxRank_ = other.taxRank_;\n        summedHitLen_ = other.summedHitLen_;\n\t\treadPositions_ = other.readPositions_;\n\t\tisFw_ = other.isFw_;\n        return *this;\n    }\n    \n    ~AlnRes() {}\n\n\t/**\n\t * Clear all contents.\n\t */\n\tvoid reset() {\n        score_ = 0;\n        max_score_ = 0;\n        uid_ = \"\";\n        tid_ = 0;\n        taxRank_ = RANK_UNKNOWN;\n        summedHitLen_ = 0.0;\n\t\treadPositions_.clear();\n    }\n    \n\t/**\n\t * Set alignment score for this alignment.\n\t */\n\tvoid setScore(TAlScore score) {\n\t\tscore_ = score;\n\t}\n\n\tTAlScore           score()          const { return score_;     }\n    TAlScore           max_score()      const { return max_score_; }\n    string             uid()            const { return uid_;   }\n    uint64_t           taxID()          const { return tid_;   }\n    uint8_t            taxRank()        const { return taxRank_; }\n    double             summedHitLen()   const { return summedHitLen_; }\n\n\tconst EList<pair<uint32_t,uint32_t> >& readPositionsPtr() const { return readPositions_; }\n\n\tconst pair<uint32_t,uint32_t> readPositions(size_t i) const { return readPositions_[i]; }\n\tsize_t nReadPositions() const { return readPositions_.size(); }\n\n\tbool               isFw()           const { return isFw_;      }\n    \n   /**\n\t * Print the sequence for the read that aligned using A, C, G and\n\t * T.  This will simply print the read sequence (or its reverse\n\t * complement).\n\t */\n \tvoid printSeq(\n\t\tconst Read& rd,\n\t\tconst BTDnaString* dns,\n\t\tBTString& o) const\n    {\n        assert(!rd.patFw.empty());\n        ASSERT_ONLY(size_t written = 0);\n        // Print decoded nucleotides\n        assert(dns != NULL);\n        size_t len = dns->length();\n        size_t st = 0;\n        size_t en = len;\n        for(size_t i = st; i < en; i++) {\n            int c = dns->get(i);\n            assert_range(0, 3, c);\n            o.append(\"ACGT\"[c]);\n            ASSERT_ONLY(written++);\n        }\n    }\n\n\t/**\n\t * Print the quality string for the read that aligned.  This will\n\t * simply print the read qualities (or their reverse).\n\t */\n \tvoid printQuals(\n\t\tconst Read& rd,\n\t\tconst BTString* dqs,\n\t\tBTString& o) const\n    {\n        assert(dqs != NULL);\n        size_t len = dqs->length();\n        // Print decoded qualities from upstream to downstream Watson\n        for(size_t i = 1; i < len-1; i++) {\n            o.append(dqs->get(i));\n        }\n    }\n\t\n\n\t/**\n\t * Initialize new AlnRes.\n\t */\n\tvoid init(\n              TAlScore score,           // alignment score\n              TAlScore max_score,\n              const string& uniqueID,\n              uint64_t taxID,\n              uint8_t taxRank,\n\t\t\t  double summedHitLen,\n\t\t\t  const EList<pair<uint32_t, uint32_t> >& readPositions,\n\t\t\t  bool isFw)\n    {\n        score_  = score;\n        max_score_ = max_score;\n        uid_ = uniqueID;\n        tid_ = taxID;\n        taxRank_ = taxRank;\n        summedHitLen_ = summedHitLen;\n\t\treadPositions_ = readPositions;\n\t\tisFw_ = isFw;\n    }\n\nprotected:\n\tTAlScore     score_;        //\n    TAlScore     max_score_;\n    string       uid_;\n    uint64_t     tid_;\n    uint8_t      taxRank_;\n    double       summedHitLen_; // sum of the length of all partial hits, divided by the number of genome matches\n\tbool         isFw_;\n  \n\tEList<pair<uint32_t, uint32_t> > readPositions_;\n};\n\ntypedef uint64_t TNumAlns;\n\n/**\n * Encapsulates a concise summary of a set of alignment results for a\n * given pair or mate.  Referring to the fields of this object should\n * provide enough information to print output records for the read.\n */\nclass AlnSetSumm {\n\npublic:\n\n\tAlnSetSumm() { reset(); }\n\n\t/**\n\t * Given an unpaired read (in either rd1 or rd2) or a read pair\n\t * (mate 1 in rd1, mate 2 in rd2).\n\t */\n\texplicit AlnSetSumm(\n\t\tconst Read* rd1,\n\t\tconst Read* rd2,\n\t\tconst EList<AlnRes>* rs)\n\t{\n\t\tinit(rd1, rd2, rs);\n\t}\n\n\texplicit AlnSetSumm(\n\t\tAlnScore best,\n\t\tAlnScore secbest)\n\t{\n\t\tinit(best, secbest);\n\t}\n\t\n\t/**\n\t * Set to uninitialized state.\n\t */\n\tvoid reset() {\n\t\tbest_.invalidate();\n\t\tsecbest_.invalidate();\n\t}\n\t\n    /**\n     * Given all the paired and unpaired results involving mates #1 and #2,\n     * calculate best and second-best scores for both mates.  These are\n     * used for future MAPQ calculations.\n     */\n\tvoid init(\n\t\tconst Read* rd1,\n\t\tconst Read* rd2,\n\t\tconst EList<AlnRes>* rs)\n    {\n        assert(rd1 != NULL || rd2 != NULL);\n        assert(rs != NULL);\n        AlnScore best, secbest;\n        size_t szs = 0;\n        best.invalidate();    secbest.invalidate();\n        szs = rs->size();\n        //assert_gt(szs[j], 0);\n        for(size_t i = 0; i < rs->size(); i++) {\n            AlnScore sc = (*rs)[i].score();\n            if(sc > best) {\n                secbest = best;\n                best = sc;\n                assert(VALID_AL_SCORE(best));\n            } else if(sc > secbest) {\n                secbest = sc;\n                assert(VALID_AL_SCORE(best));\n                assert(VALID_AL_SCORE(secbest));\n            }\n        }\n        if(szs > 0) {\n            init(best, secbest);\n        } else {\n            reset();\n        }\n    }\n\n\t\n\t/**\n\t * Initialize given fields.  See constructor for how fields are set.\n\t */\n\tvoid init(\n\t\tAlnScore best,\n\t\tAlnScore secbest)\n\t{\n\t\tbest_         = best;\n\t\tsecbest_      = secbest;\n\t\tassert(repOk());\n\t}\n\t\n\t/**\n\t * Return true iff there is at least a best alignment\n\t */\n\tbool empty() const {\n\t\tassert(repOk());\n\t\treturn !VALID_AL_SCORE(best_);\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that the summary is internally consistent.\n\t */\n\tbool repOk() const {\n\t\treturn true;\n\t}\n#endif\n\t\n\tAlnScore best()         const { return best_;         }\n\tAlnScore secbest()      const { return secbest_;      }\n\n\nprotected:\n\t\n\tAlnScore best_;         // best full-alignment score found for this read\n\tAlnScore secbest_;      // second-best\n};\n\n#endif\n"
  },
  {
    "path": "aligner_seed.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"aligner_cache.h\"\n#include \"aligner_seed.h\"\n#include \"search_globals.h\"\n#include \"bt2_idx.h\"\n\nusing namespace std;\n\n/**\n * Construct a constraint with no edits of any kind allowed.\n */\nConstraint Constraint::exact() {\n\tConstraint c;\n\tc.edits = c.mms = c.ins = c.dels = c.penalty = 0;\n\treturn c;\n}\n\n/**\n * Construct a constraint where the only constraint is a total\n * penalty constraint.\n */\nConstraint Constraint::penaltyBased(int pen) {\n\tConstraint c;\n\tc.penalty = pen;\n\treturn c;\n}\n\n/**\n * Construct a constraint where the only constraint is a total\n * penalty constraint related to the length of the read.\n */\nConstraint Constraint::penaltyFuncBased(const SimpleFunc& f) {\n\tConstraint c;\n\tc.penFunc = f;\n\treturn c;\n}\n\n/**\n * Construct a constraint where the only constraint is a total\n * penalty constraint.\n */\nConstraint Constraint::mmBased(int mms) {\n\tConstraint c;\n\tc.mms = mms;\n\tc.edits = c.dels = c.ins = 0;\n\treturn c;\n}\n\n/**\n * Construct a constraint where the only constraint is a total\n * penalty constraint.\n */\nConstraint Constraint::editBased(int edits) {\n\tConstraint c;\n\tc.edits = edits;\n\tc.dels = c.ins = c.mms = 0;\n\treturn c;\n}\n\n//\n// Some static methods for constructing some standard SeedPolicies\n//\n\n/**\n * Given a read, depth and orientation, extract a seed data structure\n * from the read and fill in the steps & zones arrays.  The Seed\n * contains the sequence and quality values.\n */\nbool\nSeed::instantiate(\n\tconst Read& read,\n\tconst BTDnaString& seq, // seed read sequence\n\tconst BTString& qual,   // seed quality sequence\n\tconst Scoring& pens,\n\tint depth,\n\tint seedoffidx,\n\tint seedtypeidx,\n\tbool fw,\n\tInstantiatedSeed& is) const\n{\n\tassert(overall != NULL);\n\tint seedlen = len;\n\tif((int)read.length() < seedlen) {\n\t\t// Shrink seed length to fit read if necessary\n\t\tseedlen = (int)read.length();\n\t}\n\tassert_gt(seedlen, 0);\n\tis.steps.resize(seedlen);\n\tis.zones.resize(seedlen);\n\t// Fill in 'steps' and 'zones'\n\t//\n\t// The 'steps' list indicates which read character should be\n\t// incorporated at each step of the search process.  Often we will\n\t// simply proceed from one end to the other, in which case the\n\t// 'steps' list is ascending or descending.  In some cases (e.g.\n\t// the 2mm case), we might want to switch directions at least once\n\t// during the search, in which case 'steps' will jump in the\n\t// middle.  When an element of the 'steps' list is negative, this\n\t// indicates that the next\n\t//\n\t// The 'zones' list indicates which zone constraint is active at\n\t// each step.  Each element of the 'zones' list is a pair; the\n\t// first pair element indicates the applicable zone when\n\t// considering either mismatch or delete (ref gap) events, while\n\t// the second pair element indicates the applicable zone when\n\t// considering insertion (read gap) events.  When either pair\n\t// element is a negative number, that indicates that we are about\n\t// to leave the zone for good, at which point we may need to\n\t// evaluate whether we have reached the zone's budget.\n\t//\n\tswitch(type) {\n\t\tcase SEED_TYPE_EXACT: {\n\t\t\tfor(int k = 0; k < seedlen; k++) {\n\t\t\t\tis.steps[k] = -(seedlen - k);\n\t\t\t\t// Zone 0 all the way\n\t\t\t\tis.zones[k].first = is.zones[k].second = 0;\n\t\t\t}\n\t\t\tbreak;\n\t\t}\n\t\tcase SEED_TYPE_LEFT_TO_RIGHT: {\n\t\t\tfor(int k = 0; k < seedlen; k++) {\n\t\t\t\tis.steps[k] = k+1;\n\t\t\t\t// Zone 0 from 0 up to ceil(len/2), then 1\n\t\t\t\tis.zones[k].first = is.zones[k].second = ((k < (seedlen+1)/2) ? 0 : 1);\n\t\t\t}\n\t\t\t// Zone 1 ends at the RHS\n\t\t\tis.zones[seedlen-1].first = is.zones[seedlen-1].second = -1;\n\t\t\tbreak;\n\t\t}\n\t\tcase SEED_TYPE_RIGHT_TO_LEFT: {\n\t\t\tfor(int k = 0; k < seedlen; k++) {\n\t\t\t\tis.steps[k] = -(seedlen - k);\n\t\t\t\t// Zone 0 from 0 up to floor(len/2), then 1\n\t\t\t\tis.zones[k].first  = ((k < seedlen/2) ? 0 : 1);\n\t\t\t\t// Inserts: Zone 0 from 0 up to ceil(len/2)-1, then 1\n\t\t\t\tis.zones[k].second = ((k < (seedlen+1)/2+1) ? 0 : 1);\n\t\t\t}\n\t\t\tis.zones[seedlen-1].first = is.zones[seedlen-1].second = -1;\n\t\t\tbreak;\n\t\t}\n\t\tcase SEED_TYPE_INSIDE_OUT: {\n\t\t\t// Zone 0 from ceil(N/4) up to N-floor(N/4)\n\t\t\tint step = 0;\n\t\t\tfor(int k = (seedlen+3)/4; k < seedlen - (seedlen/4); k++) {\n\t\t\t\tis.zones[step].first = is.zones[step].second = 0;\n\t\t\t\tis.steps[step++] = k+1;\n\t\t\t}\n\t\t\t// Zone 1 from N-floor(N/4) up\n\t\t\tfor(int k = seedlen - (seedlen/4); k < seedlen; k++) {\n\t\t\t\tis.zones[step].first = is.zones[step].second = 1;\n\t\t\t\tis.steps[step++] = k+1;\n\t\t\t}\n\t\t\t// No Zone 1 if seedlen is short (like 2)\n\t\t\t//assert_eq(1, is.zones[step-1].first);\n\t\t\tis.zones[step-1].first = is.zones[step-1].second = -1;\n\t\t\t// Zone 2 from ((seedlen+3)/4)-1 down to 0\n\t\t\tfor(int k = ((seedlen+3)/4)-1; k >= 0; k--) {\n\t\t\t\tis.zones[step].first = is.zones[step].second = 2;\n\t\t\t\tis.steps[step++] = -(k+1);\n\t\t\t}\n\t\t\tassert_eq(2, is.zones[step-1].first);\n\t\t\tis.zones[step-1].first = is.zones[step-1].second = -2;\n\t\t\tassert_eq(seedlen, step);\n\t\t\tbreak;\n\t\t}\n\t\tdefault:\n\t\t\tthrow 1;\n\t}\n\t// Instantiate constraints\n\tfor(int i = 0; i < 3; i++) {\n\t\tis.cons[i] = zones[i];\n\t\tis.cons[i].instantiate(read.length());\n\t}\n\tis.overall = *overall;\n\tis.overall.instantiate(read.length());\n\t// Take a sweep through the seed sequence.  Consider where the Ns\n\t// occur and how zones are laid out.  Calculate the maximum number\n\t// of positions we can jump over initially (e.g. with the ftab) and\n\t// perhaps set this function's return value to false, indicating\n\t// that the arrangements of Ns prevents the seed from aligning.\n\tbool streak = true;\n\tis.maxjump = 0;\n\tbool ret = true;\n\tbool ltr = (is.steps[0] > 0); // true -> left-to-right\n\tfor(size_t i = 0; i < is.steps.size(); i++) {\n\t\tassert_neq(0, is.steps[i]);\n\t\tint off = is.steps[i];\n\t\toff = abs(off)-1;\n\t\tConstraint& cons = is.cons[abs(is.zones[i].first)];\n\t\tint c = seq[off];  assert_range(0, 4, c);\n\t\tint q = qual[off];\n\t\tif(ltr != (is.steps[i] > 0) || // changed direction\n\t\t   is.zones[i].first != 0 ||   // changed zone\n\t\t   is.zones[i].second != 0)    // changed zone\n\t\t{\n\t\t\tstreak = false;\n\t\t}\n\t\tif(c == 4) {\n\t\t\t// Induced mismatch\n\t\t\tif(cons.canN(q, pens)) {\n\t\t\t\tcons.chargeN(q, pens);\n\t\t\t} else {\n\t\t\t\t// Seed disqualified due to arrangement of Ns\n\t\t\t\treturn false;\n\t\t\t}\n\t\t}\n\t\tif(streak) is.maxjump++;\n\t}\n\tis.seedoff = depth;\n\tis.seedoffidx = seedoffidx;\n\tis.fw = fw;\n\tis.s = *this;\n\treturn ret;\n}\n\n/**\n * Return a set consisting of 1 seed encapsulating an exact matching\n * strategy.\n */\nvoid\nSeed::zeroMmSeeds(int ln, EList<Seed>& pols, Constraint& oall) {\n\toall.init();\n\t// Seed policy 1: left-to-right search\n\tpols.expand();\n\tpols.back().len = ln;\n\tpols.back().type = SEED_TYPE_EXACT;\n\tpols.back().zones[0] = Constraint::exact();\n\tpols.back().zones[1] = Constraint::exact();\n\tpols.back().zones[2] = Constraint::exact(); // not used\n\tpols.back().overall = &oall;\n}\n\n/**\n * Return a set of 2 seeds encapsulating a half-and-half 1mm strategy.\n */\nvoid\nSeed::oneMmSeeds(int ln, EList<Seed>& pols, Constraint& oall) {\n\toall.init();\n\t// Seed policy 1: left-to-right search\n\tpols.expand();\n\tpols.back().len = ln;\n\tpols.back().type = SEED_TYPE_LEFT_TO_RIGHT;\n\tpols.back().zones[0] = Constraint::exact();\n\tpols.back().zones[1] = Constraint::mmBased(1);\n\tpols.back().zones[2] = Constraint::exact(); // not used\n\tpols.back().overall = &oall;\n\t// Seed policy 2: right-to-left search\n\tpols.expand();\n\tpols.back().len = ln;\n\tpols.back().type = SEED_TYPE_RIGHT_TO_LEFT;\n\tpols.back().zones[0] = Constraint::exact();\n\tpols.back().zones[1] = Constraint::mmBased(1);\n\tpols.back().zones[1].mmsCeil = 0;\n\tpols.back().zones[2] = Constraint::exact(); // not used\n\tpols.back().overall = &oall;\n}\n\n/**\n * Return a set of 3 seeds encapsulating search roots for:\n *\n * 1. Starting from the left-hand side and searching toward the\n *    right-hand side allowing 2 mismatches in the right half.\n * 2. Starting from the right-hand side and searching toward the\n *    left-hand side allowing 2 mismatches in the left half.\n * 3. Starting (effectively) from the center and searching out toward\n *    both the left and right-hand sides, allowing one mismatch on\n *    either side.\n *\n * This is not exhaustive.  There are 2 mismatch cases mised; if you\n * imagine the seed as divided into four successive quarters A, B, C\n * and D, the cases we miss are when mismatches occur in A and C or B\n * and D.\n */\nvoid\nSeed::twoMmSeeds(int ln, EList<Seed>& pols, Constraint& oall) {\n\toall.init();\n\t// Seed policy 1: left-to-right search\n\tpols.expand();\n\tpols.back().len = ln;\n\tpols.back().type = SEED_TYPE_LEFT_TO_RIGHT;\n\tpols.back().zones[0] = Constraint::exact();\n\tpols.back().zones[1] = Constraint::mmBased(2);\n\tpols.back().zones[2] = Constraint::exact(); // not used\n\tpols.back().overall = &oall;\n\t// Seed policy 2: right-to-left search\n\tpols.expand();\n\tpols.back().len = ln;\n\tpols.back().type = SEED_TYPE_RIGHT_TO_LEFT;\n\tpols.back().zones[0] = Constraint::exact();\n\tpols.back().zones[1] = Constraint::mmBased(2);\n\tpols.back().zones[1].mmsCeil = 1; // Must have used at least 1 mismatch\n\tpols.back().zones[2] = Constraint::exact(); // not used\n\tpols.back().overall = &oall;\n\t// Seed policy 3: inside-out search\n\tpols.expand();\n\tpols.back().len = ln;\n\tpols.back().type = SEED_TYPE_INSIDE_OUT;\n\tpols.back().zones[0] = Constraint::exact();\n\tpols.back().zones[1] = Constraint::mmBased(1);\n\tpols.back().zones[1].mmsCeil = 0; // Must have used at least 1 mismatch\n\tpols.back().zones[2] = Constraint::mmBased(1);\n\tpols.back().zones[2].mmsCeil = 0; // Must have used at least 1 mismatch\n\tpols.back().overall = &oall;\n}\n\n/**\n * Types of actions that can be taken by the SeedAligner.\n */\nenum {\n\tSA_ACTION_TYPE_RESET = 1,\n\tSA_ACTION_TYPE_SEARCH_SEED, // 2\n\tSA_ACTION_TYPE_FTAB,        // 3\n\tSA_ACTION_TYPE_FCHR,        // 4\n\tSA_ACTION_TYPE_MATCH,       // 5\n\tSA_ACTION_TYPE_EDIT         // 6\n};\n\n#define MIN(x, y) ((x < y) ? x : y)\n\n#ifdef ALIGNER_SEED_MAIN\n\n#include <getopt.h>\n#include <string>\n\n/**\n * Parse an int out of optarg and enforce that it be at least 'lower';\n * if it is less than 'lower', than output the given error message and\n * exit with an error and a usage message.\n */\nstatic int parseInt(const char *errmsg, const char *arg) {\n\tlong l;\n\tchar *endPtr = NULL;\n\tl = strtol(arg, &endPtr, 10);\n\tif (endPtr != NULL) {\n\t\treturn (int32_t)l;\n\t}\n\tcerr << errmsg << endl;\n\tthrow 1;\n\treturn -1;\n}\n\nenum {\n\tARG_NOFW = 256,\n\tARG_NORC,\n\tARG_MM,\n\tARG_SHMEM,\n\tARG_TESTS,\n\tARG_RANDOM_TESTS,\n\tARG_SEED\n};\n\nstatic const char *short_opts = \"vCt\";\nstatic struct option long_opts[] = {\n\t{(char*)\"verbose\",  no_argument,       0, 'v'},\n\t{(char*)\"color\",    no_argument,       0, 'C'},\n\t{(char*)\"timing\",   no_argument,       0, 't'},\n\t{(char*)\"nofw\",     no_argument,       0, ARG_NOFW},\n\t{(char*)\"norc\",     no_argument,       0, ARG_NORC},\n\t{(char*)\"mm\",       no_argument,       0, ARG_MM},\n\t{(char*)\"shmem\",    no_argument,       0, ARG_SHMEM},\n\t{(char*)\"tests\",    no_argument,       0, ARG_TESTS},\n\t{(char*)\"random\",   required_argument, 0, ARG_RANDOM_TESTS},\n\t{(char*)\"seed\",     required_argument, 0, ARG_SEED},\n};\n\nstatic void printUsage(ostream& os) {\n\tos << \"Usage: ac [options]* <index> <patterns>\" << endl;\n\tos << \"Options:\" << endl;\n\tos << \"  --mm                memory-mapped mode\" << endl;\n\tos << \"  --shmem             shared memory mode\" << endl;\n\tos << \"  --nofw              don't align forward-oriented read\" << endl;\n\tos << \"  --norc              don't align reverse-complemented read\" << endl;\n\tos << \"  -t/--timing         show timing information\" << endl;\n\tos << \"  -C/--color          colorspace mode\" << endl;\n\tos << \"  -v/--verbose        talkative mode\" << endl;\n}\n\nbool gNorc = false;\nbool gNofw = false;\nbool gColor = false;\nint gVerbose = 0;\nint gGapBarrier = 1;\nbool gColorExEnds = true;\nint gSnpPhred = 30;\nbool gReportOverhangs = true;\n\nextern void aligner_seed_tests();\nextern void aligner_random_seed_tests(\n\tint num_tests,\n\tuint32_t qslo,\n\tuint32_t qshi,\n\tbool color,\n\tuint32_t seed);\n\n/**\n * A way of feeding simply tests to the seed alignment infrastructure.\n */\nint main(int argc, char **argv) {\n\tbool useMm = false;\n\tbool useShmem = false;\n\tbool mmSweep = false;\n\tbool noRefNames = false;\n\tbool sanity = false;\n\tbool timing = false;\n\tint option_index = 0;\n\tint seed = 777;\n\tint next_option;\n\tdo {\n\t\tnext_option = getopt_long(\n\t\t\targc, argv, short_opts, long_opts, &option_index);\n\t\tswitch (next_option) {\n\t\t\tcase 'v':       gVerbose = true; break;\n\t\t\tcase 'C':       gColor   = true; break;\n\t\t\tcase 't':       timing   = true; break;\n\t\t\tcase ARG_NOFW:  gNofw    = true; break;\n\t\t\tcase ARG_NORC:  gNorc    = true; break;\n\t\t\tcase ARG_MM:    useMm    = true; break;\n\t\t\tcase ARG_SHMEM: useShmem = true; break;\n\t\t\tcase ARG_SEED:  seed = parseInt(\"\", optarg); break;\n\t\t\tcase ARG_TESTS: {\n\t\t\t\taligner_seed_tests();\n\t\t\t\taligner_random_seed_tests(\n\t\t\t\t\t100,     // num references\n\t\t\t\t\t100,   // queries per reference lo\n\t\t\t\t\t400,   // queries per reference hi\n\t\t\t\t\tfalse, // true -> generate colorspace reference/reads\n\t\t\t\t\t18);   // pseudo-random seed\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\tcase ARG_RANDOM_TESTS: {\n\t\t\t\tseed = parseInt(\"\", optarg);\n\t\t\t\taligner_random_seed_tests(\n\t\t\t\t\t100,   // num references\n\t\t\t\t\t100,   // queries per reference lo\n\t\t\t\t\t400,   // queries per reference hi\n\t\t\t\t\tfalse, // true -> generate colorspace reference/reads\n\t\t\t\t\tseed); // pseudo-random seed\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\tcase -1: break;\n\t\t\tdefault: {\n\t\t\t\tcerr << \"Unknown option: \" << (char)next_option << endl;\n\t\t\t\tprintUsage(cerr);\n\t\t\t\texit(1);\n\t\t\t}\n\t\t}\n\t} while(next_option != -1);\n\tchar *reffn;\n\tif(optind >= argc) {\n\t\tcerr << \"No reference; quitting...\" << endl;\n\t\treturn 1;\n\t}\n\treffn = argv[optind++];\n\tif(optind >= argc) {\n\t\tcerr << \"No reads; quitting...\" << endl;\n\t\treturn 1;\n\t}\n\tstring ebwtBase(reffn);\n\tBitPairReference ref(\n\t\tebwtBase,    // base path\n\t\tgColor,      // whether we expect it to be colorspace\n\t\tsanity,      // whether to sanity-check reference as it's loaded\n\t\tNULL,        // fasta files to sanity check reference against\n\t\tNULL,        // another way of specifying original sequences\n\t\tfalse,       // true -> infiles (2 args ago) contains raw seqs\n\t\tuseMm,       // use memory mapping to load index?\n\t\tuseShmem,    // use shared memory (not memory mapping)\n\t\tmmSweep,     // touch all the pages after memory-mapping the index\n\t\tgVerbose,    // verbose\n\t\tgVerbose);   // verbose but just for startup messages\n\tTimer *t = new Timer(cerr, \"Time loading fw index: \", timing);\n\tEbwt ebwtFw(\n\t\tebwtBase,\n\t\tgColor,      // index is colorspace\n\t\t0,           // don't need entireReverse for fw index\n\t\ttrue,        // index is for the forward direction\n\t\t-1,          // offrate (irrelevant)\n\t\tuseMm,       // whether to use memory-mapped files\n\t\tuseShmem,    // whether to use shared memory\n\t\tmmSweep,     // sweep memory-mapped files\n\t\t!noRefNames, // load names?\n\t\tfalse,       // load SA sample?\n\t\ttrue,        // load ftab?\n\t\ttrue,        // load rstarts?\n\t\tNULL,        // reference map, or NULL if none is needed\n\t\tgVerbose,    // whether to be talkative\n\t\tgVerbose,    // talkative during initialization\n\t\tfalse,       // handle memory exceptions, don't pass them up\n\t\tsanity);\n\tdelete t;\n\tt = new Timer(cerr, \"Time loading bw index: \", timing);\n\tEbwt ebwtBw(\n\t\tebwtBase + \".rev\",\n\t\tgColor,      // index is colorspace\n\t\t1,           // need entireReverse\n\t\tfalse,       // index is for the backward direction\n\t\t-1,          // offrate (irrelevant)\n\t\tuseMm,       // whether to use memory-mapped files\n\t\tuseShmem,    // whether to use shared memory\n\t\tmmSweep,     // sweep memory-mapped files\n\t\t!noRefNames, // load names?\n\t\tfalse,       // load SA sample?\n\t\ttrue,        // load ftab?\n\t\tfalse,       // load rstarts?\n\t\tNULL,        // reference map, or NULL if none is needed\n\t\tgVerbose,    // whether to be talkative\n\t\tgVerbose,    // talkative during initialization\n\t\tfalse,       // handle memory exceptions, don't pass them up\n\t\tsanity);\n\tdelete t;\n\tfor(int i = optind; i < argc; i++) {\n\t}\n}\n#endif\n"
  },
  {
    "path": "aligner_seed.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef ALIGNER_SEED_H_\n#define ALIGNER_SEED_H_\n\n#include <iostream>\n#include <utility>\n#include <limits>\n#include \"qual.h\"\n#include \"ds.h\"\n#include \"sstring.h\"\n#include \"alphabet.h\"\n#include \"edit.h\"\n#include \"read.h\"\n// Threading is necessary to synchronize the classes that dump\n// intermediate alignment results to files.  Otherwise, all data herein\n// is constant and shared, or per-thread.\n#include \"threading.h\"\n#include \"aligner_result.h\"\n#include \"aligner_cache.h\"\n#include \"scoring.h\"\n#include \"mem_ids.h\"\n#include \"simple_func.h\"\n#include \"btypes.h\"\n\n/**\n * A constraint to apply to an alignment zone, or to an overall\n * alignment.\n *\n * The constraint can put both caps and ceilings on the number and\n * types of edits allowed.\n */\nstruct Constraint {\n\t\n\tConstraint() { init(); }\n\t\n\t/**\n\t * Initialize Constraint to be fully permissive.\n\t */\n\tvoid init() {\n\t\tedits = mms = ins = dels = penalty = editsCeil = mmsCeil =\n\t\tinsCeil = delsCeil = penaltyCeil = MAX_I;\n\t\tpenFunc.reset();\n\t\tinstantiated = false;\n\t}\n\t\n\t/**\n\t * Return true iff penalities and constraints prevent us from\n\t * adding any edits.\n\t */\n\tbool mustMatch() {\n\t\tassert(instantiated);\n\t\treturn (mms == 0 && edits == 0) ||\n\t\t        penalty == 0 ||\n\t\t       (mms == 0 && dels == 0 && ins == 0);\n\t}\n\t\n\t/**\n\t * Return true iff a mismatch of the given quality is permitted.\n\t */\n\tbool canMismatch(int q, const Scoring& cm) {\n\t\tassert(instantiated);\n\t\treturn (mms > 0 || edits > 0) &&\n\t\t       penalty >= cm.mm(q);\n\t}\n\n\t/**\n\t * Return true iff a mismatch of the given quality is permitted.\n\t */\n\tbool canN(int q, const Scoring& cm) {\n\t\tassert(instantiated);\n\t\treturn (mms > 0 || edits > 0) &&\n\t\t       penalty >= cm.n(q);\n\t}\n\t\n\t/**\n\t * Return true iff a mismatch of *any* quality (even qual=1) is\n\t * permitted.\n\t */\n\tbool canMismatch() {\n\t\tassert(instantiated);\n\t\treturn (mms > 0 || edits > 0) && penalty > 0;\n\t}\n\n\t/**\n\t * Return true iff a mismatch of *any* quality (even qual=1) is\n\t * permitted.\n\t */\n\tbool canN() {\n\t\tassert(instantiated);\n\t\treturn (mms > 0 || edits > 0);\n\t}\n\t\n\t/**\n\t * Return true iff a deletion of the given extension (0=open, 1=1st\n\t * extension, etc) is permitted.\n\t */\n\tbool canDelete(int ex, const Scoring& cm) {\n\t\tassert(instantiated);\n\t\treturn (dels > 0 && edits > 0) &&\n\t\t       penalty >= cm.del(ex);\n\t}\n\n\t/**\n\t * Return true iff a deletion of any extension is permitted.\n\t */\n\tbool canDelete() {\n\t\tassert(instantiated);\n\t\treturn (dels > 0 || edits > 0) &&\n\t\t       penalty > 0;\n\t}\n\t\n\t/**\n\t * Return true iff an insertion of the given extension (0=open,\n\t * 1=1st extension, etc) is permitted.\n\t */\n\tbool canInsert(int ex, const Scoring& cm) {\n\t\tassert(instantiated);\n\t\treturn (ins > 0 || edits > 0) &&\n\t\t       penalty >= cm.ins(ex);\n\t}\n\n\t/**\n\t * Return true iff an insertion of any extension is permitted.\n\t */\n\tbool canInsert() {\n\t\tassert(instantiated);\n\t\treturn (ins > 0 || edits > 0) &&\n\t\t       penalty > 0;\n\t}\n\t\n\t/**\n\t * Return true iff a gap of any extension is permitted\n\t */\n\tbool canGap() {\n\t\tassert(instantiated);\n\t\treturn ((ins > 0 || dels > 0) || edits > 0) && penalty > 0;\n\t}\n\t\n\t/**\n\t * Charge a mismatch of the given quality.\n\t */\n\tvoid chargeMismatch(int q, const Scoring& cm) {\n\t\tassert(instantiated);\n\t\tif(mms == 0) { assert_gt(edits, 0); edits--; }\n\t\telse mms--;\n\t\tpenalty -= cm.mm(q);\n\t\tassert_geq(mms, 0);\n\t\tassert_geq(edits, 0);\n\t\tassert_geq(penalty, 0);\n\t}\n\t\n\t/**\n\t * Charge an N mismatch of the given quality.\n\t */\n\tvoid chargeN(int q, const Scoring& cm) {\n\t\tassert(instantiated);\n\t\tif(mms == 0) { assert_gt(edits, 0); edits--; }\n\t\telse mms--;\n\t\tpenalty -= cm.n(q);\n\t\tassert_geq(mms, 0);\n\t\tassert_geq(edits, 0);\n\t\tassert_geq(penalty, 0);\n\t}\n\t\n\t/**\n\t * Charge a deletion of the given extension.\n\t */\n\tvoid chargeDelete(int ex, const Scoring& cm) {\n\t\tassert(instantiated);\n\t\tdels--;\n\t\tedits--;\n\t\tpenalty -= cm.del(ex);\n\t\tassert_geq(dels, 0);\n\t\tassert_geq(edits, 0);\n\t\tassert_geq(penalty, 0);\n\t}\n\n\t/**\n\t * Charge an insertion of the given extension.\n\t */\n\tvoid chargeInsert(int ex, const Scoring& cm) {\n\t\tassert(instantiated);\n\t\tins--;\n\t\tedits--;\n\t\tpenalty -= cm.ins(ex);\n\t\tassert_geq(ins, 0);\n\t\tassert_geq(edits, 0);\n\t\tassert_geq(penalty, 0);\n\t}\n\t\n\t/**\n\t * Once the constrained area is completely explored, call this\n\t * function to check whether there were *at least* as many\n\t * dissimilarities as required by the constraint.  Bounds like this\n\t * are helpful to resolve instances where two search roots would\n\t * otherwise overlap in what alignments they can find.\n\t */\n\tbool acceptable() {\n\t\tassert(instantiated);\n\t\treturn edits   <= editsCeil &&\n\t\t       mms     <= mmsCeil   &&\n\t\t       ins     <= insCeil   &&\n\t\t       dels    <= delsCeil  &&\n\t\t       penalty <= penaltyCeil;\n\t}\n\t\n\t/**\n\t * Instantiate a constraint w/r/t the read length and the constant\n\t * and linear coefficients for the penalty function.\n\t */\n\tstatic int instantiate(size_t rdlen, const SimpleFunc& func) {\n\t\treturn func.f<int>((double)rdlen);\n\t}\n\t\n\t/**\n\t * Instantiate this constraint w/r/t the read length.\n\t */\n\tvoid instantiate(size_t rdlen) {\n\t\tassert(!instantiated);\n\t\tif(penFunc.initialized()) {\n\t\t\tpenalty = Constraint::instantiate(rdlen, penFunc);\n\t\t}\n\t\tinstantiated = true;\n\t}\n\t\n\tint edits;      // # edits permitted\n\tint mms;        // # mismatches permitted\n\tint ins;        // # insertions permitted\n\tint dels;       // # deletions permitted\n\tint penalty;    // penalty total permitted\n\tint editsCeil;  // <= this many edits can be left at the end\n\tint mmsCeil;    // <= this many mismatches can be left at the end\n\tint insCeil;    // <= this many inserts can be left at the end\n\tint delsCeil;   // <= this many deletions can be left at the end\n\tint penaltyCeil;// <= this much leftover penalty can be left at the end\n\tSimpleFunc penFunc;// penalty function; function of read len\n\tbool instantiated; // whether constraint is instantiated w/r/t read len\n\t\n\t//\n\t// Some static methods for constructing some standard Constraints\n\t//\n\n\t/**\n\t * Construct a constraint with no edits of any kind allowed.\n\t */\n\tstatic Constraint exact();\n\t\n\t/**\n\t * Construct a constraint where the only constraint is a total\n\t * penalty constraint.\n\t */\n\tstatic Constraint penaltyBased(int pen);\n\n\t/**\n\t * Construct a constraint where the only constraint is a total\n\t * penalty constraint related to the length of the read.\n\t */\n\tstatic Constraint penaltyFuncBased(const SimpleFunc& func);\n\n\t/**\n\t * Construct a constraint where the only constraint is a total\n\t * penalty constraint.\n\t */\n\tstatic Constraint mmBased(int mms);\n\n\t/**\n\t * Construct a constraint where the only constraint is a total\n\t * penalty constraint.\n\t */\n\tstatic Constraint editBased(int edits);\n};\n\n/**\n * We divide seed search strategies into three categories:\n *\n * 1. A left-to-right search where the left half of the read is\n *    constrained to match exactly and the right half is subject to\n *    some looser constraint (e.g. 1mm or 2mm)\n * 2. Same as 1, but going right to left with the exact matching half\n *    on the right.\n * 3. Inside-out search where the center half of the read is\n *    constrained to match exactly, and the extreme quarters of the\n *    read are subject to a looser constraint.\n */\nenum {\n\tSEED_TYPE_EXACT = 1,\n\tSEED_TYPE_LEFT_TO_RIGHT,\n\tSEED_TYPE_RIGHT_TO_LEFT,\n\tSEED_TYPE_INSIDE_OUT\n};\n\nstruct InstantiatedSeed;\n\n/**\n * Policy dictating how to size and arrange seeds along the length of\n * the read, and what constraints to force on the zones of the seed.\n * We assume that seeds are plopped down at regular intervals from the\n * 5' to 3' ends, with the first seed flush to the 5' end.\n *\n * If the read is shorter than a single seed, one seed is used and it\n * is shrunk to accommodate the read.\n */\nstruct Seed {\n\n\tint len;             // length of a seed\n\tint type;            // dictates anchor portion, direction of search\n\tConstraint *overall; // for the overall alignment\n\n\tSeed() { init(0, 0, NULL); }\n\n\t/**\n\t * Construct and initialize this seed with given length and type.\n\t */\n\tSeed(int ln, int ty, Constraint* oc) {\n\t\tinit(ln, ty, oc);\n\t}\n\n\t/**\n\t * Initialize this seed with given length and type.\n\t */\n\tvoid init(int ln, int ty, Constraint* oc) {\n\t\tlen = ln;\n\t\ttype = ty;\n\t\toverall = oc;\n\t}\n\t\n\t// If the seed is split into halves, we just use zones[0] and\n\t// zones[1]; 0 is the near half and 1 is the far half.  If the seed\n\t// is split into thirds (i.e. inside-out) then 0 is the center, 1\n\t// is the far portion on the left, and 2 is the far portion on the\n\t// right.\n\tConstraint zones[3];\n\n\t/**\n\t * Once the constrained seed is completely explored, call this\n\t * function to check whether there were *at least* as many\n\t * dissimilarities as required by all constraints.  Bounds like this\n\t * are helpful to resolve instances where two search roots would\n\t * otherwise overlap in what alignments they can find.\n\t */\n\tbool acceptable() {\n\t\tassert(overall != NULL);\n\t\treturn zones[0].acceptable() &&\n\t\t       zones[1].acceptable() &&\n\t\t       zones[2].acceptable() &&\n\t\t       overall->acceptable();\n\t}\n\n\t/**\n\t * Given a read, depth and orientation, extract a seed data structure\n\t * from the read and fill in the steps & zones arrays.  The Seed\n\t * contains the sequence and quality values.\n\t */\n\tbool instantiate(\n\t\tconst Read& read,\n\t\tconst BTDnaString& seq, // already-extracted seed sequence\n\t\tconst BTString& qual,   // already-extracted seed quality sequence\n\t\tconst Scoring& pens,\n\t\tint depth,\n\t\tint seedoffidx,\n\t\tint seedtypeidx,\n\t\tbool fw,\n\t\tInstantiatedSeed& si) const;\n\n\t/**\n\t * Return a list of Seed objects encapsulating\n\t */\n\tstatic void mmSeeds(\n\t\tint mms,\n\t\tint ln,\n\t\tEList<Seed>& pols,\n\t\tConstraint& oall)\n\t{\n\t\tif(mms == 0) {\n\t\t\tzeroMmSeeds(ln, pols, oall);\n\t\t} else if(mms == 1) {\n\t\t\toneMmSeeds(ln, pols, oall);\n\t\t} else if(mms == 2) {\n\t\t\ttwoMmSeeds(ln, pols, oall);\n\t\t} else throw 1;\n\t}\n\t\n\tstatic void zeroMmSeeds(int ln, EList<Seed>&, Constraint&);\n\tstatic void oneMmSeeds (int ln, EList<Seed>&, Constraint&);\n\tstatic void twoMmSeeds (int ln, EList<Seed>&, Constraint&);\n};\n\n/**\n * An instantiated seed is a seed (perhaps modified to fit the read)\n * plus all data needed to conduct a search of the seed.\n */\nstruct InstantiatedSeed {\n\n\tInstantiatedSeed() : steps(AL_CAT), zones(AL_CAT) { }\n\n\t// Steps map.  There are as many steps as there are positions in\n\t// the seed.  The map is a helpful abstraction because we sometimes\n\t// visit seed positions in an irregular order (e.g. inside-out\n\t// search).\n\tEList<int> steps;\n\n\t// Zones map.  For each step, records what constraint to charge an\n\t// edit to.  The first entry in each pair gives the constraint for\n\t// non-insert edits and the second entry in each pair gives the\n\t// constraint for insert edits.  If the value stored is negative,\n\t// this indicates that the zone is \"closed out\" after this\n\t// position, so zone acceptility should be checked.\n\tEList<pair<int, int> > zones;\n\n\t// Nucleotide sequence covering the seed, extracted from read\n\tBTDnaString *seq;\n\t\n\t// Quality sequence covering the seed, extracted from read\n\tBTString *qual;\n\t\n\t// Initial constraints governing zones 0, 1, 2.  We precalculate\n\t// the effect of Ns on these.\n\tConstraint cons[3];\n\t\n\t// Overall constraint, tailored to the read length.\n\tConstraint overall;\n\t\n\t// Maximum number of positions that the aligner may advance before\n\t// its first step.  This lets the aligner know whether it can use\n\t// the ftab or not.\n\tint maxjump;\n\t\n\t// Offset of seed from 5' end of read\n\tint seedoff;\n\n\t// Id for seed offset; ids are such that the smallest index is the\n\t// closest to the 5' end and consecutive ids are adjacent (i.e.\n\t// there are no intervening offsets with seeds)\n\tint seedoffidx;\n\t\n\t// Type of seed (left-to-right, etc)\n\tint seedtypeidx;\n\t\n\t// Seed comes from forward-oriented read?\n\tbool fw;\n\t\n\t// Filtered out due to the pattern of Ns present.  If true, this\n\t// seed should be ignored by searchAllSeeds().\n\tbool nfiltered;\n\t\n\t// Seed this was instantiated from\n\tSeed s;\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that InstantiatedSeed is internally consistent.\n\t */\n\tbool repOk() const {\n\t\treturn true;\n\t}\n#endif\n};\n\n/**\n * Simple struct for holding a end-to-end alignments for the read with at most\n * 2 edits.\n */\ntemplate <typename index_t>\nstruct EEHit {\n\t\n\tEEHit() { reset(); }\n\t\n\tvoid reset() {\n\t\ttop = bot = 0;\n\t\tfw = false;\n\t\te1.reset();\n\t\te2.reset();\n\t\tscore = MIN_I64;\n\t}\n\t\n\tvoid init(\n\t\tindex_t top_,\n\t\tindex_t bot_,\n\t\tconst Edit* e1_,\n\t\tconst Edit* e2_,\n\t\tbool fw_,\n\t\tint64_t score_)\n\t{\n\t\ttop = top_; bot = bot_;\n\t\tif(e1_ != NULL) {\n\t\t\te1 = *e1_;\n\t\t} else {\n\t\t\te1.reset();\n\t\t}\n\t\tif(e2_ != NULL) {\n\t\t\te2 = *e2_;\n\t\t} else {\n\t\t\te2.reset();\n\t\t}\n\t\tfw = fw_;\n\t\tscore = score_;\n\t}\n\t\n\t/**\n\t * Return number of mismatches in the alignment.\n\t */\n\tint mms() const {\n\t\tif     (e2.inited()) return 2;\n\t\telse if(e1.inited()) return 1;\n\t\telse                 return 0;\n\t}\n\t\n\t/**\n\t * Return the number of Ns involved in the alignment.\n\t */\n\tint ns() const {\n\t\tint ns = 0;\n\t\tif(e1.inited() && e1.hasN()) {\n\t\t\tns++;\n\t\t\tif(e2.inited() && e2.hasN()) {\n\t\t\t\tns++;\n\t\t\t}\n\t\t}\n\t\treturn ns;\n\t}\n\n\t/**\n\t * Return the number of Ns involved in the alignment.\n\t */\n\tint refns() const {\n\t\tint ns = 0;\n\t\tif(e1.inited() && e1.chr == 'N') {\n\t\t\tns++;\n\t\t\tif(e2.inited() && e2.chr == 'N') {\n\t\t\t\tns++;\n\t\t\t}\n\t\t}\n\t\treturn ns;\n\t}\n\t\n\t/**\n\t * Return true iff there is no hit.\n\t */\n\tbool empty() const {\n\t\treturn bot <= top;\n\t}\n\t\n\t/**\n\t * Higher score = higher priority.\n\t */\n\tbool operator<(const EEHit& o) const {\n\t\treturn score > o.score;\n\t}\n\t\n\t/**\n\t * Return the size of the alignments SA range.s\n\t */\n\tindex_t size() const { return bot - top; }\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that hit is sane w/r/t read.\n\t */\n\tbool repOk(const Read& rd) const {\n\t\tassert_gt(bot, top);\n\t\tif(e1.inited()) {\n\t\t\tassert_lt(e1.pos, rd.length());\n\t\t\tif(e2.inited()) {\n\t\t\t\tassert_lt(e2.pos, rd.length());\n\t\t\t}\n\t\t}\n\t\treturn true;\n\t}\n#endif\n\t\n\tindex_t top;\n\tindex_t bot;\n\tEdit     e1;\n\tEdit     e2;\n\tbool     fw;\n\tint64_t  score;\n};\n\n/**\n * Data structure for holding all of the seed hits associated with a read.  All\n * the seed hits for a given read are encapsulated in a single QVal object.  A\n * QVal refers to a range of values in the qlist, where each qlist value is a \n * BW range and a slot to hold the hit's suffix array offset.  QVals are kept\n * in two lists (hitsFw_ and hitsRc_), one for seeds on the forward read strand,\n * one for seeds on the reverse read strand.  The list is indexed by read\n * offset index (e.g. 0=closest-to-5', 1=second-closest, etc).\n *\n * An assumption behind this data structure is that all the seeds are found\n * first, then downstream analyses try to extend them.  In between finding the\n * seed hits and extending them, the sort() member function is called, which\n * ranks QVals according to the order they should be extended.  Right now the\n * policy is that QVals with fewer elements (hits) should be tried first.\n */\ntemplate <typename index_t>\nclass SeedResults {\n\npublic:\n\tSeedResults() :\n\t\tseqFw_(AL_CAT),\n\t\tseqRc_(AL_CAT),\n\t\tqualFw_(AL_CAT),\n\t\tqualRc_(AL_CAT),\n\t\thitsFw_(AL_CAT),\n\t\thitsRc_(AL_CAT),\n\t\tisFw_(AL_CAT),\n\t\tisRc_(AL_CAT),\n\t\tsortedFw_(AL_CAT),\n\t\tsortedRc_(AL_CAT),\n\t\toffIdx2off_(AL_CAT),\n\t\trankOffs_(AL_CAT),\n\t\trankFws_(AL_CAT),\n\t\tmm1Hit_(AL_CAT)\n\t{\n\t\tclear();\n\t}\n\t\n\t/**\n\t * Set the current read.\n\t */\n\tvoid nextRead(const Read& read) {\n\t\tread_ = &read;\n\t}\n\n\t/**\n\t * Set the appropriate element of either hitsFw_ or hitsRc_ to the given\n\t * QVal.  A QVal encapsulates all the BW ranges for reference substrings \n\t * that are within some distance of the seed string.\n\t */\n\tvoid add(\n\t\tconst   QVal<index_t>& qv,  // range of ranges in cache\n\t\tconst   AlignmentCache<index_t>& ac, // cache\n\t\tindex_t seedIdx,            // seed index (from 5' end)\n\t\tbool    seedFw)             // whether seed is from forward read\n\t{\n\t\tassert(qv.repOk(ac));\n\t\tassert(repOk(&ac));\n\t\tassert_lt(seedIdx, hitsFw_.size());\n\t\tassert_gt(numOffs_, 0); // if this fails, probably failed to call reset\n\t\tif(qv.empty()) return;\n\t\tif(seedFw) {\n\t\t\tassert(!hitsFw_[seedIdx].valid());\n\t\t\thitsFw_[seedIdx] = qv;\n\t\t\tnumEltsFw_ += qv.numElts();\n\t\t\tnumRangesFw_ += qv.numRanges();\n\t\t\tif(qv.numRanges() > 0) nonzFw_++;\n\t\t} else {\n\t\t\tassert(!hitsRc_[seedIdx].valid());\n\t\t\thitsRc_[seedIdx] = qv;\n\t\t\tnumEltsRc_ += qv.numElts();\n\t\t\tnumRangesRc_ += qv.numRanges();\n\t\t\tif(qv.numRanges() > 0) nonzRc_++;\n\t\t}\n\t\tnumElts_ += qv.numElts();\n\t\tnumRanges_ += qv.numRanges();\n\t\tif(qv.numRanges() > 0) {\n\t\t\tnonzTot_++;\n\t\t}\n\t\tassert(repOk(&ac));\n\t}\n\n\t/**\n\t * Clear buffered seed hits and state.  Set the number of seed\n\t * offsets and the read.\n\t */\n\tvoid reset(\n\t\tconst Read& read,\n\t\tconst EList<index_t>& offIdx2off,\n\t\tsize_t numOffs)\n\t{\n\t\tassert_gt(numOffs, 0);\n\t\tclearSeeds();\n\t\tnumOffs_ = numOffs;\n\t\tseqFw_.resize(numOffs_);\n\t\tseqRc_.resize(numOffs_);\n\t\tqualFw_.resize(numOffs_);\n\t\tqualRc_.resize(numOffs_);\n\t\thitsFw_.resize(numOffs_);\n\t\thitsRc_.resize(numOffs_);\n\t\tisFw_.resize(numOffs_);\n\t\tisRc_.resize(numOffs_);\n\t\tsortedFw_.resize(numOffs_);\n\t\tsortedRc_.resize(numOffs_);\n\t\toffIdx2off_ = offIdx2off;\n\t\tfor(size_t i = 0; i < numOffs_; i++) {\n\t\t\tsortedFw_[i] = sortedRc_[i] = false;\n\t\t\thitsFw_[i].reset();\n\t\t\thitsRc_[i].reset();\n\t\t\tisFw_[i].clear();\n\t\t\tisRc_[i].clear();\n\t\t}\n\t\tread_ = &read;\n\t\tsorted_ = false;\n\t}\n\t\n\t/**\n\t * Clear seed-hit state.\n\t */\n\tvoid clearSeeds() {\n\t\tsortedFw_.clear();\n\t\tsortedRc_.clear();\n\t\trankOffs_.clear();\n\t\trankFws_.clear();\n\t\toffIdx2off_.clear();\n\t\thitsFw_.clear();\n\t\thitsRc_.clear();\n\t\tisFw_.clear();\n\t\tisRc_.clear();\n\t\tseqFw_.clear();\n\t\tseqRc_.clear();\n\t\tnonzTot_ = 0;\n\t\tnonzFw_ = 0;\n\t\tnonzRc_ = 0;\n\t\tnumOffs_ = 0;\n\t\tnumRanges_ = 0;\n\t\tnumElts_ = 0;\n\t\tnumRangesFw_ = 0;\n\t\tnumEltsFw_ = 0;\n\t\tnumRangesRc_ = 0;\n\t\tnumEltsRc_ = 0;\n\t}\n\t\n\t/**\n\t * Clear seed-hit state and end-to-end alignment state.\n\t */\n\tvoid clear() {\n\t\tclearSeeds();\n\t\tread_ = NULL;\n\t\texactFwHit_.reset();\n\t\texactRcHit_.reset();\n\t\tmm1Hit_.clear();\n\t\tmm1Sorted_ = false;\n\t\tmm1Elt_ = 0;\n\t\tassert(empty());\n\t}\n\t\n    /**\n\t * Return average number of hits per seed.\n\t */\n\tfloat averageHitsPerSeed() const {\n\t\treturn (float)numElts_ / (float)nonzTot_;\n\t}\n\t\n\t/**\n\t * Return median of all the non-zero per-seed # hits\n\t */\n\tfloat medianHitsPerSeed() const {\n\t\tEList<size_t>& median = const_cast<EList<size_t>&>(tmpMedian_);\n\t\tmedian.clear();\n\t\tfor(size_t i = 0; i < numOffs_; i++) {\n\t\t\tif(hitsFw_[i].valid() && hitsFw_[i].numElts() > 0) {\n\t\t\t\tmedian.push_back(hitsFw_[i].numElts());\n\t\t\t}\n\t\t\tif(hitsRc_[i].valid() && hitsRc_[i].numElts() > 0) {\n\t\t\t\tmedian.push_back(hitsRc_[i].numElts());\n\t\t\t}\n\t\t}\n\t\tif(tmpMedian_.empty()) {\n\t\t\treturn 0.0f;\n\t\t}\n\t\tmedian.sort();\n\t\tfloat med1 = (float)median[tmpMedian_.size() >> 1];\n\t\tfloat med2 = med1;\n\t\tif((median.size() & 1) == 0) {\n\t\t\tmed2 = (float)median[(tmpMedian_.size() >> 1) - 1];\n\t\t}\n\t\treturn med1 + med2 * 0.5f;\n\t}\n\t\n\t/**\n\t * Return a number that's meant to quantify how hopeful we are that this\n\t * set of seed hits will lead to good alignments.\n\t */\n\tdouble uniquenessFactor() const {\n\t\tdouble result = 0.0;\n\t\tfor(size_t i = 0; i < numOffs_; i++) {\n\t\t\tif(hitsFw_[i].valid()) {\n\t\t\t\tsize_t nelt = hitsFw_[i].numElts();\n\t\t\t\tresult += (1.0 / (double)(nelt * nelt));\n\t\t\t}\n\t\t\tif(hitsRc_[i].valid()) {\n\t\t\t\tsize_t nelt = hitsRc_[i].numElts();\n\t\t\t\tresult += (1.0 / (double)(nelt * nelt));\n\t\t\t}\n\t\t}\n\t\treturn result;\n\t}\n\n\t/**\n\t * Return the number of ranges being held.\n\t */\n\tindex_t numRanges() const { return numRanges_; }\n\n\t/**\n\t * Return the number of elements being held.\n\t */\n\tindex_t numElts() const { return numElts_; }\n\n\t/**\n\t * Return the number of ranges being held for seeds on the forward\n\t * read strand.\n\t */\n\tindex_t numRangesFw() const { return numRangesFw_; }\n\n\t/**\n\t * Return the number of elements being held for seeds on the\n\t * forward read strand.\n\t */\n\tindex_t numEltsFw() const { return numEltsFw_; }\n\n\t/**\n\t * Return the number of ranges being held for seeds on the\n\t * reverse-complement read strand.\n\t */\n\tindex_t numRangesRc() const { return numRangesRc_; }\n\n\t/**\n\t * Return the number of elements being held for seeds on the\n\t * reverse-complement read strand.\n\t */\n\tindex_t numEltsRc() const { return numEltsRc_; }\n\t\n\t/**\n\t * Given an offset index, return the offset that has that index.\n\t */\n\tindex_t idx2off(size_t off) const {\n\t\treturn offIdx2off_[off];\n\t}\n\t\n\t/**\n\t * Return true iff there are 0 hits being held.\n\t */\n\tbool empty() const { return numRanges() == 0; }\n\t\n\t/**\n\t * Get the QVal representing all the reference hits for the given\n\t * orientation and seed offset index.\n\t */\n\tconst QVal<index_t>& hitsAtOffIdx(bool fw, size_t seedoffidx) const {\n\t\tassert_lt(seedoffidx, numOffs_);\n\t\tassert(repOk(NULL));\n\t\treturn fw ? hitsFw_[seedoffidx] : hitsRc_[seedoffidx];\n\t}\n\n\t/**\n\t * Get the Instantiated seeds for the given orientation and offset.\n\t */\n\tEList<InstantiatedSeed>& instantiatedSeeds(bool fw, size_t seedoffidx) {\n\t\tassert_lt(seedoffidx, numOffs_);\n\t\tassert(repOk(NULL));\n\t\treturn fw ? isFw_[seedoffidx] : isRc_[seedoffidx];\n\t}\n\t\n\t/**\n\t * Return the number of different seed offsets possible.\n\t */\n\tindex_t numOffs() const { return numOffs_; }\n\t\n\t/**\n\t * Return the read from which seeds were extracted, aligned.\n\t */\n\tconst Read& read() const { return *read_; }\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that this SeedResults is internally consistent.\n\t */\n\tbool repOk(\n\t\tconst AlignmentCache<index_t>* ac,\n\t\tbool requireInited = false) const\n\t{\n\t\tif(requireInited) {\n\t\t\tassert(read_ != NULL);\n\t\t}\n\t\tif(numOffs_ > 0) {\n\t\t\tassert_eq(numOffs_, hitsFw_.size());\n\t\t\tassert_eq(numOffs_, hitsRc_.size());\n\t\t\tassert_leq(numRanges_, numElts_);\n\t\t\tassert_leq(nonzTot_, numRanges_);\n\t\t\tsize_t nonzs = 0;\n\t\t\tfor(int fw = 0; fw <= 1; fw++) {\n\t\t\t\tconst EList<QVal<index_t> >& rrs = (fw ? hitsFw_ : hitsRc_);\n\t\t\t\tfor(size_t i = 0; i < numOffs_; i++) {\n\t\t\t\t\tif(rrs[i].valid()) {\n\t\t\t\t\t\tif(rrs[i].numRanges() > 0) nonzs++;\n\t\t\t\t\t\tif(ac != NULL) {\n\t\t\t\t\t\t\tassert(rrs[i].repOk(*ac));\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\tassert_eq(nonzs, nonzTot_);\n\t\t\tassert(!sorted_ || nonzTot_ == rankFws_.size());\n\t\t\tassert(!sorted_ || nonzTot_ == rankOffs_.size());\n\t\t}\n\t\treturn true;\n\t}\n#endif\n\t\n\t/**\n\t * Populate rankOffs_ and rankFws_ with the list of QVals that need to be\n\t * examined for this SeedResults, in order.  The order is ascending by\n\t * number of elements, so QVals with fewer elements (i.e. seed sequences\n\t * that are more unique) will be tried first and QVals with more elements\n\t * (i.e. seed sequences\n\t */\n\tvoid rankSeedHits(RandomSource& rnd) {\n\t\twhile(rankOffs_.size() < nonzTot_) {\n\t\t\tindex_t minsz = (index_t)0xffffffff;\n\t\t\tindex_t minidx = 0;\n\t\t\tbool minfw = true;\n\t\t\t// Rank seed-hit positions in ascending order by number of elements\n\t\t\t// in all BW ranges\n\t\t\tbool rb = rnd.nextBool();\n\t\t\tassert(rb == 0 || rb == 1);\n\t\t\tfor(int fwi = 0; fwi <= 1; fwi++) {\n\t\t\t\tbool fw = (fwi == (rb ? 1 : 0));\n\t\t\t\tEList<QVal<index_t> >& rrs = (fw ? hitsFw_ : hitsRc_);\n\t\t\t\tEList<bool>& sorted = (fw ? sortedFw_ : sortedRc_);\n\t\t\t\tindex_t i = (rnd.nextU32() % (index_t)numOffs_);\n\t\t\t\tfor(index_t ii = 0; ii < numOffs_; ii++) {\n\t\t\t\t\tif(rrs[i].valid() &&         // valid QVal\n\t\t\t\t\t   rrs[i].numElts() > 0 &&   // non-empty\n\t\t\t\t\t   !sorted[i] &&             // not already sorted\n\t\t\t\t\t   rrs[i].numElts() < minsz) // least elts so far?\n\t\t\t\t\t{\n\t\t\t\t\t\tminsz = rrs[i].numElts();\n\t\t\t\t\t\tminidx = i;\n\t\t\t\t\t\tminfw = (fw == 1);\n\t\t\t\t\t}\n\t\t\t\t\tif((++i) == numOffs_) {\n\t\t\t\t\t\ti = 0;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\tassert_neq((index_t)0xffffffff, minsz);\n\t\t\tif(minfw) {\n\t\t\t\tsortedFw_[minidx] = true;\n\t\t\t} else {\n\t\t\t\tsortedRc_[minidx] = true;\n\t\t\t}\n\t\t\trankOffs_.push_back(minidx);\n\t\t\trankFws_.push_back(minfw);\n\t\t}\n\t\tassert_eq(rankOffs_.size(), rankFws_.size());\n\t\tsorted_ = true;\n\t}\n\n\t/**\n\t * Return the number of orientation/offsets into the read that have\n\t * at least one seed hit.\n\t */\n\tsize_t nonzeroOffsets() const {\n\t\tassert(!sorted_ || nonzTot_ == rankFws_.size());\n\t\tassert(!sorted_ || nonzTot_ == rankOffs_.size());\n\t\treturn nonzTot_;\n\t}\n\t\n\t/**\n\t * Return true iff all seeds hit for forward read.\n\t */\n\tbool allFwSeedsHit() const {\n\t\treturn nonzFw_ == numOffs();\n\t}\n\n\t/**\n\t * Return true iff all seeds hit for revcomp read.\n\t */\n\tbool allRcSeedsHit() const {\n\t\treturn nonzRc_ == numOffs();\n\t}\n\t\n\t/**\n\t * Return the minimum number of edits that an end-to-end alignment of the\n\t * fw read could have.  Uses knowledge of how many seeds have exact hits\n\t * and how the seeds overlap.\n\t */\n\tindex_t fewestEditsEE(bool fw, int seedlen, int per) const {\n\t\tassert_gt(seedlen, 0);\n\t\tassert_gt(per, 0);\n\t\tindex_t nonz = fw ? nonzFw_ : nonzRc_;\n\t\tif(nonz < numOffs()) {\n\t\t\tint maxdepth = (seedlen + per - 1) / per;\n\t\t\tint missing = (int)(numOffs() - nonz);\n\t\t\treturn (missing + maxdepth - 1) / maxdepth;\n\t\t} else {\n\t\t\t// Exact hit is possible (not guaranteed)\n\t\t\treturn 0;\n\t\t}\n\t}\n\n\t/**\n\t * Return the number of offsets into the forward read that have at\n\t * least one seed hit.\n\t */\n\tindex_t nonzeroOffsetsFw() const {\n\t\treturn nonzFw_;\n\t}\n\t\n\t/**\n\t * Return the number of offsets into the reverse-complement read\n\t * that have at least one seed hit.\n\t */\n\tindex_t nonzeroOffsetsRc() const {\n\t\treturn nonzRc_;\n\t}\n\n\t/**\n\t * Return a QVal of seed hits of the given rank 'r'.  'offidx' gets the id\n\t * of the offset from 5' from which it was extracted (0 for the 5-most\n\t * offset, 1 for the next closes to 5', etc).  'off' gets the offset from\n\t * the 5' end.  'fw' gets true iff the seed was extracted from the forward\n\t * read.\n\t */\n\tconst QVal<index_t>& hitsByRank(\n\t\tindex_t  r,       // in\n\t\tindex_t& offidx,  // out\n\t\tindex_t& off,     // out\n\t\tbool&    fw,      // out\n\t\tindex_t& seedlen) // out\n\t{\n\t\tassert(sorted_);\n\t\tassert_lt(r, nonzTot_);\n\t\tif(rankFws_[r]) {\n\t\t\tfw = true;\n\t\t\toffidx = rankOffs_[r];\n\t\t\tassert_lt(offidx, offIdx2off_.size());\n\t\t\toff = offIdx2off_[offidx];\n\t\t\tseedlen = (index_t)seqFw_[rankOffs_[r]].length();\n\t\t\treturn hitsFw_[rankOffs_[r]];\n\t\t} else {\n\t\t\tfw = false;\n\t\t\toffidx = rankOffs_[r];\n\t\t\tassert_lt(offidx, offIdx2off_.size());\n\t\t\toff = offIdx2off_[offidx];\n\t\t\tseedlen = (index_t)seqRc_[rankOffs_[r]].length();\n\t\t\treturn hitsRc_[rankOffs_[r]];\n\t\t}\n\t}\n\n\t/**\n\t * Return an EList of seed hits of the given rank.\n\t */\n\tconst BTDnaString& seqByRank(index_t r) {\n\t\tassert(sorted_);\n\t\tassert_lt(r, nonzTot_);\n\t\treturn rankFws_[r] ? seqFw_[rankOffs_[r]] : seqRc_[rankOffs_[r]];\n\t}\n\n\t/**\n\t * Return an EList of seed hits of the given rank.\n\t */\n\tconst BTString& qualByRank(index_t r) {\n\t\tassert(sorted_);\n\t\tassert_lt(r, nonzTot_);\n\t\treturn rankFws_[r] ? qualFw_[rankOffs_[r]] : qualRc_[rankOffs_[r]];\n\t}\n\t\n\t/**\n\t * Return the list of extracted seed sequences for seeds on either\n\t * the forward or reverse strand.\n\t */\n\tEList<BTDnaString>& seqs(bool fw) { return fw ? seqFw_ : seqRc_; }\n\n\t/**\n\t * Return the list of extracted quality sequences for seeds on\n\t * either the forward or reverse strand.\n\t */\n\tEList<BTString>& quals(bool fw) { return fw ? qualFw_ : qualRc_; }\n\n\t/**\n\t * Return exact end-to-end alignment of fw read.\n\t */\n\tEEHit<index_t> exactFwEEHit() const { return exactFwHit_; }\n\n\t/**\n\t * Return exact end-to-end alignment of rc read.\n\t */\n\tEEHit<index_t> exactRcEEHit() const { return exactRcHit_; }\n\t\n\t/**\n\t * Return const ref to list of 1-mismatch end-to-end alignments.\n\t */\n\tconst EList<EEHit<index_t> >& mm1EEHits() const { return mm1Hit_; }\n    \n\t/**\n\t * Sort the end-to-end 1-mismatch alignments, prioritizing by score (higher\n\t * score = higher priority).\n\t */\n\tvoid sort1mmEe(RandomSource& rnd) {\n\t\tassert(!mm1Sorted_);\n\t\tmm1Hit_.sort();\n\t\tsize_t streak = 0;\n\t\tfor(size_t i = 1; i < mm1Hit_.size(); i++) {\n\t\t\tif(mm1Hit_[i].score == mm1Hit_[i-1].score) {\n\t\t\t\tif(streak == 0) { streak = 1; }\n\t\t\t\tstreak++;\n\t\t\t} else {\n\t\t\t\tif(streak > 1) {\n\t\t\t\t\tassert_geq(i, streak);\n\t\t\t\t\tmm1Hit_.shufflePortion(i-streak, streak, rnd);\n\t\t\t\t}\n\t\t\t\tstreak = 0;\n\t\t\t}\n\t\t}\n\t\tif(streak > 1) {\n\t\t\tmm1Hit_.shufflePortion(mm1Hit_.size() - streak, streak, rnd);\n\t\t}\n\t\tmm1Sorted_ = true;\n\t}\n\t\n\t/**\n\t * Add an end-to-end 1-mismatch alignment.\n\t */\n\tvoid add1mmEe(\n\t\tindex_t top,\n\t\tindex_t bot,\n\t\tconst Edit* e1,\n\t\tconst Edit* e2,\n\t\tbool fw,\n\t\tint64_t score)\n\t{\n\t\tmm1Hit_.expand();\n\t\tmm1Hit_.back().init(top, bot, e1, e2, fw, score);\n\t\tmm1Elt_ += (bot - top);\n\t}\n\n\t/**\n\t * Add an end-to-end exact alignment.\n\t */\n\tvoid addExactEeFw(\n\t\tindex_t top,\n\t\tindex_t bot,\n\t\tconst Edit* e1,\n\t\tconst Edit* e2,\n\t\tbool fw,\n\t\tint64_t score)\n\t{\n\t\texactFwHit_.init(top, bot, e1, e2, fw, score);\n\t}\n\n\t/**\n\t * Add an end-to-end exact alignment.\n\t */\n\tvoid addExactEeRc(\n\t\tindex_t top,\n\t\tindex_t bot,\n\t\tconst Edit* e1,\n\t\tconst Edit* e2,\n\t\tbool fw,\n\t\tint64_t score)\n\t{\n\t\texactRcHit_.init(top, bot, e1, e2, fw, score);\n\t}\n\t\n\t/**\n\t * Clear out the end-to-end exact alignments.\n\t */\n\tvoid clearExactE2eHits() {\n\t\texactFwHit_.reset();\n\t\texactRcHit_.reset();\n\t}\n\t\n\t/**\n\t * Clear out the end-to-end 1-mismatch alignments.\n\t */\n\tvoid clear1mmE2eHits() {\n\t\tmm1Hit_.clear();     // 1-mismatch end-to-end hits\n\t\tmm1Elt_ = 0;         // number of 1-mismatch hit rows\n\t\tmm1Sorted_ = false;  // true iff we've sorted the mm1Hit_ list\n\t}\n\n\t/**\n\t * Return the number of distinct exact and 1-mismatch end-to-end hits\n\t * found.\n\t */\n\tindex_t numE2eHits() const {\n\t\treturn (index_t)(exactFwHit_.size() + exactRcHit_.size() + mm1Elt_);\n\t}\n\n\t/**\n\t * Return the number of distinct exact end-to-end hits found.\n\t */\n\tindex_t numExactE2eHits() const {\n\t\treturn (index_t)(exactFwHit_.size() + exactRcHit_.size());\n\t}\n\n\t/**\n\t * Return the number of distinct 1-mismatch end-to-end hits found.\n\t */\n\tindex_t num1mmE2eHits() const {\n\t\treturn mm1Elt_;\n\t}\n\t\n\t/**\n\t * Return the length of the read that yielded the seed hits.\n\t */\n\tindex_t readLength() const {\n\t\tassert(read_ != NULL);\n\t\treturn read_->length();\n\t}\n\nprotected:\n\n\t// As seed hits and edits are added they're sorted into these\n\t// containers\n\tEList<BTDnaString>  seqFw_;       // seqs for seeds from forward read\n\tEList<BTDnaString>  seqRc_;       // seqs for seeds from revcomp read\n\tEList<BTString>     qualFw_;      // quals for seeds from forward read\n\tEList<BTString>     qualRc_;      // quals for seeds from revcomp read\n\tEList<QVal<index_t> >         hitsFw_;      // hits for forward read\n\tEList<QVal<index_t> >         hitsRc_;      // hits for revcomp read\n\tEList<EList<InstantiatedSeed> > isFw_; // hits for forward read\n\tEList<EList<InstantiatedSeed> > isRc_; // hits for revcomp read\n\tEList<bool>         sortedFw_;    // true iff fw QVal was sorted/ranked\n\tEList<bool>         sortedRc_;    // true iff rc QVal was sorted/ranked\n\tindex_t             nonzTot_;     // # offsets with non-zero size\n\tindex_t             nonzFw_;      // # offsets into fw read with non-0 size\n\tindex_t             nonzRc_;      // # offsets into rc read with non-0 size\n\tindex_t             numRanges_;   // # ranges added\n\tindex_t             numElts_;     // # elements added\n\tindex_t             numRangesFw_; // # ranges added for fw seeds\n\tindex_t             numEltsFw_;   // # elements added for fw seeds\n\tindex_t             numRangesRc_; // # ranges added for rc seeds\n\tindex_t             numEltsRc_;   // # elements added for rc seeds\n\n\tEList<index_t>      offIdx2off_;// map from offset indexes to offsets from 5' end\n\n\t// When the sort routine is called, the seed hits collected so far\n\t// are sorted into another set of containers that allow easy access\n\t// to hits from the lowest-ranked offset (the one with the fewest\n\t// BW elements) to the greatest-ranked offset.  Offsets with 0 hits\n\t// are ignored.\n\tEList<index_t>      rankOffs_;  // sorted offests of seeds to try\n\tEList<bool>         rankFws_;   // sorted orientations assoc. with rankOffs_\n\tbool                sorted_;    // true if sort() called since last reset\n\t\n\t// These fields set once per read\n\tindex_t             numOffs_;   // # different seed offsets possible\n\tconst Read*         read_;      // read from which seeds were extracted\n\t\n\tEEHit<index_t>      exactFwHit_; // end-to-end exact hit for fw read\n\tEEHit<index_t>      exactRcHit_; // end-to-end exact hit for rc read\n\tEList<EEHit<index_t> > mm1Hit_;     // 1-mismatch end-to-end hits\n\tindex_t             mm1Elt_;     // number of 1-mismatch hit rows\n\tbool                mm1Sorted_;  // true iff we've sorted the mm1Hit_ list\n    \n\tEList<size_t> tmpMedian_; // temporary storage for calculating median\n};\n\n\n// Forward decl\ntemplate <typename index_t> class Ebwt;\ntemplate <typename index_t> struct SideLocus;\n\n/**\n * Encapsulates a sumamry of what the searchAllSeeds aligner did.\n */\nstruct SeedSearchMetrics {\n\n\tSeedSearchMetrics() : mutex_m() {\n\t    reset();\n\t}\n\n\t/**\n\t * Merge this metrics object with the given object, i.e., sum each\n\t * category.  This is the only safe way to update a\n\t * SeedSearchMetrics object shread by multiple threads.\n\t */\n\tvoid merge(const SeedSearchMetrics& m, bool getLock = false) {\n        ThreadSafe ts(&mutex_m, getLock);\n\t\tseedsearch   += m.seedsearch;\n\t\tpossearch    += m.possearch;\n\t\tintrahit     += m.intrahit;\n\t\tinterhit     += m.interhit;\n\t\tfilteredseed += m.filteredseed;\n\t\tooms         += m.ooms;\n\t\tbwops        += m.bwops;\n\t\tbweds        += m.bweds;\n\t\tbestmin0     += m.bestmin0;\n\t\tbestmin1     += m.bestmin1;\n\t\tbestmin2     += m.bestmin2;\n\t}\n\t\n\t/**\n\t * Set all counters to 0.\n\t */\n\tvoid reset() {\n\t\tseedsearch =\n\t\tpossearch =\n\t\tintrahit =\n\t\tinterhit =\n\t\tfilteredseed =\n\t\tooms =\n\t\tbwops =\n\t\tbweds =\n\t\tbestmin0 =\n\t\tbestmin1 =\n\t\tbestmin2 = 0;\n\t}\n\n\tuint64_t seedsearch;   // # times we executed strategy in InstantiatedSeed\n\tuint64_t possearch;    // # offsets where aligner executed >= 1 strategy\n\tuint64_t intrahit;     // # offsets where current-read cache gave answer\n\tuint64_t interhit;     // # offsets where across-read cache gave answer\n\tuint64_t filteredseed; // # seed instantiations skipped due to Ns\n\tuint64_t ooms;         // out-of-memory errors\n\tuint64_t bwops;        // Burrows-Wheeler operations\n\tuint64_t bweds;        // Burrows-Wheeler edits\n\tuint64_t bestmin0;     // # times the best min # edits was 0\n\tuint64_t bestmin1;     // # times the best min # edits was 1\n\tuint64_t bestmin2;     // # times the best min # edits was 2\n\tMUTEX_T  mutex_m;\n};\n\n/**\n * Given an index and a seeding scheme, searches for seed hits.\n */\ntemplate <typename index_t>\nclass SeedAligner {\n\npublic:\n\t\n\t/**\n\t * Initialize with index.\n\t */\n\tSeedAligner() : edits_(AL_CAT), offIdx2off_(AL_CAT) { }\n\n\t/**\n\t * Given a read and a few coordinates that describe a substring of the\n\t * read (or its reverse complement), fill in 'seq' and 'qual' objects\n\t * with the seed sequence and qualities.\n\t */\n\tvoid instantiateSeq(\n\t\tconst Read& read, // input read\n\t\tBTDnaString& seq, // output sequence\n\t\tBTString& qual,   // output qualities\n\t\tint len,          // seed length\n\t\tint depth,        // seed's 0-based offset from 5' end\n\t\tbool fw) const;   // seed's orientation\n\n\t/**\n\t * Iterate through the seeds that cover the read and initiate a\n\t * search for each seed.\n\t */\n\tstd::pair<int, int> instantiateSeeds(\n\t\tconst EList<Seed>& seeds,   // search seeds\n\t\tindex_t off,                // offset into read to start extracting\n\t\tint per,                    // interval between seeds\n\t\tconst Read& read,           // read to align\n\t\tconst Scoring& pens,        // scoring scheme\n\t\tbool nofw,                  // don't align forward read\n\t\tbool norc,                  // don't align revcomp read\n\t\tAlignmentCacheIface<index_t>& cache, // holds some seed hits from previous reads\n\t\tSeedResults<index_t>& sr,   // holds all the seed hits\n\t\tSeedSearchMetrics& met);    // metrics\n\n\t/**\n\t * Iterate through the seeds that cover the read and initiate a\n\t * search for each seed.\n\t */\n\tvoid searchAllSeeds(\n\t\tconst EList<Seed>& seeds,     // search seeds\n\t\tconst Ebwt<index_t>* ebwtFw,  // BWT index\n\t\tconst Ebwt<index_t>* ebwtBw,  // BWT' index\n\t\tconst Read& read,             // read to align\n\t\tconst Scoring& pens,          // scoring scheme\n\t\tAlignmentCacheIface<index_t>& cache,   // local seed alignment cache\n\t\tSeedResults<index_t>& hits,   // holds all the seed hits\n\t\tSeedSearchMetrics& met,       // metrics\n\t\tPerReadMetrics& prm);         // per-read metrics\n\n\t/**\n\t * Sanity-check a partial alignment produced during oneMmSearch.\n\t */\n\tbool sanityPartial(\n\t\tconst Ebwt<index_t>* ebwtFw, // BWT index\n\t\tconst Ebwt<index_t>* ebwtBw, // BWT' index\n\t\tconst BTDnaString&   seq,\n\t\tindex_t              dep,\n\t\tindex_t              len,\n\t\tbool                 do1mm,\n\t\tindex_t              topfw,\n\t\tindex_t              botfw,\n\t\tindex_t              topbw,\n\t\tindex_t              botbw);\n\n\t/**\n\t * Do an exact-matching sweet to establish a lower bound on number of edits\n\t * and to find exact alignments.\n\t */\n\tsize_t exactSweep(\n\t\tconst Ebwt<index_t>&  ebwt,    // BWT index\n\t\tconst Read&           read,    // read to align\n\t\tconst Scoring&        sc,      // scoring scheme\n\t\tbool                  nofw,    // don't align forward read\n\t\tbool                  norc,    // don't align revcomp read\n\t\tsize_t                mineMax, // don't care about edit bounds > this\n\t\tsize_t&               mineFw,  // minimum # edits for forward read\n\t\tsize_t&               mineRc,  // minimum # edits for revcomp read\n\t\tbool                  repex,   // report 0mm hits?\n\t\tSeedResults<index_t>& hits,    // holds all the seed hits (and exact hit)\n\t\tSeedSearchMetrics&    met);    // metrics\n\n\t/**\n\t * Search for end-to-end alignments with up to 1 mismatch.\n\t */\n\tbool oneMmSearch(\n\t\tconst Ebwt<index_t>*  ebwtFw, // BWT index\n\t\tconst Ebwt<index_t>*  ebwtBw, // BWT' index\n\t\tconst Read&           read,   // read to align\n\t\tconst Scoring&        sc,     // scoring\n\t\tint64_t               minsc,  // minimum score\n\t\tbool                  nofw,   // don't align forward read\n\t\tbool                  norc,   // don't align revcomp read\n\t\tbool                  local,  // 1mm hits must be legal local alignments\n\t\tbool                  repex,  // report 0mm hits?\n\t\tbool                  rep1mm, // report 1mm hits?\n\t\tSeedResults<index_t>& hits,   // holds all the seed hits (and exact hit)\n\t\tSeedSearchMetrics&    met);   // metrics\n    \nprotected:\n\n\t/**\n\t * Report a seed hit found by searchSeedBi(), but first try to extend it out in\n\t * either direction as far as possible without hitting any edits.  This will\n\t * allow us to prioritize the seed hits better later on.  Call reportHit() when\n\t * we're done, which actually adds the hit to the cache.  Returns result from\n\t * calling reportHit().\n\t */\n\tbool extendAndReportHit(\n\t\tindex_t topf,                      // top in BWT\n\t\tindex_t botf,                      // bot in BWT\n\t\tindex_t topb,                      // top in BWT'\n\t\tindex_t botb,                      // bot in BWT'\n\t\tindex_t len,                       // length of hit\n\t\tDoublyLinkedList<Edit> *prevEdit); // previous edit\n\n\t/**\n\t * Report a seed hit found by searchSeedBi() by adding it to the cache.  Return\n\t * false if the hit could not be reported because of, e.g., cache exhaustion.\n\t */\n\tbool reportHit(\n\t\tindex_t topf,         // top in BWT\n\t\tindex_t botf,         // bot in BWT\n\t\tindex_t topb,         // top in BWT'\n\t\tindex_t botb,         // bot in BWT'\n\t\tindex_t len,          // length of hit\n\t\tDoublyLinkedList<Edit> *prevEdit);  // previous edit\n\t\n\t/**\n\t * Given an instantiated seed (in s_ and other fields), search\n\t */\n\tbool searchSeedBi();\n\t\n\t/**\n\t * Main, recursive implementation of the seed search.\n\t */\n\tbool searchSeedBi(\n\t\tint step,                // depth into steps_[] array\n\t\tint depth,               // recursion depth\n\t\tindex_t topf,            // top in BWT\n\t\tindex_t botf,            // bot in BWT\n\t\tindex_t topb,            // top in BWT'\n\t\tindex_t botb,            // bot in BWT'\n\t\tSideLocus<index_t> tloc, // locus for top (perhaps unititialized)\n\t\tSideLocus<index_t> bloc, // locus for bot (perhaps unititialized)\n\t\tConstraint c0,           // constraints to enforce in seed zone 0\n\t\tConstraint c1,           // constraints to enforce in seed zone 1\n\t\tConstraint c2,           // constraints to enforce in seed zone 2\n\t\tConstraint overall,      // overall constraints\n\t\tDoublyLinkedList<Edit> *prevEdit);  // previous edit\n\t\n\t/**\n\t * Get tloc and bloc ready for the next step.\n\t */\n\tinline void nextLocsBi(\n\t\tSideLocus<index_t>& tloc,  // top locus\n\t\tSideLocus<index_t>& bloc,  // bot locus\n\t\tindex_t topf,              // top in BWT\n\t\tindex_t botf,              // bot in BWT\n\t\tindex_t topb,              // top in BWT'\n\t\tindex_t botb,              // bot in BWT'\n\t\tint step);                 // step to get ready for\n\t\n\t// Following are set in searchAllSeeds then used by searchSeed()\n\t// and other protected members.\n\tconst Ebwt<index_t>* ebwtFw_;       // forward index (BWT)\n\tconst Ebwt<index_t>* ebwtBw_;       // backward/mirror index (BWT')\n\tconst Scoring* sc_;                 // scoring scheme\n\tconst InstantiatedSeed* s_;         // current instantiated seed\n\t\n\tconst Read* read_;                  // read whose seeds are currently being aligned\n\t\n\t// The following are set just before a call to searchSeedBi()\n\tconst BTDnaString* seq_;            // sequence of current seed\n\tconst BTString* qual_;              // quality string for current seed\n\tindex_t off_;                       // offset of seed currently being searched\n\tbool fw_;                           // orientation of seed currently being searched\n\t\n\tEList<Edit> edits_;                 // temporary place to sort edits\n\tAlignmentCacheIface<index_t> *ca_;  // local alignment cache for seed alignments\n\tEList<index_t> offIdx2off_;         // offset idx to read offset map, set up instantiateSeeds()\n\tuint64_t bwops_;                    // Burrows-Wheeler operations\n\tuint64_t bwedits_;                  // Burrows-Wheeler edits\n\tBTDnaString tmprfdnastr_;           // used in reportHit\n\t\n\tASSERT_ONLY(ESet<BTDnaString> hits_); // Ref hits so far for seed being aligned\n\tBTDnaString tmpdnastr_;\n};\n\n#define INIT_LOCS(top, bot, tloc, bloc, e) { \\\n\tif(bot - top == 1) { \\\n\t\ttloc.initFromRow(top, (e).eh(), (e).ebwt()); \\\n\t\tbloc.invalidate(); \\\n\t} else { \\\n\t\tSideLocus<index_t>::initFromTopBot(top, bot, (e).eh(), (e).ebwt(), tloc, bloc); \\\n\t\tassert(bloc.valid()); \\\n\t} \\\n}\n\n#define SANITY_CHECK_4TUP(t, b, tp, bp) { \\\n\tASSERT_ONLY(index_t tot = (b[0]-t[0])+(b[1]-t[1])+(b[2]-t[2])+(b[3]-t[3])); \\\n\tASSERT_ONLY(index_t totp = (bp[0]-tp[0])+(bp[1]-tp[1])+(bp[2]-tp[2])+(bp[3]-tp[3])); \\\n\tassert_eq(tot, totp); \\\n}\n\n/**\n * Given a read and a few coordinates that describe a substring of the read (or\n * its reverse complement), fill in 'seq' and 'qual' objects with the seed\n * sequence and qualities.\n *\n * The seq field is filled with the sequence as it would align to the Watson\n * reference strand.  I.e. if fw is false, then the sequence that appears in\n * 'seq' is the reverse complement of the raw read substring.\n */\ntemplate <typename index_t>\nvoid SeedAligner<index_t>::instantiateSeq(\n\t\t\t\t\t\t\t\t\t\t  const Read& read, // input read\n\t\t\t\t\t\t\t\t\t\t  BTDnaString& seq, // output sequence\n\t\t\t\t\t\t\t\t\t\t  BTString& qual,   // output qualities\n\t\t\t\t\t\t\t\t\t\t  int len,          // seed length\n\t\t\t\t\t\t\t\t\t\t  int depth,        // seed's 0-based offset from 5' end\n\t\t\t\t\t\t\t\t\t\t  bool fw) const    // seed's orientation\n{\n\t// Fill in 'seq' and 'qual'\n\tint seedlen = len;\n\tif((int)read.length() < seedlen) seedlen = (int)read.length();\n\tseq.resize(len);\n\tqual.resize(len);\n\t// If fw is false, we take characters starting at the 3' end of the\n\t// reverse complement of the read.\n\tfor(int i = 0; i < len; i++) {\n\t\tseq.set(read.patFw.windowGetDna(i, fw, read.color, depth, len), i);\n\t\tqual.set(read.qual.windowGet(i, fw, depth, len), i);\n\t}\n}\n\n/**\n * We assume that all seeds are the same length.\n *\n * For each seed, instantiate the seed, retracting if necessary.\n */\ntemplate <typename index_t>\npair<int, int> SeedAligner<index_t>::instantiateSeeds(\n\t\t\t\t\t\t\t\t\t\t\t\t\t  const EList<Seed>& seeds,  // search seeds\n\t\t\t\t\t\t\t\t\t\t\t\t\t  index_t off,                // offset into read to start extracting\n\t\t\t\t\t\t\t\t\t\t\t\t\t  int per,                   // interval between seeds\n\t\t\t\t\t\t\t\t\t\t\t\t\t  const Read& read,          // read to align\n\t\t\t\t\t\t\t\t\t\t\t\t\t  const Scoring& pens,       // scoring scheme\n\t\t\t\t\t\t\t\t\t\t\t\t\t  bool nofw,                 // don't align forward read\n\t\t\t\t\t\t\t\t\t\t\t\t\t  bool norc,                 // don't align revcomp read\n\t\t\t\t\t\t\t\t\t\t\t\t\t  AlignmentCacheIface<index_t>& cache,// holds some seed hits from previous reads\n\t\t\t\t\t\t\t\t\t\t\t\t\t  SeedResults<index_t>& sr,  // holds all the seed hits\n\t\t\t\t\t\t\t\t\t\t\t\t\t  SeedSearchMetrics& met)    // metrics\n{\n\tassert(!seeds.empty());\n\tassert_gt(read.length(), 0);\n\t// Check whether read has too many Ns\n\toffIdx2off_.clear();\n\tint len = seeds[0].len; // assume they're all the same length\n#ifndef NDEBUG\n\tfor(size_t i = 1; i < seeds.size(); i++) {\n\t\tassert_eq(len, seeds[i].len);\n\t}\n#endif\n\t// Calc # seeds within read interval\n\tint nseeds = 1;\n\tif((int)read.length() - (int)off > len) {\n\t\tnseeds += ((int)read.length() - (int)off - len) / per;\n\t}\n\tfor(int i = 0; i < nseeds; i++) {\n\t\toffIdx2off_.push_back(per * i + (int)off);\n\t}\n\tpair<int, int> ret;\n\tret.first = 0;  // # seeds that require alignment\n\tret.second = 0; // # seeds that hit in cache with non-empty results\n\tsr.reset(read, offIdx2off_, nseeds);\n\tassert(sr.repOk(&cache.current(), true)); // require that SeedResult be initialized\n\t// For each seed position\n\tfor(int fwi = 0; fwi < 2; fwi++) {\n\t\tbool fw = (fwi == 0);\n\t\tif((fw && nofw) || (!fw && norc)) {\n\t\t\t// Skip this orientation b/c user specified --nofw or --norc\n\t\t\tcontinue;\n\t\t}\n\t\t// For each seed position\n\t\tfor(int i = 0; i < nseeds; i++) {\n\t\t\tint depth = i * per + (int)off;\n\t\t\tint seedlen = seeds[0].len;\n\t\t\t// Extract the seed sequence at this offset\n\t\t\t// If fw == true, we extract the characters from i*per to\n\t\t\t// i*(per-1) (exclusive).  If fw == false, \n\t\t\tinstantiateSeq(\n\t\t\t\t\t\t   read,\n\t\t\t\t\t\t   sr.seqs(fw)[i],\n\t\t\t\t\t\t   sr.quals(fw)[i],\n\t\t\t\t\t\t   std::min<int>((int)seedlen, (int)read.length()),\n\t\t\t\t\t\t   depth,\n\t\t\t\t\t\t   fw);\n\t\t\t//QKey qk(sr.seqs(fw)[i] ASSERT_ONLY(, tmpdnastr_));\n\t\t\t// For each search strategy\n\t\t\tEList<InstantiatedSeed>& iss = sr.instantiatedSeeds(fw, i);\n\t\t\tfor(int j = 0; j < (int)seeds.size(); j++) {\n\t\t\t\tiss.expand();\n\t\t\t\tassert_eq(seedlen, seeds[j].len);\n\t\t\t\tInstantiatedSeed* is = &iss.back();\n\t\t\t\tif(seeds[j].instantiate(\n\t\t\t\t\t\t\t\t\t\tread,\n\t\t\t\t\t\t\t\t\t\tsr.seqs(fw)[i],\n\t\t\t\t\t\t\t\t\t\tsr.quals(fw)[i],\n\t\t\t\t\t\t\t\t\t\tpens,\n\t\t\t\t\t\t\t\t\t\tdepth,\n\t\t\t\t\t\t\t\t\t\ti,\n\t\t\t\t\t\t\t\t\t\tj,\n\t\t\t\t\t\t\t\t\t\tfw,\n\t\t\t\t\t\t\t\t\t\t*is))\n\t\t\t\t{\n\t\t\t\t\t// Can we fill this seed hit in from the cache?\n\t\t\t\t\tret.first++;\n\t\t\t\t} else {\n\t\t\t\t\t// Seed may fail to instantiate if there are Ns\n\t\t\t\t\t// that prevent it from matching\n\t\t\t\t\tmet.filteredseed++;\n\t\t\t\t\tiss.pop_back();\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\treturn ret;\n}\n\n/**\n * We assume that all seeds are the same length.\n *\n * For each seed:\n *\n * 1. Instantiate all seeds, retracting them if necessary.\n * 2. Calculate zone boundaries for each seed\n */\ntemplate <typename index_t>\nvoid SeedAligner<index_t>::searchAllSeeds(\n\t\t\t\t\t\t\t\t\t\t  const EList<Seed>& seeds,    // search seeds\n\t\t\t\t\t\t\t\t\t\t  const Ebwt<index_t>* ebwtFw, // BWT index\n\t\t\t\t\t\t\t\t\t\t  const Ebwt<index_t>* ebwtBw, // BWT' index\n\t\t\t\t\t\t\t\t\t\t  const Read& read,            // read to align\n\t\t\t\t\t\t\t\t\t\t  const Scoring& pens,         // scoring scheme\n\t\t\t\t\t\t\t\t\t\t  AlignmentCacheIface<index_t>& cache,  // local cache for seed alignments\n\t\t\t\t\t\t\t\t\t\t  SeedResults<index_t>& sr,    // holds all the seed hits\n\t\t\t\t\t\t\t\t\t\t  SeedSearchMetrics& met,      // metrics\n\t\t\t\t\t\t\t\t\t\t  PerReadMetrics& prm)         // per-read metrics\n{\n\tassert(!seeds.empty());\n\tassert(ebwtFw != NULL);\n\tassert(ebwtFw->isInMemory());\n\tassert(sr.repOk(&cache.current()));\n\tebwtFw_ = ebwtFw;\n\tebwtBw_ = ebwtBw;\n\tsc_ = &pens;\n\tread_ = &read;\n\tca_ = &cache;\n\tbwops_ = bwedits_ = 0;\n\tuint64_t possearches = 0, seedsearches = 0, intrahits = 0, interhits = 0, ooms = 0;\n\t// For each instantiated seed\n\tfor(int i = 0; i < (int)sr.numOffs(); i++) {\n\t\tsize_t off = sr.idx2off(i);\n\t\tfor(int fwi = 0; fwi < 2; fwi++) {\n\t\t\tbool fw = (fwi == 0);\n\t\t\tassert(sr.repOk(&cache.current()));\n\t\t\tEList<InstantiatedSeed>& iss = sr.instantiatedSeeds(fw, i);\n\t\t\tif(iss.empty()) {\n\t\t\t\t// Cache hit in an across-read cache\n\t\t\t\tcontinue;\n\t\t\t}\n\t\t\tQVal<index_t> qv;\n\t\t\tseq_  = &sr.seqs(fw)[i];  // seed sequence\n\t\t\tqual_ = &sr.quals(fw)[i]; // seed qualities\n\t\t\toff_  = off;              // seed offset (from 5')\n\t\t\tfw_   = fw;               // seed orientation\n\t\t\t// Tell the cache that we've started aligning, so the cache can\n\t\t\t// expect a series of on-the-fly updates\n\t\t\tint ret = cache.beginAlign(*seq_, *qual_, qv);\n\t\t\tASSERT_ONLY(hits_.clear());\n\t\t\tif(ret == -1) {\n\t\t\t\t// Out of memory when we tried to add key to map\n\t\t\t\tooms++;\n\t\t\t\tcontinue;\n\t\t\t}\n\t\t\tbool abort = false;\n\t\t\tif(ret == 0) {\n\t\t\t\t// Not already in cache\n\t\t\t\tassert(cache.aligning());\n\t\t\t\tpossearches++;\n\t\t\t\tfor(size_t j = 0; j < iss.size(); j++) {\n\t\t\t\t\t// Set seq_ and qual_ appropriately, using the seed sequences\n\t\t\t\t\t// and qualities already installed in SeedResults\n\t\t\t\t\tassert_eq(fw, iss[j].fw);\n\t\t\t\t\tassert_eq(i, (int)iss[j].seedoffidx);\n\t\t\t\t\ts_ = &iss[j];\n\t\t\t\t\t// Do the search with respect to seq_, qual_ and s_.\n\t\t\t\t\tif(!searchSeedBi()) {\n\t\t\t\t\t\t// Memory exhausted during search\n\t\t\t\t\t\tooms++;\n\t\t\t\t\t\tabort = true;\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t}\n\t\t\t\t\tseedsearches++;\n\t\t\t\t\tassert(cache.aligning());\n\t\t\t\t}\n\t\t\t\tif(!abort) {\n\t\t\t\t\tqv = cache.finishAlign();\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\t// Already in cache\n\t\t\t\tassert_eq(1, ret);\n\t\t\t\tassert(qv.valid());\n\t\t\t\tintrahits++;\n\t\t\t}\n\t\t\tassert(abort || !cache.aligning());\n\t\t\tif(qv.valid()) {\n\t\t\t\tsr.add(\n\t\t\t\t\t   qv,    // range of ranges in cache\n\t\t\t\t\t   cache.current(), // cache\n\t\t\t\t\t   i,     // seed index (from 5' end)\n\t\t\t\t\t   fw);   // whether seed is from forward read\n\t\t\t}\n\t\t}\n\t}\n\tprm.nSeedRanges = sr.numRanges();\n\tprm.nSeedElts = sr.numElts();\n\tprm.nSeedRangesFw = sr.numRangesFw();\n\tprm.nSeedRangesRc = sr.numRangesRc();\n\tprm.nSeedEltsFw = sr.numEltsFw();\n\tprm.nSeedEltsRc = sr.numEltsRc();\n\tprm.seedMedian = (uint64_t)(sr.medianHitsPerSeed() + 0.5);\n\tprm.seedMean = (uint64_t)sr.averageHitsPerSeed();\n\t\n\tprm.nSdFmops += bwops_;\n\tmet.seedsearch += seedsearches;\n\tmet.possearch += possearches;\n\tmet.intrahit += intrahits;\n\tmet.interhit += interhits;\n\tmet.ooms += ooms;\n\tmet.bwops += bwops_;\n\tmet.bweds += bwedits_;\n}\n\ntemplate <typename index_t>\nbool SeedAligner<index_t>::sanityPartial(\n\t\t\t\t\t\t\t\t\t\t const Ebwt<index_t>*        ebwtFw, // BWT index\n\t\t\t\t\t\t\t\t\t\t const Ebwt<index_t>*        ebwtBw, // BWT' index\n\t\t\t\t\t\t\t\t\t\t const BTDnaString& seq,\n\t\t\t\t\t\t\t\t\t\t index_t dep,\n\t\t\t\t\t\t\t\t\t\t index_t len,\n\t\t\t\t\t\t\t\t\t\t bool do1mm,\n\t\t\t\t\t\t\t\t\t\t index_t topfw,\n\t\t\t\t\t\t\t\t\t\t index_t botfw,\n\t\t\t\t\t\t\t\t\t\t index_t topbw,\n\t\t\t\t\t\t\t\t\t\t index_t botbw)\n{\n\ttmpdnastr_.clear();\n\tfor(size_t i = dep; i < len; i++) {\n\t\ttmpdnastr_.append(seq[i]);\n\t}\n\tindex_t top_fw = 0, bot_fw = 0;\n\tebwtFw->contains(tmpdnastr_, &top_fw, &bot_fw);\n\tassert_eq(top_fw, topfw);\n\tassert_eq(bot_fw, botfw);\n\tif(do1mm && ebwtBw != NULL) {\n\t\ttmpdnastr_.reverse();\n\t\tindex_t top_bw = 0, bot_bw = 0;\n\t\tebwtBw->contains(tmpdnastr_, &top_bw, &bot_bw);\n\t\tassert_eq(top_bw, topbw);\n\t\tassert_eq(bot_bw, botbw);\n\t}\n\treturn true;\n}\n\n/**\n * Sweep right-to-left and left-to-right using exact matching.  Remember all\n * the SA ranges encountered along the way.  Report exact matches if there are\n * any.  Calculate a lower bound on the number of edits in an end-to-end\n * alignment.\n */\ntemplate <typename index_t>\nsize_t SeedAligner<index_t>::exactSweep(\n\t\t\t\t\t\t\t\t\t\tconst Ebwt<index_t>&  ebwt,    // BWT index\n\t\t\t\t\t\t\t\t\t\tconst Read&           read,    // read to align\n\t\t\t\t\t\t\t\t\t\tconst Scoring&        sc,      // scoring scheme\n\t\t\t\t\t\t\t\t\t\tbool                  nofw,    // don't align forward read\n\t\t\t\t\t\t\t\t\t\tbool                  norc,    // don't align revcomp read\n\t\t\t\t\t\t\t\t\t\tsize_t                mineMax, // don't care about edit bounds > this\n\t\t\t\t\t\t\t\t\t\tsize_t&               mineFw,  // minimum # edits for forward read\n\t\t\t\t\t\t\t\t\t\tsize_t&               mineRc,  // minimum # edits for revcomp read\n\t\t\t\t\t\t\t\t\t\tbool                  repex,   // report 0mm hits?\n\t\t\t\t\t\t\t\t\t\tSeedResults<index_t>& hits,    // holds all the seed hits (and exact hit)\n\t\t\t\t\t\t\t\t\t\tSeedSearchMetrics&    met)     // metrics\n{\n\tassert_gt(mineMax, 0);\n\tindex_t top = 0, bot = 0;\n\tSideLocus<index_t> tloc, bloc;\n\tconst size_t len = read.length();\n\tsize_t nelt = 0;\n\tfor(int fwi = 0; fwi < 2; fwi++) {\n\t\tbool fw = (fwi == 0);\n\t\tif( fw && nofw) continue;\n\t\tif(!fw && norc) continue;\n\t\tconst BTDnaString& seq = fw ? read.patFw : read.patRc;\n\t\tassert(!seq.empty());\n\t\tint ftabLen = ebwt.eh().ftabChars();\n\t\tsize_t dep = 0;\n\t\tsize_t nedit = 0;\n\t\tbool done = false;\n\t\twhile(dep < len && !done) {\n\t\t\ttop = bot = 0;\n\t\t\tsize_t left = len - dep;\n\t\t\tassert_gt(left, 0);\n\t\t\tbool doFtab = ftabLen > 1 && left >= (size_t)ftabLen;\n\t\t\tif(doFtab) {\n\t\t\t\t// Does N interfere with use of Ftab?\n\t\t\t\tfor(size_t i = 0; i < (size_t)ftabLen; i++) {\n\t\t\t\t\tint c = seq[len-dep-1-i];\n\t\t\t\t\tif(c > 3) {\n\t\t\t\t\t\tdoFtab = false;\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(doFtab) {\n\t\t\t\t// Use ftab\n\t\t\t\tebwt.ftabLoHi(seq, len - dep - ftabLen, false, top, bot);\n\t\t\t\tdep += (size_t)ftabLen;\n\t\t\t} else {\n\t\t\t\t// Use fchr\n\t\t\t\tint c = seq[len-dep-1];\n\t\t\t\tif(c < 4) {\n\t\t\t\t\ttop = ebwt.fchr()[c];\n\t\t\t\t\tbot = ebwt.fchr()[c+1];\n\t\t\t\t}\n\t\t\t\tdep++;\n\t\t\t}\n\t\t\tif(bot <= top) {\n\t\t\t\tnedit++;\n\t\t\t\tif(nedit >= mineMax) {\n\t\t\t\t\tif(fw) { mineFw = nedit; } else { mineRc = nedit; }\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t\tcontinue;\n\t\t\t}\n\t\t\tINIT_LOCS(top, bot, tloc, bloc, ebwt);\n\t\t\t// Keep going\n\t\t\twhile(dep < len) {\n\t\t\t\tint c = seq[len-dep-1];\n\t\t\t\tif(c > 3) {\n\t\t\t\t\ttop = bot = 0;\n\t\t\t\t} else {\n\t\t\t\t\tif(bloc.valid()) {\n\t\t\t\t\t\tbwops_ += 2;\n\t\t\t\t\t\ttop = ebwt.mapLF(tloc, c);\n\t\t\t\t\t\tbot = ebwt.mapLF(bloc, c);\n\t\t\t\t\t} else {\n\t\t\t\t\t\tbwops_++;\n\t\t\t\t\t\ttop = ebwt.mapLF1(top, tloc, c);\n\t\t\t\t\t\tif(top == (index_t)OFF_MASK) {\n\t\t\t\t\t\t\ttop = bot = 0;\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tbot = top+1;\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(bot <= top) {\n\t\t\t\t\tnedit++;\n\t\t\t\t\tif(nedit >= mineMax) {\n\t\t\t\t\t\tif(fw) { mineFw = nedit; } else { mineRc = nedit; }\n\t\t\t\t\t\tdone = true;\n\t\t\t\t\t}\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t\tINIT_LOCS(top, bot, tloc, bloc, ebwt);\n\t\t\t\tdep++;\n\t\t\t}\n\t\t\tif(done) {\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tif(dep == len) {\n\t\t\t\t// Set the minimum # edits\n\t\t\t\tif(fw) { mineFw = nedit; } else { mineRc = nedit; }\n\t\t\t\t// Done\n\t\t\t\tif(nedit == 0 && bot > top) {\n\t\t\t\t\tif(repex) {\n\t\t\t\t\t\t// This is an exact hit\n\t\t\t\t\t\tint64_t score = len * sc.match();\n\t\t\t\t\t\tif(fw) {\n\t\t\t\t\t\t\thits.addExactEeFw(top, bot, NULL, NULL, fw, score);\n\t\t\t\t\t\t\tassert(ebwt.contains(seq, NULL, NULL));\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\thits.addExactEeRc(top, bot, NULL, NULL, fw, score);\n\t\t\t\t\t\t\tassert(ebwt.contains(seq, NULL, NULL));\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\tnelt += (bot - top);\n\t\t\t\t}\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tdep++;\n\t\t}\n\t}\n\treturn nelt;\n}\n\n/**\n * Search for end-to-end exact hit for read.  Return true iff one is found.\n */\ntemplate <typename index_t>\nbool SeedAligner<index_t>::oneMmSearch(\n\t\t\t\t\t\t\t\t\t   const Ebwt<index_t>*  ebwtFw, // BWT index\n\t\t\t\t\t\t\t\t\t   const Ebwt<index_t>*  ebwtBw, // BWT' index\n\t\t\t\t\t\t\t\t\t   const Read&           read,   // read to align\n\t\t\t\t\t\t\t\t\t   const Scoring&        sc,     // scoring\n\t\t\t\t\t\t\t\t\t   int64_t               minsc,  // minimum score\n\t\t\t\t\t\t\t\t\t   bool                  nofw,   // don't align forward read\n\t\t\t\t\t\t\t\t\t   bool                  norc,   // don't align revcomp read\n\t\t\t\t\t\t\t\t\t   bool                  local,  // 1mm hits must be legal local alignments\n\t\t\t\t\t\t\t\t\t   bool                  repex,  // report 0mm hits?\n\t\t\t\t\t\t\t\t\t   bool                  rep1mm, // report 1mm hits?\n\t\t\t\t\t\t\t\t\t   SeedResults<index_t>& hits,   // holds all the seed hits (and exact hit)\n\t\t\t\t\t\t\t\t\t   SeedSearchMetrics&    met)    // metrics\n{\n\tassert(!rep1mm || ebwtBw != NULL);\n\tconst size_t len = read.length();\n\tint nceil = sc.nCeil.f<int>((double)len);\n\tsize_t ns = read.ns();\n\tif(ns > 1) {\n\t\t// Can't align this with <= 1 mismatches\n\t\treturn false;\n\t} else if(ns == 1 && !rep1mm) {\n\t\t// Can't align this with 0 mismatches\n\t\treturn false;\n\t}\n\tassert_geq(len, 2);\n\tassert(!rep1mm || ebwtBw->eh().ftabChars() == ebwtFw->eh().ftabChars());\n#ifndef NDEBUG\n\tif(ebwtBw != NULL) {\n\t\tfor(int i = 0; i < 4; i++) {\n\t\t\tassert_eq(ebwtBw->fchr()[i], ebwtFw->fchr()[i]);\n\t\t}\n\t}\n#endif\n\tsize_t halfFw = len >> 1;\n\tsize_t halfBw = len >> 1;\n\tif((len & 1) != 0) {\n\t\thalfBw++;\n\t}\n\tassert_geq(halfFw, 1);\n\tassert_geq(halfBw, 1);\n\tSideLocus<index_t> tloc, bloc;\n\tindex_t t[4], b[4];   // dest BW ranges for BWT\n\tt[0] = t[1] = t[2] = t[3] = 0;\n\tb[0] = b[1] = b[2] = b[3] = 0;\n\tindex_t tp[4], bp[4]; // dest BW ranges for BWT'\n\ttp[0] = tp[1] = tp[2] = tp[3] = 0;\n\tbp[0] = bp[1] = bp[2] = bp[3] = 0;\n\tindex_t top = 0, bot = 0, topp = 0, botp = 0;\n\t// Align fw read / rc read\n\tbool results = false;\n\tfor(int fwi = 0; fwi < 2; fwi++) {\n\t\tbool fw = (fwi == 0);\n\t\tif( fw && nofw) continue;\n\t\tif(!fw && norc) continue;\n\t\t// Align going right-to-left, left-to-right\n\t\tint lim = rep1mm ? 2 : 1;\n\t\tfor(int ebwtfwi = 0; ebwtfwi < lim; ebwtfwi++) {\n\t\t\tbool ebwtfw = (ebwtfwi == 0);\n\t\t\tconst Ebwt<index_t>* ebwt  = (ebwtfw ? ebwtFw : ebwtBw);\n\t\t\tconst Ebwt<index_t>* ebwtp = (ebwtfw ? ebwtBw : ebwtFw);\n\t\t\tassert(rep1mm || ebwt->fw());\n\t\t\tconst BTDnaString& seq =\n\t\t\t(fw ? (ebwtfw ? read.patFw : read.patFwRev) :\n\t\t\t (ebwtfw ? read.patRc : read.patRcRev));\n\t\t\tassert(!seq.empty());\n\t\t\tconst BTString& qual =\n\t\t\t(fw ? (ebwtfw ? read.qual    : read.qualRev) :\n\t\t\t (ebwtfw ? read.qualRev : read.qual));\n\t\t\tint ftabLen = ebwt->eh().ftabChars();\n\t\t\tsize_t nea = ebwtfw ? halfFw : halfBw;\n\t\t\t// Check if there's an N in the near portion\n\t\t\tbool skip = false;\n\t\t\tfor(size_t dep = 0; dep < nea; dep++) {\n\t\t\t\tif(seq[len-dep-1] > 3) {\n\t\t\t\t\tskip = true;\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(skip) {\n\t\t\t\tcontinue;\n\t\t\t}\n\t\t\tsize_t dep = 0;\n\t\t\t// Align near half\n\t\t\tif(ftabLen > 1 && (size_t)ftabLen <= nea) {\n\t\t\t\t// Use ftab to jump partway into near half\n\t\t\t\tbool rev = !ebwtfw;\n\t\t\t\tebwt->ftabLoHi(seq, len - ftabLen, rev, top, bot);\n\t\t\t\tif(rep1mm) {\n\t\t\t\t\tebwtp->ftabLoHi(seq, len - ftabLen, rev, topp, botp);\n\t\t\t\t\tassert_eq(bot - top, botp - topp);\n\t\t\t\t}\n\t\t\t\tif(bot - top == 0) {\n\t\t\t\t\tcontinue;\n\t\t\t\t}\n\t\t\t\tint c = seq[len - ftabLen];\n\t\t\t\tt[c] = top; b[c] = bot;\n\t\t\t\ttp[c] = topp; bp[c] = botp;\n\t\t\t\tdep = ftabLen;\n\t\t\t\t// initialize tloc, bloc??\n\t\t\t} else {\n\t\t\t\t// Use fchr to jump in by 1 pos\n\t\t\t\tint c = seq[len-1];\n\t\t\t\tassert_range(0, 3, c);\n\t\t\t\ttop = topp = tp[c] = ebwt->fchr()[c];\n\t\t\t\tbot = botp = bp[c] = ebwt->fchr()[c+1];\n\t\t\t\tif(bot - top == 0) {\n\t\t\t\t\tcontinue;\n\t\t\t\t}\n\t\t\t\tdep = 1;\n\t\t\t\t// initialize tloc, bloc??\n\t\t\t}\n\t\t\tINIT_LOCS(top, bot, tloc, bloc, *ebwt);\n\t\t\tassert(sanityPartial(ebwt, ebwtp, seq, len-dep, len, rep1mm, top, bot, topp, botp));\n\t\t\tbool do_continue = false;\n\t\t\tfor(; dep < nea; dep++) {\n\t\t\t\tassert_lt(dep, len);\n\t\t\t\tint rdc = seq[len - dep - 1];\n\t\t\t\ttp[0] = tp[1] = tp[2] = tp[3] = topp;\n\t\t\t\tbp[0] = bp[1] = bp[2] = bp[3] = botp;\n\t\t\t\tif(bloc.valid()) {\n\t\t\t\t\tbwops_++;\n\t\t\t\t\tt[0] = t[1] = t[2] = t[3] = b[0] = b[1] = b[2] = b[3] = 0;\n\t\t\t\t\tebwt->mapBiLFEx(tloc, bloc, t, b, tp, bp);\n\t\t\t\t\tSANITY_CHECK_4TUP(t, b, tp, bp);\n\t\t\t\t\ttop = t[rdc]; bot = b[rdc];\n\t\t\t\t\tif(bot <= top) {\n\t\t\t\t\t\tdo_continue = true;\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t}\n\t\t\t\t\ttopp = tp[rdc]; botp = bp[rdc];\n\t\t\t\t\tassert(!rep1mm || bot - top == botp - topp);\n\t\t\t\t} else {\n\t\t\t\t\tassert_eq(bot, top+1);\n\t\t\t\t\tassert(!rep1mm || botp == topp+1);\n\t\t\t\t\tbwops_++;\n\t\t\t\t\ttop = ebwt->mapLF1(top, tloc, rdc);\n\t\t\t\t\tif(top == (index_t)OFF_MASK) {\n\t\t\t\t\t\tdo_continue = true;\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t}\n\t\t\t\t\tbot = top + 1;\n\t\t\t\t\tt[rdc] = top; b[rdc] = bot;\n\t\t\t\t\ttp[rdc] = topp; bp[rdc] = botp;\n\t\t\t\t\tassert(!rep1mm || b[rdc] - t[rdc] == bp[rdc] - tp[rdc]);\n\t\t\t\t\t// topp/botp stay the same\n\t\t\t\t}\n\t\t\t\tINIT_LOCS(top, bot, tloc, bloc, *ebwt);\n\t\t\t\tassert(sanityPartial(ebwt, ebwtp, seq, len - dep - 1, len, rep1mm, top, bot, topp, botp));\n\t\t\t}\n\t\t\tif(do_continue) {\n\t\t\t\tcontinue;\n\t\t\t}\n\t\t\t// Align far half\n\t\t\tfor(; dep < len; dep++) {\n\t\t\t\tint rdc = seq[len-dep-1];\n\t\t\t\tint quc = qual[len-dep-1];\n\t\t\t\tif(rdc > 3 && nceil == 0) {\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t\ttp[0] = tp[1] = tp[2] = tp[3] = topp;\n\t\t\t\tbp[0] = bp[1] = bp[2] = bp[3] = botp;\n\t\t\t\tint clo = 0, chi = 3;\n\t\t\t\tbool match = true;\n\t\t\t\tif(bloc.valid()) {\n\t\t\t\t\tbwops_++;\n\t\t\t\t\tt[0] = t[1] = t[2] = t[3] = b[0] = b[1] = b[2] = b[3] = 0;\n\t\t\t\t\tebwt->mapBiLFEx(tloc, bloc, t, b, tp, bp);\n\t\t\t\t\tSANITY_CHECK_4TUP(t, b, tp, bp);\n\t\t\t\t\tmatch = rdc < 4;\n\t\t\t\t\ttop = t[rdc]; bot = b[rdc];\n\t\t\t\t\ttopp = tp[rdc]; botp = bp[rdc];\n\t\t\t\t} else {\n\t\t\t\t\tassert_eq(bot, top+1);\n\t\t\t\t\tassert(!rep1mm || botp == topp+1);\n\t\t\t\t\tbwops_++;\n\t\t\t\t\tclo = ebwt->mapLF1(top, tloc);\n\t\t\t\t\tmatch = (clo == rdc);\n\t\t\t\t\tassert_range(-1, 3, clo);\n\t\t\t\t\tif(clo < 0) {\n\t\t\t\t\t\tbreak; // Hit the $\n\t\t\t\t\t} else {\n\t\t\t\t\t\tt[clo] = top;\n\t\t\t\t\t\tb[clo] = bot = top + 1;\n\t\t\t\t\t}\n\t\t\t\t\tbp[clo] = botp;\n\t\t\t\t\ttp[clo] = topp;\n\t\t\t\t\tassert(!rep1mm || bot - top == botp - topp);\n\t\t\t\t\tassert(!rep1mm || b[clo] - t[clo] == bp[clo] - tp[clo]);\n\t\t\t\t\tchi = clo;\n\t\t\t\t}\n\t\t\t\t//assert(sanityPartial(ebwt, ebwtp, seq, len - dep - 1, len, rep1mm, top, bot, topp, botp));\n\t\t\t\tif(rep1mm && (ns == 0 || rdc > 3)) {\n\t\t\t\t\tfor(int j = clo; j <= chi; j++) {\n\t\t\t\t\t\tif(j == rdc || b[j] == t[j]) {\n\t\t\t\t\t\t\t// Either matches read or isn't a possibility\n\t\t\t\t\t\t\tcontinue;\n\t\t\t\t\t\t}\n\t\t\t\t\t\t// Potential mismatch - next, try\n\t\t\t\t\t\tsize_t depm = dep + 1;\n\t\t\t\t\t\tindex_t topm = t[j], botm = b[j];\n\t\t\t\t\t\tindex_t topmp = tp[j], botmp = bp[j];\n\t\t\t\t\t\tassert_eq(botm - topm, botmp - topmp);\n\t\t\t\t\t\tindex_t tm[4], bm[4];   // dest BW ranges for BWT\n\t\t\t\t\t\ttm[0] = t[0]; tm[1] = t[1];\n\t\t\t\t\t\ttm[2] = t[2]; tm[3] = t[3];\n\t\t\t\t\t\tbm[0] = b[0]; bm[1] = t[1];\n\t\t\t\t\t\tbm[2] = b[2]; bm[3] = t[3];\n\t\t\t\t\t\tindex_t tmp[4], bmp[4]; // dest BW ranges for BWT'\n\t\t\t\t\t\ttmp[0] = tp[0]; tmp[1] = tp[1];\n\t\t\t\t\t\ttmp[2] = tp[2]; tmp[3] = tp[3];\n\t\t\t\t\t\tbmp[0] = bp[0]; bmp[1] = tp[1];\n\t\t\t\t\t\tbmp[2] = bp[2]; bmp[3] = tp[3];\n\t\t\t\t\t\tSideLocus<index_t> tlocm, blocm;\n\t\t\t\t\t\tINIT_LOCS(topm, botm, tlocm, blocm, *ebwt);\n\t\t\t\t\t\tfor(; depm < len; depm++) {\n\t\t\t\t\t\t\tint rdcm = seq[len - depm - 1];\n\t\t\t\t\t\t\ttmp[0] = tmp[1] = tmp[2] = tmp[3] = topmp;\n\t\t\t\t\t\t\tbmp[0] = bmp[1] = bmp[2] = bmp[3] = botmp;\n\t\t\t\t\t\t\tif(blocm.valid()) {\n\t\t\t\t\t\t\t\tbwops_++;\n\t\t\t\t\t\t\t\ttm[0] = tm[1] = tm[2] = tm[3] =\n\t\t\t\t\t\t\t\tbm[0] = bm[1] = bm[2] = bm[3] = 0;\n\t\t\t\t\t\t\t\tebwt->mapBiLFEx(tlocm, blocm, tm, bm, tmp, bmp);\n\t\t\t\t\t\t\t\tSANITY_CHECK_4TUP(tm, bm, tmp, bmp);\n\t\t\t\t\t\t\t\ttopm = tm[rdcm]; botm = bm[rdcm];\n\t\t\t\t\t\t\t\ttopmp = tmp[rdcm]; botmp = bmp[rdcm];\n\t\t\t\t\t\t\t\tif(botm <= topm) {\n\t\t\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t\tassert_eq(botm, topm+1);\n\t\t\t\t\t\t\t\tassert_eq(botmp, topmp+1);\n\t\t\t\t\t\t\t\tbwops_++;\n\t\t\t\t\t\t\t\ttopm = ebwt->mapLF1(topm, tlocm, rdcm);\n\t\t\t\t\t\t\t\tif(topm == (index_t)0xffffffff) {\n\t\t\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\tbotm = topm + 1;\n\t\t\t\t\t\t\t\t// topp/botp stay the same\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tINIT_LOCS(topm, botm, tlocm, blocm, *ebwt);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(depm == len) {\n\t\t\t\t\t\t\t// Success; this is a 1MM hit\n\t\t\t\t\t\t\tsize_t off5p = dep;  // offset from 5' end of read\n\t\t\t\t\t\t\tsize_t offstr = dep; // offset into patFw/patRc\n\t\t\t\t\t\t\tif(fw == ebwtfw) {\n\t\t\t\t\t\t\t\toff5p = len - off5p - 1;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tif(!ebwtfw) {\n\t\t\t\t\t\t\t\toffstr = len - offstr - 1;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tEdit e((uint32_t)off5p, j, rdc, EDIT_TYPE_MM, false);\n\t\t\t\t\t\t\tresults = true;\n\t\t\t\t\t\t\tint64_t score = (len - 1) * sc.match();\n\t\t\t\t\t\t\t// In --local mode, need to double-check that\n\t\t\t\t\t\t\t// end-to-end alignment doesn't violate  local\n\t\t\t\t\t\t\t// alignment principles.  Specifically, it\n\t\t\t\t\t\t\t// shouldn't to or below 0 anywhere in the middle.\n\t\t\t\t\t\t\tint pen = sc.score(rdc, (int)(1 << j), quc - 33);\n\t\t\t\t\t\t\tscore += pen;\n\t\t\t\t\t\t\tbool valid = true;\n\t\t\t\t\t\t\tif(local) {\n\t\t\t\t\t\t\t\tint64_t locscore_fw = 0, locscore_bw = 0;\n\t\t\t\t\t\t\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\t\t\t\t\t\t\tif(i == dep) {\n\t\t\t\t\t\t\t\t\t\tif(locscore_fw + pen <= 0) {\n\t\t\t\t\t\t\t\t\t\t\tvalid = false;\n\t\t\t\t\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\t\t\tlocscore_fw += pen;\n\t\t\t\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t\t\t\tlocscore_fw += sc.match();\n\t\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\t\tif(len-i-1 == dep) {\n\t\t\t\t\t\t\t\t\t\tif(locscore_bw + pen <= 0) {\n\t\t\t\t\t\t\t\t\t\t\tvalid = false;\n\t\t\t\t\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\t\t\tlocscore_bw += pen;\n\t\t\t\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t\t\t\tlocscore_bw += sc.match();\n\t\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tif(valid) {\n\t\t\t\t\t\t\t\tvalid = score >= minsc;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tif(valid) {\n#ifndef NDEBUG\n\t\t\t\t\t\t\t\tBTDnaString& rf = tmprfdnastr_;\n\t\t\t\t\t\t\t\trf.clear();\n\t\t\t\t\t\t\t\tedits_.clear();\n\t\t\t\t\t\t\t\tedits_.push_back(e);\n\t\t\t\t\t\t\t\tif(!fw) Edit::invertPoss(edits_, len, false);\n\t\t\t\t\t\t\t\tEdit::toRef(fw ? read.patFw : read.patRc, edits_, rf);\n\t\t\t\t\t\t\t\tif(!fw) Edit::invertPoss(edits_, len, false);\n\t\t\t\t\t\t\t\tassert_eq(len, rf.length());\n\t\t\t\t\t\t\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\t\t\t\t\t\t\tassert_lt((int)rf[i], 4);\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\tASSERT_ONLY(index_t toptmp = 0);\n\t\t\t\t\t\t\t\tASSERT_ONLY(index_t bottmp = 0);\n\t\t\t\t\t\t\t\tassert(ebwtFw->contains(rf, &toptmp, &bottmp));\n#endif\n\t\t\t\t\t\t\t\tindex_t toprep = ebwtfw ? topm : topmp;\n\t\t\t\t\t\t\t\tindex_t botrep = ebwtfw ? botm : botmp;\n\t\t\t\t\t\t\t\tassert_eq(toprep, toptmp);\n\t\t\t\t\t\t\t\tassert_eq(botrep, bottmp);\n\t\t\t\t\t\t\t\thits.add1mmEe(toprep, botrep, &e, NULL, fw, score);\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(bot > top && match) {\n\t\t\t\t\tassert_lt(rdc, 4);\n\t\t\t\t\tif(dep == len-1) {\n\t\t\t\t\t\t// Success; this is an exact hit\n\t\t\t\t\t\tif(ebwtfw && repex) {\n\t\t\t\t\t\t\tif(fw) {\n\t\t\t\t\t\t\t\tresults = true;\n\t\t\t\t\t\t\t\tint64_t score = len * sc.match();\n\t\t\t\t\t\t\t\thits.addExactEeFw(\n\t\t\t\t\t\t\t\t\t\t\t\t  ebwtfw ? top : topp,\n\t\t\t\t\t\t\t\t\t\t\t\t  ebwtfw ? bot : botp,\n\t\t\t\t\t\t\t\t\t\t\t\t  NULL, NULL, fw, score);\n\t\t\t\t\t\t\t\tassert(ebwtFw->contains(seq, NULL, NULL));\n\t\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t\tresults = true;\n\t\t\t\t\t\t\t\tint64_t score = len * sc.match();\n\t\t\t\t\t\t\t\thits.addExactEeRc(\n\t\t\t\t\t\t\t\t\t\t\t\t  ebwtfw ? top : topp,\n\t\t\t\t\t\t\t\t\t\t\t\t  ebwtfw ? bot : botp,\n\t\t\t\t\t\t\t\t\t\t\t\t  NULL, NULL, fw, score);\n\t\t\t\t\t\t\t\tassert(ebwtFw->contains(seq, NULL, NULL));\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t\tbreak; // End of far loop\n\t\t\t\t\t} else {\n\t\t\t\t\t\tINIT_LOCS(top, bot, tloc, bloc, *ebwt);\n\t\t\t\t\t\tassert(sanityPartial(ebwt, ebwtp, seq, len - dep - 1, len, rep1mm, top, bot, topp, botp));\n\t\t\t\t\t}\n\t\t\t\t} else {\n\t\t\t\t\tbreak; // End of far loop\n\t\t\t\t}\n\t\t\t} // for(; dep < len; dep++)\n\t\t} // for(int ebwtfw = 0; ebwtfw < 2; ebwtfw++)\n\t} // for(int fw = 0; fw < 2; fw++)\n\treturn results;\n}\n\n/**\n * Wrapper for initial invcation of searchSeed.\n */\ntemplate <typename index_t>\nbool SeedAligner<index_t>::searchSeedBi() {\n\treturn searchSeedBi(\n\t\t\t\t\t\t0, 0,\n\t\t\t\t\t\t0, 0, 0, 0,\n\t\t\t\t\t\tSideLocus<index_t>(), SideLocus<index_t>(),\n\t\t\t\t\t\ts_->cons[0], s_->cons[1], s_->cons[2], s_->overall,\n\t\t\t\t\t\tNULL);\n}\n\n/**\n * Get tloc, bloc ready for the next step.  If the new range is under\n * the ceiling.\n */\ntemplate <typename index_t>\ninline void SeedAligner<index_t>::nextLocsBi(\n\t\t\t\t\t\t\t\t\t\t\t SideLocus<index_t>& tloc, // top locus\n\t\t\t\t\t\t\t\t\t\t\t SideLocus<index_t>& bloc, // bot locus\n\t\t\t\t\t\t\t\t\t\t\t index_t topf,             // top in BWT\n\t\t\t\t\t\t\t\t\t\t\t index_t botf,             // bot in BWT\n\t\t\t\t\t\t\t\t\t\t\t index_t topb,             // top in BWT'\n\t\t\t\t\t\t\t\t\t\t\t index_t botb,             // bot in BWT'\n\t\t\t\t\t\t\t\t\t\t\t int step                  // step to get ready for\n#if 0\n\t\t\t\t\t\t\t\t\t\t\t , const SABWOffTrack* prevOt, // previous tracker\n\t\t\t\t\t\t\t\t\t\t\t SABWOffTrack& ot            // current tracker\n#endif\n\t\t\t\t\t\t\t\t\t\t\t )\n{\n\tassert_gt(botf, 0);\n\tassert(ebwtBw_ == NULL || botb > 0);\n\tassert_geq(step, 0); // next step can't be first one\n\tassert(ebwtBw_ == NULL || botf-topf == botb-topb);\n\tif(step == (int)s_->steps.size()) return; // no more steps!\n\t// Which direction are we going in next?\n\tif(s_->steps[step] > 0) {\n\t\t// Left to right; use BWT'\n\t\tif(botb - topb == 1) {\n\t\t\t// Already down to 1 row; just init top locus\n\t\t\ttloc.initFromRow(topb, ebwtBw_->eh(), ebwtBw_->ebwt());\n\t\t\tbloc.invalidate();\n\t\t} else {\n\t\t\tSideLocus<index_t>::initFromTopBot(\n\t\t\t\t\t\t\t\t\t\t\t   topb, botb, ebwtBw_->eh(), ebwtBw_->ebwt(), tloc, bloc);\n\t\t\tassert(bloc.valid());\n\t\t}\n\t} else {\n\t\t// Right to left; use BWT\n\t\tif(botf - topf == 1) {\n\t\t\t// Already down to 1 row; just init top locus\n\t\t\ttloc.initFromRow(topf, ebwtFw_->eh(), ebwtFw_->ebwt());\n\t\t\tbloc.invalidate();\n\t\t} else {\n\t\t\tSideLocus<index_t>::initFromTopBot(\n\t\t\t\t\t\t\t\t\t\t\t   topf, botf, ebwtFw_->eh(), ebwtFw_->ebwt(), tloc, bloc);\n\t\t\tassert(bloc.valid());\n\t\t}\n\t}\n\t// Check if we should update the tracker with this refinement\n#if 0\n\tif(botf-topf <= BW_OFF_TRACK_CEIL) {\n\t\tif(ot.size() == 0 && prevOt != NULL && prevOt->size() > 0) {\n\t\t\t// Inherit state from the predecessor\n\t\t\tot = *prevOt;\n\t\t}\n\t\tbool ltr = s_->steps[step-1] > 0;\n\t\tint adj = abs(s_->steps[step-1])-1;\n\t\tconst Ebwt<index_t>* ebwt = ltr ? ebwtBw_ : ebwtFw_;\n\t\tot.update(\n\t\t\t\t  ltr ? topb : topf,    // top\n\t\t\t\t  ltr ? botb : botf,    // bot\n\t\t\t\t  adj,                  // adj (to be subtracted from offset)\n\t\t\t\t  ebwt->offs(),         // offs array\n\t\t\t\t  ebwt->eh().offRate(), // offrate (sample = every 1 << offrate elts)\n\t\t\t\t  NULL                  // dead\n\t\t\t\t  );\n\t\tassert_gt(ot.size(), 0);\n\t}\n#endif\n\tassert(botf - topf == 1 ||  bloc.valid());\n\tassert(botf - topf > 1  || !bloc.valid());\n}\n\n/**\n * Report a seed hit found by searchSeedBi(), but first try to extend it out in\n * either direction as far as possible without hitting any edits.  This will\n * allow us to prioritize the seed hits better later on.  Call reportHit() when\n * we're done, which actually adds the hit to the cache.  Returns result from\n * calling reportHit().\n */\ntemplate <typename index_t>\nbool SeedAligner<index_t>::extendAndReportHit(\n\t\t\t\t\t\t\t\t\t\t\t  index_t topf,                      // top in BWT\n\t\t\t\t\t\t\t\t\t\t\t  index_t botf,                      // bot in BWT\n\t\t\t\t\t\t\t\t\t\t\t  index_t topb,                      // top in BWT'\n\t\t\t\t\t\t\t\t\t\t\t  index_t botb,                      // bot in BWT'\n\t\t\t\t\t\t\t\t\t\t\t  index_t len,                       // length of hit\n\t\t\t\t\t\t\t\t\t\t\t  DoublyLinkedList<Edit> *prevEdit)  // previous edit\n{\n\tindex_t nlex = 0, nrex = 0;\n\tindex_t t[4], b[4];\n\tindex_t tp[4], bp[4];\n\tSideLocus<index_t> tloc, bloc;\n\tif(off_ > 0) {\n\t\tconst Ebwt<index_t> *ebwt = ebwtFw_;\n\t\tassert(ebwt != NULL);\n\t\t// Extend left using forward index\n\t\tconst BTDnaString& seq = fw_ ? read_->patFw : read_->patRc;\n\t\t// See what we get by extending \n\t\tindex_t top = topf, bot = botf;\n\t\tt[0] = t[1] = t[2] = t[3] = 0;\n\t\tb[0] = b[1] = b[2] = b[3] = 0;\n\t\ttp[0] = tp[1] = tp[2] = tp[3] = topb;\n\t\tbp[0] = bp[1] = bp[2] = bp[3] = botb;\n\t\tSideLocus<index_t> tloc, bloc;\n\t\tINIT_LOCS(top, bot, tloc, bloc, *ebwt);\n\t\tfor(size_t ii = off_; ii > 0; ii--) {\n\t\t\tsize_t i = ii-1;\n\t\t\t// Get char from read\n\t\t\tint rdc = seq.get(i);\n\t\t\t// See what we get by extending \n\t\t\tif(bloc.valid()) {\n\t\t\t\tbwops_++;\n\t\t\t\tt[0] = t[1] = t[2] = t[3] =\n\t\t\t\tb[0] = b[1] = b[2] = b[3] = 0;\n\t\t\t\tebwt->mapBiLFEx(tloc, bloc, t, b, tp, bp);\n\t\t\t\tSANITY_CHECK_4TUP(t, b, tp, bp);\n\t\t\t\tint nonz = -1;\n\t\t\t\tbool abort = false;\n\t\t\t\tfor(int j = 0; j < 4; j++) {\n\t\t\t\t\tif(b[i] > t[i]) {\n\t\t\t\t\t\tif(nonz >= 0) {\n\t\t\t\t\t\t\tabort = true;\n\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tnonz = j;\n\t\t\t\t\t\ttop = t[i]; bot = b[i];\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(abort || nonz != rdc) {\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\tassert_eq(bot, top+1);\n\t\t\t\tbwops_++;\n\t\t\t\tint c = ebwt->mapLF1(top, tloc);\n\t\t\t\tif(c != rdc) {\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t\tbot = top + 1;\n\t\t\t}\n\t\t\tif(++nlex == 255) {\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tINIT_LOCS(top, bot, tloc, bloc, *ebwt);\n\t\t}\n\t}\n\tsize_t rdlen = read_->length();\n\tsize_t nright = rdlen - off_ - len;\n\tif(nright > 0 && ebwtBw_ != NULL) {\n\t\tconst Ebwt<index_t> *ebwt = ebwtBw_;\n\t\tassert(ebwt != NULL);\n\t\t// Extend right using backward index\n\t\tconst BTDnaString& seq = fw_ ? read_->patFw : read_->patRc;\n\t\t// See what we get by extending \n\t\tindex_t top = topb, bot = botb;\n\t\tt[0] = t[1] = t[2] = t[3] = 0;\n\t\tb[0] = b[1] = b[2] = b[3] = 0;\n\t\ttp[0] = tp[1] = tp[2] = tp[3] = topb;\n\t\tbp[0] = bp[1] = bp[2] = bp[3] = botb;\n\t\tINIT_LOCS(top, bot, tloc, bloc, *ebwt);\n\t\tfor(size_t i = off_ + len; i < rdlen; i++) {\n\t\t\t// Get char from read\n\t\t\tint rdc = seq.get(i);\n\t\t\t// See what we get by extending \n\t\t\tif(bloc.valid()) {\n\t\t\t\tbwops_++;\n\t\t\t\tt[0] = t[1] = t[2] = t[3] =\n\t\t\t\tb[0] = b[1] = b[2] = b[3] = 0;\n\t\t\t\tebwt->mapBiLFEx(tloc, bloc, t, b, tp, bp);\n\t\t\t\tSANITY_CHECK_4TUP(t, b, tp, bp);\n\t\t\t\tint nonz = -1;\n\t\t\t\tbool abort = false;\n\t\t\t\tfor(int j = 0; j < 4; j++) {\n\t\t\t\t\tif(b[i] > t[i]) {\n\t\t\t\t\t\tif(nonz >= 0) {\n\t\t\t\t\t\t\tabort = true;\n\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tnonz = j;\n\t\t\t\t\t\ttop = t[i]; bot = b[i];\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(abort || nonz != rdc) {\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\tassert_eq(bot, top+1);\n\t\t\t\tbwops_++;\n\t\t\t\tint c = ebwt->mapLF1(top, tloc);\n\t\t\t\tif(c != rdc) {\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t\tbot = top + 1;\n\t\t\t}\n\t\t\tif(++nrex == 255) {\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tINIT_LOCS(top, bot, tloc, bloc, *ebwt);\n\t\t}\n\t}\n\tassert_lt(nlex, rdlen);\n\tassert_leq(nlex, off_);\n\tassert_lt(nrex, rdlen);\n\treturn reportHit(topf, botf, topb, botb, len, prevEdit);\n}\n\n/**\n * Report a seed hit found by searchSeedBi() by adding it to the cache.  Return\n * false if the hit could not be reported because of, e.g., cache exhaustion.\n */\ntemplate <typename index_t>\nbool SeedAligner<index_t>::reportHit(\n\t\t\t\t\t\t\t\t\t index_t topf,                      // top in BWT\n\t\t\t\t\t\t\t\t\t index_t botf,                      // bot in BWT\n\t\t\t\t\t\t\t\t\t index_t topb,                      // top in BWT'\n\t\t\t\t\t\t\t\t\t index_t botb,                      // bot in BWT'\n\t\t\t\t\t\t\t\t\t index_t len,                       // length of hit\n\t\t\t\t\t\t\t\t\t DoublyLinkedList<Edit> *prevEdit)  // previous edit\n{\n\t// Add information about the seed hit to AlignmentCache.  This\n\t// information eventually makes its way back to the SeedResults\n\t// object when we call finishAlign(...).\n\tBTDnaString& rf = tmprfdnastr_;\n\trf.clear();\n\tedits_.clear();\n\tif(prevEdit != NULL) {\n\t\tprevEdit->toList(edits_);\n\t\tEdit::sort(edits_);\n\t\tassert(Edit::repOk(edits_, *seq_));\n\t\tEdit::toRef(*seq_, edits_, rf);\n\t} else {\n\t\trf = *seq_;\n\t}\n\t// Sanity check: shouldn't add the same hit twice.  If this\n\t// happens, it may be because our zone Constraints are not set up\n\t// properly and erroneously return true from acceptable() when they\n\t// should return false in some cases.\n\tassert_eq(hits_.size(), ca_->curNumRanges());\n\tassert(hits_.insert(rf));\n\tif(!ca_->addOnTheFly(rf, topf, botf, topb, botb)) {\n\t\treturn false;\n\t}\n\tassert_eq(hits_.size(), ca_->curNumRanges());\n#ifndef NDEBUG\n\t// Sanity check that the topf/botf and topb/botb ranges really\n\t// correspond to the reference sequence aligned to\n\t{\n\t\tBTDnaString rfr;\n\t\tindex_t tpf, btf, tpb, btb;\n\t\ttpf = btf = tpb = btb = 0;\n\t\tassert(ebwtFw_->contains(rf, &tpf, &btf));\n\t\tif(ebwtBw_ != NULL) {\n\t\t\trfr = rf;\n\t\t\trfr.reverse();\n\t\t\tassert(ebwtBw_->contains(rfr, &tpb, &btb));\n\t\t\tassert_eq(tpf, topf);\n\t\t\tassert_eq(btf, botf);\n\t\t\tassert_eq(tpb, topb);\n\t\t\tassert_eq(btb, botb);\n\t\t}\n\t}\n#endif\n\treturn true;\n}\n\n/**\n * Given a seed, search.  Assumes zone 0 = no backtracking.\n *\n * Return a list of Seed hits.\n * 1. Edits\n * 2. Bidirectional BWT range(s) on either end\n */\ntemplate <typename index_t>\nbool SeedAligner<index_t>::searchSeedBi(\n\t\t\t\t\t\t\t\t\t\tint step,                // depth into steps_[] array\n\t\t\t\t\t\t\t\t\t\tint depth,               // recursion depth\n\t\t\t\t\t\t\t\t\t\tindex_t topf,            // top in BWT\n\t\t\t\t\t\t\t\t\t\tindex_t botf,            // bot in BWT\n\t\t\t\t\t\t\t\t\t\tindex_t topb,            // top in BWT'\n\t\t\t\t\t\t\t\t\t\tindex_t botb,            // bot in BWT'\n\t\t\t\t\t\t\t\t\t\tSideLocus<index_t> tloc, // locus for top (perhaps unititialized)\n\t\t\t\t\t\t\t\t\t\tSideLocus<index_t> bloc, // locus for bot (perhaps unititialized)\n\t\t\t\t\t\t\t\t\t\tConstraint c0,           // constraints to enforce in seed zone 0\n\t\t\t\t\t\t\t\t\t\tConstraint c1,           // constraints to enforce in seed zone 1\n\t\t\t\t\t\t\t\t\t\tConstraint c2,           // constraints to enforce in seed zone 2\n\t\t\t\t\t\t\t\t\t\tConstraint overall,      // overall constraints to enforce\n\t\t\t\t\t\t\t\t\t\tDoublyLinkedList<Edit> *prevEdit  // previous edit\n#if 0\n\t\t\t\t\t\t\t\t\t\t, const SABWOffTrack* prevOt // prev off tracker (if tracking started)\n#endif\n\t\t\t\t\t\t\t\t\t\t)\n{\n\tassert(s_ != NULL);\n\tconst InstantiatedSeed& s = *s_;\n\tassert_gt(s.steps.size(), 0);\n\tassert(ebwtBw_ == NULL || ebwtBw_->eh().ftabChars() == ebwtFw_->eh().ftabChars());\n#ifndef NDEBUG\n\tfor(int i = 0; i < 4; i++) {\n\t\tassert(ebwtBw_ == NULL || ebwtBw_->fchr()[i] == ebwtFw_->fchr()[i]);\n\t}\n#endif\n\tif(step == (int)s.steps.size()) {\n\t\t// Finished aligning seed\n\t\tassert(c0.acceptable());\n\t\tassert(c1.acceptable());\n\t\tassert(c2.acceptable());\n\t\tif(!reportHit(topf, botf, topb, botb, seq_->length(), prevEdit)) {\n\t\t\treturn false; // Memory exhausted\n\t\t}\n\t\treturn true;\n\t}\n#ifndef NDEBUG\n\tif(depth > 0) {\n\t\tassert(botf - topf == 1 ||  bloc.valid());\n\t\tassert(botf - topf > 1  || !bloc.valid());\n\t}\n#endif\n\tint off;\n\tindex_t tp[4], bp[4]; // dest BW ranges for \"prime\" index\n\tif(step == 0) {\n\t\t// Just starting\n\t\tassert(prevEdit == NULL);\n\t\tassert(!tloc.valid());\n\t\tassert(!bloc.valid());\n\t\toff = s.steps[0];\n\t\tbool ltr = off > 0;\n\t\toff = abs(off)-1;\n\t\t// Check whether/how far we can jump using ftab or fchr\n\t\tint ftabLen = ebwtFw_->eh().ftabChars();\n\t\tif(ftabLen > 1 && ftabLen <= s.maxjump) {\n\t\t\tif(!ltr) {\n\t\t\t\tassert_geq(off+1, ftabLen-1);\n\t\t\t\toff = off - ftabLen + 1;\n\t\t\t}\n\t\t\tebwtFw_->ftabLoHi(*seq_, off, false, topf, botf);\n#ifdef NDEBUG\n\t\t\tif(botf - topf == 0) return true;\n#endif\n#ifdef NDEBUG\n\t\t\tif(ebwtBw_ != NULL) {\n\t\t\t\ttopb = ebwtBw_->ftabHi(*seq_, off);\n\t\t\t\tbotb = topb + (botf-topf);\n\t\t\t}\n#else\n\t\t\tif(ebwtBw_ != NULL) {\n\t\t\t\tebwtBw_->ftabLoHi(*seq_, off, false, topb, botb);\n\t\t\t\tassert_eq(botf-topf, botb-topb);\n\t\t\t}\n\t\t\tif(botf - topf == 0) return true;\n#endif\n\t\t\tstep += ftabLen;\n\t\t} else if(s.maxjump > 0) {\n\t\t\t// Use fchr\n\t\t\tint c = (*seq_)[off];\n\t\t\tassert_range(0, 3, c);\n\t\t\ttopf = topb = ebwtFw_->fchr()[c];\n\t\t\tbotf = botb = ebwtFw_->fchr()[c+1];\n\t\t\tif(botf - topf == 0) return true;\n\t\t\tstep++;\n\t\t} else {\n\t\t\tassert_eq(0, s.maxjump);\n\t\t\ttopf = topb = 0;\n\t\t\tbotf = botb = ebwtFw_->fchr()[4];\n\t\t}\n\t\tif(step == (int)s.steps.size()) {\n\t\t\t// Finished aligning seed\n\t\t\tassert(c0.acceptable());\n\t\t\tassert(c1.acceptable());\n\t\t\tassert(c2.acceptable());\n\t\t\tif(!reportHit(topf, botf, topb, botb, seq_->length(), prevEdit)) {\n\t\t\t\treturn false; // Memory exhausted\n\t\t\t}\n\t\t\treturn true;\n\t\t}\n\t\tnextLocsBi(tloc, bloc, topf, botf, topb, botb, step);\n\t\tassert(tloc.valid());\n\t} else assert(prevEdit != NULL);\n\tassert(tloc.valid());\n\tassert(botf - topf == 1 ||  bloc.valid());\n\tassert(botf - topf > 1  || !bloc.valid());\n\tassert_geq(step, 0);\n\tindex_t t[4], b[4]; // dest BW ranges\n\tConstraint* zones[3] = { &c0, &c1, &c2 };\n\tASSERT_ONLY(index_t lasttot = botf - topf);\n\tfor(int i = step; i < (int)s.steps.size(); i++) {\n\t\tassert_gt(botf, topf);\n\t\tassert(botf - topf == 1 ||  bloc.valid());\n\t\tassert(botf - topf > 1  || !bloc.valid());\n\t\tassert(ebwtBw_ == NULL || botf-topf == botb-topb);\n\t\tassert(tloc.valid());\n\t\toff = s.steps[i];\n\t\tbool ltr = off > 0;\n\t\tconst Ebwt<index_t>* ebwt = ltr ? ebwtBw_ : ebwtFw_;\n\t\tassert(ebwt != NULL);\n\t\tif(ltr) {\n\t\t\ttp[0] = tp[1] = tp[2] = tp[3] = topf;\n\t\t\tbp[0] = bp[1] = bp[2] = bp[3] = botf;\n\t\t} else {\n\t\t\ttp[0] = tp[1] = tp[2] = tp[3] = topb;\n\t\t\tbp[0] = bp[1] = bp[2] = bp[3] = botb;\n\t\t}\n\t\tt[0] = t[1] = t[2] = t[3] = b[0] = b[1] = b[2] = b[3] = 0;\n\t\tif(bloc.valid()) {\n\t\t\t// Range delimited by tloc/bloc has size >1.  If size == 1,\n\t\t\t// we use a simpler query (see if(!bloc.valid()) blocks below)\n\t\t\tbwops_++;\n\t\t\tebwt->mapBiLFEx(tloc, bloc, t, b, tp, bp);\n\t\t\tASSERT_ONLY(index_t tot = (b[0]-t[0])+(b[1]-t[1])+(b[2]-t[2])+(b[3]-t[3]));\n\t\t\tASSERT_ONLY(index_t totp = (bp[0]-tp[0])+(bp[1]-tp[1])+(bp[2]-tp[2])+(bp[3]-tp[3]));\n\t\t\tassert_eq(tot, totp);\n\t\t\tassert_leq(tot, lasttot);\n\t\t\tASSERT_ONLY(lasttot = tot);\n\t\t}\n\t\tindex_t *tf = ltr ? tp : t, *tb = ltr ? t : tp;\n\t\tindex_t *bf = ltr ? bp : b, *bb = ltr ? b : bp;\n\t\toff = abs(off)-1;\n\t\t//\n\t\tbool leaveZone = s.zones[i].first < 0;\n\t\t//bool leaveZoneIns = zones_[i].second < 0;\n\t\tConstraint& cons    = *zones[abs(s.zones[i].first)];\n\t\tConstraint& insCons = *zones[abs(s.zones[i].second)];\n\t\tint c = (*seq_)[off];  assert_range(0, 4, c);\n\t\tint q = (*qual_)[off];\n\t\t// Is it legal for us to advance on characters other than 'c'?\n\t\tif(!(cons.mustMatch() && !overall.mustMatch()) || c == 4) {\n\t\t\t// There may be legal edits\n\t\t\tbool bail = false;\n\t\t\tif(!bloc.valid()) {\n\t\t\t\t// Range delimited by tloc/bloc has size 1\n\t\t\t\tindex_t ntop = ltr ? topb : topf;\n\t\t\t\tbwops_++;\n\t\t\t\tint cc = ebwt->mapLF1(ntop, tloc);\n\t\t\t\tassert_range(-1, 3, cc);\n\t\t\t\tif(cc < 0) bail = true;\n\t\t\t\telse { t[cc] = ntop; b[cc] = ntop+1; }\n\t\t\t}\n\t\t\tif(!bail) {\n\t\t\t\tif((cons.canMismatch(q, *sc_) && overall.canMismatch(q, *sc_)) || c == 4) {\n\t\t\t\t\tConstraint oldCons = cons, oldOvCons = overall;\n\t\t\t\t\tSideLocus<index_t> oldTloc = tloc, oldBloc = bloc;\n\t\t\t\t\tif(c != 4) {\n\t\t\t\t\t\tcons.chargeMismatch(q, *sc_);\n\t\t\t\t\t\toverall.chargeMismatch(q, *sc_);\n\t\t\t\t\t}\n\t\t\t\t\t// Can leave the zone as-is\n\t\t\t\t\tif(!leaveZone || (cons.acceptable() && overall.acceptable())) {\n\t\t\t\t\t\tfor(int j = 0; j < 4; j++) {\n\t\t\t\t\t\t\tif(j == c || b[j] == t[j]) continue;\n\t\t\t\t\t\t\t// Potential mismatch\n\t\t\t\t\t\t\tnextLocsBi(tloc, bloc, tf[j], bf[j], tb[j], bb[j], i+1);\n\t\t\t\t\t\t\tint loff = off;\n\t\t\t\t\t\t\tif(!ltr) loff = (int)(s.steps.size() - loff - 1);\n\t\t\t\t\t\t\tassert(prevEdit == NULL || prevEdit->next == NULL);\n\t\t\t\t\t\t\tEdit edit(off, j, c, EDIT_TYPE_MM, false);\n\t\t\t\t\t\t\tDoublyLinkedList<Edit> editl;\n\t\t\t\t\t\t\teditl.payload = edit;\n\t\t\t\t\t\t\tif(prevEdit != NULL) {\n\t\t\t\t\t\t\t\tprevEdit->next = &editl;\n\t\t\t\t\t\t\t\teditl.prev = prevEdit;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tassert(editl.next == NULL);\n\t\t\t\t\t\t\tbwedits_++;\n\t\t\t\t\t\t\tif(!searchSeedBi(\n\t\t\t\t\t\t\t\t\t\t\t i+1,     // depth into steps_[] array\n\t\t\t\t\t\t\t\t\t\t\t depth+1, // recursion depth\n\t\t\t\t\t\t\t\t\t\t\t tf[j],   // top in BWT\n\t\t\t\t\t\t\t\t\t\t\t bf[j],   // bot in BWT\n\t\t\t\t\t\t\t\t\t\t\t tb[j],   // top in BWT'\n\t\t\t\t\t\t\t\t\t\t\t bb[j],   // bot in BWT'\n\t\t\t\t\t\t\t\t\t\t\t tloc,    // locus for top (perhaps unititialized)\n\t\t\t\t\t\t\t\t\t\t\t bloc,    // locus for bot (perhaps unititialized)\n\t\t\t\t\t\t\t\t\t\t\t c0,      // constraints to enforce in seed zone 0\n\t\t\t\t\t\t\t\t\t\t\t c1,      // constraints to enforce in seed zone 1\n\t\t\t\t\t\t\t\t\t\t\t c2,      // constraints to enforce in seed zone 2\n\t\t\t\t\t\t\t\t\t\t\t overall, // overall constraints to enforce\n\t\t\t\t\t\t\t\t\t\t\t &editl))  // latest edit\n\t\t\t\t\t\t\t{\n\t\t\t\t\t\t\t\treturn false;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tif(prevEdit != NULL) prevEdit->next = NULL;\n\t\t\t\t\t\t}\n\t\t\t\t\t} else {\n\t\t\t\t\t\t// Not enough edits to make this path\n\t\t\t\t\t\t// non-redundant with other seeds\n\t\t\t\t\t}\n\t\t\t\t\tcons = oldCons;\n\t\t\t\t\toverall = oldOvCons;\n\t\t\t\t\ttloc = oldTloc;\n\t\t\t\t\tbloc = oldBloc;\n\t\t\t\t}\n\t\t\t\tif(cons.canGap() && overall.canGap()) {\n\t\t\t\t\tthrow 1; // TODO\n\t\t\t\t\tint delEx = 0;\n\t\t\t\t\tif(cons.canDelete(delEx, *sc_) && overall.canDelete(delEx, *sc_)) {\n\t\t\t\t\t\t// Try delete\n\t\t\t\t\t}\n\t\t\t\t\tint insEx = 0;\n\t\t\t\t\tif(insCons.canInsert(insEx, *sc_) && overall.canInsert(insEx, *sc_)) {\n\t\t\t\t\t\t// Try insert\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t} // if(!bail)\n\t\t}\n\t\tif(c == 4) {\n\t\t\treturn true; // couldn't handle the N\n\t\t}\n\t\tif(leaveZone && (!cons.acceptable() || !overall.acceptable())) {\n\t\t\t// Not enough edits to make this path non-redundant with\n\t\t\t// other seeds\n\t\t\treturn true;\n\t\t}\n\t\tif(!bloc.valid()) {\n\t\t\tassert(ebwtBw_ == NULL || bp[c] == tp[c]+1);\n\t\t\t// Range delimited by tloc/bloc has size 1\n\t\t\tindex_t top = ltr ? topb : topf;\n\t\t\tbwops_++;\n\t\t\tt[c] = ebwt->mapLF1(top, tloc, c);\n\t\t\tif(t[c] == (index_t)OFF_MASK) {\n\t\t\t\treturn true;\n\t\t\t}\n\t\t\tassert_geq(t[c], ebwt->fchr()[c]);\n\t\t\tassert_lt(t[c],  ebwt->fchr()[c+1]);\n\t\t\tb[c] = t[c]+1;\n\t\t\tassert_gt(b[c], 0);\n\t\t}\n\t\tassert(ebwtBw_ == NULL || bf[c]-tf[c] == bb[c]-tb[c]);\n\t\tassert_leq(bf[c]-tf[c], lasttot);\n\t\tASSERT_ONLY(lasttot = bf[c]-tf[c]);\n\t\tif(b[c] == t[c]) {\n\t\t\treturn true;\n\t\t}\n\t\ttopf = tf[c]; botf = bf[c];\n\t\ttopb = tb[c]; botb = bb[c];\n\t\tif(i+1 == (int)s.steps.size()) {\n\t\t\t// Finished aligning seed\n\t\t\tassert(c0.acceptable());\n\t\t\tassert(c1.acceptable());\n\t\t\tassert(c2.acceptable());\n\t\t\tif(!reportHit(topf, botf, topb, botb, seq_->length(), prevEdit)) {\n\t\t\t\treturn false; // Memory exhausted\n\t\t\t}\n\t\t\treturn true;\n\t\t}\n\t\tnextLocsBi(tloc, bloc, tf[c], bf[c], tb[c], bb[c], i+1);\n\t}\n\treturn true;\n}\n\n#endif /*ALIGNER_SEED_H_*/\n"
  },
  {
    "path": "aligner_seed_policy.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <string>\n#include <iostream>\n#include <sstream>\n#include <limits>\n#include \"ds.h\"\n#include \"aligner_seed_policy.h\"\n#include \"mem_ids.h\"\n\nusing namespace std;\n\nstatic int parseFuncType(const std::string& otype) {\n\tstring type = otype;\n\tif(type == \"C\" || type == \"Constant\") {\n\t\treturn SIMPLE_FUNC_CONST;\n\t} else if(type == \"L\" || type == \"Linear\") {\n\t\treturn SIMPLE_FUNC_LINEAR;\n\t} else if(type == \"S\" || type == \"Sqrt\") {\n\t\treturn SIMPLE_FUNC_SQRT;\n\t} else if(type == \"G\" || type == \"Log\") {\n\t\treturn SIMPLE_FUNC_LOG;\n\t}\n\tstd::cerr << \"Error: Bad function type '\" << otype.c_str()\n\t          << \"'.  Should be C (constant), L (linear), \"\n\t          << \"S (square root) or G (natural log).\" << std::endl;\n\tthrow 1;\n}\n\n#define PARSE_FUNC(fv) { \\\n\tif(ctoks.size() >= 1) { \\\n\t\tfv.setType(parseFuncType(ctoks[0])); \\\n\t} \\\n\tif(ctoks.size() >= 2) { \\\n\t\tdouble co; \\\n\t\tistringstream tmpss(ctoks[1]); \\\n\t\ttmpss >> co; \\\n\t\tfv.setConst(co); \\\n\t} \\\n\tif(ctoks.size() >= 3) { \\\n\t\tdouble ce; \\\n\t\tistringstream tmpss(ctoks[2]); \\\n\t\ttmpss >> ce; \\\n\t\tfv.setCoeff(ce); \\\n\t} \\\n\tif(ctoks.size() >= 4) { \\\n\t\tdouble mn; \\\n\t\tistringstream tmpss(ctoks[3]); \\\n\t\ttmpss >> mn; \\\n\t\tfv.setMin(mn); \\\n\t} \\\n\tif(ctoks.size() >= 5) { \\\n\t\tdouble mx; \\\n\t\tistringstream tmpss(ctoks[4]); \\\n\t\ttmpss >> mx; \\\n\t\tfv.setMin(mx); \\\n\t} \\\n}\n\n/**\n * Parse alignment policy when provided in this format:\n * <lab>=<val>;<lab>=<val>;<lab>=<val>...\n *\n * And label=value possibilities are:\n *\n * Bonus for a match\n * -----------------\n *\n * MA=xx (default: MA=0, or MA=2 if --local is set)\n *\n *    xx = Each position where equal read and reference characters match up\n *         in the alignment contriubtes this amount to the total score.\n *\n * Penalty for a mismatch\n * ----------------------\n *\n * MMP={Cxx|Q|RQ} (default: MMP=C6)\n *\n *   Cxx = Each mismatch costs xx.  If MMP=Cxx is specified, quality\n *         values are ignored when assessing penalities for mismatches.\n *   Q   = Each mismatch incurs a penalty equal to the mismatched base's\n *         value.\n *   R   = Each mismatch incurs a penalty equal to the mismatched base's\n *         rounded quality value.  Qualities are rounded off to the\n *         nearest 10, and qualities greater than 30 are rounded to 30.\n *\n * Penalty for position with N (in either read or reference)\n * ---------------------------------------------------------\n *\n * NP={Cxx|Q|RQ} (default: NP=C1)\n *\n *   Cxx = Each alignment position with an N in either the read or the\n *         reference costs xx.  If NP=Cxx is specified, quality values are\n *         ignored when assessing penalities for Ns.\n *   Q   = Each alignment position with an N in either the read or the\n *         reference incurs a penalty equal to the read base's quality\n *         value.\n *   R   = Each alignment position with an N in either the read or the\n *         reference incurs a penalty equal to the read base's rounded\n *         quality value.  Qualities are rounded off to the nearest 10,\n *         and qualities greater than 30 are rounded to 30.\n *\n * Penalty for a read gap\n * ----------------------\n *\n * RDG=xx,yy (default: RDG=5,3)\n *\n *   xx    = Read gap open penalty.\n *   yy    = Read gap extension penalty.\n *\n * Total cost incurred by a read gap = xx + (yy * gap length)\n *\n * Penalty for a reference gap\n * ---------------------------\n *\n * RFG=xx,yy (default: RFG=5,3)\n *\n *   xx    = Reference gap open penalty.\n *   yy    = Reference gap extension penalty.\n *\n * Total cost incurred by a reference gap = xx + (yy * gap length)\n *\n * Minimum score for valid alignment\n * ---------------------------------\n *\n * MIN=xx,yy (defaults: MIN=-0.6,-0.6, or MIN=0.0,0.66 if --local is set)\n *\n *   xx,yy = For a read of length N, the total score must be at least\n *           xx + (read length * yy) for the alignment to be valid.  The\n *           total score is the sum of all negative penalties (from\n *           mismatches and gaps) and all positive bonuses.  The minimum\n *           can be negative (and is by default in global alignment mode).\n *\n * Score floor for local alignment\n * -------------------------------\n *\n * FL=xx,yy (defaults: FL=-Infinity,0.0, or FL=0.0,0.0 if --local is set)\n *\n *   xx,yy = If a cell in the dynamic programming table has a score less\n *           than xx + (read length * yy), then no valid alignment can go\n *           through it.  Defaults are highly recommended.\n *\n * N ceiling\n * ---------\n *\n * NCEIL=xx,yy (default: NCEIL=0.0,0.15)\n *\n *   xx,yy = For a read of length N, the number of alignment\n *           positions with an N in either the read or the\n *           reference cannot exceed\n *           ceiling = xx + (read length * yy).  If the ceiling is\n *           exceeded, the alignment is considered invalid.\n *\n * Seeds\n * -----\n *\n * SEED=mm,len,ival (default: SEED=0,22)\n *\n *   mm   = Maximum number of mismatches allowed within a seed.\n *          Must be >= 0 and <= 2.  Note that 2-mismatch mode is\n *          not fully sensitive; i.e. some 2-mismatch seed\n *          alignments may be missed.\n *   len  = Length of seed.\n *   ival = Interval between seeds.  If not specified, seed\n *          interval is determined by IVAL.\n *\n * Seed interval\n * -------------\n *\n * IVAL={L|S|C},xx,yy (default: IVAL=S,1.0,0.0)\n *\n *   L  = let interval between seeds be a linear function of the\n *        read length.  xx and yy are the constant and linear\n *        coefficients respectively.  In other words, the interval\n *        equals a * len + b, where len is the read length.\n *        Intervals less than 1 are rounded up to 1.\n *   S  = let interval between seeds be a function of the sqaure\n *        root of the  read length.  xx and yy are the\n *        coefficients.  In other words, the interval equals\n *        a * sqrt(len) + b, where len is the read length.\n *        Intervals less than 1 are rounded up to 1.\n *   C  = Like S but uses cube root of length instead of square\n *        root.\n *\n * Example 1:\n *\n *  SEED=1,10,5 and read sequence is TGCTATCGTACGATCGTAC:\n *\n *  The following seeds are extracted from the forward\n *  representation of the read and aligned to the reference\n *  allowing up to 1 mismatch:\n *\n *  Read:    TGCTATCGTACGATCGTACA\n *\n *  Seed 1+: TGCTATCGTA\n *  Seed 2+:      TCGTACGATC\n *  Seed 3+:           CGATCGTACA\n *\n *  ...and the following are extracted from the reverse-complement\n *  representation of the read and align to the reference allowing\n *  up to 1 mismatch:\n *\n *  Seed 1-: TACGATAGCA\n *  Seed 2-:      GATCGTACGA\n *  Seed 3-:           TGTACGATCG\n *\n * Example 2:\n *\n *  SEED=1,20,20 and read sequence is TGCTATCGTACGATC.  The seed\n *  length is 20 but the read is only 15 characters long.  In this\n *  case, Bowtie2 automatically shrinks the seed length to be equal\n *  to the read length.\n *\n *  Read:    TGCTATCGTACGATC\n *\n *  Seed 1+: TGCTATCGTACGATC\n *  Seed 1-: GATCGTACGATAGCA\n *\n * Example 3:\n *\n *  SEED=1,10,10 and read sequence is TGCTATCGTACGATC.  Only one seed\n *  fits on the read; a second seed would overhang the end of the read\n *  by 5 positions.  In this case, Bowtie2 extracts one seed.\n *\n *  Read:    TGCTATCGTACGATC\n *\n *  Seed 1+: TGCTATCGTA\n *  Seed 1-: TACGATAGCA\n */\nvoid SeedAlignmentPolicy::parseString(\n\tconst       std::string& s,\n\tbool        local,\n\tbool        noisyHpolymer,\n\tbool        ignoreQuals,\n\tint&        bonusMatchType,\n\tint&        bonusMatch,\n\tint&        penMmcType,\n\tint&        penMmcMax,\n\tint&        penMmcMin,\n\tint&        penNType,\n\tint&        penN,\n\tint&        penRdExConst,\n\tint&        penRfExConst,\n\tint&        penRdExLinear,\n\tint&        penRfExLinear,\n\tSimpleFunc& costMin,\n\tSimpleFunc& nCeil,\n\tbool&       nCatPair,\n\tint&        multiseedMms,\n\tint&        multiseedLen,\n\tSimpleFunc& multiseedIval,\n\tsize_t&     failStreak,\n\tsize_t&     seedRounds,\n    SimpleFunc* penCanSplice,\n    SimpleFunc* penNoncanSplice,\n    SimpleFunc* penIntronLen)\n{\n\n\tbonusMatchType    = local ? DEFAULT_MATCH_BONUS_TYPE_LOCAL : DEFAULT_MATCH_BONUS_TYPE;\n\tbonusMatch        = local ? DEFAULT_MATCH_BONUS_LOCAL : DEFAULT_MATCH_BONUS;\n\tpenMmcType        = ignoreQuals ? DEFAULT_MM_PENALTY_TYPE_IGNORE_QUALS :\n\t                                  DEFAULT_MM_PENALTY_TYPE;\n\tpenMmcMax         = DEFAULT_MM_PENALTY_MAX;\n\tpenMmcMin         = DEFAULT_MM_PENALTY_MIN;\n\tpenNType          = DEFAULT_N_PENALTY_TYPE;\n\tpenN              = DEFAULT_N_PENALTY;\n\t\n\tconst double DMAX = std::numeric_limits<double>::max();\n    /*\n\tcostMin.init(\n\t\tlocal ? SIMPLE_FUNC_LOG : SIMPLE_FUNC_LINEAR,\n\t\tlocal ? DEFAULT_MIN_CONST_LOCAL  : DEFAULT_MIN_CONST,\n\t\tlocal ? DEFAULT_MIN_LINEAR_LOCAL : DEFAULT_MIN_LINEAR);\n    */\n    costMin.init(\n                 local ? SIMPLE_FUNC_LOG : SIMPLE_FUNC_CONST,\n                 local ? DEFAULT_MIN_CONST_LOCAL  : -18,\n                 local ? DEFAULT_MIN_LINEAR_LOCAL : 0);\n\tnCeil.init(\n\t\tSIMPLE_FUNC_LINEAR, 0.0f, DMAX,\n\t\tDEFAULT_N_CEIL_CONST, DEFAULT_N_CEIL_LINEAR);\n\tmultiseedIval.init(\n\t\tDEFAULT_IVAL, 1.0f, DMAX,\n\t\tDEFAULT_IVAL_B, DEFAULT_IVAL_A);\n\tnCatPair          = DEFAULT_N_CAT_PAIR;\n\n\tif(!noisyHpolymer) {\n\t\tpenRdExConst  = DEFAULT_READ_GAP_CONST;\n\t\tpenRdExLinear = DEFAULT_READ_GAP_LINEAR;\n\t\tpenRfExConst  = DEFAULT_REF_GAP_CONST;\n\t\tpenRfExLinear = DEFAULT_REF_GAP_LINEAR;\n\t} else {\n\t\tpenRdExConst  = DEFAULT_READ_GAP_CONST_BADHPOLY;\n\t\tpenRdExLinear = DEFAULT_READ_GAP_LINEAR_BADHPOLY;\n\t\tpenRfExConst  = DEFAULT_REF_GAP_CONST_BADHPOLY;\n\t\tpenRfExLinear = DEFAULT_REF_GAP_LINEAR_BADHPOLY;\n\t}\n\t\n\tmultiseedMms      = DEFAULT_SEEDMMS;\n\tmultiseedLen      = DEFAULT_SEEDLEN;\n\t\n\tEList<string> toks(MISC_CAT);\n\tstring tok;\n\tistringstream ss(s);\n\tint setting = 0;\n\t// Get each ;-separated token\n\twhile(getline(ss, tok, ';')) {\n\t\tsetting++;\n\t\tEList<string> etoks(MISC_CAT);\n\t\tstring etok;\n\t\t// Divide into tokens on either side of =\n\t\tistringstream ess(tok);\n\t\twhile(getline(ess, etok, '=')) {\n\t\t\tetoks.push_back(etok);\n\t\t}\n\t\t// Must be exactly 1 =\n\t\tif(etoks.size() != 2) {\n\t\t\tcerr << \"Error parsing alignment policy setting \" << setting\n\t\t\t     << \"; must be bisected by = sign\" << endl\n\t\t\t\t << \"Policy: \" << s.c_str() << endl;\n\t\t\tassert(false); throw 1;\n\t\t}\n\t\t// LHS is tag, RHS value\n\t\tstring tag = etoks[0], val = etoks[1];\n\t\t// Separate value into comma-separated tokens\n\t\tEList<string> ctoks(MISC_CAT);\n\t\tstring ctok;\n\t\tistringstream css(val);\n\t\twhile(getline(css, ctok, ',')) {\n\t\t\tctoks.push_back(ctok);\n\t\t}\n\t\tif(ctoks.size() == 0) {\n\t\t\tcerr << \"Error parsing alignment policy setting \" << setting\n\t\t\t     << \"; RHS must have at least 1 token\" << endl\n\t\t\t\t << \"Policy: \" << s.c_str() << endl;\n\t\t\tassert(false); throw 1;\n\t\t}\n\t\tfor(size_t i = 0; i < ctoks.size(); i++) {\n\t\t\tif(ctoks[i].length() == 0) {\n\t\t\t\tcerr << \"Error parsing alignment policy setting \" << setting\n\t\t\t\t     << \"; token \" << i+1 << \" on RHS had length=0\" << endl\n\t\t\t\t\t << \"Policy: \" << s.c_str() << endl;\n\t\t\t\tassert(false); throw 1;\n\t\t\t}\n\t\t}\n\t\t// Bonus for a match\n\t\t// MA=xx (default: MA=0, or MA=10 if --local is set)\n\t\tif(tag == \"MA\") {\n\t\t\tif(ctoks.size() != 1) {\n\t\t\t\tcerr << \"Error parsing alignment policy setting \" << setting\n\t\t\t\t     << \"; RHS must have 1 token\" << endl\n\t\t\t\t\t << \"Policy: \" << s.c_str() << endl;\n\t\t\t\tassert(false); throw 1;\n\t\t\t}\n\t\t\tstring tmp = ctoks[0];\n\t\t\tistringstream tmpss(tmp);\n\t\t\ttmpss >> bonusMatch;\n\t\t}\n\t\t// Scoring for mismatches\n\t\t// MMP={Cxx|Q|RQ}\n\t\t//        Cxx = constant, where constant is integer xx\n\t\t//        Qxx = equal to quality, scaled\n\t\t//        R   = equal to maq-rounded quality value (rounded to nearest\n\t\t//              10, can't be greater than 30)\n\t\telse if(tag == \"MMP\") {\n\t\t\tif(ctoks.size() > 3) {\n\t\t\t\tcerr << \"Error parsing alignment policy setting \"\n\t\t\t\t     << \"'\" << tag.c_str() << \"'\"\n\t\t\t\t     << \"; RHS must have at most 3 tokens\" << endl\n\t\t\t\t\t << \"Policy: '\" << s.c_str() << \"'\" << endl;\n\t\t\t\tassert(false); throw 1;\n\t\t\t}\n\t\t\tif(ctoks[0][0] == 'C') {\n\t\t\t\tstring tmp = ctoks[0].substr(1);\n\t\t\t\t// Parse constant penalty\n\t\t\t\tistringstream tmpss(tmp);\n\t\t\t\ttmpss >> penMmcMax;\n\t\t\t\tpenMmcMin = penMmcMax;\n\t\t\t\t// Parse constant penalty\n\t\t\t\tpenMmcType = COST_MODEL_CONSTANT;\n\t\t\t} else if(ctoks[0][0] == 'Q') {\n\t\t\t\tif(ctoks.size() >= 2) {\n\t\t\t\t\tstring tmp = ctoks[1];\n\t\t\t\t\tistringstream tmpss(tmp);\n\t\t\t\t\ttmpss >> penMmcMax;\n\t\t\t\t} else {\n\t\t\t\t\tpenMmcMax = DEFAULT_MM_PENALTY_MAX;\n\t\t\t\t}\n\t\t\t\tif(ctoks.size() >= 3) {\n\t\t\t\t\tstring tmp = ctoks[2];\n\t\t\t\t\tistringstream tmpss(tmp);\n\t\t\t\t\ttmpss >> penMmcMin;\n\t\t\t\t} else {\n\t\t\t\t\tpenMmcMin = DEFAULT_MM_PENALTY_MIN;\n\t\t\t\t}\n\t\t\t\tif(penMmcMin > penMmcMax) {\n\t\t\t\t\tcerr << \"Error: Maximum mismatch penalty (\" << penMmcMax\n\t\t\t\t\t     << \") is less than minimum penalty (\" << penMmcMin\n\t\t\t\t\t\t << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t\t// Set type to =quality\n\t\t\t\tpenMmcType = COST_MODEL_QUAL;\n\t\t\t} else if(ctoks[0][0] == 'R') {\n\t\t\t\t// Set type to=Maq-quality\n\t\t\t\tpenMmcType = COST_MODEL_ROUNDED_QUAL;\n\t\t\t} else {\n\t\t\t\tcerr << \"Error parsing alignment policy setting \"\n\t\t\t\t     << \"'\" << tag.c_str() << \"'\"\n\t\t\t\t     << \"; RHS must start with C, Q or R\" << endl\n\t\t\t\t\t << \"Policy: '\" << s.c_str() << \"'\" << endl;\n\t\t\t\tassert(false); throw 1;\n\t\t\t}\n\t\t}\n\t\t// Scoring for mismatches where read char=N\n\t\t// NP={Cxx|Q|RQ}\n\t\t//        Cxx = constant, where constant is integer xx\n\t\t//        Q   = equal to quality\n\t\t//        R   = equal to maq-rounded quality value (rounded to nearest\n\t\t//              10, can't be greater than 30)\n\t\telse if(tag == \"NP\") {\n\t\t\tif(ctoks.size() != 1) {\n\t\t\t\tcerr << \"Error parsing alignment policy setting \"\n\t\t\t\t     << \"'\" << tag.c_str() << \"'\"\n\t\t\t\t     << \"; RHS must have 1 token\" << endl\n\t\t\t\t\t << \"Policy: '\" << s.c_str() << \"'\" << endl;\n\t\t\t\tassert(false); throw 1;\n\t\t\t}\n\t\t\tif(ctoks[0][0] == 'C') {\n\t\t\t\tstring tmp = ctoks[0].substr(1);\n\t\t\t\t// Parse constant penalty\n\t\t\t\tistringstream tmpss(tmp);\n\t\t\t\ttmpss >> penN;\n\t\t\t\t// Parse constant penalty\n\t\t\t\tpenNType = COST_MODEL_CONSTANT;\n\t\t\t} else if(ctoks[0][0] == 'Q') {\n\t\t\t\t// Set type to =quality\n\t\t\t\tpenNType = COST_MODEL_QUAL;\n\t\t\t} else if(ctoks[0][0] == 'R') {\n\t\t\t\t// Set type to=Maq-quality\n\t\t\t\tpenNType = COST_MODEL_ROUNDED_QUAL;\n\t\t\t} else {\n\t\t\t\tcerr << \"Error parsing alignment policy setting \"\n\t\t\t\t     << \"'\" << tag.c_str() << \"'\"\n\t\t\t\t     << \"; RHS must start with C, Q or R\" << endl\n\t\t\t\t\t << \"Policy: '\" << s.c_str() << \"'\" << endl;\n\t\t\t\tassert(false); throw 1;\n\t\t\t}\n\t\t}\n\t\t// Scoring for read gaps\n\t\t// RDG=xx,yy,zz\n\t\t//        xx = read gap open penalty\n\t\t//        yy = read gap extension penalty constant coefficient\n\t\t//             (defaults to open penalty)\n\t\t//        zz = read gap extension penalty linear coefficient\n\t\t//             (defaults to 0)\n\t\telse if(tag == \"RDG\") {\n\t\t\tif(ctoks.size() >= 1) {\n\t\t\t\tistringstream tmpss(ctoks[0]);\n\t\t\t\ttmpss >> penRdExConst;\n\t\t\t} else {\n\t\t\t\tpenRdExConst = noisyHpolymer ?\n\t\t\t\t\tDEFAULT_READ_GAP_CONST_BADHPOLY :\n\t\t\t\t\tDEFAULT_READ_GAP_CONST;\n\t\t\t}\n\t\t\tif(ctoks.size() >= 2) {\n\t\t\t\tistringstream tmpss(ctoks[1]);\n\t\t\t\ttmpss >> penRdExLinear;\n\t\t\t} else {\n\t\t\t\tpenRdExLinear = noisyHpolymer ?\n\t\t\t\t\tDEFAULT_READ_GAP_LINEAR_BADHPOLY :\n\t\t\t\t\tDEFAULT_READ_GAP_LINEAR;\n\t\t\t}\n\t\t}\n\t\t// Scoring for reference gaps\n\t\t// RFG=xx,yy,zz\n\t\t//        xx = ref gap open penalty\n\t\t//        yy = ref gap extension penalty constant coefficient\n\t\t//             (defaults to open penalty)\n\t\t//        zz = ref gap extension penalty linear coefficient\n\t\t//             (defaults to 0)\n\t\telse if(tag == \"RFG\") {\n\t\t\tif(ctoks.size() >= 1) {\n\t\t\t\tistringstream tmpss(ctoks[0]);\n\t\t\t\ttmpss >> penRfExConst;\n\t\t\t} else {\n\t\t\t\tpenRfExConst = noisyHpolymer ?\n\t\t\t\t\tDEFAULT_REF_GAP_CONST_BADHPOLY :\n\t\t\t\t\tDEFAULT_REF_GAP_CONST;\n\t\t\t}\n\t\t\tif(ctoks.size() >= 2) {\n\t\t\t\tistringstream tmpss(ctoks[1]);\n\t\t\t\ttmpss >> penRfExLinear;\n\t\t\t} else {\n\t\t\t\tpenRfExLinear = noisyHpolymer ?\n\t\t\t\t\tDEFAULT_REF_GAP_LINEAR_BADHPOLY :\n\t\t\t\t\tDEFAULT_REF_GAP_LINEAR;\n\t\t\t}\n\t\t}\n\t\t// Minimum score as a function of read length\n\t\t// MIN=xx,yy\n\t\t//        xx = constant coefficient\n\t\t//        yy = linear coefficient\n\t\telse if(tag == \"MIN\") {\n\t\t\tPARSE_FUNC(costMin);\n\t\t}\n\t\t// Per-read N ceiling as a function of read length\n\t\t// NCEIL=xx,yy\n\t\t//        xx = N ceiling constant coefficient\n\t\t//        yy = N ceiling linear coefficient (set to 0 if unspecified)\n\t\telse if(tag == \"NCEIL\") {\n\t\t\tPARSE_FUNC(nCeil);\n\t\t}\n\t\t/*\n\t\t * Seeds\n\t\t * -----\n\t\t *\n\t\t * SEED=mm,len,ival (default: SEED=0,22)\n\t\t *\n\t\t *   mm   = Maximum number of mismatches allowed within a seed.\n\t\t *          Must be >= 0 and <= 2.  Note that 2-mismatch mode is\n\t\t *          not fully sensitive; i.e. some 2-mismatch seed\n\t\t *          alignments may be missed.\n\t\t *   len  = Length of seed.\n\t\t *   ival = Interval between seeds.  If not specified, seed\n\t\t *          interval is determined by IVAL.\n\t\t */\n\t\telse if(tag == \"SEED\") {\n\t\t\tif(ctoks.size() > 2) {\n\t\t\t\tcerr << \"Error parsing alignment policy setting \"\n\t\t\t\t     << \"'\" << tag.c_str() << \"'; RHS must have 1 or 2 tokens, \"\n\t\t\t\t\t << \"had \" << ctoks.size() << \".  \"\n\t\t\t\t\t << \"Policy: '\" << s.c_str() << \"'\" << endl;\n\t\t\t\tassert(false); throw 1;\n\t\t\t}\n\t\t\tif(ctoks.size() >= 1) {\n\t\t\t\tistringstream tmpss(ctoks[0]);\n\t\t\t\ttmpss >> multiseedMms;\n\t\t\t\tif(multiseedMms > 1) {\n\t\t\t\t\tcerr << \"Error: -N was set to \" << multiseedMms << \", but cannot be set greater than 1\" << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t\tif(multiseedMms < 0) {\n\t\t\t\t\tcerr << \"Error: -N was set to a number less than 0 (\" << multiseedMms << \")\" << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(ctoks.size() >= 2) {\n\t\t\t\tistringstream tmpss(ctoks[1]);\n\t\t\t\ttmpss >> multiseedLen;\n\t\t\t} else {\n\t\t\t\tmultiseedLen = DEFAULT_SEEDLEN;\n\t\t\t}\n\t\t}\n\t\telse if(tag == \"SEEDLEN\") {\n\t\t\tif(ctoks.size() > 1) {\n\t\t\t\tcerr << \"Error parsing alignment policy setting \"\n\t\t\t\t     << \"'\" << tag.c_str() << \"'; RHS must have 1 token, \"\n\t\t\t\t\t << \"had \" << ctoks.size() << \".  \"\n\t\t\t\t\t << \"Policy: '\" << s.c_str() << \"'\" << endl;\n\t\t\t\tassert(false); throw 1;\n\t\t\t}\n\t\t\tif(ctoks.size() >= 1) {\n\t\t\t\tistringstream tmpss(ctoks[0]);\n\t\t\t\ttmpss >> multiseedLen;\n\t\t\t}\n\t\t}\n\t\telse if(tag == \"DPS\") {\n\t\t\tif(ctoks.size() > 1) {\n\t\t\t\tcerr << \"Error parsing alignment policy setting \"\n\t\t\t\t     << \"'\" << tag.c_str() << \"'; RHS must have 1 token, \"\n\t\t\t\t\t << \"had \" << ctoks.size() << \".  \"\n\t\t\t\t\t << \"Policy: '\" << s.c_str() << \"'\" << endl;\n\t\t\t\tassert(false); throw 1;\n\t\t\t}\n\t\t\tif(ctoks.size() >= 1) {\n\t\t\t\tistringstream tmpss(ctoks[0]);\n\t\t\t\ttmpss >> failStreak;\n\t\t\t}\n\t\t}\n\t\telse if(tag == \"ROUNDS\") {\n\t\t\tif(ctoks.size() > 1) {\n\t\t\t\tcerr << \"Error parsing alignment policy setting \"\n\t\t\t\t     << \"'\" << tag.c_str() << \"'; RHS must have 1 token, \"\n\t\t\t\t\t << \"had \" << ctoks.size() << \".  \"\n\t\t\t\t\t << \"Policy: '\" << s.c_str() << \"'\" << endl;\n\t\t\t\tassert(false); throw 1;\n\t\t\t}\n\t\t\tif(ctoks.size() >= 1) {\n\t\t\t\tistringstream tmpss(ctoks[0]);\n\t\t\t\ttmpss >> seedRounds;\n\t\t\t}\n\t\t}\n\t\t/*\n\t\t * Seed interval\n\t\t * -------------\n\t\t *\n\t\t * IVAL={L|S|C},a,b (default: IVAL=S,1.0,0.0)\n\t\t *\n\t\t *   L  = let interval between seeds be a linear function of the\n\t\t *        read length.  xx and yy are the constant and linear\n\t\t *        coefficients respectively.  In other words, the interval\n\t\t *        equals a * len + b, where len is the read length.\n\t\t *        Intervals less than 1 are rounded up to 1.\n\t\t *   S  = let interval between seeds be a function of the sqaure\n\t\t *        root of the  read length.  xx and yy are the\n\t\t *        coefficients.  In other words, the interval equals\n\t\t *        a * sqrt(len) + b, where len is the read length.\n\t\t *        Intervals less than 1 are rounded up to 1.\n\t\t *   C  = Like S but uses cube root of length instead of square\n\t\t *        root.\n\t\t */\n\t\telse if(tag == \"IVAL\") {\n\t\t\tPARSE_FUNC(multiseedIval);\n\t\t}\n        else if(tag == \"INTRONLEN\") {\n            assert(penIntronLen != NULL);\n\t\t\tPARSE_FUNC((*penIntronLen));\n\t\t}\n\t\telse {\n\t\t\t// Unknown tag\n\t\t\tcerr << \"Unexpected alignment policy setting \"\n\t\t\t\t << \"'\" << tag.c_str() << \"'\" << endl\n\t\t\t\t << \"Policy: '\" << s.c_str() << \"'\" << endl;\n\t\t\tassert(false); throw 1;\n\t\t}\n\t}\n}\n\n#ifdef ALIGNER_SEED_POLICY_MAIN\nint main() {\n\n\tint bonusMatchType;\n\tint bonusMatch;\n\tint penMmcType;\n\tint penMmc;\n\tint penNType;\n\tint penN;\n\tint penRdExConst;\n\tint penRfExConst;\n\tint penRdExLinear;\n\tint penRfExLinear;\n\tSimpleFunc costMin;\n\tSimpleFunc costFloor;\n\tSimpleFunc nCeil;\n\tbool nCatPair;\n\tint multiseedMms;\n\tint multiseedLen;\n\tSimpleFunc msIval;\n\tSimpleFunc posfrac;\n\tSimpleFunc rowmult;\n\tuint32_t mhits;\n\n\t{\n\t\tcout << \"Case 1: Defaults 1 ... \";\n\t\tconst char *pol = \"\";\n\t\tSeedAlignmentPolicy::parseString(\n\t\t\tstring(pol),\n\t\t\tfalse,              // --local?\n\t\t\tfalse,              // noisy homopolymers a la 454?\n\t\t\tfalse,              // ignore qualities?\n\t\t\tbonusMatchType,\n\t\t\tbonusMatch,\n\t\t\tpenMmcType,\n\t\t\tpenMmc,\n\t\t\tpenNType,\n\t\t\tpenN,\n\t\t\tpenRdExConst,\n\t\t\tpenRfExConst,\n\t\t\tpenRdExLinear,\n\t\t\tpenRfExLinear,\n\t\t\tcostMin,\n\t\t\tcostFloor,\n\t\t\tnCeil,\n\t\t\tnCatPair,\n\t\t\tmultiseedMms,\n\t\t\tmultiseedLen,\n\t\t\tmsIval,\n\t\t\tmhits);\n\t\t\n\t\tassert_eq(DEFAULT_MATCH_BONUS_TYPE,   bonusMatchType);\n\t\tassert_eq(DEFAULT_MATCH_BONUS,        bonusMatch);\n\t\tassert_eq(DEFAULT_MM_PENALTY_TYPE,    penMmcType);\n\t\tassert_eq(DEFAULT_MM_PENALTY_MAX,     penMmcMax);\n\t\tassert_eq(DEFAULT_MM_PENALTY_MIN,     penMmcMin);\n\t\tassert_eq(DEFAULT_N_PENALTY_TYPE,     penNType);\n\t\tassert_eq(DEFAULT_N_PENALTY,          penN);\n\t\tassert_eq(DEFAULT_MIN_CONST,          costMin.getConst());\n\t\tassert_eq(DEFAULT_MIN_LINEAR,         costMin.getCoeff());\n\t\tassert_eq(DEFAULT_FLOOR_CONST,        costFloor.getConst());\n\t\tassert_eq(DEFAULT_FLOOR_LINEAR,       costFloor.getCoeff());\n\t\tassert_eq(DEFAULT_N_CEIL_CONST,       nCeil.getConst());\n\t\tassert_eq(DEFAULT_N_CAT_PAIR,         nCatPair);\n\n\t\tassert_eq(DEFAULT_READ_GAP_CONST,     penRdExConst);\n\t\tassert_eq(DEFAULT_READ_GAP_LINEAR,    penRdExLinear);\n\t\tassert_eq(DEFAULT_REF_GAP_CONST,      penRfExConst);\n\t\tassert_eq(DEFAULT_REF_GAP_LINEAR,     penRfExLinear);\n\t\tassert_eq(DEFAULT_SEEDMMS,            multiseedMms);\n\t\tassert_eq(DEFAULT_SEEDLEN,            multiseedLen);\n\t\tassert_eq(DEFAULT_IVAL,               msIval.getType());\n\t\tassert_eq(DEFAULT_IVAL_A,             msIval.getCoeff());\n\t\tassert_eq(DEFAULT_IVAL_B,             msIval.getConst());\n\t\t\n\t\tcout << \"PASSED\" << endl;\n\t}\n\n\t{\n\t\tcout << \"Case 2: Defaults 2 ... \";\n\t\tconst char *pol = \"\";\n\t\tSeedAlignmentPolicy::parseString(\n\t\t\tstring(pol),\n\t\t\tfalse,              // --local?\n\t\t\ttrue,               // noisy homopolymers a la 454?\n\t\t\tfalse,              // ignore qualities?\n\t\t\tbonusMatchType,\n\t\t\tbonusMatch,\n\t\t\tpenMmcType,\n\t\t\tpenMmc,\n\t\t\tpenNType,\n\t\t\tpenN,\n\t\t\tpenRdExConst,\n\t\t\tpenRfExConst,\n\t\t\tpenRdExLinear,\n\t\t\tpenRfExLinear,\n\t\t\tcostMin,\n\t\t\tcostFloor,\n\t\t\tnCeil,\n\t\t\tnCatPair,\n\t\t\tmultiseedMms,\n\t\t\tmultiseedLen,\n\t\t\tmsIval,\n\t\t\tmhits);\n\t\t\n\t\tassert_eq(DEFAULT_MATCH_BONUS_TYPE,   bonusMatchType);\n\t\tassert_eq(DEFAULT_MATCH_BONUS,        bonusMatch);\n\t\tassert_eq(DEFAULT_MM_PENALTY_TYPE,    penMmcType);\n\t\tassert_eq(DEFAULT_MM_PENALTY_MAX,     penMmc);\n\t\tassert_eq(DEFAULT_MM_PENALTY_MIN,     penMmc);\n\t\tassert_eq(DEFAULT_N_PENALTY_TYPE,     penNType);\n\t\tassert_eq(DEFAULT_N_PENALTY,          penN);\n\t\tassert_eq(DEFAULT_MIN_CONST,          costMin.getConst());\n\t\tassert_eq(DEFAULT_MIN_LINEAR,         costMin.getCoeff());\n\t\tassert_eq(DEFAULT_FLOOR_CONST,        costFloor.getConst());\n\t\tassert_eq(DEFAULT_FLOOR_LINEAR,       costFloor.getCoeff());\n\t\tassert_eq(DEFAULT_N_CEIL_CONST,       nCeil.getConst());\n\t\tassert_eq(DEFAULT_N_CAT_PAIR,         nCatPair);\n\n\t\tassert_eq(DEFAULT_READ_GAP_CONST_BADHPOLY,  penRdExConst);\n\t\tassert_eq(DEFAULT_READ_GAP_LINEAR_BADHPOLY, penRdExLinear);\n\t\tassert_eq(DEFAULT_REF_GAP_CONST_BADHPOLY,   penRfExConst);\n\t\tassert_eq(DEFAULT_REF_GAP_LINEAR_BADHPOLY,  penRfExLinear);\n\t\tassert_eq(DEFAULT_SEEDMMS,            multiseedMms);\n\t\tassert_eq(DEFAULT_SEEDLEN,            multiseedLen);\n\t\tassert_eq(DEFAULT_IVAL,               msIval.getType());\n\t\tassert_eq(DEFAULT_IVAL_A,             msIval.getCoeff());\n\t\tassert_eq(DEFAULT_IVAL_B,             msIval.getConst());\n\t\t\n\t\tcout << \"PASSED\" << endl;\n\t}\n\n\t{\n\t\tcout << \"Case 3: Defaults 3 ... \";\n\t\tconst char *pol = \"\";\n\t\tSeedAlignmentPolicy::parseString(\n\t\t\tstring(pol),\n\t\t\ttrue,               // --local?\n\t\t\tfalse,              // noisy homopolymers a la 454?\n\t\t\tfalse,              // ignore qualities?\n\t\t\tbonusMatchType,\n\t\t\tbonusMatch,\n\t\t\tpenMmcType,\n\t\t\tpenMmc,\n\t\t\tpenNType,\n\t\t\tpenN,\n\t\t\tpenRdExConst,\n\t\t\tpenRfExConst,\n\t\t\tpenRdExLinear,\n\t\t\tpenRfExLinear,\n\t\t\tcostMin,\n\t\t\tcostFloor,\n\t\t\tnCeil,\n\t\t\tnCatPair,\n\t\t\tmultiseedMms,\n\t\t\tmultiseedLen,\n\t\t\tmsIval,\n\t\t\tmhits);\n\t\t\n\t\tassert_eq(DEFAULT_MATCH_BONUS_TYPE_LOCAL,   bonusMatchType);\n\t\tassert_eq(DEFAULT_MATCH_BONUS_LOCAL,        bonusMatch);\n\t\tassert_eq(DEFAULT_MM_PENALTY_TYPE,    penMmcType);\n\t\tassert_eq(DEFAULT_MM_PENALTY_MAX,     penMmcMax);\n\t\tassert_eq(DEFAULT_MM_PENALTY_MIN,     penMmcMin);\n\t\tassert_eq(DEFAULT_N_PENALTY_TYPE,     penNType);\n\t\tassert_eq(DEFAULT_N_PENALTY,          penN);\n\t\tassert_eq(DEFAULT_MIN_CONST_LOCAL,    costMin.getConst());\n\t\tassert_eq(DEFAULT_MIN_LINEAR_LOCAL,   costMin.getCoeff());\n\t\tassert_eq(DEFAULT_FLOOR_CONST_LOCAL,  costFloor.getConst());\n\t\tassert_eq(DEFAULT_FLOOR_LINEAR_LOCAL, costFloor.getCoeff());\n\t\tassert_eq(DEFAULT_N_CEIL_CONST,       nCeil.getConst());\n\t\tassert_eq(DEFAULT_N_CEIL_LINEAR,      nCeil.getCoeff());\n\t\tassert_eq(DEFAULT_N_CAT_PAIR,         nCatPair);\n\n\t\tassert_eq(DEFAULT_READ_GAP_CONST,     penRdExConst);\n\t\tassert_eq(DEFAULT_READ_GAP_LINEAR,    penRdExLinear);\n\t\tassert_eq(DEFAULT_REF_GAP_CONST,      penRfExConst);\n\t\tassert_eq(DEFAULT_REF_GAP_LINEAR,     penRfExLinear);\n\t\tassert_eq(DEFAULT_SEEDMMS,            multiseedMms);\n\t\tassert_eq(DEFAULT_SEEDLEN,            multiseedLen);\n\t\tassert_eq(DEFAULT_IVAL,               msIval.getType());\n\t\tassert_eq(DEFAULT_IVAL_A,             msIval.getCoeff());\n\t\tassert_eq(DEFAULT_IVAL_B,             msIval.getConst());\n\n\t\tcout << \"PASSED\" << endl;\n\t}\n\n\t{\n\t\tcout << \"Case 4: Simple string 1 ... \";\n\t\tconst char *pol = \"MMP=C44;MA=4;RFG=24,12;FL=C,8;RDG=2;NP=C4;MIN=C,7\";\n\t\tSeedAlignmentPolicy::parseString(\n\t\t\tstring(pol),\n\t\t\ttrue,               // --local?\n\t\t\tfalse,              // noisy homopolymers a la 454?\n\t\t\tfalse,              // ignore qualities?\n\t\t\tbonusMatchType,\n\t\t\tbonusMatch,\n\t\t\tpenMmcType,\n\t\t\tpenMmc,\n\t\t\tpenNType,\n\t\t\tpenN,\n\t\t\tpenRdExConst,\n\t\t\tpenRfExConst,\n\t\t\tpenRdExLinear,\n\t\t\tpenRfExLinear,\n\t\t\tcostMin,\n\t\t\tcostFloor,\n\t\t\tnCeil,\n\t\t\tnCatPair,\n\t\t\tmultiseedMms,\n\t\t\tmultiseedLen,\n\t\t\tmsIval,\n\t\t\tmhits);\n\t\t\n\t\tassert_eq(COST_MODEL_CONSTANT,        bonusMatchType);\n\t\tassert_eq(4,                          bonusMatch);\n\t\tassert_eq(COST_MODEL_CONSTANT,        penMmcType);\n\t\tassert_eq(44,                         penMmc);\n\t\tassert_eq(COST_MODEL_CONSTANT,        penNType);\n\t\tassert_eq(4.0f,                       penN);\n\t\tassert_eq(7,                          costMin.getConst());\n\t\tassert_eq(DEFAULT_MIN_LINEAR_LOCAL,   costMin.getCoeff());\n\t\tassert_eq(8,                          costFloor.getConst());\n\t\tassert_eq(DEFAULT_FLOOR_LINEAR_LOCAL, costFloor.getCoeff());\n\t\tassert_eq(DEFAULT_N_CEIL_CONST,       nCeil.getConst());\n\t\tassert_eq(DEFAULT_N_CEIL_LINEAR,      nCeil.getCoeff());\n\t\tassert_eq(DEFAULT_N_CAT_PAIR,         nCatPair);\n\n\t\tassert_eq(2.0f,                       penRdExConst);\n\t\tassert_eq(DEFAULT_READ_GAP_LINEAR,    penRdExLinear);\n\t\tassert_eq(24.0f,                      penRfExConst);\n\t\tassert_eq(12.0f,                      penRfExLinear);\n\t\tassert_eq(DEFAULT_SEEDMMS,            multiseedMms);\n\t\tassert_eq(DEFAULT_SEEDLEN,            multiseedLen);\n\t\tassert_eq(DEFAULT_IVAL,               msIval.getType());\n\t\tassert_eq(DEFAULT_IVAL_A,             msIval.getCoeff());\n\t\tassert_eq(DEFAULT_IVAL_B,             msIval.getConst());\n\n\t\tcout << \"PASSED\" << endl;\n\t}\n}\n#endif /*def ALIGNER_SEED_POLICY_MAIN*/\n"
  },
  {
    "path": "aligner_seed_policy.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef ALIGNER_SEED_POLICY_H_\n#define ALIGNER_SEED_POLICY_H_\n\n#include \"scoring.h\"\n#include \"simple_func.h\"\n\n#define DEFAULT_SEEDMMS 0\n#define DEFAULT_SEEDLEN 22\n\n#define DEFAULT_IVAL SIMPLE_FUNC_SQRT\n#define DEFAULT_IVAL_A 1.15f\n#define DEFAULT_IVAL_B 0.0f\n\n#define DEFAULT_UNGAPPED_HITS 6\n\n/**\n * Encapsulates the set of all parameters that affect what the\n * SeedAligner does with reads.\n */\nclass SeedAlignmentPolicy {\n\npublic:\n\n\t/**\n\t * Parse alignment policy when provided in this format:\n\t * <lab>=<val>;<lab>=<val>;<lab>=<val>...\n\t *\n\t * And label=value possibilities are:\n\t *\n\t * Bonus for a match\n\t * -----------------\n\t *\n\t * MA=xx (default: MA=0, or MA=2 if --local is set)\n\t *\n\t *    xx = Each position where equal read and reference characters match up\n\t *         in the alignment contriubtes this amount to the total score.\n\t *\n\t * Penalty for a mismatch\n\t * ----------------------\n\t *\n\t * MMP={Cxx|Q|RQ} (default: MMP=C6)\n\t *\n\t *   Cxx = Each mismatch costs xx.  If MMP=Cxx is specified, quality\n\t *         values are ignored when assessing penalities for mismatches.\n\t *   Q   = Each mismatch incurs a penalty equal to the mismatched base's\n\t *         value.\n\t *   R   = Each mismatch incurs a penalty equal to the mismatched base's\n\t *         rounded quality value.  Qualities are rounded off to the\n\t *         nearest 10, and qualities greater than 30 are rounded to 30.\n\t *\n\t * Penalty for position with N (in either read or reference)\n\t * ---------------------------------------------------------\n\t *\n\t * NP={Cxx|Q|RQ} (default: NP=C1)\n\t *\n\t *   Cxx = Each alignment position with an N in either the read or the\n\t *         reference costs xx.  If NP=Cxx is specified, quality values are\n\t *         ignored when assessing penalities for Ns.\n\t *   Q   = Each alignment position with an N in either the read or the\n\t *         reference incurs a penalty equal to the read base's quality\n\t *         value.\n\t *   R   = Each alignment position with an N in either the read or the\n\t *         reference incurs a penalty equal to the read base's rounded\n\t *         quality value.  Qualities are rounded off to the nearest 10,\n\t *         and qualities greater than 30 are rounded to 30.\n\t *\n\t * Penalty for a read gap\n\t * ----------------------\n\t *\n\t * RDG=xx,yy (default: RDG=5,3)\n\t *\n\t *   xx    = Read gap open penalty.\n\t *   yy    = Read gap extension penalty.\n\t *\n\t * Total cost incurred by a read gap = xx + (yy * gap length)\n\t *\n\t * Penalty for a reference gap\n\t * ---------------------------\n\t *\n\t * RFG=xx,yy (default: RFG=5,3)\n\t *\n\t *   xx    = Reference gap open penalty.\n\t *   yy    = Reference gap extension penalty.\n\t *\n\t * Total cost incurred by a reference gap = xx + (yy * gap length)\n\t *\n\t * Minimum score for valid alignment\n\t * ---------------------------------\n\t *\n\t * MIN=xx,yy (defaults: MIN=-0.6,-0.6, or MIN=0.0,0.66 if --local is set)\n\t *\n\t *   xx,yy = For a read of length N, the total score must be at least\n\t *           xx + (read length * yy) for the alignment to be valid.  The\n\t *           total score is the sum of all negative penalties (from\n\t *           mismatches and gaps) and all positive bonuses.  The minimum\n\t *           can be negative (and is by default in global alignment mode).\n\t *\n\t * N ceiling\n\t * ---------\n\t *\n\t * NCEIL=xx,yy (default: NCEIL=0.0,0.15)\n\t *\n\t *   xx,yy = For a read of length N, the number of alignment\n\t *           positions with an N in either the read or the\n\t *           reference cannot exceed\n\t *           ceiling = xx + (read length * yy).  If the ceiling is\n\t *           exceeded, the alignment is considered invalid.\n\t *\n\t * Seeds\n\t * -----\n\t *\n\t * SEED=mm,len,ival (default: SEED=0,22)\n\t *\n\t *   mm   = Maximum number of mismatches allowed within a seed.\n\t *          Must be >= 0 and <= 2.  Note that 2-mismatch mode is\n\t *          not fully sensitive; i.e. some 2-mismatch seed\n\t *          alignments may be missed.\n\t *   len  = Length of seed.\n\t *   ival = Interval between seeds.  If not specified, seed\n\t *          interval is determined by IVAL.\n\t *\n\t * Seed interval\n\t * -------------\n\t *\n\t * IVAL={L|S|C},xx,yy (default: IVAL=S,1.0,0.0)\n\t *\n\t *   L  = let interval between seeds be a linear function of the\n\t *        read length.  xx and yy are the constant and linear\n\t *        coefficients respectively.  In other words, the interval\n\t *        equals a * len + b, where len is the read length.\n\t *        Intervals less than 1 are rounded up to 1.\n\t *   S  = let interval between seeds be a function of the sqaure\n\t *        root of the  read length.  xx and yy are the\n\t *        coefficients.  In other words, the interval equals\n\t *        a * sqrt(len) + b, where len is the read length.\n\t *        Intervals less than 1 are rounded up to 1.\n\t *   C  = Like S but uses cube root of length instead of square\n\t *        root.\n\t *\n\t * Example 1:\n\t *\n\t *  SEED=1,10,5 and read sequence is TGCTATCGTACGATCGTAC:\n\t *\n\t *  The following seeds are extracted from the forward\n\t *  representation of the read and aligned to the reference\n\t *  allowing up to 1 mismatch:\n\t *\n\t *  Read:    TGCTATCGTACGATCGTACA\n\t *\n\t *  Seed 1+: TGCTATCGTA\n\t *  Seed 2+:      TCGTACGATC\n\t *  Seed 3+:           CGATCGTACA\n\t *\n\t *  ...and the following are extracted from the reverse-complement\n\t *  representation of the read and align to the reference allowing\n\t *  up to 1 mismatch:\n\t *\n\t *  Seed 1-: TACGATAGCA\n\t *  Seed 2-:      GATCGTACGA\n\t *  Seed 3-:           TGTACGATCG\n\t *\n\t * Example 2:\n\t *\n\t *  SEED=1,20,20 and read sequence is TGCTATCGTACGATC.  The seed\n\t *  length is 20 but the read is only 15 characters long.  In this\n\t *  case, Bowtie2 automatically shrinks the seed length to be equal\n\t *  to the read length.\n\t *\n\t *  Read:    TGCTATCGTACGATC\n\t *\n\t *  Seed 1+: TGCTATCGTACGATC\n\t *  Seed 1-: GATCGTACGATAGCA\n\t *\n\t * Example 3:\n\t *\n\t *  SEED=1,10,10 and read sequence is TGCTATCGTACGATC.  Only one seed\n\t *  fits on the read; a second seed would overhang the end of the read\n\t *  by 5 positions.  In this case, Bowtie2 extracts one seed.\n\t *\n\t *  Read:    TGCTATCGTACGATC\n\t *\n\t *  Seed 1+: TGCTATCGTA\n\t *  Seed 1-: TACGATAGCA\n\t */\n\tstatic void parseString(\n\t\tconst       std::string& s,\n\t\tbool        local,\n\t\tbool        noisyHpolymer,\n\t\tbool        ignoreQuals,\n\t\tint&        bonusMatchType,\n\t\tint&        bonusMatch,\n\t\tint&        penMmcType,\n\t\tint&        penMmcMax,\n\t\tint&        penMmcMin,\n\t\tint&        penNType,\n\t\tint&        penN,\n\t\tint&        penRdExConst,\n\t\tint&        penRfExConst,\n\t\tint&        penRdExLinear,\n\t\tint&        penRfExLinear,\n\t\tSimpleFunc& costMin,\n\t\tSimpleFunc& nCeil,\n\t\tbool&       nCatPair,\n\t\tint&        multiseedMms,\n\t\tint&        multiseedLen,\n\t\tSimpleFunc& multiseedIval,\n\t\tsize_t&     failStreak,\n\t\tsize_t&     seedRounds,\n        SimpleFunc* penCanSplice = NULL,\n        SimpleFunc* penNoncanSplice = NULL,\n        SimpleFunc* penIntronLen = NULL);\n};\n\n#endif /*ndef ALIGNER_SEED_POLICY_H_*/\n"
  },
  {
    "path": "aligner_sw.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <limits>\n// -- BTL remove --\n//#include <stdlib.h>\n//#include <sys/time.h>\n// -- --\n#include \"aligner_sw.h\"\n#include \"aligner_result.h\"\n#include \"search_globals.h\"\n#include \"scoring.h\"\n#include \"mask.h\"\n\n/**\n * Initialize with a new read.\n */\nvoid SwAligner::initRead(\n\tconst BTDnaString& rdfw, // forward read sequence\n\tconst BTDnaString& rdrc, // revcomp read sequence\n\tconst BTString& qufw,    // forward read qualities\n\tconst BTString& qurc,    // reverse read qualities\n\tsize_t rdi,              // offset of first read char to align\n\tsize_t rdf,              // offset of last read char to align\n\tconst Scoring& sc)       // scoring scheme\n{\n\tassert_gt(rdf, rdi);\n\tint nceil = sc.nCeil.f<int>((double)rdfw.length());\n\trdfw_    = &rdfw;      // read sequence\n\trdrc_    = &rdrc;      // read sequence\n\tqufw_    = &qufw;      // read qualities\n\tqurc_    = &qurc;      // read qualities\n\trdi_     = rdi;        // offset of first read char to align\n\trdf_     = rdf;        // offset of last read char to align\n\tsc_      = &sc;        // scoring scheme\n\tnceil_   = nceil;      // max # Ns allowed in ref portion of aln\n\treadSse16_ = false;    // true -> sse16 from now on for this read\n\tinitedRead_ = true;\n#ifndef NO_SSE\n\tsseU8fwBuilt_  = false;  // built fw query profile, 8-bit score\n\tsseU8rcBuilt_  = false;  // built rc query profile, 8-bit score\n\tsseI16fwBuilt_ = false;  // built fw query profile, 16-bit score\n\tsseI16rcBuilt_ = false;  // built rc query profile, 16-bit score\n#endif\n}\n\n/**\n * Initialize with a new alignment problem.\n */\nvoid SwAligner::initRef(\n\tbool fw,               // whether to forward or revcomp read is aligning\n\tTRefId refidx,         // id of reference aligned against\n\tconst DPRect& rect,    // DP rectangle\n\tchar *rf,              // reference sequence\n\tsize_t rfi,            // offset of first reference char to align to\n\tsize_t rff,            // offset of last reference char to align to\n\tTRefOff reflen,        // length of reference sequence\n\tconst Scoring& sc,     // scoring scheme\n\tTAlScore minsc,        // minimum score\n\tbool enable8,          // use 8-bit SSE if possible?\n\tsize_t cminlen,        // minimum length for using checkpointing scheme\n\tsize_t cpow2,          // interval b/t checkpointed diags; 1 << this\n\tbool doTri,            // triangular mini-fills?\n\tbool extend)           // is this a seed extension?\n{\n\tsize_t readGaps = sc.maxReadGaps(minsc, rdfw_->length());\n\tsize_t refGaps  = sc.maxRefGaps(minsc, rdfw_->length());\n\tassert_geq(readGaps, 0);\n\tassert_geq(refGaps, 0);\n\tassert_gt(rff, rfi);\n\trdgap_       = readGaps;  // max # gaps in read\n\trfgap_       = refGaps;   // max # gaps in reference\n\tstate_       = STATE_INITED;\n\tfw_          = fw;       // orientation\n\trd_          = fw ? rdfw_ : rdrc_; // read sequence\n\tqu_          = fw ? qufw_ : qurc_; // quality sequence\n\trefidx_      = refidx;   // id of reference aligned against\n\trf_          = rf;       // reference sequence\n\trfi_         = rfi;      // offset of first reference char to align to\n\trff_         = rff;      // offset of last reference char to align to\n\treflen_      = reflen;   // length of entire reference sequence\n\trect_        = &rect;    // DP rectangle\n\tminsc_       = minsc;    // minimum score\n\tcural_       = 0;        // idx of next alignment to give out\n\tinitedRef_   = true;     // indicate we've initialized the ref portion\n\tenable8_     = enable8;  // use 8-bit SSE if possible?\n\textend_      = extend;   // true iff this is a seed extension\n\tcperMinlen_  = cminlen;  // reads shorter than this won't use checkpointer\n\tcperPerPow2_ = cpow2;    // interval b/t checkpointed diags; 1 << this\n\tcperEf_      = true;     // whether to checkpoint H, E, and F\n\tcperTri_     = doTri;    // triangular mini-fills?\n\tbter_.initRef(\n\t\tfw_ ? rdfw_->buf() : // in: read sequence\n\t\t\t  rdrc_->buf(), \n\t\tfw_ ? qufw_->buf() : // in: quality sequence\n\t\t\t  qurc_->buf(),\n                  // daehwan\n\t\t// rd_->length(),       // in: read sequence length\n                  rdf_ - rdi_,\n\t\trf_ + rfi_,          // in: reference sequence\n\t\trff_ - rfi_,         // in: in-rectangle reference sequence length\n\t\treflen,              // in: total reference sequence length\n\t\trefidx_,             // in: reference id\n\t\trfi_,                // in: reference offset\n\t\tfw_,                 // in: orientation\n\t\trect_,               // in: DP rectangle\n\t\t&cper_,              // in: checkpointer\n\t\t*sc_,                // in: scoring scheme\n\t\tnceil_);             // in: N ceiling\n}\n\t\n/**\n * Given a read, an alignment orientation, a range of characters in a referece\n * sequence, and a bit-encoded version of the reference, set up and execute the\n * corresponding dynamic programming problem.\n *\n * The caller has already narrowed down the relevant portion of the reference\n * using, e.g., the location of a seed hit, or the range of possible fragment\n * lengths if we're searching for the opposite mate in a pair.\n */\nvoid SwAligner::initRef(\n\tbool fw,               // whether to forward or revcomp read is aligning\n\tTRefId refidx,         // reference aligned against\n\tconst DPRect& rect,    // DP rectangle\n\tconst BitPairReference& refs, // Reference strings\n\tTRefOff reflen,        // length of reference sequence\n\tconst Scoring& sc,     // scoring scheme\n\tTAlScore minsc,        // minimum score\n\tbool enable8,          // use 8-bit SSE if possible?\n\tsize_t cminlen,        // minimum length for using checkpointing scheme\n\tsize_t cpow2,          // interval b/t checkpointed diags; 1 << this\n\tbool doTri,            // triangular mini-fills?\n\tbool extend,           // true iff this is a seed extension\n\tsize_t  upto,          // count the number of Ns up to this offset\n\tsize_t& nsUpto)        // output: the number of Ns up to 'upto'\n{\n\tTRefOff rfi = rect.refl;\n\tTRefOff rff = rect.refr + 1;\n\tassert_gt(rff, rfi);\n\t// Capture an extra reference character outside the rectangle so that we\n\t// can check matches in the next column over to the right\n\trff++;\n\t// rflen = full length of the reference substring to consider, including\n\t// overhang off the boundaries of the reference sequence\n\tconst size_t rflen = (size_t)(rff - rfi);\n\t// Figure the number of Ns we're going to add to either side\n\tsize_t leftNs  =\n\t\t(rfi >= 0               ? 0 : (size_t)std::abs(static_cast<long>(rfi)));\n\tleftNs = min(leftNs, rflen);\n\tsize_t rightNs =\n\t\t(rff <= (TRefOff)reflen ? 0 : (size_t)std::abs(static_cast<long>(rff - reflen)));\n\trightNs = min(rightNs, rflen);\n\t// rflenInner = length of just the portion that doesn't overhang ref ends\n\tassert_geq(rflen, leftNs + rightNs);\n\tconst size_t rflenInner = rflen - (leftNs + rightNs);\n#ifndef NDEBUG\n\tbool haveRfbuf2 = false;\n\tEList<char> rfbuf2(rflen);\n\t// This is really slow, so only do it some of the time\n\tif((rand() % 10) == 0) {\n\t\tTRefOff rfii = rfi;\n\t\tfor(size_t i = 0; i < rflen; i++) {\n\t\t\tif(rfii < 0 || (TRefOff)rfii >= reflen) {\n\t\t\t\trfbuf2.push_back(4);\n\t\t\t} else {\n\t\t\t\trfbuf2.push_back(refs.getBase(refidx, (uint32_t)rfii));\n\t\t\t}\n\t\t\trfii++;\n\t\t}\n\t\thaveRfbuf2 = true;\n\t}\n#endif\n\t// rfbuf_ = uint32_t list large enough to accommodate both the reference\n\t// sequence and any Ns we might add to either side.\n\trfwbuf_.resize((rflen + 16) / 4);\n\tint offset = refs.getStretch(\n\t\trfwbuf_.ptr(),               // buffer to store words in\n\t\trefidx,                      // which reference\n\t\t(rfi < 0) ? 0 : (size_t)rfi, // starting offset (can't be < 0)\n\t\trflenInner                   // length to grab (exclude overhang)\n\t\tASSERT_ONLY(, tmp_destU32_));// for BitPairReference::getStretch()\n\tassert_leq(offset, 16);\n\trf_ = (char*)rfwbuf_.ptr() + offset;\n\t// Shift ref chars away from 0 so we can stick Ns at the beginning\n\tif(leftNs > 0) {\n\t\t// Slide everyone down\n\t\tfor(size_t i = rflenInner; i > 0; i--) {\n\t\t\trf_[i+leftNs-1] = rf_[i-1];\n\t\t}\n\t\t// Add Ns\n\t\tfor(size_t i = 0; i < leftNs; i++) {\n\t\t\trf_[i] = 4;\n\t\t}\n\t}\n\tif(rightNs > 0) {\n\t\t// Add Ns to the end\n\t\tfor(size_t i = 0; i < rightNs; i++) {\n\t\t\trf_[i + leftNs + rflenInner] = 4;\n\t\t}\n\t}\n#ifndef NDEBUG\n\t// Sanity check reference characters\n\tfor(size_t i = 0; i < rflen; i++) {\n\t\tassert(!haveRfbuf2 || rf_[i] == rfbuf2[i]);\n\t\tassert_range(0, 4, (int)rf_[i]);\n\t}\n#endif\n\t// Count Ns and convert reference characters into A/C/G/T masks.  Ambiguous\n\t// nucleotides (IUPAC codes) have more than one mask bit set.  If a\n\t// reference scanner was provided, use it to opportunistically resolve seed\n\t// hits.\n\tnsUpto = 0;\n\tfor(size_t i = 0; i < rflen; i++) {\n\t\t// rf_[i] gets mask version of refence char, with N=16\n\t\tif(i < upto && rf_[i] > 3) {\n\t\t\tnsUpto++;\n\t\t}\n\t\trf_[i] = (1 << rf_[i]);\n\t}\n\t// Correct for having captured an extra reference character\n\trff--;\n\tinitRef(\n\t\tfw,          // whether to forward or revcomp read is aligning\n\t\trefidx,      // id of reference aligned against\n\t\trect,        // DP rectangle\n\t\trf_,         // reference sequence, wrapped up in BTString object\n\t\t0,           // use the whole thing\n\t\t(size_t)(rff - rfi), // ditto\n\t\treflen,      // reference length\n\t\tsc,          // scoring scheme\n\t\tminsc,       // minimum score\n\t\tenable8,     // use 8-bit SSE if possible?\n\t\tcminlen,     // minimum length for using checkpointing scheme\n\t\tcpow2,       // interval b/t checkpointed diags; 1 << this\n\t\tdoTri,       // triangular mini-fills?\n\t\textend);     // true iff this is a seed extension\n}\n\n/**\n * Align read 'rd' to reference using read & reference information given\n * last time init() was called.\n */\nbool SwAligner::align(\n                      RandomSource& rnd, // source of pseudo-randoms\n                      TAlScore& best)    // best alignment score observed in DP matrix\n{\n    assert(initedRef() && initedRead());\n    assert_eq(STATE_INITED, state_);\n    state_ = STATE_ALIGNED;\n    // Reset solutions lists\n    btncand_.clear();\n    btncanddone_.clear();\n    btncanddoneSucc_ = btncanddoneFail_ = 0;\n    best = std::numeric_limits<TAlScore>::min();\n    sse8succ_ = sse16succ_ = false;\n    int flag = 0;\n    size_t rdlen = rdf_ - rdi_;\n    bool checkpointed = rdlen >= cperMinlen_;\n    bool gathered = false; // Did gathering happen along with alignment?\n    if(sc_->monotone) {\n        // End-to-end\n        if(enable8_ && !readSse16_ && minsc_ >= -254) {\n            // 8-bit end-to-end\n            if(checkpointed) {\n                best = alignGatherEE8(flag, false);\n                if(flag == 0) {\n                    gathered = true;\n                }\n            } else {\n                best = alignNucleotidesEnd2EndSseU8(flag, false);\n#ifndef NDEBUG\n                int flagtmp = 0;\n                TAlScore besttmp = alignGatherEE8(flagtmp, true); // debug\n                assert_eq(flagtmp, flag);\n                assert_eq(besttmp, best);\n#endif\n            }\n            sse8succ_ = (flag == 0);\n#ifndef NDEBUG\n            {\n                int flag2 = 0;\n                TAlScore best2 = alignNucleotidesEnd2EndSseI16(flag2, true);\n                {\n                    int flagtmp = 0;\n                    TAlScore besttmp = alignGatherEE16(flagtmp, true);\n                    assert_eq(flagtmp, flag2);\n                    assert(flag2 != 0 || best2 == besttmp);\n                }\n                assert(flag < 0 || best == best2);\n                sse16succ_ = (flag2 == 0);\n            }\n#endif /*ndef NDEBUG*/\n        } else {\n            // 16-bit end-to-end\n            if(checkpointed) {\n                best = alignGatherEE16(flag, false);\n                if(flag == 0) {\n                    gathered = true;\n                }\n            } else {\n                best = alignNucleotidesEnd2EndSseI16(flag, false);\n#ifndef NDEBUG\n                int flagtmp = 0;\n                TAlScore besttmp = alignGatherEE16(flagtmp, true);\n                assert_eq(flagtmp, flag);\n                assert_eq(besttmp, best);\n#endif\n            }\n            sse16succ_ = (flag == 0);\n        }\n    } else {\n        // Local\n        flag = -2;\n        if(enable8_ && !readSse16_) {\n            // 8-bit local\n            if(checkpointed) {\n                best = alignGatherLoc8(flag, false);\n                if(flag == 0) {\n                    gathered = true;\n                }\n            } else {\n                best = alignNucleotidesLocalSseU8(flag, false);\n#ifndef NDEBUG\n                int flagtmp = 0;\n                TAlScore besttmp = alignGatherLoc8(flagtmp, true);\n                assert_eq(flag, flagtmp);\n                assert_eq(best, besttmp);\n#endif\n            }\n        }\n        if(flag == -2) {\n            // 16-bit local\n            flag = 0;\n            if(checkpointed) {\n                best = alignNucleotidesLocalSseI16(flag, false);\n                best = alignGatherLoc16(flag, false);\n                if(flag == 0) {\n                    gathered = true;\n                }\n            } else {\n                best = alignNucleotidesLocalSseI16(flag, false);\n#ifndef NDEBUG\n                int flagtmp = 0;\n                TAlScore besttmp = alignGatherLoc16(flagtmp, true);\n                assert_eq(flag, flagtmp);\n                assert_eq(best, besttmp);\n#endif\n            }\n            sse16succ_ = (flag == 0);\n        } else {\n            sse8succ_ = (flag == 0);\n#ifndef NDEBUG\n            int flag2 = 0;\n            TAlScore best2 = alignNucleotidesLocalSseI16(flag2, true);\n            {\n                int flagtmp = 0;\n                TAlScore besttmp = alignGatherLoc16(flagtmp, true);\n                assert_eq(flag2, flagtmp);\n                assert(flag2 != 0 || best2 == besttmp);\n            }\n            assert(flag2 < 0 || best == best2);\n            sse16succ_ = (flag2 == 0);\n#endif /*ndef NDEBUG*/\n        }\n    }\n#ifndef NDEBUG\n    if(!checkpointed && (rand() & 15) == 0 && sse8succ_ && sse16succ_) {\n        SSEData& d8  = fw_ ? sseU8fw_  : sseU8rc_;\n        SSEData& d16 = fw_ ? sseI16fw_ : sseI16rc_;\n        assert_eq(d8.mat_.nrow(), d16.mat_.nrow());\n        assert_eq(d8.mat_.ncol(), d16.mat_.ncol());\n        for(size_t i = 0; i < d8.mat_.nrow(); i++) {\n            for(size_t j = 0; j < colstop_; j++) {\n                int h8  = d8.mat_.helt(i, j);\n                int h16 = d16.mat_.helt(i, j);\n                int e8  = d8.mat_.eelt(i, j);\n                int e16 = d16.mat_.eelt(i, j);\n                int f8  = d8.mat_.felt(i, j);\n                int f16 = d16.mat_.felt(i, j);\n                TAlScore h8s  =\n                (sc_->monotone ? (h8  - 0xff  ) : h8);\n                TAlScore h16s =\n                (sc_->monotone ? (h16 - 0x7fff) : (h16 + 0x8000));\n                TAlScore e8s  =\n                (sc_->monotone ? (e8  - 0xff  ) : e8);\n                TAlScore e16s =\n                (sc_->monotone ? (e16 - 0x7fff) : (e16 + 0x8000));\n                TAlScore f8s  =\n                (sc_->monotone ? (f8  - 0xff  ) : f8);\n                TAlScore f16s =\n                (sc_->monotone ? (f16 - 0x7fff) : (f16 + 0x8000));\n                if(h8s < minsc_) {\n                    h8s = minsc_ - 1;\n                }\n                if(h16s < minsc_) {\n                    h16s = minsc_ - 1;\n                }\n                if(e8s < minsc_) {\n                    e8s = minsc_ - 1;\n                }\n                if(e16s < minsc_) {\n                    e16s = minsc_ - 1;\n                }\n                if(f8s < minsc_) {\n                    f8s = minsc_ - 1;\n                }\n                if(f16s < minsc_) {\n                    f16s = minsc_ - 1;\n                }\n                if((h8 != 0 || (int16_t)h16 != (int16_t)0x8000) && h8 > 0) {\n                    assert_eq(h8s, h16s);\n                }\n                if((e8 != 0 || (int16_t)e16 != (int16_t)0x8000) && e8 > 0) {\n                    assert_eq(e8s, e16s);\n                }\n                if((f8 != 0 || (int16_t)f16 != (int16_t)0x8000) && f8 > 0) {\n                    assert_eq(f8s, f16s);\n                }\n            }\n        }\n    }\n#endif\n    assert(repOk());\n    cural_ = 0;\n    if(best == MIN_I64 || best < minsc_) {\n        return false;\n    }\n    if(!gathered) {\n        // Look for solutions using SSE matrix\n        assert(sse8succ_ || sse16succ_);\n        if(sc_->monotone) {\n            if(sse8succ_) {\n                gatherCellsNucleotidesEnd2EndSseU8(best);\n#ifndef NDEBUG\n                if(sse16succ_) {\n                    cand_tmp_ = btncand_;\n                    gatherCellsNucleotidesEnd2EndSseI16(best);\n                    cand_tmp_.sort();\n                    btncand_.sort();\n                    assert(cand_tmp_ == btncand_);\n                }\n#endif /*ndef NDEBUG*/\n            } else {\n                gatherCellsNucleotidesEnd2EndSseI16(best);\n            }\n        } else {\n            if(sse8succ_) {\n                gatherCellsNucleotidesLocalSseU8(best);\n#ifndef NDEBUG\n                if(sse16succ_) {\n                    cand_tmp_ = btncand_;\n                    gatherCellsNucleotidesLocalSseI16(best);\n                    cand_tmp_.sort();\n                    btncand_.sort();\n                    assert(cand_tmp_ == btncand_);\n                }\n#endif /*ndef NDEBUG*/\n            } else {\n                gatherCellsNucleotidesLocalSseI16(best);\n            }\n        }\n    }\n    if(!btncand_.empty()) {\n        btncand_.sort();\n    }\n    return !btncand_.empty();\n}\n\n/**\n * Populate the given SwResult with information about the \"next best\"\n * alignment if there is one.  If there isn't one, false is returned.  Note\n * that false might be returned even though a call to done() would have\n * returned false.\n */\nbool SwAligner::nextAlignment(\n\tSwResult& res,\n\tTAlScore minsc,\n\tRandomSource& rnd)\n{\n\tassert(initedRead() && initedRef());\n\tassert_eq(STATE_ALIGNED, state_);\n\tassert(repOk());\n\tif(done()) {\n\t\tres.reset();\n\t\treturn false;\n\t}\n\tassert(!done());\n\tsize_t off = 0, nbts = 0;\n\tassert_lt(cural_, btncand_.size());\n\tassert(res.repOk());\n\t// For each candidate cell that we should try to backtrack from...\n\tconst size_t candsz = btncand_.size();\n\tsize_t SQ = dpRows() >> 4;\n\tif(SQ == 0) SQ = 1;\n\tsize_t rdlen = rdf_ - rdi_;\n\tbool checkpointed = rdlen >= cperMinlen_;\n\twhile(cural_ < candsz) {\n\t\t// Doing 'continue' anywhere in here simply causes us to move on to the\n\t\t// next candidate\n\t\tif(btncand_[cural_].score < minsc) {\n\t\t\tbtncand_[cural_].fate = BT_CAND_FATE_FILT_SCORE;\n\t\t\tnbtfiltsc_++; cural_++; continue;\n\t\t}\n\t\tnbts = 0;\n\t\tassert(sse8succ_ || sse16succ_);\n\t\tsize_t row = btncand_[cural_].row;\n\t\tsize_t col = btncand_[cural_].col;\n\t\tassert_lt(row, dpRows());\n\t\tassert_lt((TRefOff)col, rff_-rfi_);\n\t\tif(sse16succ_) {\n\t\t\tSSEData& d = fw_ ? sseI16fw_ : sseI16rc_;\n\t\t\tif(!checkpointed && d.mat_.reset_[row] && d.mat_.reportedThrough(row, col)) {\n\t\t\t\t// Skipping this candidate because a previous candidate already\n\t\t\t\t// moved through this cell\n\t\t\t\tbtncand_[cural_].fate = BT_CAND_FATE_FILT_START;\n\t\t\t\t//cerr << \"  skipped becuase starting cell was covered\" << endl;\n\t\t\t\tnbtfiltst_++; cural_++; continue;\n\t\t\t}\n\t\t} else if(sse8succ_) {\n\t\t\tSSEData& d = fw_ ? sseU8fw_ : sseU8rc_;\n\t\t\tif(!checkpointed && d.mat_.reset_[row] && d.mat_.reportedThrough(row, col)) {\n\t\t\t\t// Skipping this candidate because a previous candidate already\n\t\t\t\t// moved through this cell\n\t\t\t\tbtncand_[cural_].fate = BT_CAND_FATE_FILT_START;\n\t\t\t\t//cerr << \"  skipped becuase starting cell was covered\" << endl;\n\t\t\t\tnbtfiltst_++; cural_++; continue;\n\t\t\t}\n\t\t}\n\t\tif(sc_->monotone) {\n\t\t\tbool ret = false;\n\t\t\tif(sse8succ_) {\n\t\t\t\tuint32_t reseed = rnd.nextU32() + 1;\n\t\t\t\trnd.init(reseed);\n\t\t\t\tres.reset();\n\t\t\t\tif(checkpointed) {\n\t\t\t\t\tsize_t maxiter = MAX_SIZE_T;\n\t\t\t\t\tsize_t niter = 0;\n\t\t\t\t\tret = backtrace(\n\t\t\t\t\t\tbtncand_[cural_].score, // in: expected score\n\t\t\t\t\t\ttrue,     // in: use mini-fill?\n\t\t\t\t\t\ttrue,     // in: use checkpoints?\n\t\t\t\t\t\tres,      // out: store results (edits and scores) here\n\t\t\t\t\t\toff,      // out: store diagonal projection of origin\n\t\t\t\t\t\trow,      // start in this rectangle row\n\t\t\t\t\t\tcol,      // start in this rectangle column\n\t\t\t\t\t\tmaxiter,  // max # extensions to try\n\t\t\t\t\t\tniter,    // # extensions tried\n\t\t\t\t\t\trnd);     // random gen, to choose among equal paths\n\t\t\t\t} else {\n\t\t\t\t\tret = backtraceNucleotidesEnd2EndSseU8(\n\t\t\t\t\t\tbtncand_[cural_].score, // in: expected score\n\t\t\t\t\t\tres,    // out: store results (edits and scores) here\n\t\t\t\t\t\toff,    // out: store diagonal projection of origin\n\t\t\t\t\t\tnbts,   // out: # backtracks\n\t\t\t\t\t\trow,    // start in this rectangle row\n\t\t\t\t\t\tcol,    // start in this rectangle column\n\t\t\t\t\t\trnd);   // random gen, to choose among equal paths\n\t\t\t\t}\n#ifndef NDEBUG\n\t\t\t\t// if(...) statement here should check not whether the primary\n\t\t\t\t// alignment was checkpointed, but whether a checkpointed\n\t\t\t\t// alignment was done at all.\n\t\t\t\tif(!checkpointed) {\n\t\t\t\t\tSwResult res2;\n\t\t\t\t\tsize_t maxiter2 = MAX_SIZE_T;\n\t\t\t\t\tsize_t niter2 = 0;\n\t\t\t\t\tbool ret2 = backtrace(\n\t\t\t\t\t\tbtncand_[cural_].score, // in: expected score\n\t\t\t\t\t\ttrue,     // in: use mini-fill?\n\t\t\t\t\t\ttrue,     // in: use checkpoints?\n\t\t\t\t\t\tres2,     // out: store results (edits and scores) here\n\t\t\t\t\t\toff,      // out: store diagonal projection of origin\n\t\t\t\t\t\trow,      // start in this rectangle row\n\t\t\t\t\t\tcol,      // start in this rectangle column\n\t\t\t\t\t\tmaxiter2, // max # extensions to try\n\t\t\t\t\t\tniter2,   // # extensions tried\n\t\t\t\t\t\trnd);     // random gen, to choose among equal paths\n\t\t\t\t\t// After the first alignment, there's no guarantee we'll\n\t\t\t\t\t// get the same answer from both backtrackers because of\n\t\t\t\t\t// differences in how they handle marking cells as\n\t\t\t\t\t// reported-through.\n\t\t\t\t\tassert(cural_ > 0 || !ret || ret == ret2);\n\t\t\t\t}\n\t\t\t\tif(sse16succ_ && !checkpointed) {\n\t\t\t\t\tSwResult res2;\n\t\t\t\t\tsize_t off2, nbts2 = 0;\n\t\t\t\t\trnd.init(reseed);\n\t\t\t\t\tbool ret2 = backtraceNucleotidesEnd2EndSseI16(\n\t\t\t\t\t\tbtncand_[cural_].score, // in: expected score\n\t\t\t\t\t\tres2,   // out: store results (edits and scores) here\n\t\t\t\t\t\toff2,   // out: store diagonal projection of origin\n\t\t\t\t\t\tnbts2,  // out: # backtracks\n\t\t\t\t\t\trow,    // start in this rectangle row\n\t\t\t\t\t\tcol,    // start in this rectangle column\n\t\t\t\t\t\trnd);   // random gen, to choose among equal paths\n\t\t\t\t\tassert_eq(ret, ret2);\n\t\t\t\t\tassert_eq(nbts, nbts2);\n\t\t\t\t\tassert(!ret || res2.alres.score() == res.alres.score());\n#if 0\n\t\t\t\t\tif(!checkpointed && (rand() & 15) == 0) {\n\t\t\t\t\t\t// Check that same cells are reported through\n\t\t\t\t\t\tSSEData& d8  = fw_ ? sseU8fw_  : sseU8rc_;\n\t\t\t\t\t\tSSEData& d16 = fw_ ? sseI16fw_ : sseI16rc_;\n\t\t\t\t\t\tfor(size_t i = d8.mat_.nrow(); i > 0; i--) {\n\t\t\t\t\t\t\tfor(size_t j = 0; j < d8.mat_.ncol(); j++) {\n\t\t\t\t\t\t\t\tassert_eq(d8.mat_.reportedThrough(i-1, j),\n\t\t\t\t\t\t\t\t\t\t  d16.mat_.reportedThrough(i-1, j));\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n#endif\n\t\t\t\t}\n#endif\n\t\t\t\trnd.init(reseed+1); // debug/release pseudo-randoms in lock step\n\t\t\t} else if(sse16succ_) {\n\t\t\t\tuint32_t reseed = rnd.nextU32() + 1;\n\t\t\t\tres.reset();\n\t\t\t\tif(checkpointed) {\n\t\t\t\t\tsize_t maxiter = MAX_SIZE_T;\n\t\t\t\t\tsize_t niter = 0;\n\t\t\t\t\tret = backtrace(\n\t\t\t\t\t\tbtncand_[cural_].score, // in: expected score\n\t\t\t\t\t\ttrue,     // in: use mini-fill?\n\t\t\t\t\t\ttrue,     // in: use checkpoints?\n\t\t\t\t\t\tres,      // out: store results (edits and scores) here\n\t\t\t\t\t\toff,      // out: store diagonal projection of origin\n\t\t\t\t\t\trow,      // start in this rectangle row\n\t\t\t\t\t\tcol,      // start in this rectangle column\n\t\t\t\t\t\tmaxiter,  // max # extensions to try\n\t\t\t\t\t\tniter,    // # extensions tried\n\t\t\t\t\t\trnd);     // random gen, to choose among equal paths\n\t\t\t\t} else {\n\t\t\t\t\tret = backtraceNucleotidesEnd2EndSseI16(\n\t\t\t\t\t\tbtncand_[cural_].score, // in: expected score\n\t\t\t\t\t\tres,    // out: store results (edits and scores) here\n\t\t\t\t\t\toff,    // out: store diagonal projection of origin\n\t\t\t\t\t\tnbts,   // out: # backtracks\n\t\t\t\t\t\trow,    // start in this rectangle row\n\t\t\t\t\t\tcol,    // start in this rectangle column\n\t\t\t\t\t\trnd);   // random gen, to choose among equal paths\n\t\t\t\t}\n#ifndef NDEBUG\n\t\t\t\t// if(...) statement here should check not whether the primary\n\t\t\t\t// alignment was checkpointed, but whether a checkpointed\n\t\t\t\t// alignment was done at all.\n\t\t\t\tif(!checkpointed) {\n\t\t\t\t\tSwResult res2;\n\t\t\t\t\tsize_t maxiter2 = MAX_SIZE_T;\n\t\t\t\t\tsize_t niter2 = 0;\n\t\t\t\t\tbool ret2 = backtrace(\n\t\t\t\t\t\tbtncand_[cural_].score, // in: expected score\n\t\t\t\t\t\ttrue,     // in: use mini-fill?\n\t\t\t\t\t\ttrue,     // in: use checkpoints?\n\t\t\t\t\t\tres2,     // out: store results (edits and scores) here\n\t\t\t\t\t\toff,      // out: store diagonal projection of origin\n\t\t\t\t\t\trow,      // start in this rectangle row\n\t\t\t\t\t\tcol,      // start in this rectangle column\n\t\t\t\t\t\tmaxiter2, // max # extensions to try\n\t\t\t\t\t\tniter2,   // # extensions tried\n\t\t\t\t\t\trnd);     // random gen, to choose among equal paths\n\t\t\t\t\t// After the first alignment, there's no guarantee we'll\n\t\t\t\t\t// get the same answer from both backtrackers because of\n\t\t\t\t\t// differences in how they handle marking cells as\n\t\t\t\t\t// reported-through.\n\t\t\t\t\tassert(cural_ > 0 || !ret || ret == ret2);\n\t\t\t\t}\n#endif\n\t\t\t\trnd.init(reseed); // debug/release pseudo-randoms in lock step\n\t\t\t}\n\t\t\tif(ret) {\n\t\t\t\tbtncand_[cural_].fate = BT_CAND_FATE_SUCCEEDED;\n\t\t\t\tbreak;\n\t\t\t} else {\n\t\t\t\tbtncand_[cural_].fate = BT_CAND_FATE_FAILED;\n\t\t\t}\n\t\t} else {\n\t\t\t// Local alignment\n\t\t\t// Check if this solution is \"dominated\" by a prior one.\n\t\t\t// Domination is a heuristic designed to eliminate the vast\n\t\t\t// majority of valid-but-redundant candidates lying in the\n\t\t\t// \"penumbra\" of a high-scoring alignment.\n\t\t\tbool dom = false;\n\t\t\t{\n\t\t\t\tsize_t donesz = btncanddone_.size();\n\t\t\t\tconst size_t col = btncand_[cural_].col;\n\t\t\t\tconst size_t row = btncand_[cural_].row;\n\t\t\t\tfor(size_t i = 0; i < donesz; i++) {\n\t\t\t\t\tassert_gt(btncanddone_[i].fate, 0);\n\t\t\t\t\tsize_t colhi = col, rowhi = row;\n\t\t\t\t\tsize_t rowlo = btncanddone_[i].row;\n\t\t\t\t\tsize_t collo = btncanddone_[i].col;\n\t\t\t\t\tif(colhi < collo) swap(colhi, collo);\n\t\t\t\t\tif(rowhi < rowlo) swap(rowhi, rowlo);\n\t\t\t\t\tif(colhi - collo <= SQ && rowhi - rowlo <= SQ) {\n\t\t\t\t\t\t// Skipping this candidate because it's \"dominated\" by\n\t\t\t\t\t\t// a previous candidate\n\t\t\t\t\t\tdom = true;\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(dom) {\n\t\t\t\tbtncand_[cural_].fate = BT_CAND_FATE_FILT_DOMINATED;\n\t\t\t\tnbtfiltdo_++;\n\t\t\t\tcural_++;\n\t\t\t\tcontinue;\n\t\t\t}\n\t\t\tbool ret = false;\n\t\t\tif(sse8succ_) {\n\t\t\t\tuint32_t reseed = rnd.nextU32() + 1;\n\t\t\t\tres.reset();\n\t\t\t\trnd.init(reseed);\n\t\t\t\tif(checkpointed) {\n\t\t\t\t\tsize_t maxiter = MAX_SIZE_T;\n\t\t\t\t\tsize_t niter = 0;\n\t\t\t\t\tret = backtrace(\n\t\t\t\t\t\tbtncand_[cural_].score, // in: expected score\n\t\t\t\t\t\ttrue,     // in: use mini-fill?\n\t\t\t\t\t\ttrue,     // in: use checkpoints?\n\t\t\t\t\t\tres,      // out: store results (edits and scores) here\n\t\t\t\t\t\toff,      // out: store diagonal projection of origin\n\t\t\t\t\t\trow,      // start in this rectangle row\n\t\t\t\t\t\tcol,      // start in this rectangle column\n\t\t\t\t\t\tmaxiter,  // max # extensions to try\n\t\t\t\t\t\tniter,    // # extensions tried\n\t\t\t\t\t\trnd);     // random gen, to choose among equal paths\n\t\t\t\t} else {\n\t\t\t\t\tret = backtraceNucleotidesLocalSseU8(\n\t\t\t\t\t\tbtncand_[cural_].score, // in: expected score\n\t\t\t\t\t\tres,    // out: store results (edits and scores) here\n\t\t\t\t\t\toff,    // out: store diagonal projection of origin\n\t\t\t\t\t\tnbts,   // out: # backtracks\n\t\t\t\t\t\trow,    // start in this rectangle row\n\t\t\t\t\t\tcol,    // start in this rectangle column\n\t\t\t\t\t\trnd);   // random gen, to choose among equal paths\n\t\t\t\t}\n#ifndef NDEBUG\n\t\t\t\t// if(...) statement here should check not whether the primary\n\t\t\t\t// alignment was checkpointed, but whether a checkpointed\n\t\t\t\t// alignment was done at all.\n\t\t\t\tif(!checkpointed) {\n\t\t\t\t\tSwResult res2;\n\t\t\t\t\tsize_t maxiter2 = MAX_SIZE_T;\n\t\t\t\t\tsize_t niter2 = 0;\n\t\t\t\t\tbool ret2 = backtrace(\n\t\t\t\t\t\tbtncand_[cural_].score, // in: expected score\n\t\t\t\t\t\ttrue,     // in: use mini-fill?\n\t\t\t\t\t\ttrue,     // in: use checkpoints?\n\t\t\t\t\t\tres2,     // out: store results (edits and scores) here\n\t\t\t\t\t\toff,      // out: store diagonal projection of origin\n\t\t\t\t\t\trow,      // start in this rectangle row\n\t\t\t\t\t\tcol,      // start in this rectangle column\n\t\t\t\t\t\tmaxiter2, // max # extensions to try\n\t\t\t\t\t\tniter2,   // # extensions tried\n\t\t\t\t\t\trnd);     // random gen, to choose among equal paths\n\t\t\t\t\t// After the first alignment, there's no guarantee we'll\n\t\t\t\t\t// get the same answer from both backtrackers because of\n\t\t\t\t\t// differences in how they handle marking cells as\n\t\t\t\t\t// reported-through.\n\t\t\t\t\tassert(cural_ > 0 || !ret || ret == ret2);\n\t\t\t\t}\n\t\t\t\tif(!checkpointed && sse16succ_) {\n\t\t\t\t\tSwResult res2;\n\t\t\t\t\tsize_t off2, nbts2 = 0;\n\t\t\t\t\trnd.init(reseed); // same b/t backtrace calls\n\t\t\t\t\tbool ret2 = backtraceNucleotidesLocalSseI16(\n\t\t\t\t\t\tbtncand_[cural_].score, // in: expected score\n\t\t\t\t\t\tres2,   // out: store results (edits and scores) here\n\t\t\t\t\t\toff2,   // out: store diagonal projection of origin\n\t\t\t\t\t\tnbts2,  // out: # backtracks\n\t\t\t\t\t\trow,    // start in this rectangle row\n\t\t\t\t\t\tcol,    // start in this rectangle column\n\t\t\t\t\t\trnd);   // random gen, to choose among equal paths\n\t\t\t\t\tassert_eq(ret, ret2);\n\t\t\t\t\tassert_eq(nbts, nbts2);\n\t\t\t\t\tassert(!ret || res2.alres.score() == res.alres.score());\n#if 0\n\t\t\t\t\tif(!checkpointed && (rand() & 15) == 0) {\n\t\t\t\t\t\t// Check that same cells are reported through\n\t\t\t\t\t\tSSEData& d8  = fw_ ? sseU8fw_  : sseU8rc_;\n\t\t\t\t\t\tSSEData& d16 = fw_ ? sseI16fw_ : sseI16rc_;\n\t\t\t\t\t\tfor(size_t i = d8.mat_.nrow(); i > 0; i--) {\n\t\t\t\t\t\t\tfor(size_t j = 0; j < d8.mat_.ncol(); j++) {\n\t\t\t\t\t\t\t\tassert_eq(d8.mat_.reportedThrough(i-1, j),\n\t\t\t\t\t\t\t\t\t\t  d16.mat_.reportedThrough(i-1, j));\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n#endif\n\t\t\t\t}\n#endif\n\t\t\t\trnd.init(reseed+1); // debug/release pseudo-randoms in lock step\n\t\t\t} else if(sse16succ_) {\n\t\t\t\tuint32_t reseed = rnd.nextU32() + 1;\n\t\t\t\tres.reset();\n\t\t\t\tif(checkpointed) {\n\t\t\t\t\tsize_t maxiter = MAX_SIZE_T;\n\t\t\t\t\tsize_t niter = 0;\n\t\t\t\t\tret = backtrace(\n\t\t\t\t\t\tbtncand_[cural_].score, // in: expected score\n\t\t\t\t\t\ttrue,     // in: use mini-fill?\n\t\t\t\t\t\ttrue,     // in: use checkpoints?\n\t\t\t\t\t\tres,      // out: store results (edits and scores) here\n\t\t\t\t\t\toff,      // out: store diagonal projection of origin\n\t\t\t\t\t\trow,      // start in this rectangle row\n\t\t\t\t\t\tcol,      // start in this rectangle column\n\t\t\t\t\t\tmaxiter,  // max # extensions to try\n\t\t\t\t\t\tniter,    // # extensions tried\n\t\t\t\t\t\trnd);     // random gen, to choose among equal paths\n\t\t\t\t} else {\n\t\t\t\t\tret = backtraceNucleotidesLocalSseI16(\n\t\t\t\t\t\tbtncand_[cural_].score, // in: expected score\n\t\t\t\t\t\tres,    // out: store results (edits and scores) here\n\t\t\t\t\t\toff,    // out: store diagonal projection of origin\n\t\t\t\t\t\tnbts,   // out: # backtracks\n\t\t\t\t\t\trow,    // start in this rectangle row\n\t\t\t\t\t\tcol,    // start in this rectangle column\n\t\t\t\t\t\trnd);   // random gen, to choose among equal paths\n\t\t\t\t}\n#ifndef NDEBUG\n\t\t\t\t// if(...) statement here should check not whether the primary\n\t\t\t\t// alignment was checkpointed, but whether a checkpointed\n\t\t\t\t// alignment was done at all.\n\t\t\t\tif(!checkpointed) {\n\t\t\t\t\tSwResult res2;\n\t\t\t\t\tsize_t maxiter2 = MAX_SIZE_T;\n\t\t\t\t\tsize_t niter2 = 0;\n\t\t\t\t\tbool ret2 = backtrace(\n\t\t\t\t\t\tbtncand_[cural_].score, // in: expected score\n\t\t\t\t\t\ttrue,     // in: use mini-fill?\n\t\t\t\t\t\ttrue,     // in: use checkpoints?\n\t\t\t\t\t\tres2,     // out: store results (edits and scores) here\n\t\t\t\t\t\toff,      // out: store diagonal projection of origin\n\t\t\t\t\t\trow,      // start in this rectangle row\n\t\t\t\t\t\tcol,      // start in this rectangle column\n\t\t\t\t\t\tmaxiter2, // max # extensions to try\n\t\t\t\t\t\tniter2,   // # extensions tried\n\t\t\t\t\t\trnd);     // random gen, to choose among equal paths\n\t\t\t\t\t// After the first alignment, there's no guarantee we'll\n\t\t\t\t\t// get the same answer from both backtrackers because of\n\t\t\t\t\t// differences in how they handle marking cells as\n\t\t\t\t\t// reported-through.\n\t\t\t\t\tassert(cural_ > 0 || !ret || ret == ret2);\n\t\t\t\t}\n#endif\n\t\t\t\trnd.init(reseed); // same b/t backtrace calls\n\t\t\t}\n\t\t\tif(ret) {\n\t\t\t\tbtncand_[cural_].fate = BT_CAND_FATE_SUCCEEDED;\n\t\t\t\tbtncanddone_.push_back(btncand_[cural_]);\n\t\t\t\tbtncanddoneSucc_++;\n\t\t\t\tassert(res.repOk());\n\t\t\t\tbreak;\n\t\t\t} else {\n\t\t\t\tbtncand_[cural_].fate = BT_CAND_FATE_FAILED;\n\t\t\t\tbtncanddone_.push_back(btncand_[cural_]);\n\t\t\t\tbtncanddoneFail_++;\n\t\t\t}\n\t\t}\n\t\tcural_++;\n\t} // while(cural_ < btncand_.size())\n\tif(cural_ == btncand_.size()) {\n\t\tassert(res.repOk());\n\t\treturn false;\n\t}\n\t// assert(!res.alres.empty());\n\tassert(res.repOk());\n\tif(!fw_) {\n\t\t// All edits are currently w/r/t upstream end; if read aligned\n\t\t// to Crick strand, we need to invert them so that they're\n\t\t// w/r/t the read's 5' end instead.\n\t\t// res.alres.invertEdits();\n\t}\n\tcural_++;\n\tassert(res.repOk());\n\treturn true;\n}\n\n#ifdef MAIN_ALIGNER_SW\n\n#include <sstream>\n#include <utility>\n#include <getopt.h>\n#include \"scoring.h\"\n#include \"aligner_seed_policy.h\"\n\nint  gGapBarrier;\nint  gSnpPhred;\nstatic int bonusMatchType;   // how to reward matches\nstatic int bonusMatch;       // constant if match bonus is a constant\nstatic int penMmcType;       // how to penalize mismatches\nstatic int penMmc;           // constant if mm pelanty is a constant\nstatic int penNType;         // how to penalize Ns in the read\nstatic int penN;             // constant if N pelanty is a constant\nstatic bool nPairCat;        // true -> concatenate mates before N filter\nstatic int penRdExConst;     // constant coeff for cost of gap in read\nstatic int penRfExConst;     // constant coeff for cost of gap in ref\nstatic int penRdExLinear;    // linear coeff for cost of gap in read\nstatic int penRfExLinear;    // linear coeff for cost of gap in ref\nstatic float costMinConst;   // constant coeff for min score w/r/t read len\nstatic float costMinLinear;  // linear coeff for min score w/r/t read len\nstatic float costFloorConst; // constant coeff for score floor w/r/t read len\nstatic float costFloorLinear;// linear coeff for score floor w/r/t read len\nstatic float nCeilConst;     // constant coeff for N ceiling w/r/t read len\nstatic float nCeilLinear;    // linear coeff for N ceiling w/r/t read len\nstatic bool  nCatPair;       // concat mates before applying N filter?\nstatic int multiseedMms;     // mismatches permitted in a multiseed seed\nstatic int multiseedLen;     // length of multiseed seeds\nstatic int multiseedIvalType;\nstatic float multiseedIvalA;\nstatic float multiseedIvalB;\nstatic float posmin;\nstatic float posfrac;\nstatic float rowmult;\n\nenum {\n\tARG_TESTS = 256\n};\n\nstatic const char *short_opts = \"s:m:r:d:i:\";\nstatic struct option long_opts[] = {\n\t{(char*)\"snppen\",       required_argument, 0, 's'},\n\t{(char*)\"misspen\",      required_argument, 0, 'm'},\n\t{(char*)\"seed\",         required_argument, 0, 'r'},\n\t{(char*)\"align-policy\", no_argument,       0, 'A'},\n\t{(char*)\"test\",         no_argument,       0, ARG_TESTS},\n};\n\nstatic void printUsage(ostream& os) {\n\tos << \"Usage: aligner_sw <read-seq> <ref-nuc-seq> [options]*\" << endl;\n\tos << \"Options:\" << endl;\n\tos << \"  -s/--snppen <int>   penalty incurred by SNP; used for decoding\"\n\t   << endl;\n\tos << \"  -m/--misspen <int>  quality to use for read chars\" << endl;\n\tos << \"  -r/-seed <int>      seed for pseudo-random generator\" << endl;\n}\n\n/**\n * Parse a T from a string 's'\n */\ntemplate<typename T>\nT parse(const char *s) {\n\tT tmp;\n\tstringstream ss(s);\n\tss >> tmp;\n\treturn tmp;\n}\n\nstatic EList<bool> stbuf, enbuf;\nstatic BTDnaString btread;\nstatic BTString btqual;\nstatic BTString btref;\nstatic BTString btref2;\n\nstatic BTDnaString readrc;\nstatic BTString qualrc;\n\n/**\n * Helper function for running a case consisting of a read (sequence\n * and quality), a reference string, and an offset that anchors the 0th\n * character of the read to a reference position.\n */\nstatic void doTestCase(\n\tSwAligner&         al,\n\tconst BTDnaString& read,\n\tconst BTString&    qual,\n\tconst BTString&    refin,\n\tTRefOff            off,\n\tEList<bool>       *en,\n\tconst Scoring&     sc,\n\tTAlScore           minsc,\n\tSwResult&          res,\n\tbool               nsInclusive,\n\tbool               filterns,\n\tuint32_t           seed)\n{\n\tRandomSource rnd(seed);\n\tbtref2 = refin;\n\tassert_eq(read.length(), qual.length());\n\tsize_t nrow = read.length();\n\tTRefOff rfi, rff;\n\t// Calculate the largest possible number of read and reference gaps given\n\t// 'minsc' and 'pens'\n\tsize_t maxgaps;\n\tsize_t padi, padf;\n\t{\n\t\tint readGaps = sc.maxReadGaps(minsc, read.length());\n\t\tint refGaps = sc.maxRefGaps(minsc, read.length());\n\t\tassert_geq(readGaps, 0);\n\t\tassert_geq(refGaps, 0);\n\t\tint maxGaps = max(readGaps, refGaps);\n\t\tpadi = 2 * maxGaps;\n\t\tpadf = maxGaps;\n\t\tmaxgaps = (size_t)maxGaps;\n\t}\n\tsize_t nceil = (size_t)sc.nCeil.f((double)read.length());\n\tsize_t width = 1 + padi + padf;\n\trfi = off;\n\toff = 0;\n\t// Pad the beginning of the reference with Ns if necessary\n\tif(rfi < padi) {\n\t\tsize_t beginpad = (size_t)(padi - rfi);\n\t\tfor(size_t i = 0; i < beginpad; i++) {\n\t\t\tbtref2.insert('N', 0);\n\t\t\toff--;\n\t\t}\n\t\trfi = 0;\n\t} else {\n\t\trfi -= padi;\n\t}\n\tassert_geq(rfi, 0);\n\t// Pad the end of the reference with Ns if necessary\n\twhile(rfi + nrow + padi + padf > btref2.length()) {\n\t\tbtref2.append('N');\n\t}\n\trff = rfi + nrow + padi + padf;\n\t// Convert reference string to masks\n\tfor(size_t i = 0; i < btref2.length(); i++) {\n\t\tif(toupper(btref2[i]) == 'N' && !nsInclusive) {\n\t\t\tbtref2.set(16, i);\n\t\t} else {\n\t\t\tint num = 0;\n\t\t\tint alts[] = {4, 4, 4, 4};\n\t\t\tdecodeNuc(toupper(btref2[i]), num, alts);\n\t\t\tassert_leq(num, 4);\n\t\t\tassert_gt(num, 0);\n\t\t\tbtref2.set(0, i);\n\t\t\tfor(int j = 0; j < num; j++) {\n\t\t\t\tbtref2.set(btref2[i] | (1 << alts[j]), i);\n\t\t\t}\n\t\t}\n\t}\n\tbool fw = true;\n\tuint32_t refidx = 0;\n\tsize_t solwidth = width;\n\tif(maxgaps >= solwidth) {\n\t\tsolwidth = 0;\n\t} else {\n\t\tsolwidth -= maxgaps;\n\t}\n\tif(en == NULL) {\n\t\tenbuf.resize(solwidth);\n\t\tenbuf.fill(true);\n\t\ten = &enbuf;\n\t}\n\tassert_geq(rfi, 0);\n\tassert_gt(rff, rfi);\n\treadrc = read;\n\tqualrc = qual;\n\tal.initRead(\n\t\tread,          // read sequence\n\t\treadrc,\n\t\tqual,          // read qualities\n\t\tqualrc,\n\t\t0,             // offset of first character within 'read' to consider\n\t\tread.length(), // offset of last char (exclusive) in 'read' to consider\n\t\tfloorsc);      // local-alignment score floor\n\tal.initRef(\n\t\tfw,            // 'read' is forward version of read?\n\t\trefidx,        // id of reference aligned to\n\t\toff,           // offset of upstream ref char aligned against\n\t\tbtref2.wbuf(), // reference sequence (masks)\n\t\trfi,           // offset of first char in 'ref' to consider\n\t\trff,           // offset of last char (exclusive) in 'ref' to consider\n\t\twidth,         // # bands to do (width of parallelogram)\n\t\tsolwidth,      // # rightmost cols where solns can end\n\t\tsc,            // scoring scheme\n\t\tminsc,         // minimum score for valid alignment\n\t\tmaxgaps,       // max of max # read gaps, ref gaps\n\t\t0,             // amount to truncate on left-hand side\n\t\ten);           // mask indicating which columns we can end in\n\tif(filterns) {\n\t\tal.filter((int)nceil);\n\t}\n\tal.align(rnd);\n}\n\n/**\n * Another interface for running a case.\n */\nstatic void doTestCase2(\n\tSwAligner&         al,\n\tconst char        *read,\n\tconst char        *qual,\n\tconst char        *refin,\n\tTRefOff            off,\n\tconst Scoring&     sc,\n\tfloat              costMinConst,\n\tfloat              costMinLinear,\n\tSwResult&          res,\n\tbool               nsInclusive = false,\n\tbool               filterns = false,\n\tuint32_t           seed = 0)\n{\n\tbtread.install(read, true);\n\tTAlScore minsc = (TAlScore)(Scoring::linearFunc(\n\t\tbtread.length(),\n\t\tcostMinConst,\n\t\tcostMinLinear));\n\tTAlScore floorsc = (TAlScore)(Scoring::linearFunc(\n\t\tbtread.length(),\n\t\tcostFloorConst,\n\t\tcostFloorLinear));\n\tbtqual.install(qual);\n\tbtref.install(refin);\n\tdoTestCase(\n\t\tal,\n\t\tbtread,\n\t\tbtqual,\n\t\tbtref,\n\t\toff,\n\t\tNULL,\n\t\tsc,  \n\t\tminsc,\n\t\tfloorsc,\n\t\tres,\n\t\tnsInclusive,\n\t\tfilterns,\n\t\tseed\n\t);\n}\n\n/**\n * Another interface for running a case.\n */\nstatic void doTestCase3(\n\tSwAligner&         al,\n\tconst char        *read,\n\tconst char        *qual,\n\tconst char        *refin,\n\tTRefOff            off,\n\tScoring&           sc,\n\tfloat              costMinConst,\n\tfloat              costMinLinear,\n\tfloat              nCeilConst,\n\tfloat              nCeilLinear,\n\tSwResult&          res,\n\tbool               nsInclusive = false,\n\tbool               filterns = false,\n\tuint32_t           seed = 0)\n{\n\tbtread.install(read, true);\n\t// Calculate the penalty ceiling for the read\n\tTAlScore minsc = (TAlScore)(Scoring::linearFunc(\n\t\tbtread.length(),\n\t\tcostMinConst,\n\t\tcostMinLinear));\n\tTAlScore floorsc = (TAlScore)(Scoring::linearFunc(\n\t\tbtread.length(),\n\t\tcostFloorConst,\n\t\tcostFloorLinear));\n\tbtqual.install(qual);\n\tbtref.install(refin);\n\tsc.nCeil.setType(SIMPLE_FUNC_LINEAR);\n\tsc.nCeil.setConst(costMinConst);\n\tsc.nCeil.setCoeff(costMinLinear);\n\tdoTestCase(\n\t\tal,\n\t\tbtread,\n\t\tbtqual,\n\t\tbtref,\n\t\toff,\n\t\tNULL,\n\t\tsc,  \n\t\tminsc,\n\t\tfloorsc,\n\t\tres,\n\t\tnsInclusive,\n\t\tfilterns,\n\t\tseed\n\t);\n}\n\n/**\n * Another interface for running a case.  Like doTestCase3 but caller specifies\n * st_ and en_ lists.\n */\nstatic void doTestCase4(\n\tSwAligner&         al,\n\tconst char        *read,\n\tconst char        *qual,\n\tconst char        *refin,\n\tTRefOff            off,\n\tEList<bool>&       en,\n\tScoring&           sc,\n\tfloat              costMinConst,\n\tfloat              costMinLinear,\n\tfloat              nCeilConst,\n\tfloat              nCeilLinear,\n\tSwResult&          res,\n\tbool               nsInclusive = false,\n\tbool               filterns = false,\n\tuint32_t           seed = 0)\n{\n\tbtread.install(read, true);\n\t// Calculate the penalty ceiling for the read\n\tTAlScore minsc = (TAlScore)(Scoring::linearFunc(\n\t\tbtread.length(),\n\t\tcostMinConst,\n\t\tcostMinLinear));\n\tTAlScore floorsc = (TAlScore)(Scoring::linearFunc(\n\t\tbtread.length(),\n\t\tcostFloorConst,\n\t\tcostFloorLinear));\n\tbtqual.install(qual);\n\tbtref.install(refin);\n\tsc.nCeil.setType(SIMPLE_FUNC_LINEAR);\n\tsc.nCeil.setConst(costMinConst);\n\tsc.nCeil.setCoeff(costMinLinear);\n\tdoTestCase(\n\t\tal,\n\t\tbtread,\n\t\tbtqual,\n\t\tbtref,\n\t\toff,\n\t\t&en,\n\t\tsc,  \n\t\tminsc,\n\t\tfloorsc,\n\t\tres,\n\t\tnsInclusive,\n\t\tfilterns,\n\t\tseed\n\t);\n}\n\n/**\n * Do a set of unit tests.\n */\nstatic void doTests() {\n\tbonusMatchType  = DEFAULT_MATCH_BONUS_TYPE;\n\tbonusMatch      = DEFAULT_MATCH_BONUS;\n\tpenMmcType      = DEFAULT_MM_PENALTY_TYPE;\n\tpenMmc          = DEFAULT_MM_PENALTY;\n\tpenSnp          = DEFAULT_SNP_PENALTY;\n\tpenNType        = DEFAULT_N_PENALTY_TYPE;\n\tpenN            = DEFAULT_N_PENALTY;\n\tnPairCat        = DEFAULT_N_CAT_PAIR;\n\tpenRdExConst    = DEFAULT_READ_GAP_CONST;\n\tpenRfExConst    = DEFAULT_REF_GAP_CONST;\n\tpenRdExLinear   = DEFAULT_READ_GAP_LINEAR;\n\tpenRfExLinear   = DEFAULT_REF_GAP_LINEAR;\n\tcostMinConst    = DEFAULT_MIN_CONST;\n\tcostMinLinear   = DEFAULT_MIN_LINEAR;\n\tcostFloorConst  = DEFAULT_FLOOR_CONST;\n\tcostFloorLinear = DEFAULT_FLOOR_LINEAR;\n\tnCeilConst      = 1.0f; // constant factor in N ceil w/r/t read len\n\tnCeilLinear     = 0.1f; // coeff of linear term in N ceil w/r/t read len\n\tmultiseedMms    = DEFAULT_SEEDMMS;\n\tmultiseedLen    = DEFAULT_SEEDLEN;\n\t// Set up penalities\n\tScoring sc(\n\t\tbonusMatch,\n\t\tpenMmcType,    // how to penalize mismatches\n\t\t30,        // constant if mm pelanty is a constant\n\t\t30,        // penalty for decoded SNP\n\t\tcostMinConst,  // constant factor in N ceiling w/r/t read length\n\t\tcostMinLinear, // coeff of linear term in N ceiling w/r/t read length\n\t\tcostFloorConst,  // constant factor in N ceiling w/r/t read length\n\t\tcostFloorLinear, // coeff of linear term in N ceiling w/r/t read length\n\t\tnCeilConst,    // constant factor in N ceiling w/r/t read length\n\t\tnCeilLinear,   // coeff of linear term in N ceiling w/r/t read length\n\t\tpenNType,      // how to penalize Ns in the read\n\t\tpenN,          // constant if N pelanty is a constant\n\t\tnPairCat,      // true -> concatenate mates before N filtering\n\t\t25,  // constant coeff for cost of gap in read\n\t\t25,  // constant coeff for cost of gap in ref\n\t\t15, // linear coeff for cost of gap in read\n\t\t15, // linear coeff for cost of gap in ref\n\t\t1,             // # rows at top/bot can only be entered diagonally\n\t\t-1,            // min row idx to backtrace from; -1 = no limit\n\t\tfalse          // sort results first by row then by score?\n\t);\n\t// Set up alternative penalities\n\tScoring sc2(\n\t\tbonusMatch,\n\t\tCOST_MODEL_QUAL, // how to penalize mismatches\n\t\t30,          // constant if mm pelanty is a constant\n\t\t30,          // penalty for decoded SNP\n\t\tcostMinConst,  // constant factor in N ceiling w/r/t read length\n\t\tcostMinLinear, // coeff of linear term in N ceiling w/r/t read length\n\t\tcostFloorConst,  // constant factor in N ceiling w/r/t read length\n\t\tcostFloorLinear, // coeff of linear term in N ceiling w/r/t read length\n\t\t1.0f,            // constant factor in N ceiling w/r/t read length\n\t\t1.0f,            // coeff of linear term in N ceiling w/r/t read length\n\t\tpenNType,        // how to penalize Ns in the read\n\t\tpenN,            // constant if N pelanty is a constant\n\t\tnPairCat,        // true -> concatenate mates before N filtering\n\t\t25,    // constant coeff for cost of gap in read\n\t\t25,    // constant coeff for cost of gap in ref\n\t\t15,   // linear coeff for cost of gap in read\n\t\t15,   // linear coeff for cost of gap in ref\n\t\t1,               // # rows at top/bot can only be entered diagonally\n\t\t-1,              // min row idx to backtrace from; -1 = no limit\n\t\tfalse            // sort results first by row then by score?\n\t);\n\tSwResult res;\n\t\n\t//\n\t// Basic nucleotide-space tests\n\t//\n\tcerr << \"Running tests...\" << endl;\n\tint tests = 1;\n\tbool nIncl = false;\n\tbool nfilter = false;\n\n\tSwAligner al;\n\tRandomSource rnd(73);\n\tfor(int i = 0; i < 3; i++) {\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \"\n\t\t     << (i*4) << \", exact)...\";\n\t\tsc.rdGapConst = 40;\n\t\tsc.rfGapConst = 40;\n\t\tsc.rdGapLinear = 15;\n\t\tsc.rfGapLinear = 15;\n\t//        A           C           G           T           A           C           G           T\n\t//    H   E   F   H   E   F   H   E   F   H   E   F   H   E   F   H   E   F   H   E   F   H   E   F\n\t// A  0   lo  lo -30  lo  lo -30  lo  lo -30 lo lo 0 lo lo -30 lo lo-30 lo lo-30 lo lo\n\t// C -30  lo -55  0  -85 -85 -55 -55 -85\n\t// G -30  lo -70 -55 -85 -55  0 -100-100\n\t// T -30  lo -85 -60 -85 -70 -55-100 -55\n\t// A  0   lo -85 -55 -55 -85 -70 -70 -70\n\t// C -30  lo -55  0  -85-100 -55 -55 -85\n\t// G -30  lo -70 -55 -85 -55  0 -100-100\n\t// T -30  lo -85 -60 -85 -70 -55-100 -55\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTACGT\",         // read\n\t\t\t\"IIIIIIII\",         // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), 0);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tassert(res.alres.ned().empty());\n\t\tassert(res.alres.aed().empty());\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tassert(al.done());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 1mm allowed by minsc)...\";\n\t\tsc.setMmPen(COST_MODEL_CONSTANT, 30);\n\t\t//sc.setMatchBonus(10);\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTTCGT\",         // read\n\t\t\t\"IIIIIIII\",         // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), -30);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tassert(al.done());\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 1mm allowed by minsc, check qual 1)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTTCGT\",         // read\n\t\t\t\"ABCDEFGH\",         // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc2,                // scoring scheme\n\t\t\t-40.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tsize_t lo, hi;\n\t\tif(i == 0) {\n\t\t\tlo = 0; hi = 1;\n\t\t} else if(i == 1) {\n\t\t\tlo = 1; hi = 2;\n\t\t} else {\n\t\t\tlo = 2; hi = 3;\n\t\t}\n\t\tfor(size_t j = lo; j < hi; j++) {\n\t\t\tal.nextAlignment(res, rnd);\n\t\t\tassert(!res.empty());\n\t\t\tassert_eq(j*4, res.alres.refoff());\n\t\t\tassert_eq(8, res.alres.refExtent());\n\t\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\t\tassert_eq(res.alres.score().score(), -36);\n\t\t\tassert_eq(res.alres.score().ns(), 0);\n\t\t\tres.reset();\n\t\t}\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 1mm allowed by minsc, check qual 2)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGAACGT\",         // read\n\t\t\t\"ABCDEFGH\",         // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc2,                // scoring scheme\n\t\t\t-40.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), -35);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tassert(res.empty());\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 1mm allowed by minsc, check qual )...\";\n\t\tassert(res.empty());\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"TCGTACGT\",         // read\n\t\t\t\"ABCDEFGH\",         // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc2,                // scoring scheme\n\t\t\t-40.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), -32);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tassert(res.empty());\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 1mm at the beginning, allowed by minsc)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"CCGTACGT\",         // read\n\t\t\t\"IIIIIIII\",         // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), -30);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tassert_eq(1, res.alres.ned().size());\n\t\tassert_eq(0, res.alres.aed().size());\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 1 n in read, allowed)...\";\n\t\tdoTestCase3(\n\t\t\tal,\n\t\t\t\"ACGTNCGT\",         // read\n\t\t\t\"IIIIIIII\",         // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\t1.0f,               // allow 1 N\n\t\t\t0.0f,               // allow 1 N\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), -1);\n\t\tassert_eq(res.alres.score().ns(), 1);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 2 n in read, allowed)...\";\n\t\tdoTestCase3(\n\t\t\tal,\n\t\t\t\"ACGNNCGT\",         // read\n\t\t\t\"IIIIIIII\",         // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\t2.0f,               // const coeff for N ceiling\n\t\t\t0.0f,               // linear coeff for N ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), -2);\n\t\tassert_eq(res.alres.score().ns(), 2);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 2 n in read, 1 at beginning, allowed)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"NCGTNCGT\",         // read\n\t\t\t\"IIIIIIII\",         // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), -2);\n\t\tassert_eq(res.alres.score().ns(), 2);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 1 n in ref, allowed)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTACGT\",         // read\n\t\t\t\"IIIIIIII\",         // qual\n\t\t\t\"ACGTNCGTACGTANGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), -1);\n\t\tassert_eq(res.alres.score().ns(), 1);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 1mm disallowed by minsc)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTTCGT\",         // read\n\t\t\t\"IIIIIIII\",         // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-10.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tassert(al.done());\n\t\t// Read gap with equal read and ref gap penalties\n\t\tcerr << \"PASSED\" << endl;\n\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", read gap allowed by minsc)...\";\n\t\tassert(res.empty());\n\t\tsc.rfGapConst = 25;\n\t\tsc.rdGapConst = 25;\n\t\tsc.rfGapLinear = 15;\n\t\tsc.rdGapLinear = 15;\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTCGT\",          // read\n\t\t\t\"IIIIIII\",          // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-40.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 1);\n\t\tassert_eq(res.alres.score().score(), -40);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", read gap disallowed by minsc)...\";\n\t\tsc.rfGapConst = 25;\n\t\tsc.rdGapConst = 25;\n\t\tsc.rfGapLinear = 15;\n\t\tsc.rdGapLinear = 15;\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTCGT\",          // read\n\t\t\t\"IIIIIII\",          // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tassert(al.done());\n\t\tres.reset();\n\n\t\tcerr << \"PASSED\" << endl;\n\t\t// Ref gap with equal read and ref gap penalties\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", ref gap allowed by minsc)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTAACGT\",        // read\n\t\t\t\"IIIIIIIII\",        // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-40.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 1);\n\t\tassert_eq(res.alres.score().score(), -40);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", read gap disallowed by gap barrier)...\";\n\t\tsc.rfGapConst = 25;\n\t\tsc.rdGapConst = 25;\n\t\tsc.rfGapLinear = 15;\n\t\tsc.rdGapLinear = 15;\n\t\tsc.gapbar = 4;\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTCGT\",          // read\n\t\t\t\"IIIIIII\",          // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-40.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tsc.gapbar = 1;\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tassert(al.done());\n\t\tres.reset();\n\n\t\tcerr << \"PASSED\" << endl;\n\t\t// Ref gap with equal read and ref gap penalties\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", ref gap allowed by minsc, gapbar=3)...\";\n\t\tsc.gapbar = 3;\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTAACGT\",        // read\n\t\t\t\"IIIIIIIII\",        // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-40.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tsc.gapbar = 1;\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 1);\n\t\tassert_eq(res.alres.score().score(), -40);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\n\t\t// Ref gap with equal read and ref gap penalties\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", ref gap allowed by minsc, gapbar=4)...\";\n\t\tsc.gapbar = 4;\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTAACGT\",        // read\n\t\t\t\"IIIIIIIII\",        // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-40.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tsc.gapbar = 1;\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 1);\n\t\tassert_eq(res.alres.score().score(), -40);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", ref gap disallowed by minsc)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTAACGT\",        // read\n\t\t\t\"IIIIIIIII\",        // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tassert(al.done());\n\t\tcerr << \"PASSED\" << endl;\n\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", ref gap disallowed by gap barrier)...\";\n\t\tsc.gapbar = 5;\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTAACGT\",        // read\n\t\t\t\"IIIIIIIII\",        // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-40.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tsc.gapbar = 1;\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tassert(al.done());\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\t// Read gap with one read gap and zero ref gaps allowed\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 1 read gap, ref gaps disallowed by minsc)...\";\n\t\tsc.rfGapConst = 35;\n\t\tsc.rdGapConst = 25;\n\t\tsc.rfGapLinear = 20;\n\t\tsc.rdGapLinear = 10;\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTCGT\",          // read\n\t\t\t\"IIIIIII\",          // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-40.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 1);\n\t\tassert_eq(res.alres.score().score(), -35);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", gaps disallowed by minsc)...\";\n\t\tsc.rfGapConst = 25;\n\t\tsc.rdGapConst = 25;\n\t\tsc.rfGapLinear = 10;\n\t\tsc.rdGapLinear = 10;\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTCGT\",          // read\n\t\t\t\"IIIIIII\",          // qual \n\t\t\t\"ACGTACGTACGTACGT\", // ref \n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tassert(res.empty());\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\t// Ref gap with one ref gap and zero read gaps allowed\n\t\tsc.rfGapConst = 25;\n\t\tsc.rdGapConst = 35;\n\t\tsc.rfGapLinear = 12;\n\t\tsc.rdGapLinear = 22;\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 1 ref gap, read gaps disallowed by minsc)...\";\n\t\tassert(res.empty());\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTAACGT\",\n\t\t\t\"IIIIIIIII\",\n\t\t\t\"ACGTACGTACGTACGT\",\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-40.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 1);\n\t\tassert_eq(res.alres.score().score(), -37);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t\t<< \", gaps disallowed by minsc)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTAACGT\",\n\t\t\t\"IIIIIIIII\",\n\t\t\t\"ACGTACGTACGTACGT\",\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tassert(al.done());\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\t// Read gap with one read gap and two ref gaps allowed\n\t\tsc.rfGapConst = 20;\n\t\tsc.rdGapConst = 25;\n\t\tsc.rfGapLinear = 10;\n\t\tsc.rdGapLinear = 15;\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 1 read gap, 2 ref gaps allowed by minsc)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTCGT\",\n\t\t\t\"IIIIIII\",\n\t\t\t\"ACGTACGTACGTACGT\",\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-40.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 1);\n\t\tassert_eq(res.alres.score().score(), -40);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", gaps disallowed by minsc)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTCGT\",\n\t\t\t\"IIIIIII\",\n\t\t\t\"ACGTACGTACGTACGT\",\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tassert(al.done());\n\t\tcerr << \"PASSED\" << endl;\n\n\t\t// Ref gap with one ref gap and two read gaps allowed\n\t\tsc.rfGapConst = 25;\n\t\tsc.rdGapConst = 11;  // if this were 10, we'd have ties\n\t\tsc.rfGapLinear = 15;\n\t\tsc.rdGapLinear = 10;\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 1 ref gap, 2 read gaps allowed by minsc)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTAACGT\",\n\t\t\t\"IIIIIIIII\",\n\t\t\t\"ACGTACGTACGTACGT\",\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-40.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 1);\n\t\tassert_eq(res.alres.score().score(), -40);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4) << \", gaps disallowed by minsc)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTAACGT\",\n\t\t\t\"IIIIIIIII\",\n\t\t\t\"ACGTACGTACGTACGT\",\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tres.reset();\n\t\tassert(al.done());\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\t// Read gap with two read gaps and two ref gaps allowed\n\t\tsc.rfGapConst = 15;\n\t\tsc.rdGapConst = 15;\n\t\tsc.rfGapLinear = 10;\n\t\tsc.rdGapLinear = 10;\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 2 ref gaps, 2 read gaps allowed by minsc)...\";\n\t\tdoTestCase3(\n\t\t\tal,\n\t\t\t\"ACGTCGT\",\n\t\t\t\"IIIIIII\",\n\t\t\t\"ACGTACGTACGTACGT\",\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-40.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\t1.0,                // const coeff for N ceiling\n\t\t\t0.0,                // linear coeff for N ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\ttrue);              // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tif(!res.empty()) {\n\t\t\t//al.printResultStacked(res, cerr); cerr << endl;\n\t\t}\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 1);\n\t\tassert_eq(res.alres.score().score(), -25);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\t// The following alignment is possible when i == 2:\n\t\t//   ACGTACGTACGTACGTN\n\t\t// A             x\n\t\t// C              x\n\t\t// G               x\n\t\t// T                x\n\t\t// C                x\n\t\t// G                x\n\t\t// T                 x\n\t\tassert(i == 2 || res.empty());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tsc.rfGapConst = 10;\n\t\tsc.rdGapConst = 10;\n\t\tsc.rfGapLinear = 10;\n\t\tsc.rdGapLinear = 10;\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 1 ref gap, 1 read gap allowed by minsc)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTCGT\",\n\t\t\t\"IIIIIII\",\n\t\t\t\"ACGTACGTACGTACGT\",\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 1);\n\t\tassert_eq(res.alres.score().score(), -20);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\t// Ref gap with two ref gaps and zero read gaps allowed\n\t\tsc.rfGapConst = 15;\n\t\tsc.rdGapConst = 15;\n\t\tsc.rfGapLinear = 5;\n\t\tsc.rdGapLinear = 5;\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 2 ref gaps, 2 read gaps allowed by minsc)...\";\n\t\t// Careful: it might be possible for the read to align with overhang\n\t\t// instead of with a gap\n\t\tdoTestCase3(\n\t\t\tal,\n\t\t\t\"ACGTAACGT\",\n\t\t\t\"IIIIIIIII\",\n\t\t\t\"ACGTACGTACGTACGT\",\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-35.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\t1.0f,               // needed to avoid overhang alignments\n\t\t\t0.0f,               // needed to avoid overhang alignments\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\ttrue);              // filter Ns\n\t\tif(i == 0) {\n\t\t\tlo = 0; hi = 1;\n\t\t} else if(i == 1) {\n\t\t\tlo = 1; hi = 2;\n\t\t} else {\n\t\t\tlo = 2; hi = 3;\n\t\t}\n\t\tfor(size_t j = lo; j < hi; j++) {\n\t\t\tal.nextAlignment(res, rnd);\n\t\t\tassert(!res.empty());\n\t\t\t//al.printResultStacked(res, cerr); cerr << endl;\n\t\t\tassert(res.alres.refoff() == 0 ||\n\t\t\t       res.alres.refoff() == 4 ||\n\t\t\t\t   res.alres.refoff() == 8);\n\t\t\tassert_eq(8, res.alres.refExtent());\n\t\t\tassert_eq(res.alres.score().gaps(), 1);\n\t\t\tassert_eq(res.alres.score().score(), -20);\n\t\t\tassert_eq(res.alres.score().ns(), 0);\n\t\t\tres.reset();\n\t\t}\n\t\tal.nextAlignment(res, rnd);\n\t\t//assert(res.empty());\n\t\t//res.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tsc.rfGapConst = 25;\n\t\tsc.rdGapConst = 25;\n\t\tsc.rfGapLinear = 4;\n\t\tsc.rdGapLinear = 4;\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 1 ref gap, 1 read gap allowed by minsc)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTAACGT\",\n\t\t\t\"IIIIIIIII\",\n\t\t\t\"ACGTACGTACGTACGT\",\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 1);\n\t\tassert_eq(res.alres.score().score(), -29);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", short read)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"A\",\n\t\t\t\"I\",\n\t\t\t\"AAAAAAAAAAAA\",\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), 0);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\n\t\tif(i == 0) {\n\t\t\tcerr << \"  Test \" << tests++\n\t\t\t     << \" (nuc space, offset 0, short read & ref)...\";\n\t\t\tdoTestCase2(\n\t\t\t\tal,\n\t\t\t\t\"A\",\n\t\t\t\t\"I\",\n\t\t\t\t\"A\",\n\t\t\t\t0,                  // off\n\t\t\t\tsc,                 // scoring scheme\n\t\t\t\t-30.0f,             // const coeff for cost ceiling\n\t\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\t\tres,                // result\n\t\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\t\tnfilter);           // filter Ns\n\t\t\tal.nextAlignment(res, rnd);\n\t\t\tassert(!res.empty());\n\t\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\t\tassert_eq(res.alres.score().score(), 0);\n\t\t\tassert_eq(res.alres.score().ns(), 0);\n\t\t\tres.reset();\n\t\t\tcerr << \"PASSED\" << endl;\n\t\t}\n\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", short read, many allowed gaps)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"A\",\n\t\t\t\"I\",\n\t\t\t\"AAAAAAAAAAAA\",\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t-150.0f,            // const coeff for cost ceiling\n\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), 0);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\n\t\tif(i == 0) {\n\t\t\tcerr << \"  Test \" << tests++\n\t\t\t     << \" (nuc space, offset 0, short read & ref, \"\n\t\t\t\t << \"many allowed gaps)...\";\n\t\t\tdoTestCase2(\n\t\t\t\tal,\n\t\t\t\t\"A\",\n\t\t\t\t\"I\",\n\t\t\t\t\"A\",\n\t\t\t\t0,                  // off\n\t\t\t\tsc,                 // scoring scheme\n\t\t\t\t-150.0f,            // const coeff for cost ceiling\n\t\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\t\tres,                // result\n\t\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\t\tnfilter);           // filter Ns\n\t\t\tal.nextAlignment(res, rnd);\n\t\t\tassert(!res.empty());\n\t\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\t\tassert_eq(res.alres.score().score(), 0);\n\t\t\tassert_eq(res.alres.score().ns(), 0);\n\t\t\tres.reset();\n\t\t\tcerr << \"PASSED\" << endl;\n\t\t}\n\t}\n\n\t// A test case where a valid alignment with a worse score should be\n\t// accepted over a valid alignment with a better score but too many\n\t// Ns\n\tcerr << \"  Test \" << tests++ << \" (N ceiling 1)...\";\n\tsc.mmcostType = COST_MODEL_CONSTANT;\n\tsc.mmcost = 10;\n\tsc.snp = 30;\n\tsc.nCeilConst  = 0.0f;\n\tsc.nCeilLinear = 0.0f;\n\tsc.rfGapConst  = 10;\n\tsc.rdGapLinear = 10;\n\tsc.rfGapConst  = 10;\n\tsc.rfGapLinear = 10;\n\tsc.setNPen(COST_MODEL_CONSTANT, 2);\n\tsc.gapbar = 1;\n\t// No Ns allowed, so this hit should be filtered\n\tdoTestCase2(\n\t\tal,\n\t\t\"ACGTACGT\", // read seq\n\t\t\"IIIIIIII\", // read quals\n\t\t\"NCGTACGT\", // ref seq\n\t\t0,          // offset\n\t\tsc,         // scoring scheme\n\t\t-25.0f,     // const coeff for cost ceiling\n\t\t0.0f,       // linear coeff for cost ceiling\n\t\tres,        // result\n\t\tfalse,      // ns are in inclusive\n\t\ttrue,       // nfilter\n\t\t0);\n\tal.nextAlignment(res, rnd);\n\tassert(res.empty());\n\tcerr << \"PASSED\" << endl;\n\tres.reset();\n\n\t// 1 N allowed, so this hit should stand\n\tcerr << \"  Test \" << tests++ << \" (N ceiling 2)...\";\n\tdoTestCase3(\n\t\tal,\n\t\t\"ACGTACGT\", // read seq\n\t\t\"IIIIIIII\", // read quals\n\t\t\"NCGTACGT\", // ref seq\n\t\t0,          // offset\n\t\tsc,         // scoring scheme\n\t\t-25.0f,     // const coeff for cost ceiling\n\t\t0.0f,       // linear coeff for cost ceiling\n\t\t1.0f,       // constant coefficient for # Ns allowed\n\t\t0.0f,       // linear coefficient for # Ns allowed\n\t\tres,        // result\n\t\tfalse,      // ns are in inclusive\n\t\tfalse,      // nfilter - NOTE: FILTER OFF\n\t\t0);\n\tal.nextAlignment(res, rnd);\n\tassert(!res.empty());\n\tassert_eq(0,  res.alres.score().gaps());\n\tassert_eq(-2, res.alres.score().score());\n\tassert_eq(1,  res.alres.score().ns());\n\tcerr << \"PASSED\" << endl;\n\tres.reset();\n\n\t// 1 N allowed, but we set st_ such that this hit should not stand\n\tfor(size_t i = 0; i < 2; i++) {\n\t\tcerr << \"  Test \" << tests++ << \" (N ceiling 2 with st_ override)...\";\n\t\tEList<bool> en;\n\t\ten.resize(3); en.fill(true);\n\t\tif(i == 1) {\n\t\t\ten[1] = false;\n\t\t}\n\t\tsc.rfGapConst  = 10;\n\t\tsc.rdGapLinear = 10;\n\t\tsc.rfGapConst  = 10;\n\t\tsc.rfGapLinear = 10;\n\t\tdoTestCase4(\n\t\t\tal,\n\t\t\t\"ACGTACGT\", // read seq\n\t\t\t\"IIIIIIII\", // read quals\n\t\t\t\"NCGTACGT\", // ref seq\n\t\t\t0,          // offset\n\t\t\ten,         // rectangle columns where solution can end\n\t\t\tsc,         // scoring scheme\n\t\t\t-25.0f,     // const coeff for cost ceiling\n\t\t\t0.0f,       // linear coeff for cost ceiling\n\t\t\t1.0f,       // constant coefficient for # Ns allowed\n\t\t\t0.0f,       // linear coefficient for # Ns allowed\n\t\t\tres,        // result\n\t\t\tfalse,      // ns are in inclusive\n\t\t\tfalse,      // nfilter - NOTE: FILTER OFF\n\t\t\t0);\n\t\tal.nextAlignment(res, rnd);\n\t\tif(i > 0) {\n\t\t\tassert(res.empty());\n\t\t} else {\n\t\t\tassert(!res.empty());\n\t\t}\n\t\tcerr << \"PASSED\" << endl;\n\t\tres.reset();\n\t}\n\n\t// No Ns allowed, so this hit should be filtered\n\tcerr << \"  Test \" << tests++ << \" (N ceiling 3)...\";\n\tsc.nCeilConst = 1.0f;\n\tsc.nCeilLinear = 0.0f;\n\tdoTestCase2(\n\t\tal,\n\t\t\"ACGTACGT\", // read seq\n\t\t\"IIIIIIII\", // read quals\n\t\t\"NCGTACGT\", // ref seq\n\t\t0,          // offset\n\t\tsc,         // scoring scheme\n\t\t-25.0f,     // const coeff for cost ceiling\n\t\t0.0f,       // linear coeff for cost ceiling\n\t\tres,        // result\n\t\tfalse,      // ns are in inclusive\n\t\ttrue,       // nfilter - NOTE: FILTER ON\n\t\t0);\n\tal.nextAlignment(res, rnd);\n\tassert(!res.empty());\n\tassert_eq(0,  res.alres.score().gaps());\n\tassert_eq(-2, res.alres.score().score());\n\tassert_eq(1,  res.alres.score().ns());\n\tcerr << \"PASSED\" << endl;\n\tres.reset();\n\n\t// No Ns allowed, so this hit should be filtered\n\tcerr << \"  Test \" << tests++ << \" (redundant alignment elimination 1)...\";\n\tsc.nCeilConst = 1.0f;\n\tsc.nCeilLinear = 0.0f;\n\tsc.rfGapConst  = 25;\n\tsc.rdGapLinear = 15;\n\tsc.rfGapConst  = 25;\n\tsc.rfGapLinear = 15;\n\tdoTestCase2(\n\t\tal,\n\t\t//                   1         2         3         4\n\t\t//         01234567890123456789012345678901234567890123456\n\t\t          \"AGGCTATGCCTCTGACGCGATATCGGCGCCCACTTCAGAGCTAACCG\",\n\t\t          \"IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII\",\n\t\t  \"TTTTTTTTAGGCTATGCCTCTGACGCGATATCGGCGCCCACTTCAGAGCTAACCGTTTTTTT\",\n\t\t// 01234567890123456789012345678901234567890123456789012345678901\n\t\t//           1         2         3         4         5         6\n\t\t8,          // offset\n\t\tsc,         // scoring scheme\n\t\t-25.0f,     // const coeff for cost ceiling\n\t\t-5.0f,      // linear coeff for cost ceiling\n\t\tres,        // result\n\t\tfalse,      // ns are in inclusive\n\t\ttrue,       // nfilter - NOTE: FILTER ON\n\t\t0);\n\tal.nextAlignment(res, rnd);\n\tassert(!res.empty());\n\tassert_eq(8, res.alres.refoff());\n\tassert_eq(47, res.alres.refExtent());\n\tassert_eq(0, res.alres.score().gaps());\n\tassert_eq(0, res.alres.score().score());\n\tassert_eq(0, res.alres.score().ns());\n\tres.reset();\n\tal.nextAlignment(res, rnd);\n\tassert(res.empty());\n\tassert(al.done());\n\tcerr << \"PASSED\" << endl;\n\tres.reset();\n\t\n}\n\n/**\n * Do a set of unit tests for local alignment.\n */\nstatic void doLocalTests() {\n\tbonusMatchType  = DEFAULT_MATCH_BONUS_TYPE;\n\tbonusMatch      = DEFAULT_MATCH_BONUS_LOCAL;\n\tpenMmcType      = DEFAULT_MM_PENALTY_TYPE;\n\tpenMmc          = DEFAULT_MM_PENALTY;\n\tpenSnp          = DEFAULT_SNP_PENALTY;\n\tpenNType        = DEFAULT_N_PENALTY_TYPE;\n\tpenN            = DEFAULT_N_PENALTY;\n\tnPairCat        = DEFAULT_N_CAT_PAIR;\n\tpenRdExConst    = DEFAULT_READ_GAP_CONST;\n\tpenRfExConst    = DEFAULT_REF_GAP_CONST;\n\tpenRdExLinear   = DEFAULT_READ_GAP_LINEAR;\n\tpenRfExLinear   = DEFAULT_REF_GAP_LINEAR;\n\tcostMinConst    = DEFAULT_MIN_CONST_LOCAL;\n\tcostMinLinear   = DEFAULT_MIN_LINEAR_LOCAL;\n\tcostFloorConst  = DEFAULT_FLOOR_CONST_LOCAL;\n\tcostFloorLinear = DEFAULT_FLOOR_LINEAR_LOCAL;\n\tnCeilConst      = 1.0f; // constant factor in N ceil w/r/t read len\n\tnCeilLinear     = 0.1f; // coeff of linear term in N ceil w/r/t read len\n\tmultiseedMms    = DEFAULT_SEEDMMS;\n\tmultiseedLen    = DEFAULT_SEEDLEN;\n\t// Set up penalities\n\tScoring sc(\n\t\t10,\n\t\tpenMmcType,    // how to penalize mismatches\n\t\t30,            // constant if mm pelanty is a constant\n\t\tpenSnp,        // penalty for decoded SNP\n\t\tcostMinConst,  // constant factor in N ceiling w/r/t read length\n\t\tcostMinLinear, // coeff of linear term in N ceiling w/r/t read length\n\t\tcostFloorConst,  // constant factor in N ceiling w/r/t read length\n\t\tcostFloorLinear, // coeff of linear term in N ceiling w/r/t read length\n\t\tnCeilConst,    // constant factor in N ceiling w/r/t read length\n\t\tnCeilLinear,   // coeff of linear term in N ceiling w/r/t read length\n\t\tpenNType,      // how to penalize Ns in the read\n\t\tpenN,          // constant if N pelanty is a constant\n\t\tnPairCat,      // true -> concatenate mates before N filtering\n\t\t25,            // constant coeff for cost of gap in read\n\t\t25,            // constant coeff for cost of gap in ref\n\t\t15,            // linear coeff for cost of gap in read\n\t\t15,            // linear coeff for cost of gap in ref\n\t\t1,             // # rows at top/bot can only be entered diagonally\n\t\t-1,            // min row idx to backtrace from; -1 = no limit\n\t\tfalse          // sort results first by row then by score?\n\t);\n\tSwResult res;\n\t\n\t//\n\t// Basic nucleotide-space tests\n\t//\n\tcerr << \"Running local tests...\" << endl;\n\tint tests = 1;\n\tbool nIncl = false;\n\tbool nfilter = false;\n\n\tSwAligner al;\n\tRandomSource rnd(73);\n\tfor(int i = 0; i < 3; i++) {\n\t\tcerr << \"  Test \" << tests++ << \" (short nuc space, offset \"\n\t\t     << (i*4) << \", exact)...\";\n\t\tsc.rdGapConst = 40;\n\t\tsc.rfGapConst = 40;\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGT\",             // read\n\t\t\t\"IIII\",             // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t0.0f,               // const coeff for cost ceiling\n\t\t\t8.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(4, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), 40);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tassert(res.alres.ned().empty());\n\t\tassert(res.alres.aed().empty());\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tassert(al.done());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\n\t\t//     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5\n\t\t//     A C G T A C G T A C G T A C G T\n\t\t// 0 C\n\t\t// 1 C   x\n\t\t// 2 G     x\n\t\t// 3 T       x\n\n\t\tcerr << \"  Test \" << tests++ << \" (short nuc space, offset \"\n\t\t     << (i*4) << \", 1mm)...\";\n\t\tsc.rdGapConst = 40;\n\t\tsc.rfGapConst = 40;\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"CCGT\",             // read\n\t\t\t\"IIII\",             // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t0.0f,               // const coeff for cost ceiling\n\t\t\t7.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4+1, res.alres.refoff());\n\t\tassert_eq(3, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), 30);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tassert(res.alres.ned().empty());\n\t\tassert(res.alres.aed().empty());\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tassert(al.done());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\n\t\tcerr << \"  Test \" << tests++ << \" (short nuc space, offset \"\n\t\t     << (i*4) << \", 1mm)...\";\n\t\tsc.rdGapConst = 40;\n\t\tsc.rfGapConst = 40;\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGA\",             // read\n\t\t\t\"IIII\",             // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t0.0f,               // const coeff for cost ceiling\n\t\t\t7.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(3, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), 30);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tassert(res.alres.ned().empty());\n\t\tassert(res.alres.aed().empty());\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tassert(al.done());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\n\t\tif(i == 0) {\n\t\t\tcerr << \"  Test \" << tests++ << \" (short nuc space, offset \"\n\t\t\t\t << (i*4) << \", 1mm, match bonus=20)...\";\n\t\t\tsc.rdGapConst = 40;\n\t\t\tsc.rfGapConst = 40;\n\t\t\tsc.setMatchBonus(20);\n\t\t\tdoTestCase2(\n\t\t\t\tal,\n\t\t\t\t\"TTGT\",             // read\n\t\t\t\t\"IIII\",             // qual\n\t\t\t\t\"TTGA\",             // ref in\n\t\t\t\ti*4,                // off\n\t\t\t\tsc,                 // scoring scheme\n\t\t\t\t25.0f,               // const coeff for cost ceiling\n\t\t\t\t0.0f,               // linear coeff for cost ceiling\n\t\t\t\tres,                // result\n\t\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\t\tnfilter);           // filter Ns\n\t\t\tassert(!al.done());\n\t\t\tal.nextAlignment(res, rnd);\n\t\t\tassert(!res.empty());\n\t\t\tassert_eq(i*4, res.alres.refoff());\n\t\t\tassert_eq(3, res.alres.refExtent());\n\t\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\t\tassert_eq(res.alres.score().score(), 60);\n\t\t\tassert_eq(res.alres.score().ns(), 0);\n\t\t\tassert(res.alres.ned().empty());\n\t\t\tassert(res.alres.aed().empty());\n\t\t\tres.reset();\n\t\t\tal.nextAlignment(res, rnd);\n\t\t\tassert(res.empty());\n\t\t\tassert(al.done());\n\t\t\tres.reset();\n\t\t\tsc.setMatchBonus(10);\n\t\t\tcerr << \"PASSED\" << endl;\n\t\t}\n\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \"\n\t\t     << (i*4) << \", exact)...\";\n\t\tsc.rdGapConst = 40;\n\t\tsc.rfGapConst = 40;\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTACGT\",         // read\n\t\t\t\"IIIIIIII\",         // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t0.0f,               // const coeff for cost ceiling\n\t\t\t8.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), 80);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tassert(res.alres.ned().empty());\n\t\tassert(res.alres.aed().empty());\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(res.empty());\n\t\tassert(al.done());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t\t\n\t\tcerr << \"  Test \" << tests++ << \" (long nuc space, offset \"\n\t\t     << (i*8) << \", exact)...\";\n\t\tsc.rdGapConst = 40;\n\t\tsc.rfGapConst = 40;\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTACGTACGTACGTACGTA\", // read\n\t\t\t\"IIIIIIIIIIIIIIIIIIIII\",  // qual\n\t\t\t\"ACGTACGTACGTACGTACGTACGTACGTACGTACGTA\", // ref in\n\t\t//   ACGTACGTACGTACGTACGT\n\t\t//           ACGTACGTACGTACGTACGT\n\t\t//                   ACGTACGTACGTACGTACGT\n\t\t\ti*8,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t0.0f,               // const coeff for cost ceiling\n\t\t\t8.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*8, res.alres.refoff());\n\t\tassert_eq(21, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), 210);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tassert(res.alres.ned().empty());\n\t\tassert(res.alres.aed().empty());\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\t//assert(res.empty());\n\t\t//assert(al.done());\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\n\t\tcerr << \"  Test \" << tests++ << \" (nuc space, offset \" << (i*4)\n\t\t     << \", 1mm allowed by minsc)...\";\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTTCGT\",         // read\n\t\t\t\"IIIIIIII\",         // qual\n\t\t\t\"ACGTACGTACGTACGT\", // ref in\n\t\t\ti*4,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t0.0f,               // const coeff for cost ceiling\n\t\t\t5.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*4, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), 40);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\t//assert(res.empty());\n\t\t//assert(al.done());\n\t\tcerr << \"PASSED\" << endl;\n\n\t\tcerr << \"  Test \" << tests++ << \" (long nuc space, offset \"\n\t\t     << (i*8) << \", 6mm allowed by minsc)...\";\n\t\tsc.rdGapConst = 50;\n\t\tsc.rfGapConst = 50;\n\t\tsc.rdGapLinear = 45;\n\t\tsc.rfGapLinear = 45;\n\t\tdoTestCase2(\n\t\t\tal,\n\t\t\t\"ACGTACGATGCATCGTACGTA\", // read\n\t\t\t\"IIIIIIIIIIIIIIIIIIIII\",  // qual\n\t\t\t\"ACGTACGTACGTACGTACGTACGTACGTACGTACGTA\", // ref in\n\t\t//   ACGTACGTACGTACGTACGT\n\t\t//           ACGTACGTACGTACGTACGT\n\t\t//                   ACGTACGTACGTACGTACGT\n\t\t\ti*8,                // off\n\t\t\tsc,                 // scoring scheme\n\t\t\t0.0f,               // const coeff for cost ceiling\n\t\t\t1.0f,               // linear coeff for cost ceiling\n\t\t\tres,                // result\n\t\t\tnIncl,              // Ns inclusive (not mismatches)\n\t\t\tnfilter);           // filter Ns\n\t\tassert(!al.done());\n\t\tal.nextAlignment(res, rnd);\n\t\tassert(!res.empty());\n\t\tassert_eq(i*8 + 13, res.alres.refoff());\n\t\tassert_eq(8, res.alres.refExtent());\n\t\tassert_eq(res.alres.score().gaps(), 0);\n\t\tassert_eq(res.alres.score().score(), 80);\n\t\tassert_eq(res.alres.score().ns(), 0);\n\t\tassert(res.alres.ned().empty());\n\t\tassert(res.alres.aed().empty());\n\t\tres.reset();\n\t\tal.nextAlignment(res, rnd);\n\t\tres.reset();\n\t\tcerr << \"PASSED\" << endl;\n\t}\n}\n\nint main(int argc, char **argv) {\n\tint option_index = 0;\n\tint next_option;\n\tunsigned seed = 0;\n\tgGapBarrier = 1;\n\tgSnpPhred = 30;\n\tbool nsInclusive = false;\n\tbool nfilter = false;\n\tbonusMatchType  = DEFAULT_MATCH_BONUS_TYPE;\n\tbonusMatch      = DEFAULT_MATCH_BONUS;\n\tpenMmcType      = DEFAULT_MM_PENALTY_TYPE;\n\tpenMmc          = DEFAULT_MM_PENALTY;\n\tpenSnp          = DEFAULT_SNP_PENALTY;\n\tpenNType        = DEFAULT_N_PENALTY_TYPE;\n\tpenN            = DEFAULT_N_PENALTY;\n\tpenRdExConst    = DEFAULT_READ_GAP_CONST;\n\tpenRfExConst    = DEFAULT_REF_GAP_CONST;\n\tpenRdExLinear   = DEFAULT_READ_GAP_LINEAR;\n\tpenRfExLinear   = DEFAULT_REF_GAP_LINEAR;\n\tcostMinConst    = DEFAULT_MIN_CONST;\n\tcostMinLinear   = DEFAULT_MIN_LINEAR;\n\tcostFloorConst  = DEFAULT_FLOOR_CONST;\n\tcostFloorLinear = DEFAULT_FLOOR_LINEAR;\n\tnCeilConst      = 1.0f; // constant factor in N ceiling w/r/t read length\n\tnCeilLinear     = 1.0f; // coeff of linear term in N ceiling w/r/t read length\n\tnCatPair        = false;\n\tmultiseedMms    = DEFAULT_SEEDMMS;\n\tmultiseedLen    = DEFAULT_SEEDLEN;\n\tmultiseedIvalType = DEFAULT_IVAL;\n\tmultiseedIvalA    = DEFAULT_IVAL_A;\n\tmultiseedIvalB    = DEFAULT_IVAL_B;\n\tmhits           = 1;\n\tdo {\n\t\tnext_option = getopt_long(argc, argv, short_opts, long_opts, &option_index);\n\t\tswitch (next_option) {\n\t\t\tcase 's': gSnpPhred  = parse<int>(optarg); break;\n\t\t\tcase 'r': seed       = parse<unsigned>(optarg); break;\n\t\t\tcase ARG_TESTS: {\n\t\t\t\tdoTests();\n\t\t\t\tcout << \"PASSED end-to-ends\" << endl;\n\t\t\t\tdoLocalTests();\n\t\t\t\tcout << \"PASSED locals\" << endl;\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\tcase 'A': {\n\t\t\t\tbool localAlign = false;\n\t\t\t\tbool noisyHpolymer = false;\n\t\t\t\tbool ignoreQuals = false;\n\t\t\t\tSeedAlignmentPolicy::parseString(\n\t\t\t\t\toptarg,\n\t\t\t\t\tlocalAlign,\n\t\t\t\t\tnoisyHpolymer,\n\t\t\t\t\tignoreQuals,\n\t\t\t\t\tbonusMatchType,\n\t\t\t\t\tbonusMatch,\n\t\t\t\t\tpenMmcType,\n\t\t\t\t\tpenMmc,\n\t\t\t\t\tpenNType,\n\t\t\t\t\tpenN,\n\t\t\t\t\tpenRdExConst,\n\t\t\t\t\tpenRfExConst,\n\t\t\t\t\tpenRdExLinear,\n\t\t\t\t\tpenRfExLinear,\n\t\t\t\t\tcostMinConst,\n\t\t\t\t\tcostMinLinear,\n\t\t\t\t\tcostFloorConst,\n\t\t\t\t\tcostFloorLinear,\n\t\t\t\t\tnCeilConst,\n\t\t\t\t\tnCeilLinear,\n\t\t\t\t\tnCatPair,\n\t\t\t\t\tmultiseedMms,\n\t\t\t\t\tmultiseedLen,\n\t\t\t\t\tmultiseedIvalType,\n\t\t\t\t\tmultiseedIvalA,\n\t\t\t\t\tmultiseedIvalB,\n\t\t\t\t\tposmin);\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tcase -1: break;\n\t\t\tdefault: {\n\t\t\t\tcerr << \"Unknown option: \" << (char)next_option << endl;\n\t\t\t\tprintUsage(cerr);\n\t\t\t\texit(1);\n\t\t\t}\n\t\t}\n\t} while(next_option != -1);\n\tsrand(seed);\n\tif(argc - optind < 4) {\n\t\tcerr << \"Need at least 4 arguments\" << endl;\n\t\tprintUsage(cerr);\n\t\texit(1);\n\t}\n\tBTDnaString read;\n\tBTString ref, qual;\n\t// Get read\n\tread.installChars(argv[optind]);\n\t// Get qualities\n\tqual.install(argv[optind+1]);\n\tassert_eq(read.length(), qual.length());\n\t// Get reference\n\tref.install(argv[optind+2]);\n\t// Get reference offset\n\tsize_t off = parse<size_t>(argv[optind+3]);\n\t// Set up penalities\n\tScoring sc(\n\t\tfalse,         // local alignment?\n\t\tfalse,         // bad homopolymer?\n\t\tbonusMatchType,\n\t\tbonusMatch,\n\t\tpenMmcType,    // how to penalize mismatches\n\t\tpenMmc,        // constant if mm pelanty is a constant\n\t\tcostMinConst,\n\t\tcostMinLinear,\n\t\tcostFloorConst,\n\t\tcostFloorLinear,\n\t\tnCeilConst,    // N ceiling constant coefficient\n\t\tnCeilLinear,   // N ceiling linear coefficient\n\t\tpenNType,      // how to penalize Ns in the read\n\t\tpenN,          // constant if N pelanty is a constant\n\t\tnCatPair,      // true -> concatenate mates before N filtering\n\t\tpenRdExConst,  // constant cost of extending a gap in the read\n\t\tpenRfExConst,  // constant cost of extending a gap in the reference\n\t\tpenRdExLinear, // coeff of linear term for cost of gap extension in read\n\t\tpenRfExLinear  // coeff of linear term for cost of gap extension in ref\n\t);\n\t// Calculate the penalty ceiling for the read\n\tTAlScore minsc = Scoring::linearFunc(\n\t\tread.length(),\n\t\tcostMinConst,\n\t\tcostMinLinear);\n\tTAlScore floorsc = Scoring::linearFunc(\n\t\tread.length(),\n\t\tcostFloorConst,\n\t\tcostFloorLinear);\n\tSwResult res;\n\tSwAligner al;\n\tdoTestCase(\n\t\tal,\n\t\tread,\n\t\tqual,\n\t\tref,\n\t\toff,\n\t\tNULL,\n\t\tsc,  \n\t\tminsc,\n\t\tres,\n\t\tnsInclusive,\n\t\tnfilter,\n\t\tseed);\n}\n#endif /*MAIN_ALIGNER_SW*/\n"
  },
  {
    "path": "aligner_sw.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n/*\n * aligner_sw.h\n *\n * Classes and routines for solving dynamic programming problems in aid of read\n * alignment.  Goals include the ability to handle:\n *\n * - Both read alignment, where the query must align end-to-end, and local\n *   alignment, where we seek a high-scoring alignment that need not involve\n *   the entire query.\n * - Situations where: (a) we've found a seed hit and are trying to extend it\n *   into a larger hit, (b) we've found an alignment for one mate of a pair and\n *   are trying to find a nearby alignment for the other mate, (c) we're\n *   aligning against an entire reference sequence.\n * - Caller-specified indicators for what columns of the dynamic programming\n *   matrix we are allowed to start in or end in.\n *\n * TODO:\n *\n * - A slicker way to filter out alignments that violate a ceiling placed on\n *   the number of Ns permitted in the reference portion of the alignment.\n *   Right now we accomplish this by masking out ending columns that correspond\n *   to *ungapped* alignments with too many Ns.  This results in false\n *   positives and false negatives for gapped alignments.  The margin of error\n *   (# of Ns by which we might miscount) is bounded by the number of gaps.\n */\n\n/**\n *  |-maxgaps-|\n *  ***********oooooooooooooooooooooo    -\n *   ***********ooooooooooooooooooooo    |\n *    ***********oooooooooooooooooooo    |\n *     ***********ooooooooooooooooooo    |\n *      ***********oooooooooooooooooo    |\n *       ***********ooooooooooooooooo read len\n *        ***********oooooooooooooooo    |\n *         ***********ooooooooooooooo    |\n *          ***********oooooooooooooo    |\n *           ***********ooooooooooooo    |\n *            ***********oooooooooooo    -\n *            |-maxgaps-|\n *  |-readlen-|\n *  |-------skip--------|\n */\n\n#ifndef ALIGNER_SW_H_\n#define ALIGNER_SW_H_\n\n#define INLINE_CUPS\n\n#include <stdint.h>\n#include <iostream>\n#include <limits>\n#include \"threading.h\"\n#include <emmintrin.h>\n#include \"aligner_sw_common.h\"\n#include \"aligner_sw_nuc.h\"\n#include \"ds.h\"\n#include \"aligner_seed.h\"\n#include \"reference.h\"\n#include \"random_source.h\"\n#include \"mem_ids.h\"\n#include \"aligner_result.h\"\n#include \"mask.h\"\n#include \"dp_framer.h\"\n#include \"aligner_swsse.h\"\n#include \"aligner_bt.h\"\n\n#define QUAL2(d, f) sc_->mm((int)(*rd_)[rdi_ + d], \\\n\t\t\t\t\t\t\t(int)  rf_ [rfi_ + f], \\\n\t\t\t\t\t\t\t(int)(*qu_)[rdi_ + d] - 33)\n#define QUAL(d)     sc_->mm((int)(*rd_)[rdi_ + d], \\\n\t\t\t\t\t\t\t(int)(*qu_)[rdi_ + d] - 33)\n#define N_SNP_PEN(c) (((int)rf_[rfi_ + c] > 15) ? sc_->n(30) : sc_->penSnp)\n\n/**\n * SwAligner\n * =========\n *\n * Ensapsulates facilities for alignment using dynamic programming.  Handles\n * alignment of nucleotide reads against known reference nucleotides.\n *\n * The class is stateful.  First the user must call init() to initialize the\n * object with details regarding the dynamic programming problem to be solved.\n * Next, the user calls align() to fill the dynamic programming matrix and\n * calculate summaries describing the solutions.  Finally the user calls \n * nextAlignment(...), perhaps repeatedly, to populate the SwResult object with\n * the next result.  Results are dispensend in best-to-worst, left-to-right\n * order.\n *\n * The class expects the read string, quality string, and reference string\n * provided by the caller live at least until the user is finished aligning and\n * obtaining alignments from this object.\n *\n * There is a design tradeoff between hiding/exposing details of the genome and\n * its strands to the SwAligner.  In a sense, a better design is to hide\n * details such as the id of the reference sequence aligned to, or whether\n * we're aligning the read in its original forward orientation or its reverse\n * complement.  But this means that any alignment results returned by SwAligner\n * have to be extended to include those details before they're useful to the\n * caller.  We opt for messy but expedient - the reference id and orientation\n * of the read are given to SwAligner, remembered, and used to populate\n * SwResults.\n *\n * LOCAL VS GLOBAL\n *\n * The dynamic programming aligner supports both local and global alignment,\n * and one option in between.  To implement global alignment, the aligner (a)\n * allows negative scores (i.e. doesn't necessarily clamp them up to 0), (b)\n * checks in rows other than the last row for acceptable solutions, and (c)\n * optionally adds a bonus to the score for matches.\n * \n * For global alignment, we:\n *\n * (a) Allow negative scores\n * (b) Check only in the last row\n * (c) Either add a bonus for matches or not (doesn't matter)\n *\n * For local alignment, we:\n *\n * (a) Clamp scores to 0\n * (b) Check in any row for a sufficiently high score\n * (c) Add a bonus for matches\n *\n * An in-between solution is to allow alignments to be curtailed on the\n * right-hand side if a better score can be achieved thereby, but not on the\n * left.  For this, we:\n *\n * (a) Allow negative scores\n * (b) Check in any row for a sufficiently high score\n * (c) Either add a bonus for matches or not (doesn't matter)\n *\n * REDUNDANT ALIGNMENTS\n *\n * When are two alignments distinct and when are they redundant (not distinct)?\n * At one extreme, we might say the best alignment from any given dynamic\n * programming problem is redundant with all other alignments from that\n # problem.  At the other extreme, we might say that any two alignments with\n * distinct starting points and edits are distinct.  The former is probably too\n * conservative for mate-finding DP problems.  The latter is certainly too\n * permissive, since two alignments that differ only in how gaps are arranged\n * should not be considered distinct.\n *\n * Some in-between solutions are:\n *\n * (a) If two alignments share an end point on either end, they are redundant.\n *     Otherwise, they are distinct.\n * (b) If two alignments share *both* end points, they are redundant.\n * (c) If two alignments share any cells in the DP table, they are redundant.\n * (d) 2 alignments are redundant if either end within N poss of each other\n * (e) Like (d) but both instead of either\n * (f, g) Like d, e, but where N is tied to maxgaps somehow\n *\n * Why not (a)?  One reason is that it's possible for two alignments to have\n * different start & end positions but share many cells.  Consider alignments 1\n * and 2 below; their end-points are labeled.\n *\n *  1 2\n *  \\ \\\n *    -\\\n *      \\\n *       \\\n *        \\\n *        -\\\n *        \\ \\\n *        1 2\n *\n * 1 and 2 are distinct according to (a) but they share many cells in common.\n *\n * Why not (f, g)?  It fixes the problem with (a) above by forcing the\n * alignments to be spread so far that they can't possibly share diagonal cells\n * in common\n */\nclass SwAligner {\n\n\ttypedef std::pair<size_t, size_t> SizeTPair;\n\n\t// States that the aligner can be in\n\tenum {\n\t\tSTATE_UNINIT,  // init() hasn't been called yet\n\t\tSTATE_INITED,  // init() has been called, but not align()\n\t\tSTATE_ALIGNED, // align() has been called\n\t};\n\t\n\tconst static size_t ALPHA_SIZE = 5;\n\npublic:\n\n\texplicit SwAligner() :\n\t\tsseU8fw_(DP_CAT),\n\t\tsseU8rc_(DP_CAT),\n\t\tsseI16fw_(DP_CAT),\n\t\tsseI16rc_(DP_CAT),\n\t\tstate_(STATE_UNINIT),\n\t\tinitedRead_(false),\n\t\treadSse16_(false),\n\t\tinitedRef_(false),\n\t\trfwbuf_(DP_CAT),\n\t\tbtnstack_(DP_CAT),\n\t\tbtcells_(DP_CAT),\n\t\tbtdiag_(),\n\t\tbtncand_(DP_CAT),\n\t\tbtncanddone_(DP_CAT),\n\t\tbtncanddoneSucc_(0),\n\t\tbtncanddoneFail_(0),\n\t\tcper_(),\n\t\tcperMinlen_(),\n\t\tcperPerPow2_(),\n\t\tcperEf_(),\n\t\tcperTri_(),\n\t\tcolstop_(0),\n\t\tlastsolcol_(0),\n\t\tcural_(0)\n\t\tASSERT_ONLY(, cand_tmp_(DP_CAT))\n\t{ }\n\n\t/**\n\t * Prepare the dynamic programming driver with a new read and a new scoring\n\t * scheme.\n\t */\n\tvoid initRead(\n\t\tconst BTDnaString& rdfw, // read sequence for fw read\n\t\tconst BTDnaString& rdrc, // read sequence for rc read\n\t\tconst BTString& qufw,    // read qualities for fw read\n\t\tconst BTString& qurc,    // read qualities for rc read\n\t\tsize_t rdi,              // offset of first read char to align\n\t\tsize_t rdf,              // offset of last read char to align\n\t\tconst Scoring& sc);      // scoring scheme\n\t\n\t/**\n\t * Initialize with a new alignment problem.\n\t */\n\tvoid initRef(\n\t\tbool fw,               // whether to forward or revcomp read is aligning\n\t\tTRefId refidx,         // id of reference aligned against\n\t\tconst DPRect& rect,    // DP rectangle\n\t\tchar *rf,              // reference sequence\n\t\tsize_t rfi,            // offset of first reference char to align to\n\t\tsize_t rff,            // offset of last reference char to align to\n\t\tTRefOff reflen,        // length of reference sequence\n\t\tconst Scoring& sc,     // scoring scheme\n\t\tTAlScore minsc,        // minimum score\n\t\tbool enable8,          // use 8-bit SSE if possible?\n\t\tsize_t cminlen,        // minimum length for using checkpointing scheme\n\t\tsize_t cpow2,          // interval b/t checkpointed diags; 1 << this\n\t\tbool doTri,            // triangular mini-fills?\n\t\tbool extend);          // true iff this is a seed extension\n\n\t/**\n\t * Given a read, an alignment orientation, a range of characters in a\n\t * referece sequence, and a bit-encoded version of the reference,\n\t * execute the corresponding dynamic programming problem.\n\t *\n\t * Here we expect that the caller has already narrowed down the relevant\n\t * portion of the reference (e.g. using a seed hit) and all we do is\n\t * banded dynamic programming in the vicinity of that portion.  This is not\n\t * the function to call if we are trying to solve the whole alignment\n\t * problem with dynamic programming (that is TODO).\n\t *\n\t * Returns true if an alignment was found, false otherwise.\n\t */\n\tvoid initRef(\n\t\tbool fw,               // whether to forward or revcomp read aligned\n\t\tTRefId refidx,         // reference aligned against\n\t\tconst DPRect& rect,    // DP rectangle\n\t\tconst BitPairReference& refs, // Reference strings\n\t\tTRefOff reflen,        // length of reference sequence\n\t\tconst Scoring& sc,     // scoring scheme\n\t\tTAlScore minsc,        // minimum alignment score\n\t\tbool enable8,          // use 8-bit SSE if possible?\n\t\tsize_t cminlen,        // minimum length for using checkpointing scheme\n\t\tsize_t cpow2,          // interval b/t checkpointed diags; 1 << this\n\t\tbool doTri,            // triangular mini-fills?\n\t\tbool extend,           // true iff this is a seed extension\n\t\tsize_t  upto,          // count the number of Ns up to this offset\n\t\tsize_t& nsUpto);       // output: the number of Ns up to 'upto'\n\n\t/**\n\t * Given a read, an alignment orientation, a range of characters in a\n\t * referece sequence, and a bit-encoded version of the reference, set up\n\t * and execute the corresponding ungapped alignment problem.  There can\n\t * only be one solution.\n\t *\n\t * The caller has already narrowed down the relevant portion of the\n\t * reference using, e.g., the location of a seed hit, or the range of\n\t * possible fragment lengths if we're searching for the opposite mate in a\n\t * pair.\n\t */\n\tint ungappedAlign(\n\t\tconst BTDnaString&      rd,     // read sequence (could be RC)\n\t\tconst BTString&         qu,     // qual sequence (could be rev)\n\t\tconst Coord&            coord,  // coordinate aligned to\n\t\tconst BitPairReference& refs,   // Reference strings\n\t\tsize_t                  reflen, // length of reference sequence\n\t\tconst Scoring&          sc,     // scoring scheme\n\t\tbool                    ohang,  // allow overhang?\n\t\tTAlScore                minsc,  // minimum score\n\t\tSwResult&               res);   // put alignment result here\n\n\t/**\n\t * Align read 'rd' to reference using read & reference information given\n\t * last time init() was called.  Uses dynamic programming.\n\t */\n\tbool align(RandomSource& rnd, TAlScore& best);\n\t\n\t/**\n\t * Populate the given SwResult with information about the \"next best\"\n\t * alignment if there is one.  If there isn't one, false is returned.  Note\n\t * that false might be returned even though a call to done() would have\n\t * returned false.\n\t */\n\tbool nextAlignment(\n\t\tSwResult& res,\n\t\tTAlScore minsc,\n\t\tRandomSource& rnd);\n\t\n\t/**\n\t * Print out an alignment result as an ASCII DP table.\n\t */\n\tvoid printResultStacked(\n\t\tconst SwResult& res,\n\t\tstd::ostream& os)\n\t{\n\t\t// res.alres.printStacked(*rd_, os);\n\t}\n\t\n\t/**\n\t * Return true iff there are no more solution cells to backtace from.\n\t * Note that this may return false in situations where there are actually\n\t * no more solutions, but that hasn't been discovered yet.\n\t */\n\tbool done() const {\n\t\tassert(initedRead() && initedRef());\n\t\treturn cural_ == btncand_.size();\n\t}\n\n\t/**\n\t * Return true iff this SwAligner has been initialized with a read to align.\n\t */\n\tinline bool initedRef() const { return initedRef_; }\n\n\t/**\n\t * Return true iff this SwAligner has been initialized with a reference to\n\t * align against.\n\t */\n\tinline bool initedRead() const { return initedRead_; }\n\t\n\t/**\n\t * Reset, signaling that we're done with this dynamic programming problem\n\t * and won't be asking for any more alignments.\n\t */\n\tinline void reset() { initedRef_ = initedRead_ = false; }\n\n#ifndef NDEBUG\n\t/**\n\t * Check that aligner is internally consistent.\n\t */\n\tbool repOk() const {\n\t\tassert_gt(dpRows(), 0);\n\t\t// Check btncand_\n\t\tfor(size_t i = 0; i < btncand_.size(); i++) {\n\t\t\tassert(btncand_[i].repOk());\n\t\t\tassert_geq(btncand_[i].score, minsc_);\n\t\t}\n\t\treturn true;\n\t}\n#endif\n\t\n\t/**\n\t * Return the number of alignments given out so far by nextAlignment().\n\t */\n\tsize_t numAlignmentsReported() const { return cural_; }\n\n\t/**\n\t * Merge tallies in the counters related to filling the DP table.\n\t */\n\tvoid merge(\n\t\tSSEMetrics& sseU8ExtendMet,\n\t\tSSEMetrics& sseU8MateMet,\n\t\tSSEMetrics& sseI16ExtendMet,\n\t\tSSEMetrics& sseI16MateMet,\n\t\tuint64_t&   nbtfiltst,\n\t\tuint64_t&   nbtfiltsc,\n\t\tuint64_t&   nbtfiltdo)\n\t{\n\t\tsseU8ExtendMet.merge(sseU8ExtendMet_);\n\t\tsseU8MateMet.merge(sseU8MateMet_);\n\t\tsseI16ExtendMet.merge(sseI16ExtendMet_);\n\t\tsseI16MateMet.merge(sseI16MateMet_);\n\t\tnbtfiltst += nbtfiltst_;\n\t\tnbtfiltsc += nbtfiltsc_;\n\t\tnbtfiltdo += nbtfiltdo_;\n\t}\n\t\n\t/**\n\t * Reset all the counters related to filling in the DP table to 0.\n\t */\n\tvoid resetCounters() {\n\t\tsseU8ExtendMet_.reset();\n\t\tsseU8MateMet_.reset();\n\t\tsseI16ExtendMet_.reset();\n\t\tsseI16MateMet_.reset();\n\t\tnbtfiltst_ = nbtfiltsc_ = nbtfiltdo_ = 0;\n\t}\n\t\n\t/**\n\t * Return the size of the DP problem.\n\t */\n\tsize_t size() const {\n\t\treturn dpRows() * (rff_ - rfi_);\n\t}\n\nprotected:\n\t\n\t/**\n\t * Return the number of rows that will be in the dynamic programming table.\n\t */\n\tinline size_t dpRows() const {\n\t\tassert(initedRead_);\n\t\treturn rdf_ - rdi_;\n\t}\n\n\t/**\n\t * Align nucleotides from read 'rd' to the reference string 'rf' using\n\t * vector instructions.  Return the score of the best alignment found, or\n\t * the minimum integer if an alignment could not be found.  Flag is set to\n\t * 0 if an alignment is found, -1 if no valid alignment is found, or -2 if\n\t * the score saturated at any point during alignment.\n\t */\n\tTAlScore alignNucleotidesEnd2EndSseU8(  // unsigned 8-bit elements\n\t\tint& flag, bool debug);\n\tTAlScore alignNucleotidesLocalSseU8(    // unsigned 8-bit elements\n\t\tint& flag, bool debug);\n\tTAlScore alignNucleotidesEnd2EndSseI16( // signed 16-bit elements\n\t\tint& flag, bool debug);\n\tTAlScore alignNucleotidesLocalSseI16(   // signed 16-bit elements\n\t\tint& flag, bool debug);\n\t\n\t/**\n\t * Aligns by filling a dynamic programming matrix with the SSE-accelerated,\n\t * banded DP approach of Farrar.  As it goes, it determines which cells we\n\t * might backtrace from and tallies the best (highest-scoring) N backtrace\n\t * candidate cells per diagonal.  Also returns the alignment score of the best\n\t * alignment in the matrix.\n\t *\n\t * This routine does *not* maintain a matrix holding the entire matrix worth of\n\t * scores, nor does it maintain any other dense O(mn) data structure, as this\n\t * would quickly exhaust memory for queries longer than about 10,000 kb.\n\t * Instead, in the fill stage it maintains two columns worth of scores at a\n\t * time (current/previous, or right/left) - these take O(m) space.  When\n\t * finished with the current column, it determines which cells from the\n\t * previous column, if any, are candidates we might backtrace from to find a\n\t * full alignment.  A candidate cell has a score that rises above the threshold\n\t * and isn't improved upon by a match in the next column.  The best N\n\t * candidates per diagonal are stored in a O(m + n) data structure.\n\t */\n\tTAlScore alignGatherEE8(                // unsigned 8-bit elements\n\t\tint& flag, bool debug);\n\tTAlScore alignGatherLoc8(               // unsigned 8-bit elements\n\t\tint& flag, bool debug);\n\tTAlScore alignGatherEE16(               // signed 16-bit elements\n\t\tint& flag, bool debug);\n\tTAlScore alignGatherLoc16(              // signed 16-bit elements\n\t\tint& flag, bool debug);\n\t\n\t/**\n\t * Build query profile look up tables for the read.  The query profile look\n\t * up table is organized as a 1D array indexed by [i][j] where i is the\n\t * reference character in the current DP column (0=A, 1=C, etc), and j is\n\t * the segment of the query we're currently working on.\n\t */\n\tvoid buildQueryProfileEnd2EndSseU8(bool fw);\n\tvoid buildQueryProfileLocalSseU8(bool fw);\n\n\t/**\n\t * Build query profile look up tables for the read.  The query profile look\n\t * up table is organized as a 1D array indexed by [i][j] where i is the\n\t * reference character in the current DP column (0=A, 1=C, etc), and j is\n\t * the segment of the query we're currently working on.\n\t */\n\tvoid buildQueryProfileEnd2EndSseI16(bool fw);\n\tvoid buildQueryProfileLocalSseI16(bool fw);\n\t\n\tbool gatherCellsNucleotidesLocalSseU8(TAlScore best);\n\tbool gatherCellsNucleotidesEnd2EndSseU8(TAlScore best);\n\n\tbool gatherCellsNucleotidesLocalSseI16(TAlScore best);\n\tbool gatherCellsNucleotidesEnd2EndSseI16(TAlScore best);\n\n\tbool backtraceNucleotidesLocalSseU8(\n\t\tTAlScore       escore, // in: expected score\n\t\tSwResult&      res,    // out: store results (edits and scores) here\n\t\tsize_t&        off,    // out: store diagonal projection of origin\n\t\tsize_t&        nbts,   // out: # backtracks\n\t\tsize_t         row,    // start in this rectangle row\n\t\tsize_t         col,    // start in this rectangle column\n\t\tRandomSource&  rand);  // random gen, to choose among equal paths\n\n\tbool backtraceNucleotidesLocalSseI16(\n\t\tTAlScore       escore, // in: expected score\n\t\tSwResult&      res,    // out: store results (edits and scores) here\n\t\tsize_t&        off,    // out: store diagonal projection of origin\n\t\tsize_t&        nbts,   // out: # backtracks\n\t\tsize_t         row,    // start in this rectangle row\n\t\tsize_t         col,    // start in this rectangle column\n\t\tRandomSource&  rand);  // random gen, to choose among equal paths\n\n\tbool backtraceNucleotidesEnd2EndSseU8(\n\t\tTAlScore       escore, // in: expected score\n\t\tSwResult&      res,    // out: store results (edits and scores) here\n\t\tsize_t&        off,    // out: store diagonal projection of origin\n\t\tsize_t&        nbts,   // out: # backtracks\n\t\tsize_t         row,    // start in this rectangle row\n\t\tsize_t         col,    // start in this rectangle column\n\t\tRandomSource&  rand);  // random gen, to choose among equal paths\n\n\tbool backtraceNucleotidesEnd2EndSseI16(\n\t\tTAlScore       escore, // in: expected score\n\t\tSwResult&      res,    // out: store results (edits and scores) here\n\t\tsize_t&        off,    // out: store diagonal projection of origin\n\t\tsize_t&        nbts,   // out: # backtracks\n\t\tsize_t         row,    // start in this rectangle row\n\t\tsize_t         col,    // start in this rectangle column\n\t\tRandomSource&  rand);  // random gen, to choose among equal paths\n\n\tbool backtrace(\n\t\tTAlScore       escore, // in: expected score\n\t\tbool           fill,   // in: use mini-fill?\n\t\tbool           usecp,  // in: use checkpoints?\n\t\tSwResult&      res,    // out: store results (edits and scores) here\n\t\tsize_t&        off,    // out: store diagonal projection of origin\n\t\tsize_t         row,    // start in this rectangle row\n\t\tsize_t         col,    // start in this rectangle column\n\t\tsize_t         maxiter,// max # extensions to try\n\t\tsize_t&        niter,  // # extensions tried\n\t\tRandomSource&  rnd)    // random gen, to choose among equal paths\n\t{\n\t\tbter_.initBt(\n\t\t\tescore,              // in: alignment score\n\t\t\trow,                 // in: start in this row\n\t\t\tcol,                 // in: start in this column\n\t\t\tfill,                // in: use mini-fill?\n\t\t\tusecp,               // in: use checkpoints?\n\t\t\tcperTri_,            // in: triangle-shaped mini-fills?\n\t\t\trnd);                // in: random gen, to choose among equal paths\n\t\tassert(bter_.inited());\n\t\tsize_t nrej = 0;\n\t\tif(bter_.emptySolution()) {\n\t\t\treturn false;\n\t\t} else {\n\t\t\treturn bter_.nextAlignment(maxiter, res, off, nrej, niter, rnd);\n\t\t}\n\t}\n\n\tconst BTDnaString  *rd_;     // read sequence\n\tconst BTString     *qu_;     // read qualities\n\tconst BTDnaString  *rdfw_;   // read sequence for fw read\n\tconst BTDnaString  *rdrc_;   // read sequence for rc read\n\tconst BTString     *qufw_;   // read qualities for fw read\n\tconst BTString     *qurc_;   // read qualities for rc read\n\tTReadOff            rdi_;    // offset of first read char to align\n\tTReadOff            rdf_;    // offset of last read char to align\n\tbool                fw_;     // true iff read sequence is original fw read\n\tTRefId              refidx_; // id of reference aligned against\n\tTRefOff             reflen_; // length of entire reference sequence\n\tconst DPRect*       rect_;   // DP rectangle\n\tchar               *rf_;     // reference sequence\n\tTRefOff             rfi_;    // offset of first ref char to align to\n\tTRefOff             rff_;    // offset of last ref char to align to (excl)\n\tsize_t              rdgap_;  // max # gaps in read\n\tsize_t              rfgap_;  // max # gaps in reference\n\tbool                enable8_;// enable 8-bit sse\n\tbool                extend_; // true iff this is a seed-extend problem\n\tconst Scoring      *sc_;     // penalties for edit types\n\tTAlScore            minsc_;  // penalty ceiling for valid alignments\n\tint                 nceil_;  // max # Ns allowed in ref portion of aln\n\n\tbool                sse8succ_;  // whether 8-bit worked\n\tbool                sse16succ_; // whether 16-bit worked\n\tSSEData             sseU8fw_;   // buf for fw query, 8-bit score\n\tSSEData             sseU8rc_;   // buf for rc query, 8-bit score\n\tSSEData             sseI16fw_;  // buf for fw query, 16-bit score\n\tSSEData             sseI16rc_;  // buf for rc query, 16-bit score\n\tbool                sseU8fwBuilt_;   // built fw query profile, 8-bit score\n\tbool                sseU8rcBuilt_;   // built rc query profile, 8-bit score\n\tbool                sseI16fwBuilt_;  // built fw query profile, 16-bit score\n\tbool                sseI16rcBuilt_;  // built rc query profile, 16-bit score\n\n\tSSEMetrics\t\t\tsseU8ExtendMet_;\n\tSSEMetrics\t\t\tsseU8MateMet_;\n\tSSEMetrics\t\t\tsseI16ExtendMet_;\n\tSSEMetrics\t\t\tsseI16MateMet_;\n\n\tint                 state_;        // state\n\tbool                initedRead_;   // true iff initialized with initRead\n\tbool                readSse16_;    // true -> sse16 from now on for read\n\tbool                initedRef_;    // true iff initialized with initRef\n\tEList<uint32_t>     rfwbuf_;       // buffer for wordized ref stretches\n\t\n\tEList<DpNucFrame>    btnstack_;    // backtrace stack for nucleotides\n\tEList<SizeTPair>     btcells_;     // cells involved in current backtrace\n\n\tNBest<DpBtCandidate> btdiag_;      // per-diagonal backtrace candidates\n\tEList<DpBtCandidate> btncand_;     // cells we might backtrace from\n\tEList<DpBtCandidate> btncanddone_; // candidates that we investigated\n\tsize_t              btncanddoneSucc_; // # investigated and succeeded\n\tsize_t              btncanddoneFail_; // # investigated and failed\n\t\n\tBtBranchTracer       bter_;        // backtracer\n\t\n\tCheckpointer         cper_;        // structure for saving checkpoint cells\n\tsize_t               cperMinlen_;  // minimum length for using checkpointer\n\tsize_t               cperPerPow2_; // checkpoint every 1 << perpow2 diags (& next)\n\tbool                 cperEf_;      // store E and F in addition to H?\n\tbool                 cperTri_;     // checkpoint for triangular mini-fills?\n\t\n\tsize_t              colstop_;      // bailed on DP loop after this many cols\n\tsize_t              lastsolcol_;   // last DP col with valid cell\n\tsize_t              cural_;        // index of next alignment to be given\n\t\n\tuint64_t nbtfiltst_; // # candidates filtered b/c starting cell was seen\n\tuint64_t nbtfiltsc_; // # candidates filtered b/c score uninteresting\n\tuint64_t nbtfiltdo_; // # candidates filtered b/c dominated by other cell\n\t\n\tASSERT_ONLY(SStringExpandable<uint32_t> tmp_destU32_);\n\tASSERT_ONLY(BTDnaString tmp_editstr_, tmp_refstr_);\n\tASSERT_ONLY(EList<DpBtCandidate> cand_tmp_);\n};\n\n#endif /*ALIGNER_SW_H_*/\n"
  },
  {
    "path": "aligner_sw_common.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef ALIGNER_SW_COMMON_H_\n#define ALIGNER_SW_COMMON_H_\n\n#include \"aligner_result.h\"\n\n/**\n * Encapsulates the result of a dynamic programming alignment, including\n * colorspace alignments.  In our case, the result is a combination of:\n *\n * 1. All the nucleotide edits\n * 2. All the \"edits\" where an ambiguous reference char is resolved to\n *    an unambiguous char.\n * 3. All the color edits (if applicable)\n * 4. All the color miscalls (if applicable).  This is a subset of 3.\n * 5. The score of the best alginment\n * 6. The score of the second-best alignment\n *\n * Having scores for the best and second-best alignments gives us an\n * idea of where gaps may make reassembly beneficial.\n */\nstruct SwResult {\n\n\tSwResult() :\n\t\talres(),\n\t\tsws(0),\n\t\tswcups(0),\n\t\tswrows(0),\n\t\tswskiprows(0),\n\t\tswskip(0),\n\t\tswsucc(0),\n\t\tswfail(0),\n\t\tswbts(0)\n\t{ }\n\n\t/**\n\t * Clear all contents.\n\t */\n\tvoid reset() {\n\t\tsws = swcups = swrows = swskiprows = swskip = swsucc =\n\t\tswfail = swbts = 0;\n\t\talres.reset();\n\t}\n\t\n\t/**\n\t * Reverse all edit lists.\n\t */\n\tvoid reverse() {\n\t}\n\t\n\t/**\n\t * Return true iff no result has been installed.\n\t */\n\tbool empty() const {\n        return false;\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that result is internally consistent.\n\t */\n\tbool repOk() const {\n\t\t//assert(alres.repOk());\n\t\treturn true;\n\t}\n\n\t/**\n\t * Check that result is internally consistent w/r/t read.\n\t */\n\tbool repOk(const Read& rd) const {\n\t\t//assert(alres.repOk(rd));\n\t\treturn true;\n\t}\n#endif\n\n\tAlnRes alres;\n\tuint64_t sws;    // # DP problems solved\n\tuint64_t swcups; // # DP cell updates\n\tuint64_t swrows; // # DP row updates\n\tuint64_t swskiprows; // # skipped DP row updates (b/c no valid alignments can go thru row)\n\tuint64_t swskip; // # DP problems skipped by sse filter\n\tuint64_t swsucc; // # DP problems resulting in alignment\n\tuint64_t swfail; // # DP problems not resulting in alignment\n\tuint64_t swbts;  // # DP backtrace steps\n\t\n\tint nup;         // upstream decoded nucleotide; for colorspace reads\n\tint ndn;         // downstream decoded nucleotide; for colorspace reads\n};\n\n/**\n * Encapsulates counters that measure how much work has been done by\n * the dynamic programming driver and aligner.\n */\nstruct SwMetrics {\n\n\tSwMetrics() : mutex_m() {\n\t    reset();\n\t}\n\n\tvoid reset() {\n\t\tsws = swcups = swrows = swskiprows = swskip = swsucc = swfail = swbts =\n\t\tsws10 = sws5 = sws3 =\n\t\trshit = ungapsucc = ungapfail = ungapnodec = 0;\n\t\texatts = exranges = exrows = exsucc = exooms = 0;\n\t\tmm1atts = mm1ranges = mm1rows = mm1succ = mm1ooms = 0;\n\t\tsdatts = sdranges = sdrows = sdsucc = sdooms = 0;\n\t}\n\t\n\tvoid init(\n\t\tuint64_t sws_,\n\t\tuint64_t sws10_,\n\t\tuint64_t sws5_,\n\t\tuint64_t sws3_,\n\t\tuint64_t swcups_,\n\t\tuint64_t swrows_,\n\t\tuint64_t swskiprows_,\n\t\tuint64_t swskip_,\n\t\tuint64_t swsucc_,\n\t\tuint64_t swfail_,\n\t\tuint64_t swbts_,\n\t\tuint64_t rshit_,\n\t\tuint64_t ungapsucc_,\n\t\tuint64_t ungapfail_,\n\t\tuint64_t ungapnodec_,\n\t\tuint64_t exatts_,\n\t\tuint64_t exranges_,\n\t\tuint64_t exrows_,\n\t\tuint64_t exsucc_,\n\t\tuint64_t exooms_,\n\t\tuint64_t mm1atts_,\n\t\tuint64_t mm1ranges_,\n\t\tuint64_t mm1rows_,\n\t\tuint64_t mm1succ_,\n\t\tuint64_t mm1ooms_,\n\t\tuint64_t sdatts_,\n\t\tuint64_t sdranges_,\n\t\tuint64_t sdrows_,\n\t\tuint64_t sdsucc_,\n\t\tuint64_t sdooms_)\n\t{\n\t\tsws        = sws_;\n\t\tsws10      = sws10_;\n\t\tsws5       = sws5_;\n\t\tsws3       = sws3_;\n\t\tswcups     = swcups_;\n\t\tswrows     = swrows_;\n\t\tswskiprows = swskiprows_;\n\t\tswskip     = swskip_;\n\t\tswsucc     = swsucc_;\n\t\tswfail     = swfail_;\n\t\tswbts      = swbts_;\n\t\tungapsucc  = ungapsucc_;\n\t\tungapfail  = ungapfail_;\n\t\tungapnodec = ungapnodec_;\n\t\t\n\t\t// Exact end-to-end attempts\n\t\texatts     = exatts_;\n\t\texranges   = exranges_;\n\t\texrows     = exrows_;\n\t\texsucc     = exsucc_;\n\t\texooms     = exooms_;\n\n\t\t// 1-mismatch end-to-end attempts\n\t\tmm1atts    = mm1atts_;\n\t\tmm1ranges  = mm1ranges_;\n\t\tmm1rows    = mm1rows_;\n\t\tmm1succ    = mm1succ_;\n\t\tmm1ooms    = mm1ooms_;\n\t\t\n\t\t// Seed attempts\n\t\tsdatts     = sdatts_;\n\t\tsdranges   = sdranges_;\n\t\tsdrows     = sdrows_;\n\t\tsdsucc     = sdsucc_;\n\t\tsdooms     = sdooms_;\n\t}\n\t\n\t/**\n\t * Merge (add) the counters in the given SwResult object into this\n\t * SwMetrics object.\n\t */\n\tvoid update(const SwResult& r) {\n\t\tsws        += r.sws;\n\t\tswcups     += r.swcups;\n\t\tswrows     += r.swrows;\n\t\tswskiprows += r.swskiprows;\n\t\tswskip     += r.swskip;\n\t\tswsucc     += r.swsucc;\n\t\tswfail     += r.swfail;\n\t\tswbts      += r.swbts;\n\t}\n\t\n\t/**\n\t * Merge (add) the counters in the given SwMetrics object into this\n\t * object.  This is the only safe way to update a SwMetrics shared\n\t * by multiple threads.\n\t */\n\tvoid merge(const SwMetrics& r, bool getLock = false) {\n        ThreadSafe ts(&mutex_m, getLock);\n\t\tsws        += r.sws;\n\t\tsws10      += r.sws10;\n\t\tsws5       += r.sws5;\n\t\tsws3       += r.sws3;\n\t\tswcups     += r.swcups;\n\t\tswrows     += r.swrows;\n\t\tswskiprows += r.swskiprows;\n\t\tswskip     += r.swskip;\n\t\tswsucc     += r.swsucc;\n\t\tswfail     += r.swfail;\n\t\tswbts      += r.swbts;\n\t\trshit      += r.rshit;\n\t\tungapsucc  += r.ungapsucc;\n\t\tungapfail  += r.ungapfail;\n\t\tungapnodec += r.ungapnodec;\n\t\texatts     += r.exatts;\n\t\texranges   += r.exranges;\n\t\texrows     += r.exrows;\n\t\texsucc     += r.exsucc;\n\t\texooms     += r.exooms;\n\t\tmm1atts    += r.mm1atts;\n\t\tmm1ranges  += r.mm1ranges;\n\t\tmm1rows    += r.mm1rows;\n\t\tmm1succ    += r.mm1succ;\n\t\tmm1ooms    += r.mm1ooms;\n\t\tsdatts     += r.sdatts;\n\t\tsdranges   += r.sdranges;\n\t\tsdrows     += r.sdrows;\n\t\tsdsucc     += r.sdsucc;\n\t\tsdooms     += r.sdooms;\n\t}\n\t\n\tvoid tallyGappedDp(size_t readGaps, size_t refGaps) {\n\t\tsize_t mx = max(readGaps, refGaps);\n\t\tif(mx < 10) sws10++;\n\t\tif(mx < 5)  sws5++;\n\t\tif(mx < 3)  sws3++;\n\t}\n\n\tuint64_t sws;        // # DP problems solved\n\tuint64_t sws10;      // # DP problems solved where max gaps < 10\n\tuint64_t sws5;       // # DP problems solved where max gaps < 5\n\tuint64_t sws3;       // # DP problems solved where max gaps < 3\n\tuint64_t swcups;     // # DP cell updates\n\tuint64_t swrows;     // # DP row updates\n\tuint64_t swskiprows; // # skipped DP rows (b/c no valid alns go thru row)\n\tuint64_t swskip;     // # DP problems skipped by sse filter\n\tuint64_t swsucc;     // # DP problems resulting in alignment\n\tuint64_t swfail;     // # DP problems not resulting in alignment\n\tuint64_t swbts;      // # DP backtrace steps\n\tuint64_t rshit;      // # DP problems avoided b/c seed hit was redundant\n\tuint64_t ungapsucc;  // # DP problems avoided b/c seed hit was redundant\n\tuint64_t ungapfail;  // # DP problems avoided b/c seed hit was redundant\n\tuint64_t ungapnodec; // # DP problems avoided b/c seed hit was redundant\n\n\tuint64_t exatts;     // total # attempts at exact-hit end-to-end aln\n\tuint64_t exranges;   // total # ranges returned by exact-hit queries\n\tuint64_t exrows;     // total # rows returned by exact-hit queries\n\tuint64_t exsucc;     // exact-hit yielded non-empty result\n\tuint64_t exooms;     // exact-hit offset memory exhausted\n\t\n\tuint64_t mm1atts;    // total # attempts at 1mm end-to-end aln\n\tuint64_t mm1ranges;  // total # ranges returned by 1mm-hit queries\n\tuint64_t mm1rows;    // total # rows returned by 1mm-hit queries\n\tuint64_t mm1succ;    // 1mm-hit yielded non-empty result\n\tuint64_t mm1ooms;    // 1mm-hit offset memory exhausted\n\n\tuint64_t sdatts;     // total # attempts to find seed alignments\n\tuint64_t sdranges;   // total # seed-alignment ranges found\n\tuint64_t sdrows;     // total # seed-alignment rows found\n\tuint64_t sdsucc;     // # times seed alignment yielded >= 1 hit\n\tuint64_t sdooms;     // # times an OOM occurred during seed alignment\n\n\tMUTEX_T mutex_m;\n};\n\n// The various ways that one might backtrack from a later cell (either oall,\n// rdgap or rfgap) to an earlier cell\nenum {\n\tSW_BT_OALL_DIAG,         // from oall cell to oall cell\n\tSW_BT_OALL_REF_OPEN,     // from oall cell to oall cell\n\tSW_BT_OALL_READ_OPEN,    // from oall cell to oall cell\n\tSW_BT_RDGAP_EXTEND,      // from rdgap cell to rdgap cell\n\tSW_BT_RFGAP_EXTEND       // from rfgap cell to rfgap cell\n};\n\n#endif /*def ALIGNER_SW_COMMON_H_*/\n"
  },
  {
    "path": "aligner_sw_nuc.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef ALIGNER_SW_NUC_H_\n#define ALIGNER_SW_NUC_H_\n\n#include <stdint.h>\n#include \"aligner_sw_common.h\"\n#include \"aligner_result.h\"\n\n/**\n * Encapsulates a backtrace stack frame.  Includes enough information that we\n * can \"pop\" back up to this frame and choose to make a different backtracking\n * decision.  The information included is:\n *\n * 1. The mask at the decision point.  When we first move through the mask and\n *    when we backtrack to it, we're careful to mask out the bit corresponding\n *    to the path we're taking.  When we move through it after removing the\n *    last bit from the mask, we're careful to pop it from the stack.\n * 2. The sizes of the edit lists.  When we backtrack, we resize the lists back\n *    down to these sizes to get rid of any edits introduced since the branch\n *    point.\n */\nstruct DpNucFrame {\n\n\t/**\n\t * Initialize a new DpNucFrame stack frame.\n\t */\n\tvoid init(\n\t\tsize_t   nedsz_,\n\t\tsize_t   aedsz_,\n\t\tsize_t   celsz_,\n\t\tsize_t   row_,\n\t\tsize_t   col_,\n\t\tsize_t   gaps_,\n\t\tsize_t   readGaps_,\n\t\tsize_t   refGaps_,\n\t\tAlnScore score_,\n\t\tint      ct_)\n\t{\n\t\tnedsz    = nedsz_;\n\t\taedsz    = aedsz_;\n\t\tcelsz    = celsz_;\n\t\trow      = row_;\n\t\tcol      = col_;\n\t\tgaps     = gaps_;\n\t\treadGaps = readGaps_;\n\t\trefGaps  = refGaps_;\n\t\tscore    = score_;\n\t\tct       = ct_;\n\t}\n\n\tsize_t   nedsz;    // size of the nucleotide edit list at branch (before\n\t                   // adding the branch edit)\n\tsize_t   aedsz;    // size of ambiguous nucleotide edit list at branch\n\tsize_t   celsz;    // size of cell-traversed list at branch\n\tsize_t   row;      // row of cell where branch occurred\n\tsize_t   col;      // column of cell where branch occurred\n\tsize_t   gaps;     // number of gaps before branch occurred\n\tsize_t   readGaps; // number of read gaps before branch occurred\n\tsize_t   refGaps;  // number of ref gaps before branch occurred\n\tAlnScore score;    // score where branch occurred\n\tint      ct;       // table type (oall, rdgap or rfgap)\n};\n\nenum {\n\tBT_CAND_FATE_SUCCEEDED = 1,\n\tBT_CAND_FATE_FAILED,\n\tBT_CAND_FATE_FILT_START,     // skipped b/c starting cell already explored\n\tBT_CAND_FATE_FILT_DOMINATED, // skipped b/c it was dominated\n\tBT_CAND_FATE_FILT_SCORE      // skipped b/c score not interesting anymore\n};\n\n/**\n * Encapsulates a cell that we might want to backtrace from.\n */\nstruct DpBtCandidate {\n\n\tDpBtCandidate() { reset(); }\n\t\n\tDpBtCandidate(size_t row_, size_t col_, TAlScore score_) {\n\t\tinit(row_, col_, score_);\n\t}\n\t\n\tvoid reset() { init(0, 0, 0); }\n\t\n\tvoid init(size_t row_, size_t col_, TAlScore score_) {\n\t\trow = row_;\n\t\tcol = col_;\n\t\tscore = score_;\n\t\t// 0 = invalid; this should be set later according to what happens\n\t\t// before / during the backtrace\n\t\tfate = 0; \n\t}\n\t\n\t/** \n\t * Return true iff this candidate is (heuristically) dominated by the given\n\t * candidate.  We say that candidate A dominates candidate B if (a) B is\n\t * somewhere in the N x N square that extends up and to the left of A,\n\t * where N is an arbitrary number like 20, and (b) B's score is <= than\n\t * A's.\n\t */\n\tinline bool dominatedBy(const DpBtCandidate& o) {\n\t\tconst size_t SQ = 40;\n\t\tsize_t rowhi = row;\n\t\tsize_t rowlo = o.row;\n\t\tif(rowhi < rowlo) swap(rowhi, rowlo);\n\t\tsize_t colhi = col;\n\t\tsize_t collo = o.col;\n\t\tif(colhi < collo) swap(colhi, collo);\n\t\treturn (colhi - collo) <= SQ &&\n\t\t       (rowhi - rowlo) <= SQ;\n\t}\n\n\t/**\n\t * Return true if this candidate is \"greater than\" (should be considered\n\t * later than) the given candidate.\n\t */\n\tbool operator>(const DpBtCandidate& o) const {\n\t\tif(score < o.score) return true;\n\t\tif(score > o.score) return false;\n\t\tif(row   < o.row  ) return true;\n\t\tif(row   > o.row  ) return false;\n\t\tif(col   < o.col  ) return true;\n\t\tif(col   > o.col  ) return false;\n\t\treturn false;\n\t}\n\n\t/**\n\t * Return true if this candidate is \"less than\" (should be considered\n\t * sooner than) the given candidate.\n\t */\n\tbool operator<(const DpBtCandidate& o) const {\n\t\tif(score > o.score) return true;\n\t\tif(score < o.score) return false;\n\t\tif(row   > o.row  ) return true;\n\t\tif(row   < o.row  ) return false;\n\t\tif(col   > o.col  ) return true;\n\t\tif(col   < o.col  ) return false;\n\t\treturn false;\n\t}\n\t\n\t/**\n\t * Return true if this candidate equals the given candidate.\n\t */\n\tbool operator==(const DpBtCandidate& o) const {\n\t\treturn row   == o.row &&\n\t\t       col   == o.col &&\n\t\t\t   score == o.score;\n\t}\n\tbool operator>=(const DpBtCandidate& o) const { return !((*this) < o); }\n\tbool operator<=(const DpBtCandidate& o) const { return !((*this) > o); }\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check internal consistency.\n\t */\n\tbool repOk() const {\n\t\tassert(VALID_SCORE(score));\n\t\treturn true;\n\t}\n#endif\n\n\tsize_t   row;   // cell row\n\tsize_t   col;   // cell column w/r/t LHS of rectangle\n\tTAlScore score; // score fo alignment\n\tint      fate;  // flag indicating whether we succeeded, failed, skipped\n};\n\ntemplate <typename T>\nclass NBest {\n\npublic:\n\n\tNBest<T>() { nelt_ = nbest_ = n_ = 0; }\n\t\n\tbool inited() const { return nelt_ > 0; }\n\t\n\tvoid init(size_t nelt, size_t nbest) {\n\t\tnelt_ = nelt;\n\t\tnbest_ = nbest;\n\t\telts_.resize(nelt * nbest);\n\t\tncur_.resize(nelt);\n\t\tncur_.fill(0);\n\t\tn_ = 0;\n\t}\n\t\n\t/**\n\t * Add a new result to bin 'elt'.  Where it gets prioritized in the list of\n\t * results in that bin depends on the result of operator>.\n\t */\n\tbool add(size_t elt, const T& o) {\n\t\tassert_lt(elt, nelt_);\n\t\tconst size_t ncur = ncur_[elt];\n\t\tassert_leq(ncur, nbest_);\n\t\tn_++;\n\t\tfor(size_t i = 0; i < nbest_ && i <= ncur; i++) {\n\t\t\tif(o > elts_[nbest_ * elt + i] || i >= ncur) {\n\t\t\t\t// Insert it here\n\t\t\t\t// Move everyone from here on down by one slot\n\t\t\t\tfor(int j = (int)ncur; j > (int)i; j--) {\n\t\t\t\t\tif(j < (int)nbest_) {\n\t\t\t\t\t\telts_[nbest_ * elt + j] = elts_[nbest_ * elt + j - 1];\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\telts_[nbest_ * elt + i] = o;\n\t\t\t\tif(ncur < nbest_) {\n\t\t\t\t\tncur_[elt]++;\n\t\t\t\t}\n\t\t\t\treturn true;\n\t\t\t}\n\t\t}\n\t\treturn false;\n\t}\n\t\n\t/**\n\t * Return true iff there are no solutions.\n\t */\n\tbool empty() const {\n\t\treturn n_ == 0;\n\t}\n\t\n\t/**\n\t * Dump all the items in our payload into the given EList.\n\t */\n\ttemplate<typename TList>\n\tvoid dump(TList& l) const {\n\t\tif(empty()) return;\n\t\tfor(size_t i = 0; i < nelt_; i++) {\n\t\t\tassert_leq(ncur_[i], nbest_);\n\t\t\tfor(size_t j = 0; j < ncur_[i]; j++) {\n\t\t\t\tl.push_back(elts_[i * nbest_ + j]);\n\t\t\t}\n\t\t}\n\t}\n\nprotected:\n\n\tsize_t        nelt_;\n\tsize_t        nbest_;\n\tEList<T>      elts_;\n\tEList<size_t> ncur_;\n\tsize_t        n_;     // total # results added\n};\n\n#endif /*def ALIGNER_SW_NUC_H_*/\n"
  },
  {
    "path": "aligner_swsse.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <string.h>\n#include \"aligner_sw_common.h\"\n#include \"aligner_swsse.h\"\n\n/**\n * Given a number of rows (nrow), a number of columns (ncol), and the\n * number of words to fit inside a single __m128i vector, initialize the\n * matrix buffer to accomodate the needed configuration of vectors.\n */\nvoid SSEMatrix::init(\n\tsize_t nrow,\n\tsize_t ncol,\n\tsize_t wperv)\n{\n\tnrow_ = nrow;\n\tncol_ = ncol;\n\twperv_ = wperv;\n\tnvecPerCol_ = (nrow + (wperv-1)) / wperv;\n\t// The +1 is so that we don't have to special-case the final column;\n\t// instead, we just write off the end of the useful part of the table\n\t// with pvEStore.\n\ttry {\n\t\tmatbuf_.resizeNoCopy((ncol+1) * nvecPerCell_ * nvecPerCol_);\n\t} catch(exception& e) {\n\t\tcerr << \"Tried to allocate DP matrix with \" << (ncol+1)\n\t\t     << \" columns, \" << nvecPerCol_\n\t\t\t << \" vectors per column, and and \" << nvecPerCell_\n\t\t\t << \" vectors per cell\" << endl;\n\t\tthrow e;\n\t}\n\tassert(wperv_ == 8 || wperv_ == 16);\n\tvecshift_ = (wperv_ == 8) ? 3 : 4;\n\tnvecrow_ = (nrow + (wperv_-1)) >> vecshift_;\n\tnveccol_ = ncol;\n\tcolstride_ = nvecPerCol_ * nvecPerCell_;\n\trowstride_ = nvecPerCell_;\n\tinited_ = true;\n}\n\n/**\n * Initialize the matrix of masks and backtracking flags.\n */\nvoid SSEMatrix::initMasks() {\n\tassert_gt(nrow_, 0);\n\tassert_gt(ncol_, 0);\n\tmasks_.resize(nrow_);\n\treset_.resizeNoCopy(nrow_);\n\treset_.fill(false);\n}\n\n/**\n * Given a row, col and matrix (i.e. E, F or H), return the corresponding\n * element.\n */\nint SSEMatrix::eltSlow(size_t row, size_t col, size_t mat) const {\n\tassert_lt(row, nrow_);\n\tassert_lt(col, ncol_);\n\tassert_leq(mat, 3);\n\t// Move to beginning of column/row\n\tsize_t rowelt = row / nvecrow_;\n\tsize_t rowvec = row % nvecrow_;\n\tsize_t eltvec = (col * colstride_) + (rowvec * rowstride_) + mat;\n\tif(wperv_ == 16) {\n\t\treturn (int)((uint8_t*)(matbuf_.ptr() + eltvec))[rowelt];\n\t} else {\n\t\tassert_eq(8, wperv_);\n\t\treturn (int)((int16_t*)(matbuf_.ptr() + eltvec))[rowelt];\n\t}\n}\n"
  },
  {
    "path": "aligner_swsse.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef ALIGNER_SWSSE_H_\n#define ALIGNER_SWSSE_H_\n\n#include \"ds.h\"\n#include \"mem_ids.h\"\n#include \"random_source.h\"\n#include \"scoring.h\"\n#include \"mask.h\"\n#include \"sse_util.h\"\n#include <strings.h>\n\n\nstruct SSEMetrics {\n\t\n\tSSEMetrics():mutex_m() { reset(); }\n\n\tvoid clear() { reset(); }\n\tvoid reset() {\n\t\tdp = dpsat = dpfail = dpsucc = \n\t\tcol = cell = inner = fixup =\n\t\tgathsol = bt = btfail = btsucc = btcell =\n\t\tcorerej = nrej = 0;\n\t}\n\n\tvoid merge(const SSEMetrics& o, bool getLock = false) {\n        ThreadSafe ts(&mutex_m, getLock);\n\t\tdp       += o.dp;\n\t\tdpsat    += o.dpsat;\n\t\tdpfail   += o.dpfail;\n\t\tdpsucc   += o.dpsucc;\n\t\tcol      += o.col;\n\t\tcell     += o.cell;\n\t\tinner    += o.inner;\n\t\tfixup    += o.fixup;\n\t\tgathsol  += o.gathsol;\n\t\tbt       += o.bt;\n\t\tbtfail   += o.btfail;\n\t\tbtsucc   += o.btsucc;\n\t\tbtcell   += o.btcell;\n\t\tcorerej  += o.corerej;\n\t\tnrej     += o.nrej;\n\t}\n\n\tuint64_t dp;       // DPs tried\n\tuint64_t dpsat;    // DPs saturated\n\tuint64_t dpfail;   // DPs failed\n\tuint64_t dpsucc;   // DPs succeeded\n\tuint64_t col;      // DP columns\n\tuint64_t cell;     // DP cells\n\tuint64_t inner;    // DP inner loop iters\n\tuint64_t fixup;    // DP fixup loop iters\n\tuint64_t gathsol;  // DP gather solution cells found\n\tuint64_t bt;       // DP backtraces\n\tuint64_t btfail;   // DP backtraces failed\n\tuint64_t btsucc;   // DP backtraces succeeded\n\tuint64_t btcell;   // DP backtrace cells traversed\n\tuint64_t corerej;  // DP backtrace core rejections\n\tuint64_t nrej;     // DP backtrace N rejections\n\tMUTEX_T  mutex_m;\n};\n\n/**\n * Encapsulates matrix information calculated by the SSE aligner.\n *\n * Matrix memory is laid out as follows:\n *\n * - Elements (individual cell scores) are packed into __m128i vectors\n * - Vectors are packed into quartets, quartet elements correspond to: a vector\n *   from E, one from F, one from H, and one that's \"reserved\"\n * - Quartets are packed into columns, where the number of quartets is\n *   determined by the number of query characters divided by the number of\n *   elements per vector\n *\n * Regarding the \"reserved\" element of the vector quartet: we use it for two\n * things.  First, we use the first column of reserved vectors to stage the\n * initial column of H vectors.  Second, we use the \"reserved\" vectors during\n * the backtrace procedure to store information about (a) which cells have been\n * traversed, (b) whether the cell is \"terminal\" (in local mode), etc.\n */\nstruct SSEMatrix {\n\n\t// Each matrix element is a quartet of vectors.  These constants are used\n\t// to identify members of the quartet.\n\tconst static size_t E   = 0;\n\tconst static size_t F   = 1;\n\tconst static size_t H   = 2;\n\tconst static size_t TMP = 3;\n\n\tSSEMatrix(int cat = 0) : nvecPerCell_(4), matbuf_(cat) { }\n\n\t/**\n\t * Return a pointer to the matrix buffer.\n\t */\n\tinline __m128i *ptr() {\n\t\tassert(inited_);\n\t\treturn matbuf_.ptr();\n\t}\n\t\n\t/**\n\t * Return a pointer to the E vector at the given row and column.  Note:\n\t * here row refers to rows of vectors, not rows of elements.\n\t */\n\tinline __m128i* evec(size_t row, size_t col) {\n\t\tassert_lt(row, nvecrow_);\n\t\tassert_lt(col, nveccol_);\n\t\tsize_t elt = row * rowstride() + col * colstride() + E;\n\t\tassert_lt(elt, matbuf_.size());\n\t\treturn ptr() + elt;\n\t}\n\n\t/**\n\t * Like evec, but it's allowed to ask for a pointer to one column after the\n\t * final one.\n\t */\n\tinline __m128i* evecUnsafe(size_t row, size_t col) {\n\t\tassert_lt(row, nvecrow_);\n\t\tassert_leq(col, nveccol_);\n\t\tsize_t elt = row * rowstride() + col * colstride() + E;\n\t\tassert_lt(elt, matbuf_.size());\n\t\treturn ptr() + elt;\n\t}\n\n\t/**\n\t * Return a pointer to the F vector at the given row and column.  Note:\n\t * here row refers to rows of vectors, not rows of elements.\n\t */\n\tinline __m128i* fvec(size_t row, size_t col) {\n\t\tassert_lt(row, nvecrow_);\n\t\tassert_lt(col, nveccol_);\n\t\tsize_t elt = row * rowstride() + col * colstride() + F;\n\t\tassert_lt(elt, matbuf_.size());\n\t\treturn ptr() + elt;\n\t}\n\n\t/**\n\t * Return a pointer to the H vector at the given row and column.  Note:\n\t * here row refers to rows of vectors, not rows of elements.\n\t */\n\tinline __m128i* hvec(size_t row, size_t col) {\n\t\tassert_lt(row, nvecrow_);\n\t\tassert_lt(col, nveccol_);\n\t\tsize_t elt = row * rowstride() + col * colstride() + H;\n\t\tassert_lt(elt, matbuf_.size());\n\t\treturn ptr() + elt;\n\t}\n\n\t/**\n\t * Return a pointer to the TMP vector at the given row and column.  Note:\n\t * here row refers to rows of vectors, not rows of elements.\n\t */\n\tinline __m128i* tmpvec(size_t row, size_t col) {\n\t\tassert_lt(row, nvecrow_);\n\t\tassert_lt(col, nveccol_);\n\t\tsize_t elt = row * rowstride() + col * colstride() + TMP;\n\t\tassert_lt(elt, matbuf_.size());\n\t\treturn ptr() + elt;\n\t}\n\n\t/**\n\t * Like tmpvec, but it's allowed to ask for a pointer to one column after\n\t * the final one.\n\t */\n\tinline __m128i* tmpvecUnsafe(size_t row, size_t col) {\n\t\tassert_lt(row, nvecrow_);\n\t\tassert_leq(col, nveccol_);\n\t\tsize_t elt = row * rowstride() + col * colstride() + TMP;\n\t\tassert_lt(elt, matbuf_.size());\n\t\treturn ptr() + elt;\n\t}\n\t\n\t/**\n\t * Given a number of rows (nrow), a number of columns (ncol), and the\n\t * number of words to fit inside a single __m128i vector, initialize the\n\t * matrix buffer to accomodate the needed configuration of vectors.\n\t */\n\tvoid init(\n\t\tsize_t nrow,\n\t\tsize_t ncol,\n\t\tsize_t wperv);\n\t\n\t/**\n\t * Return the number of __m128i's you need to skip over to get from one\n\t * cell to the cell one column over from it.\n\t */\n\tinline size_t colstride() const { return colstride_; }\n\n\t/**\n\t * Return the number of __m128i's you need to skip over to get from one\n\t * cell to the cell one row down from it.\n\t */\n\tinline size_t rowstride() const { return rowstride_; }\n\n\t/**\n\t * Given a row, col and matrix (i.e. E, F or H), return the corresponding\n\t * element.\n\t */\n\tint eltSlow(size_t row, size_t col, size_t mat) const;\n\t\n\t/**\n\t * Given a row, col and matrix (i.e. E, F or H), return the corresponding\n\t * element.\n\t */\n\tinline int elt(size_t row, size_t col, size_t mat) const {\n\t\tassert(inited_);\n\t\tassert_lt(row, nrow_);\n\t\tassert_lt(col, ncol_);\n\t\tassert_lt(mat, 3);\n\t\t// Move to beginning of column/row\n\t\tsize_t rowelt = row / nvecrow_;\n\t\tsize_t rowvec = row % nvecrow_;\n\t\tsize_t eltvec = (col * colstride_) + (rowvec * rowstride_) + mat;\n\t\tassert_lt(eltvec, matbuf_.size());\n\t\tif(wperv_ == 16) {\n\t\t\treturn (int)((uint8_t*)(matbuf_.ptr() + eltvec))[rowelt];\n\t\t} else {\n\t\t\tassert_eq(8, wperv_);\n\t\t\treturn (int)((int16_t*)(matbuf_.ptr() + eltvec))[rowelt];\n\t\t}\n\t}\n\n\t/**\n\t * Return the element in the E matrix at element row, col.\n\t */\n\tinline int eelt(size_t row, size_t col) const {\n\t\treturn elt(row, col, E);\n\t}\n\n\t/**\n\t * Return the element in the F matrix at element row, col.\n\t */\n\tinline int felt(size_t row, size_t col) const {\n\t\treturn elt(row, col, F);\n\t}\n\n\t/**\n\t * Return the element in the H matrix at element row, col.\n\t */\n\tinline int helt(size_t row, size_t col) const {\n\t\treturn elt(row, col, H);\n\t}\n\t\n\t/**\n\t * Return true iff the given cell has its reportedThru bit set.\n\t */\n\tinline bool reportedThrough(\n\t\tsize_t row,          // current row\n\t\tsize_t col) const    // current column\n\t{\n\t\treturn (masks_[row][col] & (1 << 0)) != 0;\n\t}\n\n\t/**\n\t * Set the given cell's reportedThru bit.\n\t */\n\tinline void setReportedThrough(\n\t\tsize_t row,          // current row\n\t\tsize_t col)          // current column\n\t{\n\t\tmasks_[row][col] |= (1 << 0);\n\t}\n\n\t/**\n\t * Return true iff the H mask has been set with a previous call to hMaskSet.\n\t */\n\tbool isHMaskSet(\n\t\tsize_t row,          // current row\n\t\tsize_t col) const;   // current column\n\n\t/**\n\t * Set the given cell's H mask.  This is the mask of remaining legal ways to\n\t * backtrack from the H cell at this coordinate.  It's 5 bits long and has\n\t * offset=2 into the 16-bit field.\n\t */\n\tvoid hMaskSet(\n\t\tsize_t row,          // current row\n\t\tsize_t col,          // current column\n\t\tint mask);\n\n\t/**\n\t * Return true iff the E mask has been set with a previous call to eMaskSet.\n\t */\n\tbool isEMaskSet(\n\t\tsize_t row,          // current row\n\t\tsize_t col) const;   // current column\n\n\t/**\n\t * Set the given cell's E mask.  This is the mask of remaining legal ways to\n\t * backtrack from the E cell at this coordinate.  It's 2 bits long and has\n\t * offset=8 into the 16-bit field.\n\t */\n\tvoid eMaskSet(\n\t\tsize_t row,          // current row\n\t\tsize_t col,          // current column\n\t\tint mask);\n\t\n\t/**\n\t * Return true iff the F mask has been set with a previous call to fMaskSet.\n\t */\n\tbool isFMaskSet(\n\t\tsize_t row,          // current row\n\t\tsize_t col) const;   // current column\n\n\t/**\n\t * Set the given cell's F mask.  This is the mask of remaining legal ways to\n\t * backtrack from the F cell at this coordinate.  It's 2 bits long and has\n\t * offset=11 into the 16-bit field.\n\t */\n\tvoid fMaskSet(\n\t\tsize_t row,          // current row\n\t\tsize_t col,          // current column\n\t\tint mask);\n\n\t/**\n\t * Analyze a cell in the SSE-filled dynamic programming matrix.  Determine &\n\t * memorize ways that we can backtrack from the cell.  If there is at least one\n\t * way to backtrack, select one at random and return the selection.\n\t *\n\t * There are a few subtleties to keep in mind regarding which cells can be at\n\t * the end of a backtrace.  First of all: cells from which we can backtrack\n\t * should not be at the end of a backtrace.  But have to distinguish between\n\t * cells whose masks eventually become 0 (we shouldn't end at those), from\n\t * those whose masks were 0 all along (we can end at those).\n\t */\n\tvoid analyzeCell(\n\t\tsize_t row,          // current row\n\t\tsize_t col,          // current column\n\t\tsize_t ct,           // current cell type: E/F/H\n\t\tint refc,\n\t\tint readc,\n\t\tint readq,\n\t\tconst Scoring& sc,   // scoring scheme\n\t\tint64_t offsetsc,    // offset to add to each score\n\t\tRandomSource& rand,  // rand gen for choosing among equal options\n\t\tbool& empty,         // out: =true iff no way to backtrace\n\t\tint& cur,            // out: =type of transition\n\t\tbool& branch,        // out: =true iff we chose among >1 options\n\t\tbool& canMoveThru,   // out: =true iff ...\n\t\tbool& reportedThru); // out: =true iff ...\n\n\t/**\n\t * Initialize the matrix of masks and backtracking flags.\n\t */\n\tvoid initMasks();\n\n\t/**\n\t * Return the number of rows in the dynamic programming matrix.\n\t */\n\tsize_t nrow() const {\n\t\treturn nrow_;\n\t}\n\n\t/**\n\t * Return the number of columns in the dynamic programming matrix.\n\t */\n\tsize_t ncol() const {\n\t\treturn ncol_;\n\t}\n\t\n\t/**\n\t * Prepare a row so we can use it to store masks.\n\t */\n\tvoid resetRow(size_t i) {\n\t\tassert(!reset_[i]);\n\t\tmasks_[i].resizeNoCopy(ncol_);\n\t\tmasks_[i].fillZero();\n\t\treset_[i] = true;\n\t}\n\n\tbool             inited_;      // initialized?\n\tsize_t           nrow_;        // # rows\n\tsize_t           ncol_;        // # columns\n\tsize_t           nvecrow_;     // # vector rows (<= nrow_)\n\tsize_t           nveccol_;     // # vector columns (<= ncol_)\n\tsize_t           wperv_;       // # words per vector\n\tsize_t           vecshift_;    // # bits to shift to divide by words per vec\n\tsize_t           nvecPerCol_;  // # vectors per column\n\tsize_t           nvecPerCell_; // # vectors per matrix cell (4)\n\tsize_t           colstride_;   // # vectors b/t adjacent cells in same row\n\tsize_t           rowstride_;   // # vectors b/t adjacent cells in same col\n\tEList_m128i      matbuf_;      // buffer for holding vectors\n\tELList<uint16_t> masks_;       // buffer for masks/backtracking flags\n\tEList<bool>      reset_;       // true iff row in masks_ has been reset\n};\n\n/**\n * All the data associated with the query profile and other data needed for SSE\n * alignment of a query.\n */\nstruct SSEData {\n\tSSEData(int cat = 0) : profbuf_(cat), mat_(cat) { }\n\tEList_m128i    profbuf_;     // buffer for query profile & temp vecs\n\tEList_m128i    vecbuf_;      // buffer for 2 column vectors (not using mat_)\n\tsize_t         qprofStride_; // stride for query profile\n\tsize_t         gbarStride_;  // gap barrier for query profile\n\tSSEMatrix      mat_;         // SSE matrix for holding all E, F, H vectors\n\tsize_t         maxPen_;      // biggest penalty of all\n\tsize_t         maxBonus_;    // biggest bonus of all\n\tsize_t         lastIter_;    // which 128-bit striped word has final row?\n\tsize_t         lastWord_;    // which word within 128-word has final row?\n\tint            bias_;        // all scores shifted up by this for unsigned\n};\n\n/**\n * Return true iff the H mask has been set with a previous call to hMaskSet.\n */\ninline bool SSEMatrix::isHMaskSet(\n\tsize_t row,          // current row\n\tsize_t col) const    // current column\n{\n\treturn (masks_[row][col] & (1 << 1)) != 0;\n}\n\n/**\n * Set the given cell's H mask.  This is the mask of remaining legal ways to\n * backtrack from the H cell at this coordinate.  It's 5 bits long and has\n * offset=2 into the 16-bit field.\n */\ninline void SSEMatrix::hMaskSet(\n\tsize_t row,          // current row\n\tsize_t col,          // current column\n\tint mask)\n{\n\tassert_lt(mask, 32);\n\tmasks_[row][col] &= ~(31 << 1);\n\tmasks_[row][col] |= (1 << 1 | mask << 2);\n}\n\n/**\n * Return true iff the E mask has been set with a previous call to eMaskSet.\n */\ninline bool SSEMatrix::isEMaskSet(\n\tsize_t row,          // current row\n\tsize_t col) const    // current column\n{\n\treturn (masks_[row][col] & (1 << 7)) != 0;\n}\n\n/**\n * Set the given cell's E mask.  This is the mask of remaining legal ways to\n * backtrack from the E cell at this coordinate.  It's 2 bits long and has\n * offset=8 into the 16-bit field.\n */\ninline void SSEMatrix::eMaskSet(\n\tsize_t row,          // current row\n\tsize_t col,          // current column\n\tint mask)\n{\n\tassert_lt(mask, 4);\n\tmasks_[row][col] &= ~(7 << 7);\n\tmasks_[row][col] |=  (1 << 7 | mask << 8);\n}\n\n/**\n * Return true iff the F mask has been set with a previous call to fMaskSet.\n */\ninline bool SSEMatrix::isFMaskSet(\n\tsize_t row,          // current row\n\tsize_t col) const    // current column\n{\n\treturn (masks_[row][col] & (1 << 10)) != 0;\n}\n\n/**\n * Set the given cell's F mask.  This is the mask of remaining legal ways to\n * backtrack from the F cell at this coordinate.  It's 2 bits long and has\n * offset=11 into the 16-bit field.\n */\ninline void SSEMatrix::fMaskSet(\n\tsize_t row,          // current row\n\tsize_t col,          // current column\n\tint mask)\n{\n\tassert_lt(mask, 4);\n\tmasks_[row][col] &= ~(7 << 10);\n\tmasks_[row][col] |=  (1 << 10 | mask << 11);\n}\n\n#define ROWSTRIDE_2COL 4\n#define ROWSTRIDE 4\n\n#endif /*ndef ALIGNER_SWSSE_H_*/\n"
  },
  {
    "path": "aligner_swsse_ee_i16.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n/**\n * aligner_sw_sse.cpp\n *\n * Versions of key alignment functions that use vector instructions to\n * accelerate dynamic programming.  Based chiefly on the striped Smith-Waterman\n * paper and implementation by Michael Farrar.  See:\n *\n * Farrar M. Striped Smith-Waterman speeds database searches six times over\n * other SIMD implementations. Bioinformatics. 2007 Jan 15;23(2):156-61.\n * http://sites.google.com/site/farrarmichael/smith-waterman\n *\n * While the paper describes an implementation of Smith-Waterman, we extend it\n * do end-to-end read alignment as well as local alignment.  The change\n * required for this is minor: we simply let vmax be the maximum element in the\n * score domain rather than the minimum.\n *\n * The vectorized dynamic programming implementation lacks some features that\n * make it hard to adapt to solving the entire dynamic-programming alignment\n * problem.  For instance:\n *\n * - It doesn't respect gap barriers on either end of the read\n * - It just gives a maximum; not enough information to backtrace without\n *   redoing some alignment\n * - It's a little difficult to handle st_ and en_, especially st_.\n * - The query profile mechanism makes handling of ambiguous reference bases a\n *   little tricky (16 cols in query profile lookup table instead of 5)\n *\n * Given the drawbacks, it is tempting to use SSE dynamic programming as a\n * filter rather than as an aligner per se.  Here are a few ideas for how it\n * can be extended to handle more of the alignment problem:\n *\n * - Save calculated scores to a big array as we go.  We return to this array\n *   to find and backtrace from good solutions.\n */\n\n#include <limits>\n#include \"aligner_sw.h\"\n\nstatic const size_t NBYTES_PER_REG  = 16;\nstatic const size_t NWORDS_PER_REG  = 8;\nstatic const size_t NBITS_PER_WORD  = 16;\nstatic const size_t NBYTES_PER_WORD = 2;\n\n// In 16-bit end-to-end mode, we have the option of using signed saturated\n// arithmetic.  Because we have signed arithmetic, there's no need to add/subtract\n// bias when building an applying the query profile.  The lowest value we can\n// use is 0x8000, and the greatest is 0x7fff.\n\ntypedef int16_t TCScore;\n\n/**\n * Build query profile look up tables for the read.  The query profile look\n * up table is organized as a 1D array indexed by [i][j] where i is the\n * reference character in the current DP column (0=A, 1=C, etc), and j is\n * the segment of the query we're currently working on.\n */\nvoid SwAligner::buildQueryProfileEnd2EndSseI16(bool fw) {\n\tbool& done = fw ? sseI16fwBuilt_ : sseI16rcBuilt_;\n\tif(done) {\n\t\treturn;\n\t}\n\tdone = true;\n\tconst BTDnaString* rd = fw ? rdfw_ : rdrc_;\n\tconst BTString* qu = fw ? qufw_ : qurc_;\n    // daehwan - allows to align a portion of a read, not the whole\n\t// const size_t len = rd->length();\n    const size_t len = dpRows();\n\tconst size_t seglen = (len + (NWORDS_PER_REG-1)) / NWORDS_PER_REG;\n\t// How many __m128i's are needed\n\tsize_t n128s =\n\t\t64 +                    // slack bytes, for alignment?\n\t\t(seglen * ALPHA_SIZE)   // query profile data\n\t\t* 2;                    // & gap barrier data\n\tassert_gt(n128s, 0);\n\tSSEData& d = fw ? sseI16fw_ : sseI16rc_;\n\td.profbuf_.resizeNoCopy(n128s);\n\tassert(!d.profbuf_.empty());\n\td.maxPen_      = d.maxBonus_ = 0;\n\td.lastIter_    = d.lastWord_ = 0;\n\td.qprofStride_ = d.gbarStride_ = 2;\n\td.bias_ = 0; // no bias when words are signed\n\t// For each reference character A, C, G, T, N ...\n\tfor(size_t refc = 0; refc < ALPHA_SIZE; refc++) {\n\t\t// For each segment ...\n\t\tfor(size_t i = 0; i < seglen; i++) {\n\t\t\tsize_t j = i;\n\t\t\tint16_t *qprofWords =\n\t\t\t\treinterpret_cast<int16_t*>(d.profbuf_.ptr() + (refc * seglen * 2) + (i * 2));\n\t\t\tint16_t *gbarWords =\n\t\t\t\treinterpret_cast<int16_t*>(d.profbuf_.ptr() + (refc * seglen * 2) + (i * 2) + 1);\n\t\t\t// For each sub-word (byte) ...\n\t\t\tfor(size_t k = 0; k < NWORDS_PER_REG; k++) {\n\t\t\t\tint sc = 0;\n\t\t\t\t*gbarWords = 0;\n\t\t\t\tif(j < len) {\n\t\t\t\t\tint readc = (*rd)[j];\n\t\t\t\t\tint readq = (*qu)[j];\n\t\t\t\t\tsc = sc_->score(readc, (int)(1 << refc), readq - 33);\n\t\t\t\t\tsize_t j_from_end = len - j - 1;\n\t\t\t\t\tif(j < (size_t)sc_->gapbar ||\n\t\t\t\t\t   j_from_end < (size_t)sc_->gapbar)\n\t\t\t\t\t{\n\t\t\t\t\t\t// Inside the gap barrier\n\t\t\t\t\t\t*gbarWords = 0x8000; // add this twice\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(refc == 0 && j == len-1) {\n\t\t\t\t\t// Remember which 128-bit word and which smaller word has\n\t\t\t\t\t// the final row\n\t\t\t\t\td.lastIter_ = i;\n\t\t\t\t\td.lastWord_ = k;\n\t\t\t\t}\n\t\t\t\tif(sc < 0) {\n\t\t\t\t\tif((size_t)(-sc) > d.maxPen_) {\n\t\t\t\t\t\td.maxPen_ = (size_t)(-sc);\n\t\t\t\t\t}\n\t\t\t\t} else {\n\t\t\t\t\tif((size_t)sc > d.maxBonus_) {\n\t\t\t\t\t\td.maxBonus_ = (size_t)sc;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\t*qprofWords = (int16_t)sc;\n\t\t\t\tgbarWords++;\n\t\t\t\tqprofWords++;\n\t\t\t\tj += seglen; // update offset into query\n\t\t\t}\n\t\t}\n\t}\n}\n\n#ifndef NDEBUG\n/**\n * Return true iff the cell has sane E/F/H values w/r/t its predecessors.\n */\nstatic bool cellOkEnd2EndI16(\n\tSSEData& d,\n\tsize_t row,\n\tsize_t col,\n\tint refc,\n\tint readc,\n\tint readq,\n\tconst Scoring& sc)     // scoring scheme\n{\n\tTCScore floorsc = 0x8000;\n\tTCScore ceilsc = MAX_I64;\n\tTAlScore offsetsc = -0x7fff;\n\tTAlScore sc_h_cur = (TAlScore)d.mat_.helt(row, col);\n\tTAlScore sc_e_cur = (TAlScore)d.mat_.eelt(row, col);\n\tTAlScore sc_f_cur = (TAlScore)d.mat_.felt(row, col);\n\tif(sc_h_cur > floorsc) {\n\t\tsc_h_cur += offsetsc;\n\t}\n\tif(sc_e_cur > floorsc) {\n\t\tsc_e_cur += offsetsc;\n\t}\n\tif(sc_f_cur > floorsc) {\n\t\tsc_f_cur += offsetsc;\n\t}\n\tbool gapsAllowed = true;\n\tsize_t rowFromEnd = d.mat_.nrow() - row - 1;\n\tif(row < (size_t)sc.gapbar || rowFromEnd < (size_t)sc.gapbar) {\n\t\tgapsAllowed = false;\n\t}\n\tbool e_left_trans = false, h_left_trans = false;\n\tbool f_up_trans   = false, h_up_trans = false;\n\tbool h_diag_trans = false;\n\tif(gapsAllowed) {\n\t\tTAlScore sc_h_left = floorsc;\n\t\tTAlScore sc_e_left = floorsc;\n\t\tTAlScore sc_h_up   = floorsc;\n\t\tTAlScore sc_f_up   = floorsc;\n\t\tif(col > 0 && sc_e_cur > floorsc && sc_e_cur <= ceilsc) {\n\t\t\tsc_h_left = d.mat_.helt(row, col-1) + offsetsc;\n\t\t\tsc_e_left = d.mat_.eelt(row, col-1) + offsetsc;\n\t\t\te_left_trans = (sc_e_left > floorsc && sc_e_cur == sc_e_left - sc.readGapExtend());\n\t\t\th_left_trans = (sc_h_left > floorsc && sc_e_cur == sc_h_left - sc.readGapOpen());\n\t\t\tassert(e_left_trans || h_left_trans);\n\t\t}\n\t\tif(row > 0 && sc_f_cur > floorsc && sc_f_cur <= ceilsc) {\n\t\t\tsc_h_up = d.mat_.helt(row-1, col) + offsetsc;\n\t\t\tsc_f_up = d.mat_.felt(row-1, col) + offsetsc;\n\t\t\tf_up_trans = (sc_f_up > floorsc && sc_f_cur == sc_f_up - sc.refGapExtend());\n\t\t\th_up_trans = (sc_h_up > floorsc && sc_f_cur == sc_h_up - sc.refGapOpen());\n\t\t\tassert(f_up_trans || h_up_trans);\n\t\t}\n\t} else {\n\t\tassert_geq(floorsc, sc_e_cur);\n\t\tassert_geq(floorsc, sc_f_cur);\n\t}\n\tif(col > 0 && row > 0 && sc_h_cur > floorsc && sc_h_cur <= ceilsc) {\n\t\tTAlScore sc_h_upleft = d.mat_.helt(row-1, col-1) + offsetsc;\n\t\tTAlScore sc_diag = sc.score(readc, (int)refc, readq - 33);\n\t\th_diag_trans = sc_h_cur == sc_h_upleft + sc_diag;\n\t}\n\tassert(\n\t\tsc_h_cur <= floorsc ||\n\t\te_left_trans ||\n\t\th_left_trans ||\n\t\tf_up_trans   ||\n\t\th_up_trans   ||\n\t\th_diag_trans ||\n\t\tsc_h_cur > ceilsc ||\n\t\trow == 0 ||\n\t\tcol == 0);\n\treturn true;\n}\n#endif /*ndef NDEBUG*/\n\n#ifdef NDEBUG\n\n#define assert_all_eq0(x)\n#define assert_all_gt(x, y)\n#define assert_all_gt_lo(x)\n#define assert_all_lt(x, y)\n#define assert_all_lt_hi(x)\n\n#else\n\n#define assert_all_eq0(x) { \\\n\t__m128i z = _mm_setzero_si128(); \\\n\t__m128i tmp = _mm_setzero_si128(); \\\n\tz = _mm_xor_si128(z, z); \\\n\ttmp = _mm_cmpeq_epi16(x, z); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_gt(x, y) { \\\n\t__m128i tmp = _mm_cmpgt_epi16(x, y); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_gt_lo(x) { \\\n\t__m128i z = _mm_setzero_si128(); \\\n\t__m128i tmp = _mm_setzero_si128(); \\\n\tz = _mm_xor_si128(z, z); \\\n\ttmp = _mm_cmpgt_epi16(x, z); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_lt(x, y) { \\\n\t__m128i tmp = _mm_cmplt_epi16(x, y); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_leq(x, y) { \\\n\t__m128i tmp = _mm_cmpgt_epi16(x, y); \\\n\tassert_eq(0x0000, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_lt_hi(x) { \\\n\t__m128i z = _mm_setzero_si128(); \\\n\t__m128i tmp = _mm_setzero_si128(); \\\n\tz = _mm_cmpeq_epi16(z, z); \\\n\tz = _mm_srli_epi16(z, 1); \\\n\ttmp = _mm_cmplt_epi16(x, z); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n#endif\n\n/**\n * Aligns by filling a dynamic programming matrix with the SSE-accelerated,\n * banded DP approach of Farrar.  As it goes, it determines which cells we\n * might backtrace from and tallies the best (highest-scoring) N backtrace\n * candidate cells per diagonal.  Also returns the alignment score of the best\n * alignment in the matrix.\n *\n * This routine does *not* maintain a matrix holding the entire matrix worth of\n * scores, nor does it maintain any other dense O(mn) data structure, as this\n * would quickly exhaust memory for queries longer than about 10,000 kb.\n * Instead, in the fill stage it maintains two columns worth of scores at a\n * time (current/previous, or right/left) - these take O(m) space.  When\n * finished with the current column, it determines which cells from the\n * previous column, if any, are candidates we might backtrace from to find a\n * full alignment.  A candidate cell has a score that rises above the threshold\n * and isn't improved upon by a match in the next column.  The best N\n * candidates per diagonal are stored in a O(m + n) data structure.\n */\nTAlScore SwAligner::alignGatherEE16(int& flag, bool debug) {\n\tassert_leq(rdf_, rd_->length());\n\tassert_leq(rdf_, qu_->length());\n\tassert_lt(rfi_, rff_);\n\tassert_lt(rdi_, rdf_);\n\tassert_eq(rd_->length(), qu_->length());\n\tassert_geq(sc_->gapbar, 1);\n\tassert(repOk());\n#ifndef NDEBUG\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\tassert_range(0, 16, (int)rf_[i]);\n\t}\n#endif\n\n\tSSEData& d = fw_ ? sseI16fw_ : sseI16rc_;\n\tSSEMetrics& met = extend_ ? sseI16ExtendMet_ : sseI16MateMet_;\n\tif(!debug) met.dp++;\n\tbuildQueryProfileEnd2EndSseI16(fw_);\n\tassert(!d.profbuf_.empty());\n\n\tassert_eq(0, d.maxBonus_);\n\tsize_t iter =\n\t\t(dpRows() + (NWORDS_PER_REG-1)) / NWORDS_PER_REG; // iter = segLen\n\t\n\t// Now set up the score vectors.  We just need two columns worth, which\n\t// we'll call \"left\" and \"right\".\n\td.vecbuf_.resize(4 * 2 * iter);\n\td.vecbuf_.zero();\n\t__m128i *vbuf_l = d.vecbuf_.ptr();\n\t__m128i *vbuf_r = d.vecbuf_.ptr() + (4 * iter);\n\t\n\t// This is the data structure that holds candidate cells per diagonal.\n\tconst size_t ndiags = rff_ - rfi_ + dpRows() - 1;\n\tif(!debug) {\n\t\tbtdiag_.init(ndiags, 2);\n\t}\n\n\t// Data structure that holds checkpointed anti-diagonals\n\tTAlScore perfectScore = sc_->perfectScore(dpRows());\n\tbool checkpoint = true;\n\tbool cpdebug = false;\n#ifndef NDEBUG\n\tcpdebug = dpRows() < 1000;\n#endif\n\tcper_.init(\n\t\tdpRows(),      // # rows\n\t\trff_ - rfi_,   // # columns\n\t\tcperPerPow2_,  // checkpoint every 1 << perpow2 diags (& next)\n\t\tperfectScore,  // perfect score (for sanity checks)\n\t\tfalse,         // matrix cells have 8-bit scores?\n\t\tcperTri_,      // triangular mini-fills?\n\t\tfalse,         // alignment is local?\n\t\tcpdebug);      // save all cells for debugging?\n\t\t\n\t// Many thanks to Michael Farrar for releasing his striped Smith-Waterman\n\t// implementation:\n\t//\n\t//  http://sites.google.com/site/farrarmichael/smith-waterman\n\t//\n\t// Much of the implmentation below is adapted from Michael's code.\n\n\t// Set all elts to reference gap open penalty\n\t__m128i rfgapo   = _mm_setzero_si128();\n\t__m128i rfgape   = _mm_setzero_si128();\n\t__m128i rdgapo   = _mm_setzero_si128();\n\t__m128i rdgape   = _mm_setzero_si128();\n\t__m128i vlo      = _mm_setzero_si128();\n\t__m128i vhi      = _mm_setzero_si128();\n\t__m128i vhilsw   = _mm_setzero_si128();\n\t__m128i vlolsw   = _mm_setzero_si128();\n\t__m128i ve       = _mm_setzero_si128();\n\t__m128i vf       = _mm_setzero_si128();\n\t__m128i vh       = _mm_setzero_si128();\n\t__m128i vhd      = _mm_setzero_si128();\n\t__m128i vhdtmp   = _mm_setzero_si128();\n\t__m128i vtmp     = _mm_setzero_si128();\n\n\tassert_gt(sc_->refGapOpen(), 0);\n\tassert_leq(sc_->refGapOpen(), MAX_I16);\n\trfgapo = _mm_insert_epi16(rfgapo, sc_->refGapOpen(), 0);\n\trfgapo = _mm_shufflelo_epi16(rfgapo, 0);\n\trfgapo = _mm_shuffle_epi32(rfgapo, 0);\n\t\n\t// Set all elts to reference gap extension penalty\n\tassert_gt(sc_->refGapExtend(), 0);\n\tassert_leq(sc_->refGapExtend(), MAX_I16);\n\tassert_leq(sc_->refGapExtend(), sc_->refGapOpen());\n\trfgape = _mm_insert_epi16(rfgape, sc_->refGapExtend(), 0);\n\trfgape = _mm_shufflelo_epi16(rfgape, 0);\n\trfgape = _mm_shuffle_epi32(rfgape, 0);\n\n\t// Set all elts to read gap open penalty\n\tassert_gt(sc_->readGapOpen(), 0);\n\tassert_leq(sc_->readGapOpen(), MAX_I16);\n\trdgapo = _mm_insert_epi16(rdgapo, sc_->readGapOpen(), 0);\n\trdgapo = _mm_shufflelo_epi16(rdgapo, 0);\n\trdgapo = _mm_shuffle_epi32(rdgapo, 0);\n\t\n\t// Set all elts to read gap extension penalty\n\tassert_gt(sc_->readGapExtend(), 0);\n\tassert_leq(sc_->readGapExtend(), MAX_I16);\n\tassert_leq(sc_->readGapExtend(), sc_->readGapOpen());\n\trdgape = _mm_insert_epi16(rdgape, sc_->readGapExtend(), 0);\n\trdgape = _mm_shufflelo_epi16(rdgape, 0);\n\trdgape = _mm_shuffle_epi32(rdgape, 0);\n\n\t// Set all elts to 0x8000 (min value for signed 16-bit)\n\tvlo = _mm_cmpeq_epi16(vlo, vlo);             // all elts = 0xffff\n\tvlo = _mm_slli_epi16(vlo, NBITS_PER_WORD-1); // all elts = 0x8000\n\t\n\t// Set all elts to 0x7fff (max value for signed 16-bit)\n\tvhi = _mm_cmpeq_epi16(vhi, vhi);             // all elts = 0xffff\n\tvhi = _mm_srli_epi16(vhi, 1);                // all elts = 0x7fff\n\t\n\t// vlolsw: topmost (least sig) word set to 0x8000, all other words=0\n\tvlolsw = _mm_shuffle_epi32(vlo, 0);\n\tvlolsw = _mm_srli_si128(vlolsw, NBYTES_PER_REG - NBYTES_PER_WORD);\n\t\n\t// vhilsw: topmost (least sig) word set to 0x7fff, all other words=0\n\tvhilsw = _mm_shuffle_epi32(vhi, 0);\n\tvhilsw = _mm_srli_si128(vhilsw, NBYTES_PER_REG - NBYTES_PER_WORD);\n\t\n\t// Points to a long vector of __m128i where each element is a block of\n\t// contiguous cells in the E, F or H matrix.  If the index % 3 == 0, then\n\t// the block of cells is from the E matrix.  If index % 3 == 1, they're\n\t// from the F matrix.  If index % 3 == 2, then they're from the H matrix.\n\t// Blocks of cells are organized in the same interleaved manner as they are\n\t// calculated by the Farrar algorithm.\n\tconst __m128i *pvScore; // points into the query profile\n\n\tconst size_t colstride = ROWSTRIDE_2COL * iter;\n\t\n\t// Initialize the H and E vectors in the first matrix column\n\t__m128i *pvELeft = vbuf_l + 0; __m128i *pvERight = vbuf_r + 0;\n\t/* __m128i *pvFLeft = vbuf_l + 1; */ __m128i *pvFRight = vbuf_r + 1;\n\t__m128i *pvHLeft = vbuf_l + 2; __m128i *pvHRight = vbuf_r + 2;\n\t\n\t// Maximum score in final row\n\tbool found = false;\n\tTCScore lrmax = MIN_I16;\n\t\n\tfor(size_t i = 0; i < iter; i++) {\n\t\t_mm_store_si128(pvERight, vlo); pvERight += ROWSTRIDE_2COL;\n\t\t// Could initialize Hs to high or low.  If high, cells in the lower\n\t\t// triangle will have somewhat more legitiate scores, but still won't\n\t\t// be exhaustively scored.\n\t\t_mm_store_si128(pvHRight, vlo); pvHRight += ROWSTRIDE_2COL;\n\t}\n\t\n\tassert_gt(sc_->gapbar, 0);\n\tsize_t nfixup = 0;\n\n\t// Fill in the table as usual but instead of using the same gap-penalty\n\t// vector for each iteration of the inner loop, load words out of a\n\t// pre-calculated gap vector parallel to the query profile.  The pre-\n\t// calculated gap vectors enforce the gap barrier constraint by making it\n\t// infinitely costly to introduce a gap in barrier rows.\n\t//\n\t// AND use a separate loop to fill in the first row of the table, enforcing\n\t// the st_ constraints in the process.  This is awkward because it\n\t// separates the processing of the first row from the others and might make\n\t// it difficult to use the first-row results in the next row, but it might\n\t// be the simplest and least disruptive way to deal with the st_ constraint.\n\t\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\t// Swap left and right; vbuf_l is the vector on the left, which we\n\t\t// generally load from, and vbuf_r is the vector on the right, which we\n\t\t// generally store to.\n\t\tswap(vbuf_l, vbuf_r);\n\t\tpvELeft = vbuf_l + 0; pvERight = vbuf_r + 0;\n\t\t/* pvFLeft = vbuf_l + 1; */ pvFRight = vbuf_r + 1;\n\t\tpvHLeft = vbuf_l + 2; pvHRight = vbuf_r + 2;\n\t\t\n\t\t// Fetch the appropriate query profile.  Note that elements of rf_ must\n\t\t// be numbers, not masks.\n\t\tconst int refc = (int)rf_[i];\n\t\t\n\t\t// Fetch the appropriate query profile\n\t\tsize_t off = (size_t)firsts5[refc] * iter * 2;\n\t\tpvScore = d.profbuf_.ptr() + off; // even elts = query profile, odd = gap barrier\n\t\t\n\t\t// Set all cells to low value\n\t\tvf = _mm_cmpeq_epi16(vf, vf);\n\t\tvf = _mm_slli_epi16(vf, NBITS_PER_WORD-1);\n\t\tvf = _mm_or_si128(vf, vlolsw);\n\t\t\n\t\t// Load H vector from the final row of the previous column\n\t\tvh = _mm_load_si128(pvHLeft + colstride - ROWSTRIDE_2COL);\n\t\t// Shift 2 bytes down so that topmost (least sig) cell gets 0\n\t\tvh = _mm_slli_si128(vh, NBYTES_PER_WORD);\n\t\t// Fill topmost (least sig) cell with high value\n\t\tvh = _mm_or_si128(vh, vhilsw);\n\t\t\n\t\t// For each character in the reference text:\n\t\tsize_t j;\n\t\tfor(j = 0; j < iter; j++) {\n\t\t\t// Load cells from E, calculated previously\n\t\t\tve = _mm_load_si128(pvELeft);\n\t\t\tvhd = _mm_load_si128(pvHLeft);\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\tpvELeft += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Store cells in F, calculated previously\n\t\t\tvf = _mm_adds_epi16(vf, pvScore[1]); // veto some ref gap extensions\n\t\t\tvf = _mm_adds_epi16(vf, pvScore[1]); // veto some ref gap extensions\n\t\t\t_mm_store_si128(pvFRight, vf);\n\t\t\tpvFRight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Factor in query profile (matches and mismatches)\n\t\t\tvh = _mm_adds_epi16(vh, pvScore[0]);\n\t\t\t\n\t\t\t// Update H, factoring in E and F\n\t\t\tvh = _mm_max_epi16(vh, vf);\n\t\t\t\n\t\t\t// Update vE value\n\t\t\tvhdtmp = vhd;\n\t\t\tvhd = _mm_subs_epi16(vhd, rdgapo);\n\t\t\tvhd = _mm_adds_epi16(vhd, pvScore[1]); // veto some read gap opens\n\t\t\tvhd = _mm_adds_epi16(vhd, pvScore[1]); // veto some read gap opens\n\t\t\tve = _mm_subs_epi16(ve, rdgape);\n\t\t\tve = _mm_max_epi16(ve, vhd);\n\t\t\tvh = _mm_max_epi16(vh, ve);\n\n\t\t\t// Save the new vH values\n\t\t\t_mm_store_si128(pvHRight, vh);\n\t\t\tpvHRight += ROWSTRIDE_2COL;\n\t\t\tvtmp = vh;\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\t\n\t\t\t// Load the next h value\n\t\t\tvh = vhdtmp;\n\t\t\tpvHLeft += ROWSTRIDE_2COL;\n\n\t\t\t// Save E values\n\t\t\t_mm_store_si128(pvERight, ve);\n\t\t\tpvERight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Update vf value\n\t\t\tvtmp = _mm_subs_epi16(vtmp, rfgapo);\n\t\t\tvf = _mm_subs_epi16(vf, rfgape);\n\t\t\tassert_all_lt(vf, vhi);\n\t\t\tvf = _mm_max_epi16(vf, vtmp);\n\t\t\t\n\t\t\tpvScore += 2; // move on to next query profile / gap veto\n\t\t}\n\t\t// pvHStore, pvELoad, pvEStore have all rolled over to the next column\n\t\tpvFRight -= colstride; // reset to start of column\n\t\tvtmp = _mm_load_si128(pvFRight);\n\t\t\n\t\tpvHRight -= colstride; // reset to start of column\n\t\tvh = _mm_load_si128(pvHRight);\n\t\t\n\t\tpvScore = d.profbuf_.ptr() + off + 1; // reset veto vector\n\t\t\n\t\t// vf from last row gets shifted down by one to overlay the first row\n\t\t// rfgape has already been subtracted from it.\n\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\tvf = _mm_or_si128(vf, vlolsw);\n\t\t\n\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\tvf = _mm_max_epi16(vtmp, vf);\n\t\tvtmp = _mm_cmpgt_epi16(vf, vtmp);\n\t\tint cmp = _mm_movemask_epi8(vtmp);\n\t\t\n\t\t// If any element of vtmp is greater than H - gap-open...\n\t\tj = 0;\n\t\twhile(cmp != 0x0000) {\n\t\t\t// Store this vf\n\t\t\t_mm_store_si128(pvFRight, vf);\n\t\t\tpvFRight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Update vh w/r/t new vf\n\t\t\tvh = _mm_max_epi16(vh, vf);\n\t\t\t\n\t\t\t// Save vH values\n\t\t\t_mm_store_si128(pvHRight, vh);\n\t\t\tpvHRight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\tpvScore += 2;\n\t\t\t\n\t\t\tassert_lt(j, iter);\n\t\t\tif(++j == iter) {\n\t\t\t\tpvFRight -= colstride;\n\t\t\t\tvtmp = _mm_load_si128(pvFRight);   // load next vf ASAP\n\t\t\t\tpvHRight -= colstride;\n\t\t\t\tvh = _mm_load_si128(pvHRight);     // load next vh ASAP\n\t\t\t\tpvScore = d.profbuf_.ptr() + off + 1;\n\t\t\t\tj = 0;\n\t\t\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\t\t\tvf = _mm_or_si128(vf, vlolsw);\n\t\t\t} else {\n\t\t\t\tvtmp = _mm_load_si128(pvFRight);   // load next vf ASAP\n\t\t\t\tvh = _mm_load_si128(pvHRight);     // load next vh ASAP\n\t\t\t}\n\t\t\t\n\t\t\t// Update F with another gap extension\n\t\t\tvf = _mm_subs_epi16(vf, rfgape);\n\t\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\t\tvf = _mm_max_epi16(vtmp, vf);\n\t\t\tvtmp = _mm_cmpgt_epi16(vf, vtmp);\n\t\t\tcmp = _mm_movemask_epi8(vtmp);\n\t\t\tnfixup++;\n\t\t}\n\n\t\t\n\t\t// Check in the last row for the maximum so far\n\t\t__m128i *vtmp = vbuf_r + 2 /* H */ + (d.lastIter_ * ROWSTRIDE_2COL);\n\t\t// Note: we may not want to extract from the final row\n\t\tTCScore lr = ((TCScore*)(vtmp))[d.lastWord_];\n\t\tfound = true;\n\t\tif(lr > lrmax) {\n\t\t\tlrmax = lr;\n\t\t}\n\t\t\n\t\t// Now we'd like to know whether the bottommost element of the right\n\t\t// column is a candidate we might backtrace from.  First question is:\n\t\t// did it exceed the minimum score threshold?\n\t\tTAlScore score = (TAlScore)(lr - 0x7fff);\n\t\tif(lr == MIN_I16) {\n\t\t\tscore = MIN_I64;\n\t\t}\n\t\tif(!debug && score >= minsc_) {\n\t\t\tDpBtCandidate cand(dpRows() - 1, i - rfi_, score);\n\t\t\tbtdiag_.add(i - rfi_, cand);\n\t\t}\n\n\t\t// Save some elements to checkpoints\n\t\tif(checkpoint) {\n\t\t\t\n\t\t\t__m128i *pvE = vbuf_r + 0;\n\t\t\t__m128i *pvF = vbuf_r + 1;\n\t\t\t__m128i *pvH = vbuf_r + 2;\n\t\t\tsize_t coli = i - rfi_;\n\t\t\tif(coli < cper_.locol_) cper_.locol_ = coli;\n\t\t\tif(coli > cper_.hicol_) cper_.hicol_ = coli;\n\t\t\t\n\t\t\tif(cperTri_) {\n\t\t\t\tsize_t rc_mod = coli & cper_.lomask_;\n\t\t\t\tassert_lt(rc_mod, cper_.per_);\n\t\t\t\tint64_t row = -rc_mod-1;\n\t\t\t\tint64_t row_mod = row;\n\t\t\t\tint64_t row_div = 0;\n\t\t\t\tsize_t idx = coli >> cper_.perpow2_;\n\t\t\t\tsize_t idxrow = idx * cper_.nrow_;\n\t\t\t\tassert_eq(4, ROWSTRIDE_2COL);\n\t\t\t\tbool done = false;\n\t\t\t\twhile(true) {\n\t\t\t\t\trow += (cper_.per_ - 2);\n\t\t\t\t\trow_mod += (cper_.per_ - 2);\n\t\t\t\t\tfor(size_t j = 0; j < 2; j++) {\n\t\t\t\t\t\trow++;\n\t\t\t\t\t\trow_mod++;\n\t\t\t\t\t\tif(row >= 0 && (size_t)row < cper_.nrow_) {\n\t\t\t\t\t\t\t// Update row divided by iter_ and mod iter_\n\t\t\t\t\t\t\twhile(row_mod >= (int64_t)iter) {\n\t\t\t\t\t\t\t\trow_mod -= (int64_t)iter;\n\t\t\t\t\t\t\t\trow_div++;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tsize_t delt = idxrow + row;\n\t\t\t\t\t\t\tsize_t vecoff = (row_mod << 5) + row_div;\n\t\t\t\t\t\t\tassert_lt(row_div, 8);\n\t\t\t\t\t\t\tint16_t h_sc = ((int16_t*)pvH)[vecoff];\n\t\t\t\t\t\t\tint16_t e_sc = ((int16_t*)pvE)[vecoff];\n\t\t\t\t\t\t\tint16_t f_sc = ((int16_t*)pvF)[vecoff];\n\t\t\t\t\t\t\tif(h_sc != MIN_I16) h_sc -= 0x7fff;\n\t\t\t\t\t\t\tif(e_sc != MIN_I16) e_sc -= 0x7fff;\n\t\t\t\t\t\t\tif(f_sc != MIN_I16) f_sc -= 0x7fff;\n\t\t\t\t\t\t\tassert_leq(h_sc, cper_.perf_);\n\t\t\t\t\t\t\tassert_leq(e_sc, cper_.perf_);\n\t\t\t\t\t\t\tassert_leq(f_sc, cper_.perf_);\n\t\t\t\t\t\t\tCpQuad *qdiags = ((j == 0) ? cper_.qdiag1s_.ptr() : cper_.qdiag2s_.ptr());\n\t\t\t\t\t\t\tqdiags[delt].sc[0] = h_sc;\n\t\t\t\t\t\t\tqdiags[delt].sc[1] = e_sc;\n\t\t\t\t\t\t\tqdiags[delt].sc[2] = f_sc;\n\t\t\t\t\t\t} // if(row >= 0 && row < nrow_)\n\t\t\t\t\t\telse if(row >= 0 && (size_t)row >= cper_.nrow_) {\n\t\t\t\t\t\t\tdone = true;\n\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t}\n\t\t\t\t\t} // end of loop over anti-diags\n\t\t\t\t\tif(done) {\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t}\n\t\t\t\t\tidx++;\n\t\t\t\t\tidxrow += cper_.nrow_;\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\t// If this is the first column, take this opportunity to\n\t\t\t\t// pre-calculate the coordinates of the elements we're going to\n\t\t\t\t// checkpoint.\n\t\t\t\tif(coli == 0) {\n\t\t\t\t\tsize_t cpi    = cper_.per_-1;\n\t\t\t\t\tsize_t cpimod = cper_.per_-1;\n\t\t\t\t\tsize_t cpidiv = 0;\n\t\t\t\t\tcper_.commitMap_.clear();\n\t\t\t\t\twhile(cpi < cper_.nrow_) {\n\t\t\t\t\t\twhile(cpimod >= iter) {\n\t\t\t\t\t\t\tcpimod -= iter;\n\t\t\t\t\t\t\tcpidiv++;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tsize_t vecoff = (cpimod << 5) + cpidiv;\n\t\t\t\t\t\tcper_.commitMap_.push_back(vecoff);\n\t\t\t\t\t\tcpi += cper_.per_;\n\t\t\t\t\t\tcpimod += cper_.per_;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\t// Save all the rows\n\t\t\t\tsize_t rowoff = 0;\n\t\t\t\tsize_t sz = cper_.commitMap_.size();\n\t\t\t\tfor(size_t i = 0; i < sz; i++, rowoff += cper_.ncol_) {\n\t\t\t\t\tsize_t vecoff = cper_.commitMap_[i];\n\t\t\t\t\tint16_t h_sc = ((int16_t*)pvH)[vecoff];\n\t\t\t\t\tint16_t e_sc = ((int16_t*)pvE)[vecoff];\n\t\t\t\t\tint16_t f_sc = ((int16_t*)pvF)[vecoff];\n\t\t\t\t\tif(h_sc != MIN_I16) h_sc -= 0x7fff;\n\t\t\t\t\tif(e_sc != MIN_I16) e_sc -= 0x7fff;\n\t\t\t\t\tif(f_sc != MIN_I16) f_sc -= 0x7fff;\n\t\t\t\t\tassert_leq(h_sc, cper_.perf_);\n\t\t\t\t\tassert_leq(e_sc, cper_.perf_);\n\t\t\t\t\tassert_leq(f_sc, cper_.perf_);\n\t\t\t\t\tCpQuad& dst = cper_.qrows_[rowoff + coli];\n\t\t\t\t\tdst.sc[0] = h_sc;\n\t\t\t\t\tdst.sc[1] = e_sc;\n\t\t\t\t\tdst.sc[2] = f_sc;\n\t\t\t\t}\n\t\t\t\t// Is this a column we'd like to checkpoint?\n\t\t\t\tif((coli & cper_.lomask_) == cper_.lomask_) {\n\t\t\t\t\t// Save the column using memcpys\n\t\t\t\t\tassert_gt(coli, 0);\n\t\t\t\t\tsize_t wordspercol = cper_.niter_ * ROWSTRIDE_2COL;\n\t\t\t\t\tsize_t coloff = (coli >> cper_.perpow2_) * wordspercol;\n\t\t\t\t\t__m128i *dst = cper_.qcols_.ptr() + coloff;\n\t\t\t\t\tmemcpy(dst, vbuf_r, sizeof(__m128i) * wordspercol);\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(cper_.debug_) {\n\t\t\t\t// Save the column using memcpys\n\t\t\t\tsize_t wordspercol = cper_.niter_ * ROWSTRIDE_2COL;\n\t\t\t\tsize_t coloff = coli * wordspercol;\n\t\t\t\t__m128i *dst = cper_.qcolsD_.ptr() + coloff;\n\t\t\t\tmemcpy(dst, vbuf_r, sizeof(__m128i) * wordspercol);\n\t\t\t}\n\t\t}\n\t}\n\t\n\t// Update metrics\n\tif(!debug) {\n\t\tsize_t ninner = (rff_ - rfi_) * iter;\n\t\tmet.col   += (rff_ - rfi_);             // DP columns\n\t\tmet.cell  += (ninner * NWORDS_PER_REG); // DP cells\n\t\tmet.inner += ninner;                    // DP inner loop iters\n\t\tmet.fixup += nfixup;                    // DP fixup loop iters\n\t}\n\n\tflag = 0;\n\n\t// Did we find a solution?\n\tTAlScore score = MIN_I64;\n\tif(!found) {\n\t\tflag = -1; // no\n\t\tif(!debug) met.dpfail++;\n\t\treturn MIN_I64;\n\t} else {\n\t\tscore = (TAlScore)(lrmax - 0x7fff);\n\t\tif(score < minsc_) {\n\t\t\tflag = -1; // no\n\t\t\tif(!debug) met.dpfail++;\n\t\t\treturn score;\n\t\t}\n\t}\n\t\n\t// Could we have saturated?\n\tif(lrmax == MIN_I16) {\n\t\tflag = -2; // yes\n\t\tif(!debug) met.dpsat++;\n\t\treturn MIN_I64;\n\t}\n\t\n\t// Now take all the backtrace candidates in the btdaig_ structure and\n\t// dump them into the btncand_ array.  They'll be sorted later.\n\tif(!debug) {\n\t\tbtdiag_.dump(btncand_);\n\t\tassert(!btncand_.empty());\n\t}\n\t\n\t// Return largest score\n\tif(!debug) met.dpsucc++;\n\treturn score;\n}\n\n/**\n * Solve the current alignment problem using SSE instructions that operate on 8\n * signed 16-bit values packed into a single 128-bit register.\n */\nTAlScore SwAligner::alignNucleotidesEnd2EndSseI16(int& flag, bool debug) {\n\tassert_leq(rdf_, rd_->length());\n\tassert_leq(rdf_, qu_->length());\n\tassert_lt(rfi_, rff_);\n\tassert_lt(rdi_, rdf_);\n\tassert_eq(rd_->length(), qu_->length());\n\tassert_geq(sc_->gapbar, 1);\n\tassert(repOk());\n#ifndef NDEBUG\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\tassert_range(0, 16, (int)rf_[i]);\n\t}\n#endif\n\n\tSSEData& d = fw_ ? sseI16fw_ : sseI16rc_;\n\tSSEMetrics& met = extend_ ? sseI16ExtendMet_ : sseI16MateMet_;\n\tif(!debug) met.dp++;\n\tbuildQueryProfileEnd2EndSseI16(fw_);\n\tassert(!d.profbuf_.empty());\n\n\tassert_eq(0, d.maxBonus_);\n\tsize_t iter =\n\t\t(dpRows() + (NWORDS_PER_REG-1)) / NWORDS_PER_REG; // iter = segLen\n\n\t// Many thanks to Michael Farrar for releasing his striped Smith-Waterman\n\t// implementation:\n\t//\n\t//  http://sites.google.com/site/farrarmichael/smith-waterman\n\t//\n\t// Much of the implmentation below is adapted from Michael's code.\n\n\t// Set all elts to reference gap open penalty\n\t__m128i rfgapo   = _mm_setzero_si128();\n\t__m128i rfgape   = _mm_setzero_si128();\n\t__m128i rdgapo   = _mm_setzero_si128();\n\t__m128i rdgape   = _mm_setzero_si128();\n\t__m128i vlo      = _mm_setzero_si128();\n\t__m128i vhi      = _mm_setzero_si128();\n\t__m128i vhilsw   = _mm_setzero_si128();\n\t__m128i vlolsw   = _mm_setzero_si128();\n\t__m128i ve       = _mm_setzero_si128();\n\t__m128i vf       = _mm_setzero_si128();\n\t__m128i vh       = _mm_setzero_si128();\n#if 0\n\t__m128i vhd      = _mm_setzero_si128();\n\t__m128i vhdtmp   = _mm_setzero_si128();\n#endif\n\t__m128i vtmp     = _mm_setzero_si128();\n\n\tassert_gt(sc_->refGapOpen(), 0);\n\tassert_leq(sc_->refGapOpen(), MAX_I16);\n\trfgapo = _mm_insert_epi16(rfgapo, sc_->refGapOpen(), 0);\n\trfgapo = _mm_shufflelo_epi16(rfgapo, 0);\n\trfgapo = _mm_shuffle_epi32(rfgapo, 0);\n\t\n\t// Set all elts to reference gap extension penalty\n\tassert_gt(sc_->refGapExtend(), 0);\n\tassert_leq(sc_->refGapExtend(), MAX_I16);\n\tassert_leq(sc_->refGapExtend(), sc_->refGapOpen());\n\trfgape = _mm_insert_epi16(rfgape, sc_->refGapExtend(), 0);\n\trfgape = _mm_shufflelo_epi16(rfgape, 0);\n\trfgape = _mm_shuffle_epi32(rfgape, 0);\n\n\t// Set all elts to read gap open penalty\n\tassert_gt(sc_->readGapOpen(), 0);\n\tassert_leq(sc_->readGapOpen(), MAX_I16);\n\trdgapo = _mm_insert_epi16(rdgapo, sc_->readGapOpen(), 0);\n\trdgapo = _mm_shufflelo_epi16(rdgapo, 0);\n\trdgapo = _mm_shuffle_epi32(rdgapo, 0);\n\t\n\t// Set all elts to read gap extension penalty\n\tassert_gt(sc_->readGapExtend(), 0);\n\tassert_leq(sc_->readGapExtend(), MAX_I16);\n\tassert_leq(sc_->readGapExtend(), sc_->readGapOpen());\n\trdgape = _mm_insert_epi16(rdgape, sc_->readGapExtend(), 0);\n\trdgape = _mm_shufflelo_epi16(rdgape, 0);\n\trdgape = _mm_shuffle_epi32(rdgape, 0);\n\n\t// Set all elts to 0x8000 (min value for signed 16-bit)\n\tvlo = _mm_cmpeq_epi16(vlo, vlo);             // all elts = 0xffff\n\tvlo = _mm_slli_epi16(vlo, NBITS_PER_WORD-1); // all elts = 0x8000\n\t\n\t// Set all elts to 0x7fff (max value for signed 16-bit)\n\tvhi = _mm_cmpeq_epi16(vhi, vhi);             // all elts = 0xffff\n\tvhi = _mm_srli_epi16(vhi, 1);                // all elts = 0x7fff\n\t\n\t// vlolsw: topmost (least sig) word set to 0x8000, all other words=0\n\tvlolsw = _mm_shuffle_epi32(vlo, 0);\n\tvlolsw = _mm_srli_si128(vlolsw, NBYTES_PER_REG - NBYTES_PER_WORD);\n\t\n\t// vhilsw: topmost (least sig) word set to 0x7fff, all other words=0\n\tvhilsw = _mm_shuffle_epi32(vhi, 0);\n\tvhilsw = _mm_srli_si128(vhilsw, NBYTES_PER_REG - NBYTES_PER_WORD);\n\t\n\t// Points to a long vector of __m128i where each element is a block of\n\t// contiguous cells in the E, F or H matrix.  If the index % 3 == 0, then\n\t// the block of cells is from the E matrix.  If index % 3 == 1, they're\n\t// from the F matrix.  If index % 3 == 2, then they're from the H matrix.\n\t// Blocks of cells are organized in the same interleaved manner as they are\n\t// calculated by the Farrar algorithm.\n\tconst __m128i *pvScore; // points into the query profile\n\n\td.mat_.init(dpRows(), rff_ - rfi_, NWORDS_PER_REG);\n\tconst size_t colstride = d.mat_.colstride();\n\tassert_eq(ROWSTRIDE, colstride / iter);\n\t\n\t// Initialize the H and E vectors in the first matrix column\n\t__m128i *pvHTmp = d.mat_.tmpvec(0, 0);\n\t__m128i *pvETmp = d.mat_.evec(0, 0);\n\t\n\t// Maximum score in final row\n\tbool found = false;\n\tTCScore lrmax = MIN_I16;\n\t\n\tfor(size_t i = 0; i < iter; i++) {\n\t\t_mm_store_si128(pvETmp, vlo);\n\t\t// Could initialize Hs to high or low.  If high, cells in the lower\n\t\t// triangle will have somewhat more legitiate scores, but still won't\n\t\t// be exhaustively scored.\n\t\t_mm_store_si128(pvHTmp, vlo);\n\t\tpvETmp += ROWSTRIDE;\n\t\tpvHTmp += ROWSTRIDE;\n\t}\n\t// These are swapped just before the innermost loop\n\t__m128i *pvHStore = d.mat_.hvec(0, 0);\n\t__m128i *pvHLoad  = d.mat_.tmpvec(0, 0);\n\t__m128i *pvELoad  = d.mat_.evec(0, 0);\n\t__m128i *pvEStore = d.mat_.evecUnsafe(0, 1);\n\t__m128i *pvFStore = d.mat_.fvec(0, 0);\n\t__m128i *pvFTmp   = NULL;\n\t\n\tassert_gt(sc_->gapbar, 0);\n\tsize_t nfixup = 0;\n\t\n\t// Fill in the table as usual but instead of using the same gap-penalty\n\t// vector for each iteration of the inner loop, load words out of a\n\t// pre-calculated gap vector parallel to the query profile.  The pre-\n\t// calculated gap vectors enforce the gap barrier constraint by making it\n\t// infinitely costly to introduce a gap in barrier rows.\n\t//\n\t// AND use a separate loop to fill in the first row of the table, enforcing\n\t// the st_ constraints in the process.  This is awkward because it\n\t// separates the processing of the first row from the others and might make\n\t// it difficult to use the first-row results in the next row, but it might\n\t// be the simplest and least disruptive way to deal with the st_ constraint.\n\t\n\tcolstop_ = rff_ - 1;\n\tlastsolcol_ = 0;\n\t\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\tassert(pvFStore == d.mat_.fvec(0, i - rfi_));\n\t\tassert(pvHStore == d.mat_.hvec(0, i - rfi_));\n\t\t\n\t\t// Fetch the appropriate query profile.  Note that elements of rf_ must\n\t\t// be numbers, not masks.\n\t\tconst int refc = (int)rf_[i];\n\t\tsize_t off = (size_t)firsts5[refc] * iter * 2;\n\t\tpvScore = d.profbuf_.ptr() + off; // even elts = query profile, odd = gap barrier\n\t\t\n\t\t// Set all cells to low value\n\t\tvf = _mm_cmpeq_epi16(vf, vf);\n\t\tvf = _mm_slli_epi16(vf, NBITS_PER_WORD-1);\n\t\tvf = _mm_or_si128(vf, vlolsw);\n\t\t\n\t\t// Load H vector from the final row of the previous column\n\t\tvh = _mm_load_si128(pvHLoad + colstride - ROWSTRIDE);\n\t\t// Shift 2 bytes down so that topmost (least sig) cell gets 0\n\t\tvh = _mm_slli_si128(vh, NBYTES_PER_WORD);\n\t\t// Fill topmost (least sig) cell with high value\n\t\tvh = _mm_or_si128(vh, vhilsw);\n\t\t\n\t\t// For each character in the reference text:\n\t\tsize_t j;\n\t\tfor(j = 0; j < iter; j++) {\n\t\t\t// Load cells from E, calculated previously\n\t\t\tve = _mm_load_si128(pvELoad);\n#if 0\n\t\t\tvhd = _mm_load_si128(pvHLoad);\n#endif\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\tpvELoad += ROWSTRIDE;\n\t\t\t\n\t\t\t// Store cells in F, calculated previously\n\t\t\tvf = _mm_adds_epi16(vf, pvScore[1]); // veto some ref gap extensions\n\t\t\tvf = _mm_adds_epi16(vf, pvScore[1]); // veto some ref gap extensions\n\t\t\t_mm_store_si128(pvFStore, vf);\n\t\t\tpvFStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Factor in query profile (matches and mismatches)\n\t\t\tvh = _mm_adds_epi16(vh, pvScore[0]);\n\t\t\t\n\t\t\t// Update H, factoring in E and F\n\t\t\tvh = _mm_max_epi16(vh, ve);\n\t\t\tvh = _mm_max_epi16(vh, vf);\n\t\t\t\n\t\t\t// Save the new vH values\n\t\t\t_mm_store_si128(pvHStore, vh);\n\t\t\tpvHStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update vE value\n\t\t\tvtmp = vh;\n#if 0\n\t\t\tvhdtmp = vhd;\n\t\t\tvhd = _mm_subs_epi16(vhd, rdgapo);\n\t\t\tvhd = _mm_adds_epi16(vhd, pvScore[1]); // veto some read gap opens\n\t\t\tvhd = _mm_adds_epi16(vhd, pvScore[1]); // veto some read gap opens\n\t\t\tve = _mm_subs_epi16(ve, rdgape);\n\t\t\tve = _mm_max_epi16(ve, vhd);\n#else\n\t\t\tvh = _mm_subs_epi16(vh, rdgapo);\n\t\t\tvh = _mm_adds_epi16(vh, pvScore[1]); // veto some read gap opens\n\t\t\tvh = _mm_adds_epi16(vh, pvScore[1]); // veto some read gap opens\n\t\t\tve = _mm_subs_epi16(ve, rdgape);\n\t\t\tve = _mm_max_epi16(ve, vh);\n#endif\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\t\n\t\t\t// Load the next h value\n#if 0\n\t\t\tvh = vhdtmp;\n#else\n\t\t\tvh = _mm_load_si128(pvHLoad);\n#endif\n\t\t\tpvHLoad += ROWSTRIDE;\n\t\t\t\n\t\t\t// Save E values\n\t\t\t_mm_store_si128(pvEStore, ve);\n\t\t\tpvEStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update vf value\n\t\t\tvtmp = _mm_subs_epi16(vtmp, rfgapo);\n\t\t\tvf = _mm_subs_epi16(vf, rfgape);\n\t\t\tassert_all_lt(vf, vhi);\n\t\t\tvf = _mm_max_epi16(vf, vtmp);\n\t\t\t\n\t\t\tpvScore += 2; // move on to next query profile / gap veto\n\t\t}\n\t\t// pvHStore, pvELoad, pvEStore have all rolled over to the next column\n\t\tpvFTmp = pvFStore;\n\t\tpvFStore -= colstride; // reset to start of column\n\t\tvtmp = _mm_load_si128(pvFStore);\n\t\t\n\t\tpvHStore -= colstride; // reset to start of column\n\t\tvh = _mm_load_si128(pvHStore);\n\t\t\n#if 0\n#else\n\t\tpvEStore -= colstride; // reset to start of column\n\t\tve = _mm_load_si128(pvEStore);\n#endif\n\t\t\n\t\tpvHLoad = pvHStore;    // new pvHLoad = pvHStore\n\t\tpvScore = d.profbuf_.ptr() + off + 1; // reset veto vector\n\t\t\n\t\t// vf from last row gets shifted down by one to overlay the first row\n\t\t// rfgape has already been subtracted from it.\n\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\tvf = _mm_or_si128(vf, vlolsw);\n\t\t\n\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\tvf = _mm_max_epi16(vtmp, vf);\n\t\tvtmp = _mm_cmpgt_epi16(vf, vtmp);\n\t\tint cmp = _mm_movemask_epi8(vtmp);\n\t\t\n\t\t// If any element of vtmp is greater than H - gap-open...\n\t\tj = 0;\n\t\twhile(cmp != 0x0000) {\n\t\t\t// Store this vf\n\t\t\t_mm_store_si128(pvFStore, vf);\n\t\t\tpvFStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update vh w/r/t new vf\n\t\t\tvh = _mm_max_epi16(vh, vf);\n\t\t\t\n\t\t\t// Save vH values\n\t\t\t_mm_store_si128(pvHStore, vh);\n\t\t\tpvHStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update E in case it can be improved using our new vh\n#if 0\n#else\n\t\t\tvh = _mm_subs_epi16(vh, rdgapo);\n\t\t\tvh = _mm_adds_epi16(vh, *pvScore); // veto some read gap opens\n\t\t\tvh = _mm_adds_epi16(vh, *pvScore); // veto some read gap opens\n\t\t\tve = _mm_max_epi16(ve, vh);\n\t\t\t_mm_store_si128(pvEStore, ve);\n\t\t\tpvEStore += ROWSTRIDE;\n#endif\n\t\t\tpvScore += 2;\n\t\t\t\n\t\t\tassert_lt(j, iter);\n\t\t\tif(++j == iter) {\n\t\t\t\tpvFStore -= colstride;\n\t\t\t\tvtmp = _mm_load_si128(pvFStore);   // load next vf ASAP\n\t\t\t\tpvHStore -= colstride;\n\t\t\t\tvh = _mm_load_si128(pvHStore);     // load next vh ASAP\n#if 0\n#else\n\t\t\t\tpvEStore -= colstride;\n\t\t\t\tve = _mm_load_si128(pvEStore);     // load next ve ASAP\n#endif\n\t\t\t\tpvScore = d.profbuf_.ptr() + off + 1;\n\t\t\t\tj = 0;\n\t\t\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\t\t\tvf = _mm_or_si128(vf, vlolsw);\n\t\t\t} else {\n\t\t\t\tvtmp = _mm_load_si128(pvFStore);   // load next vf ASAP\n\t\t\t\tvh = _mm_load_si128(pvHStore);     // load next vh ASAP\n#if 0\n#else\n\t\t\t\tve = _mm_load_si128(pvEStore);     // load next vh ASAP\n#endif\n\t\t\t}\n\t\t\t\n\t\t\t// Update F with another gap extension\n\t\t\tvf = _mm_subs_epi16(vf, rfgape);\n\t\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\t\tvf = _mm_max_epi16(vtmp, vf);\n\t\t\tvtmp = _mm_cmpgt_epi16(vf, vtmp);\n\t\t\tcmp = _mm_movemask_epi8(vtmp);\n\t\t\tnfixup++;\n\t\t}\n\n#ifndef NDEBUG\n\t\tif((rand() & 15) == 0) {\n\t\t\t// This is a work-intensive sanity check; each time we finish filling\n\t\t\t// a column, we check that each H, E, and F is sensible.\n\t\t\tfor(size_t k = 0; k < dpRows(); k++) {\n\t\t\t\tassert(cellOkEnd2EndI16(\n\t\t\t\t\td,\n\t\t\t\t\tk,                   // row\n\t\t\t\t\ti - rfi_,            // col\n\t\t\t\t\trefc,                // reference mask\n\t\t\t\t\t(int)(*rd_)[rdi_+k], // read char\n\t\t\t\t\t(int)(*qu_)[rdi_+k], // read quality\n\t\t\t\t\t*sc_));              // scoring scheme\n\t\t\t}\n\t\t}\n#endif\n\t\t\n\t\t__m128i *vtmp = d.mat_.hvec(d.lastIter_, i-rfi_);\n\t\t// Note: we may not want to extract from the final row\n\t\tTCScore lr = ((TCScore*)(vtmp))[d.lastWord_];\n\t\tfound = true;\n\t\tif(lr > lrmax) {\n\t\t\tlrmax = lr;\n\t\t}\n\n\t\t// pvELoad and pvHLoad are already where they need to be\n\t\t\n\t\t// Adjust the load and store vectors here.  \n\t\tpvHStore = pvHLoad + colstride;\n\t\tpvEStore = pvELoad + colstride;\n\t\tpvFStore = pvFTmp;\n\t}\n\t\n\t// Update metrics\n\tif(!debug) {\n\t\tsize_t ninner = (rff_ - rfi_) * iter;\n\t\tmet.col   += (rff_ - rfi_);             // DP columns\n\t\tmet.cell  += (ninner * NWORDS_PER_REG); // DP cells\n\t\tmet.inner += ninner;                    // DP inner loop iters\n\t\tmet.fixup += nfixup;                    // DP fixup loop iters\n\t}\n\t\n\tflag = 0;\n\t\n\t// Did we find a solution?\n\tTAlScore score = MIN_I64;\n\tif(!found) {\n\t\tflag = -1; // no\n\t\tif(!debug) met.dpfail++;\n\t\treturn MIN_I64;\n\t} else {\n\t\tscore = (TAlScore)(lrmax - 0x7fff);\n\t\tif(score < minsc_) {\n\t\t\tflag = -1; // no\n\t\t\tif(!debug) met.dpfail++;\n\t\t\treturn score;\n\t\t}\n\t}\n\t\n\t// Could we have saturated?\n\tif(lrmax == MIN_I16) {\n\t\tflag = -2; // yes\n\t\tif(!debug) met.dpsat++;\n\t\treturn MIN_I64;\n\t}\n\t\n\t// Return largest score\n\tif(!debug) met.dpsucc++;\n\treturn score;\n}\n\n/**\n * Given a filled-in DP table, populate the btncand_ list with candidate cells\n * that might be at the ends of valid alignments.  No need to do this unless\n * the maximum score returned by the align*() func is >= the minimum.\n *\n * Only cells that are exhaustively scored are candidates.  Those are the\n * cells inside the shape made of o's in this:\n *\n *  |-maxgaps-|\n *  *********************************    -\n *   ********************************    |\n *    *******************************    |\n *     ******************************    |\n *      *****************************    |\n *       **************************** read len\n *        ***************************    |\n *         **************************    |\n *          *************************    |\n *           ************************    |\n *            ***********oooooooooooo    -\n *            |-maxgaps-|\n *  |-readlen-|\n *  |-------skip--------|\n *\n * And it's possible for the shape to be truncated on the left and right sides.\n *\n * \n */\nbool SwAligner::gatherCellsNucleotidesEnd2EndSseI16(TAlScore best) {\n\t// What's the minimum number of rows that can possibly be spanned by an\n\t// alignment that meets the minimum score requirement?\n\tassert(sse16succ_);\n\tconst size_t ncol = rff_ - rfi_;\n\tconst size_t nrow = dpRows();\n\tassert_gt(nrow, 0);\n\tbtncand_.clear();\n\tbtncanddone_.clear();\n\tSSEData& d = fw_ ? sseI16fw_ : sseI16rc_;\n\tSSEMetrics& met = extend_ ? sseI16ExtendMet_ : sseI16MateMet_;\n\tassert(!d.profbuf_.empty());\n\tconst size_t colstride = d.mat_.colstride();\n\tASSERT_ONLY(bool sawbest = false);\n\t__m128i *pvH = d.mat_.hvec(d.lastIter_, 0);\n\tfor(size_t j = 0; j < ncol; j++) {\n\t\tTAlScore sc = (TAlScore)(((TCScore*)pvH)[d.lastWord_] - 0x7fff);\n\t\tassert_leq(sc, best);\n\t\tASSERT_ONLY(sawbest = (sawbest || sc == best));\n\t\tif(sc >= minsc_) {\n\t\t\t// Yes, this is legit\n\t\t\tmet.gathsol++;\n\t\t\tbtncand_.expand();\n\t\t\tbtncand_.back().init(nrow-1, j, sc);\n\t\t}\n\t\tpvH += colstride;\n\t}\n\tassert(sawbest);\n\tif(!btncand_.empty()) {\n\t\td.mat_.initMasks();\n\t}\n\treturn !btncand_.empty();\n}\n\n#define MOVE_VEC_PTR_UP(vec, rowvec, rowelt) { \\\n\tif(rowvec == 0) { \\\n\t\trowvec += d.mat_.nvecrow_; \\\n\t\tvec += d.mat_.colstride_; \\\n\t\trowelt--; \\\n\t} \\\n\trowvec--; \\\n\tvec -= ROWSTRIDE; \\\n}\n\n#define MOVE_VEC_PTR_LEFT(vec, rowvec, rowelt) { vec -= d.mat_.colstride_; }\n\n#define MOVE_VEC_PTR_UPLEFT(vec, rowvec, rowelt) { \\\n \tMOVE_VEC_PTR_UP(vec, rowvec, rowelt); \\\n \tMOVE_VEC_PTR_LEFT(vec, rowvec, rowelt); \\\n}\n\n#define MOVE_ALL_LEFT() { \\\n\tMOVE_VEC_PTR_LEFT(cur_vec, rowvec, rowelt); \\\n\tMOVE_VEC_PTR_LEFT(left_vec, left_rowvec, left_rowelt); \\\n\tMOVE_VEC_PTR_LEFT(up_vec, up_rowvec, up_rowelt); \\\n\tMOVE_VEC_PTR_LEFT(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n#define MOVE_ALL_UP() { \\\n\tMOVE_VEC_PTR_UP(cur_vec, rowvec, rowelt); \\\n\tMOVE_VEC_PTR_UP(left_vec, left_rowvec, left_rowelt); \\\n\tMOVE_VEC_PTR_UP(up_vec, up_rowvec, up_rowelt); \\\n\tMOVE_VEC_PTR_UP(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n#define MOVE_ALL_UPLEFT() { \\\n\tMOVE_VEC_PTR_UPLEFT(cur_vec, rowvec, rowelt); \\\n\tMOVE_VEC_PTR_UPLEFT(left_vec, left_rowvec, left_rowelt); \\\n\tMOVE_VEC_PTR_UPLEFT(up_vec, up_rowvec, up_rowelt); \\\n\tMOVE_VEC_PTR_UPLEFT(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n#define NEW_ROW_COL(row, col) { \\\n\trowelt = row / d.mat_.nvecrow_; \\\n\trowvec = row % d.mat_.nvecrow_; \\\n\teltvec = (col * d.mat_.colstride_) + (rowvec * ROWSTRIDE); \\\n\tcur_vec = d.mat_.matbuf_.ptr() + eltvec; \\\n\tleft_vec = cur_vec; \\\n\tleft_rowelt = rowelt; \\\n\tleft_rowvec = rowvec; \\\n\tMOVE_VEC_PTR_LEFT(left_vec, left_rowvec, left_rowelt); \\\n\tup_vec = cur_vec; \\\n\tup_rowelt = rowelt; \\\n\tup_rowvec = rowvec; \\\n\tMOVE_VEC_PTR_UP(up_vec, up_rowvec, up_rowelt); \\\n\tupleft_vec = up_vec; \\\n\tupleft_rowelt = up_rowelt; \\\n\tupleft_rowvec = up_rowvec; \\\n\tMOVE_VEC_PTR_LEFT(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n/**\n * Given the dynamic programming table and a cell, trace backwards from the\n * cell and install the edits and score/penalty in the appropriate fields\n * of res.  The RandomSource is used to break ties among equally good ways\n * of tracing back.\n *\n * Whenever we enter a cell, we check whether the read/ref coordinates of\n * that cell correspond to a cell we traversed constructing a previous\n * alignment.  If so, we backtrack to the last decision point, mask out the\n * path that led to the previously observed cell, and continue along a\n * different path; or, if there are no more paths to try, we give up.\n *\n * If an alignment is found, 'off' is set to the alignment's upstream-most\n * reference character's offset into the chromosome and true is returned.\n * Otherwise, false is returned.\n */\nbool SwAligner::backtraceNucleotidesEnd2EndSseI16(\n\tTAlScore       escore, // in: expected score\n\tSwResult&      res,    // out: store results (edits and scores) here\n\tsize_t&        off,    // out: store diagonal projection of origin\n\tsize_t&        nbts,   // out: # backtracks\n\tsize_t         row,    // start in this row\n\tsize_t         col,    // start in this column\n\tRandomSource&  rnd)    // random gen, to choose among equal paths\n{\n\tassert_lt(row, dpRows());\n\tassert_lt(col, (size_t)(rff_ - rfi_));\n\tSSEData& d = fw_ ? sseI16fw_ : sseI16rc_;\n\tSSEMetrics& met = extend_ ? sseI16ExtendMet_ : sseI16MateMet_;\n\tmet.bt++;\n\tassert(!d.profbuf_.empty());\n\tassert_lt(row, rd_->length());\n\tbtnstack_.clear(); // empty the backtrack stack\n\tbtcells_.clear();  // empty the cells-so-far list\n\tAlnScore score; score.score_ = 0;\n\t// score.gaps_ = score.ns_ = 0;\n\tsize_t origCol = col;\n\tsize_t gaps = 0, readGaps = 0, refGaps = 0;\n\tres.alres.reset();\n    EList<Edit>& ned = res.alres.ned();\n\tassert(ned.empty());\n\tassert_gt(dpRows(), row);\n\tASSERT_ONLY(size_t trimEnd = dpRows() - row - 1);\n\tsize_t trimBeg = 0;\n\tsize_t ct = SSEMatrix::H; // cell type\n\t// Row and col in terms of where they fall in the SSE vector matrix\n\tsize_t rowelt, rowvec, eltvec;\n\tsize_t left_rowelt, up_rowelt, upleft_rowelt;\n\tsize_t left_rowvec, up_rowvec, upleft_rowvec;\n\t__m128i *cur_vec, *left_vec, *up_vec, *upleft_vec;\n\tNEW_ROW_COL(row, col);\n\twhile((int)row >= 0) {\n\t\tmet.btcell++;\n\t\tnbts++;\n\t\tint readc = (*rd_)[rdi_ + row];\n\t\tint refm  = (int)rf_[rfi_ + col];\n\t\tint readq = (*qu_)[row];\n\t\tassert_leq(col, origCol);\n\t\t// Get score in this cell\n\t\tbool empty = false, reportedThru, canMoveThru, branch = false;\n\t\tint cur = SSEMatrix::H;\n\t\tif(!d.mat_.reset_[row]) {\n\t\t\td.mat_.resetRow(row);\n\t\t}\n\t\treportedThru = d.mat_.reportedThrough(row, col);\n\t\tcanMoveThru = true;\n\t\tif(reportedThru) {\n\t\t\tcanMoveThru = false;\n\t\t} else {\n\t\t\tempty = false;\n\t\t\tif(row > 0) {\n\t\t\t\tassert_gt(row, 0);\n\t\t\t\tsize_t rowFromEnd = d.mat_.nrow() - row - 1;\n\t\t\t\tbool gapsAllowed = true;\n\t\t\t\tif(row < (size_t)sc_->gapbar ||\n\t\t\t\t   rowFromEnd < (size_t)sc_->gapbar)\n\t\t\t\t{\n\t\t\t\t\tgapsAllowed = false;\n\t\t\t\t}\n\t\t\t\tconst TAlScore floorsc = MIN_I64;\n\t\t\t\tconst int offsetsc = -0x7fff;\n\t\t\t\t// Move to beginning of column/row\n\t\t\t\tif(ct == SSEMatrix::E) { // AKA rdgap\n\t\t\t\t\tassert_gt(col, 0);\n\t\t\t\t\tTAlScore sc_cur = ((TCScore*)(cur_vec + SSEMatrix::E))[rowelt] + offsetsc;\n\t\t\t\t\tassert(gapsAllowed);\n\t\t\t\t\t// Currently in the E matrix; incoming transition must come from the\n\t\t\t\t\t// left.  It's either a gap open from the H matrix or a gap extend from\n\t\t\t\t\t// the E matrix.\n\t\t\t\t\t// TODO: save and restore origMask as well as mask\n\t\t\t\t\tint origMask = 0, mask = 0;\n\t\t\t\t\t// Get H score of cell to the left\n\t\t\t\t\tTAlScore sc_h_left = ((TCScore*)(left_vec + SSEMatrix::H))[left_rowelt] + offsetsc;\n\t\t\t\t\tif(sc_h_left > floorsc && sc_h_left - sc_->readGapOpen() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 0);\n\t\t\t\t\t}\n\t\t\t\t\t// Get E score of cell to the left\n\t\t\t\t\tTAlScore sc_e_left = ((TCScore*)(left_vec + SSEMatrix::E))[left_rowelt] + offsetsc;\n\t\t\t\t\tif(sc_e_left > floorsc && sc_e_left - sc_->readGapExtend() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 1);\n\t\t\t\t\t}\n\t\t\t\t\torigMask = mask;\n\t\t\t\t\tassert(origMask > 0 || sc_cur <= sc_->match());\n\t\t\t\t\tif(d.mat_.isEMaskSet(row, col)) {\n\t\t\t\t\t\tmask = (d.mat_.masks_[row][col] >> 8) & 3;\n\t\t\t\t\t}\n\t\t\t\t\tif(mask == 3) {\n#if 1\n\t\t\t\t\t\t// Pick H -> E cell\n\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\td.mat_.eMaskSet(row, col, 2); // might choose E later\n#else\n\t\t\t\t\t\tif(rnd.nextU2()) {\n\t\t\t\t\t\t\t// Pick H -> E cell\n\t\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\t\td.mat_.eMaskSet(row, col, 2); // might choose E later\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t// Pick E -> E cell\n\t\t\t\t\t\t\tcur = SW_BT_RDGAP_EXTEND;\n\t\t\t\t\t\t\td.mat_.eMaskSet(row, col, 1); // might choose H later\n\t\t\t\t\t\t}\n#endif\n\t\t\t\t\t\tbranch = true;\n\t\t\t\t\t} else if(mask == 2) {\n\t\t\t\t\t\t// I chose the E cell\n\t\t\t\t\t\tcur = SW_BT_RDGAP_EXTEND;\n\t\t\t\t\t\td.mat_.eMaskSet(row, col, 0); // done\n\t\t\t\t\t} else if(mask == 1) {\n\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\td.mat_.eMaskSet(row, col, 0); // done\n\t\t\t\t\t} else {\n\t\t\t\t\t\tempty = true;\n\t\t\t\t\t\t// It's empty, so the only question left is whether we should be\n\t\t\t\t\t\t// allowed in terimnate in this cell.  If it's got a valid score\n\t\t\t\t\t\t// then we *shouldn't* be allowed to terminate here because that\n\t\t\t\t\t\t// means it's part of a larger alignment that was already reported.\n\t\t\t\t\t\tcanMoveThru = (origMask == 0);\n\t\t\t\t\t}\n\t\t\t\t\tassert(!empty || !canMoveThru);\n\t\t\t\t} else if(ct == SSEMatrix::F) { // AKA rfgap\n\t\t\t\t\tassert_gt(row, 0);\n\t\t\t\t\tassert(gapsAllowed);\n\t\t\t\t\tTAlScore sc_h_up = ((TCScore*)(up_vec  + SSEMatrix::H))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_f_up = ((TCScore*)(up_vec  + SSEMatrix::F))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_cur  = ((TCScore*)(cur_vec + SSEMatrix::F))[rowelt] + offsetsc;\n\t\t\t\t\t// Currently in the F matrix; incoming transition must come from above.\n\t\t\t\t\t// It's either a gap open from the H matrix or a gap extend from the F\n\t\t\t\t\t// matrix.\n\t\t\t\t\t// TODO: save and restore origMask as well as mask\n\t\t\t\t\tint origMask = 0, mask = 0;\n\t\t\t\t\t// Get H score of cell above\n\t\t\t\t\tif(sc_h_up > floorsc && sc_h_up - sc_->refGapOpen() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 0);\n\t\t\t\t\t}\n\t\t\t\t\t// Get F score of cell above\n\t\t\t\t\tif(sc_f_up > floorsc && sc_f_up - sc_->refGapExtend() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 1);\n\t\t\t\t\t}\n\t\t\t\t\torigMask = mask;\n\t\t\t\t\tassert(origMask > 0 || sc_cur <= sc_->match());\n\t\t\t\t\tif(d.mat_.isFMaskSet(row, col)) {\n\t\t\t\t\t\tmask = (d.mat_.masks_[row][col] >> 11) & 3;\n\t\t\t\t\t}\n\t\t\t\t\tif(mask == 3) {\n#if 1\n\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\td.mat_.fMaskSet(row, col, 2); // might choose E later\n#else\n\t\t\t\t\t\tif(rnd.nextU2()) {\n\t\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\t\td.mat_.fMaskSet(row, col, 2); // might choose E later\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t// I chose the F cell\n\t\t\t\t\t\t\tcur = SW_BT_RFGAP_EXTEND;\n\t\t\t\t\t\t\td.mat_.fMaskSet(row, col, 1); // might choose E later\n\t\t\t\t\t\t}\n#endif\n\t\t\t\t\t\tbranch = true;\n\t\t\t\t\t} else if(mask == 2) {\n\t\t\t\t\t\t// I chose the F cell\n\t\t\t\t\t\tcur = SW_BT_RFGAP_EXTEND;\n\t\t\t\t\t\td.mat_.fMaskSet(row, col, 0); // done\n\t\t\t\t\t} else if(mask == 1) {\n\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\td.mat_.fMaskSet(row, col, 0); // done\n\t\t\t\t\t} else {\n\t\t\t\t\t\tempty = true;\n\t\t\t\t\t\t// It's empty, so the only question left is whether we should be\n\t\t\t\t\t\t// allowed in terimnate in this cell.  If it's got a valid score\n\t\t\t\t\t\t// then we *shouldn't* be allowed to terminate here because that\n\t\t\t\t\t\t// means it's part of a larger alignment that was already reported.\n\t\t\t\t\t\tcanMoveThru = (origMask == 0);\n\t\t\t\t\t}\n\t\t\t\t\tassert(!empty || !canMoveThru);\n\t\t\t\t} else {\n\t\t\t\t\tassert_eq(SSEMatrix::H, ct);\n\t\t\t\t\tTAlScore sc_cur      = ((TCScore*)(cur_vec + SSEMatrix::H))[rowelt]    + offsetsc;\n\t\t\t\t\tTAlScore sc_f_up     = ((TCScore*)(up_vec  + SSEMatrix::F))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_h_up     = ((TCScore*)(up_vec  + SSEMatrix::H))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_h_left   = col > 0 ? (((TCScore*)(left_vec   + SSEMatrix::H))[left_rowelt]   + offsetsc) : floorsc;\n\t\t\t\t\tTAlScore sc_e_left   = col > 0 ? (((TCScore*)(left_vec   + SSEMatrix::E))[left_rowelt]   + offsetsc) : floorsc;\n\t\t\t\t\tTAlScore sc_h_upleft = col > 0 ? (((TCScore*)(upleft_vec + SSEMatrix::H))[upleft_rowelt] + offsetsc) : floorsc;\n\t\t\t\t\tTAlScore sc_diag     = sc_->score(readc, refm, readq - 33);\n\t\t\t\t\t// TODO: save and restore origMask as well as mask\n\t\t\t\t\tint origMask = 0, mask = 0;\n\t\t\t\t\tif(gapsAllowed) {\n\t\t\t\t\t\tif(sc_h_up     > floorsc && sc_cur == sc_h_up   - sc_->refGapOpen()) {\n\t\t\t\t\t\t\tmask |= (1 << 0);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(sc_h_left   > floorsc && sc_cur == sc_h_left - sc_->readGapOpen()) {\n\t\t\t\t\t\t\tmask |= (1 << 1);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(sc_f_up     > floorsc && sc_cur == sc_f_up   - sc_->refGapExtend()) {\n\t\t\t\t\t\t\tmask |= (1 << 2);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(sc_e_left   > floorsc && sc_cur == sc_e_left - sc_->readGapExtend()) {\n\t\t\t\t\t\t\tmask |= (1 << 3);\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\tif(sc_h_upleft > floorsc && sc_cur == sc_h_upleft + sc_diag) {\n\t\t\t\t\t\tmask |= (1 << 4);\n\t\t\t\t\t}\n\t\t\t\t\torigMask = mask;\n\t\t\t\t\tassert(origMask > 0 || sc_cur <= sc_->match());\n\t\t\t\t\tif(d.mat_.isHMaskSet(row, col)) {\n\t\t\t\t\t\tmask = (d.mat_.masks_[row][col] >> 2) & 31;\n\t\t\t\t\t}\n\t\t\t\t\tassert(gapsAllowed || mask == (1 << 4) || mask == 0);\n\t\t\t\t\tint opts = alts5[mask];\n\t\t\t\t\tint select = -1;\n\t\t\t\t\tif(opts == 1) {\n\t\t\t\t\t\tselect = firsts5[mask];\n\t\t\t\t\t\tassert_geq(mask, 0);\n\t\t\t\t\t\td.mat_.hMaskSet(row, col, 0);\n\t\t\t\t\t} else if(opts > 1) {\n#if 1\n\t\t\t\t\t\tif(       (mask & 16) != 0) {\n\t\t\t\t\t\t\tselect = 4; // H diag\n\t\t\t\t\t\t} else if((mask & 1) != 0) {\n\t\t\t\t\t\t\tselect = 0; // H up\n\t\t\t\t\t\t} else if((mask & 4) != 0) {\n\t\t\t\t\t\t\tselect = 2; // F up\n\t\t\t\t\t\t} else if((mask & 2) != 0) {\n\t\t\t\t\t\t\tselect = 1; // H left\n\t\t\t\t\t\t} else if((mask & 8) != 0) {\n\t\t\t\t\t\t\tselect = 3; // E left\n\t\t\t\t\t\t}\n#else\n\t\t\t\t\t\tselect = randFromMask(rnd, mask);\n#endif\n\t\t\t\t\t\tassert_geq(mask, 0);\n\t\t\t\t\t\tmask &= ~(1 << select);\n\t\t\t\t\t\tassert(gapsAllowed || mask == (1 << 4) || mask == 0);\n\t\t\t\t\t\td.mat_.hMaskSet(row, col, mask);\n\t\t\t\t\t\tbranch = true;\n\t\t\t\t\t} else { /* No way to backtrack! */ }\n\t\t\t\t\tif(select != -1) {\n\t\t\t\t\t\tif(select == 4) {\n\t\t\t\t\t\t\tcur = SW_BT_OALL_DIAG;\n\t\t\t\t\t\t} else if(select == 0) {\n\t\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\t} else if(select == 1) {\n\t\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\t} else if(select == 2) {\n\t\t\t\t\t\t\tcur = SW_BT_RFGAP_EXTEND;\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tassert_eq(3, select)\n\t\t\t\t\t\t\tcur = SW_BT_RDGAP_EXTEND;\n\t\t\t\t\t\t}\n\t\t\t\t\t} else {\n\t\t\t\t\t\tempty = true;\n\t\t\t\t\t\t// It's empty, so the only question left is whether we should be\n\t\t\t\t\t\t// allowed in terimnate in this cell.  If it's got a valid score\n\t\t\t\t\t\t// then we *shouldn't* be allowed to terminate here because that\n\t\t\t\t\t\t// means it's part of a larger alignment that was already reported.\n\t\t\t\t\t\tcanMoveThru = (origMask == 0);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tassert(!empty || !canMoveThru || ct == SSEMatrix::H);\n\t\t\t}\n\t\t}\n\t\td.mat_.setReportedThrough(row, col);\n\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t// Cell was involved in a previously-reported alignment?\n\t\tif(!canMoveThru) {\n\t\t\tif(!btnstack_.empty()) {\n\t\t\t\t// Remove all the cells from list back to and including the\n\t\t\t\t// cell where the branch occurred\n\t\t\t\tbtcells_.resize(btnstack_.back().celsz);\n\t\t\t\t// Pop record off the top of the stack\n\t\t\t\tned.resize(btnstack_.back().nedsz);\n\t\t\t\t//aed.resize(btnstack_.back().aedsz);\n\t\t\t\trow      = btnstack_.back().row;\n\t\t\t\tcol      = btnstack_.back().col;\n\t\t\t\tgaps     = btnstack_.back().gaps;\n\t\t\t\treadGaps = btnstack_.back().readGaps;\n\t\t\t\trefGaps  = btnstack_.back().refGaps;\n\t\t\t\tscore    = btnstack_.back().score;\n\t\t\t\tct       = btnstack_.back().ct;\n\t\t\t\tbtnstack_.pop_back();\n\t\t\t\tassert(!sc_->monotone || score.score() >= escore);\n\t\t\t\tNEW_ROW_COL(row, col);\n\t\t\t\tcontinue;\n\t\t\t} else {\n\t\t\t\t// No branch points to revisit; just give up\n\t\t\t\tres.reset();\n\t\t\t\tmet.btfail++; // DP backtraces failed\n\t\t\t\treturn false;\n\t\t\t}\n\t\t}\n\t\tassert(!reportedThru);\n\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\tif(empty || row == 0) {\n\t\t\tassert_eq(SSEMatrix::H, ct);\n\t\t\tbtcells_.expand();\n\t\t\tbtcells_.back().first = row;\n\t\t\tbtcells_.back().second = col;\n\t\t\t// This cell is at the end of a legitimate alignment\n\t\t\ttrimBeg = row;\n\t\t\tassert_eq(btcells_.size(), dpRows() - trimBeg - trimEnd + readGaps);\n\t\t\tbreak;\n\t\t}\n\t\tif(branch) {\n\t\t\t// Add a frame to the backtrack stack\n\t\t\tbtnstack_.expand();\n\t\t\tbtnstack_.back().init(\n\t\t\t\tned.size(),\n\t\t\t\t0,               // aed.size()\n\t\t\t\tbtcells_.size(),\n\t\t\t\trow,\n\t\t\t\tcol,\n\t\t\t\tgaps,\n\t\t\t\treadGaps,\n\t\t\t\trefGaps,\n\t\t\t\tscore,\n\t\t\t\t(int)ct);\n\t\t}\n\t\tbtcells_.expand();\n\t\tbtcells_.back().first = row;\n\t\tbtcells_.back().second = col;\n\t\tswitch(cur) {\n\t\t\t// Move up and to the left.  If the reference nucleotide in the\n\t\t\t// source row mismatches the read nucleotide, penalize\n\t\t\t// it and add a nucleotide mismatch.\n\t\t\tcase SW_BT_OALL_DIAG: {\n\t\t\t\tassert_gt(row, 0); assert_gt(col, 0);\n\t\t\t\t// Check for color mismatch\n\t\t\t\tint readC = (*rd_)[row];\n\t\t\t\tint refNmask = (int)rf_[rfi_+col];\n\t\t\t\tassert_gt(refNmask, 0);\n\t\t\t\tint m = matchesEx(readC, refNmask);\n\t\t\t\tct = SSEMatrix::H;\n\t\t\t\tif(m != 1) {\n\t\t\t\t\tEdit e(\n\t\t\t\t\t\t(int)row,\n\t\t\t\t\t\tmask2dna[refNmask],\n\t\t\t\t\t\t\"ACGTN\"[readC],\n\t\t\t\t\t\tEDIT_TYPE_MM);\n\t\t\t\t\tassert(e.repOk());\n\t\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\t\tned.push_back(e);\n\t\t\t\t\tint pen = QUAL2(row, col);\n\t\t\t\t\tscore.score_ -= pen;\n\t\t\t\t\tassert(!sc_->monotone || score.score() >= escore);\n\t\t\t\t} else {\n\t\t\t\t\t// Reward a match\n\t\t\t\t\tint64_t bonus = sc_->match(30);\n\t\t\t\t\tscore.score_ += bonus;\n\t\t\t\t\tassert(!sc_->monotone || score.score() >= escore);\n\t\t\t\t}\n\t\t\t\tif(m == -1) {\n\t\t\t\t\t//score.ns_++;\n\t\t\t\t}\n\t\t\t\trow--; col--;\n\t\t\t\tMOVE_ALL_UPLEFT();\n\t\t\t\tassert(VALID_AL_SCORE(score));\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\t// Move up.  Add an edit encoding the ref gap.\n\t\t\tcase SW_BT_OALL_REF_OPEN:\n\t\t\t{\n\t\t\t\tassert_gt(row, 0);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row,\n\t\t\t\t\t'-',\n\t\t\t\t\t\"ACGTN\"[(int)(*rd_)[row]],\n\t\t\t\t\tEDIT_TYPE_REF_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\trow--;\n\t\t\t\tct = SSEMatrix::H;\n\t\t\t\tint pen = sc_->refGapOpen();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; refGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_UP();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\t// Move up.  Add an edit encoding the ref gap.\n\t\t\tcase SW_BT_RFGAP_EXTEND:\n\t\t\t{\n\t\t\t\tassert_gt(row, 1);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row,\n\t\t\t\t\t'-',\n\t\t\t\t\t\"ACGTN\"[(int)(*rd_)[row]],\n\t\t\t\t\tEDIT_TYPE_REF_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\trow--;\n\t\t\t\tct = SSEMatrix::F;\n\t\t\t\tint pen = sc_->refGapExtend();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; refGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_UP();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tcase SW_BT_OALL_READ_OPEN:\n\t\t\t{\n\t\t\t\tassert_gt(col, 0);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row+1,\n\t\t\t\t\tmask2dna[(int)rf_[rfi_+col]],\n\t\t\t\t\t'-',\n\t\t\t\t\tEDIT_TYPE_READ_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\tcol--;\n\t\t\t\tct = SSEMatrix::H;\n\t\t\t\tint pen = sc_->readGapOpen();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; readGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_LEFT();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tcase SW_BT_RDGAP_EXTEND:\n\t\t\t{\n\t\t\t\tassert_gt(col, 1);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row+1,\n\t\t\t\t\tmask2dna[(int)rf_[rfi_+col]],\n\t\t\t\t\t'-',\n\t\t\t\t\tEDIT_TYPE_READ_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\tcol--;\n\t\t\t\tct = SSEMatrix::E;\n\t\t\t\tint pen = sc_->readGapExtend();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; readGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_LEFT();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tdefault: throw 1;\n\t\t}\n\t} // while((int)row > 0)\n\tassert_eq(0, trimBeg);\n\tassert_eq(0, trimEnd);\n\tassert_geq(col, 0);\n\tassert_eq(SSEMatrix::H, ct);\n\t// The number of cells in the backtracs should equal the number of read\n\t// bases after trimming plus the number of gaps\n\tassert_eq(btcells_.size(), dpRows() - trimBeg - trimEnd + readGaps);\n\t// Check whether we went through a core diagonal and set 'reported' flag on\n\t// each cell\n\tbool overlappedCoreDiag = false;\n\tfor(size_t i = 0; i < btcells_.size(); i++) {\n\t\tsize_t rw = btcells_[i].first;\n\t\tsize_t cl = btcells_[i].second;\n\t\t// Calculate the diagonal within the *trimmed* rectangle, i.e. the\n\t\t// rectangle we dealt with in align, gather and backtrack.\n\t\tint64_t diagi = cl - rw;\n\t\t// Now adjust to the diagonal within the *untrimmed* rectangle by\n\t\t// adding on the amount trimmed from the left.\n\t\tdiagi += rect_->triml;\n\t\tif(diagi >= 0) {\n\t\t\tsize_t diag = (size_t)diagi;\n\t\t\tif(diag >= rect_->corel && diag <= rect_->corer) {\n\t\t\t\toverlappedCoreDiag = true;\n\t\t\t\tbreak;\n\t\t\t}\n\t\t}\n\t\tassert(d.mat_.reportedThrough(rw, cl));\n\t}\n\tif(!overlappedCoreDiag) {\n\t\t// Must overlap a core diagonal.  Otherwise, we run the risk of\n\t\t// reporting an alignment that overlaps (and trumps) a higher-scoring\n\t\t// alignment that lies partially outside the dynamic programming\n\t\t// rectangle.\n\t\tres.reset();\n\t\tmet.corerej++;\n\t\treturn false;\n\t}\n\tint readC = (*rd_)[rdi_+row];      // get last char in read\n\tint refNmask = (int)rf_[rfi_+col]; // get last ref char ref involved in aln\n\tassert_gt(refNmask, 0);\n\tint m = matchesEx(readC, refNmask);\n\tif(m != 1) {\n\t\tEdit e((int)row, mask2dna[refNmask], \"ACGTN\"[readC], EDIT_TYPE_MM);\n\t\tassert(e.repOk());\n\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\tned.push_back(e);\n\t\tscore.score_ -= QUAL2(row, col);\n\t\tassert_geq(score.score(), minsc_);\n\t} else {\n\t\tscore.score_ += sc_->match(30);\n\t}\n\tif(m == -1) {\n\t\t//score.ns_++;\n\t}\n#if 0\n\tif(score.ns_ > nceil_) {\n\t\t// Alignment has too many Ns in it!\n\t\tres.reset();\n\t\tmet.nrej++;\n\t\treturn false;\n\t}\n#endif\n\tres.reverse();\n\tassert(Edit::repOk(ned, (*rd_)));\n\tassert_eq(score.score(), escore);\n\tassert_leq(gaps, rdgap_ + rfgap_);\n\toff = col;\n\tassert_lt(col + (size_t)rfi_, (size_t)rff_);\n\t// score.gaps_ = gaps;\n\tres.alres.setScore(score);\n#if 0\n\tres.alres.setShape(\n\t\trefidx_,                  // ref id\n\t\toff + rfi_ + rect_->refl, // 0-based ref offset\n\t\treflen_,                  // reference length\n\t\tfw_,                      // aligned to Watson?\n\t\trdf_ - rdi_,              // read length\n\t\ttrue,                     // pretrim soft?\n\t\t0,                        // pretrim 5' end\n\t\t0,                        // pretrim 3' end\n\t\ttrue,                     // alignment trim soft?\n\t\tfw_ ? trimBeg : trimEnd,  // alignment trim 5' end\n\t\tfw_ ? trimEnd : trimBeg); // alignment trim 3' end\n#endif\n\tsize_t refns = 0;\n\tfor(size_t i = col; i <= origCol; i++) {\n\t\tif((int)rf_[rfi_+i] > 15) {\n\t\t\trefns++;\n\t\t}\n\t}\n\t// res.alres.setRefNs(refns);\n\tassert(Edit::repOk(ned, (*rd_), true, trimBeg, trimEnd));\n\tassert(res.repOk());\n#ifndef NDEBUG\n\tsize_t gapsCheck = 0;\n\tfor(size_t i = 0; i < ned.size(); i++) {\n\t\tif(ned[i].isGap()) gapsCheck++;\n\t}\n\tassert_eq(gaps, gapsCheck);\n\tBTDnaString refstr;\n\tfor(size_t i = col; i <= origCol; i++) {\n\t\trefstr.append(firsts5[(int)rf_[rfi_+i]]);\n\t}\n\tBTDnaString editstr;\n    // daehwan\n\t// Edit::toRef((*rd_), ned, editstr, true, trimBeg, trimEnd);\n    Edit::toRef((*rd_), ned, editstr, true, trimBeg + rdi_, trimEnd + (rd_->length() - rdf_));\n\tif(refstr != editstr) {\n\t\tcerr << \"Decoded nucleotides and edits don't match reference:\" << endl;\n\t\tcerr << \"           score: \" << score.score()\n\t\t     << \" (\" << gaps << \" gaps)\" << endl;\n\t\tcerr << \"           edits: \";\n\t\tEdit::print(cerr, ned);\n\t\tcerr << endl;\n\t\tcerr << \"    decoded nucs: \" << (*rd_) << endl;\n\t\tcerr << \"     edited nucs: \" << editstr << endl;\n\t\tcerr << \"  reference nucs: \" << refstr << endl;\n\t\tassert(0);\n\t}\n#endif\n\tmet.btsucc++; // DP backtraces succeeded\n\treturn true;\n}\n"
  },
  {
    "path": "aligner_swsse_ee_u8.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n/**\n * aligner_sw_sse.cpp\n *\n * Versions of key alignment functions that use vector instructions to\n * accelerate dynamic programming.  Based chiefly on the striped Smith-Waterman\n * paper and implementation by Michael Farrar.  See:\n *\n * Farrar M. Striped Smith-Waterman speeds database searches six times over\n * other SIMD implementations. Bioinformatics. 2007 Jan 15;23(2):156-61.\n * http://sites.google.com/site/farrarmichael/smith-waterman\n *\n * While the paper describes an implementation of Smith-Waterman, we extend it\n * do end-to-end read alignment as well as local alignment.  The change\n * required for this is minor: we simply let vmax be the maximum element in the\n * score domain rather than the minimum.\n *\n * The vectorized dynamic programming implementation lacks some features that\n * make it hard to adapt to solving the entire dynamic-programming alignment\n * problem.  For instance:\n *\n * - It doesn't respect gap barriers on either end of the read\n * - It just gives a maximum; not enough information to backtrace without\n *   redoing some alignment\n * - It's a little difficult to handle st_ and en_, especially st_.\n * - The query profile mechanism makes handling of ambiguous reference bases a\n *   little tricky (16 cols in query profile lookup table instead of 5)\n *\n * Given the drawbacks, it is tempting to use SSE dynamic programming as a\n * filter rather than as an aligner per se.  Here are a few ideas for how it\n * can be extended to handle more of the alignment problem:\n *\n * - Save calculated scores to a big array as we go.  We return to this array\n *   to find and backtrace from good solutions.\n */\n\n#include <limits>\n#include \"aligner_sw.h\"\n\nstatic const size_t NBYTES_PER_REG  = 16;\nstatic const size_t NWORDS_PER_REG  = 16;\n// static const size_t NBITS_PER_WORD  = 8;\nstatic const size_t NBYTES_PER_WORD = 1;\n\n// In end-to-end mode, we start high (255) and go low (0).  Factoring in\n// a query profile involves unsigned saturating subtraction, so all the\n// query profile elements should be expressed as a positive penalty rather\n// than a negative score.\n\ntypedef uint8_t TCScore;\n\n/**\n * Build query profile look up tables for the read.  The query profile look\n * up table is organized as a 1D array indexed by [i][j] where i is the\n * reference character in the current DP column (0=A, 1=C, etc), and j is\n * the segment of the query we're currently working on.\n */\nvoid SwAligner::buildQueryProfileEnd2EndSseU8(bool fw) {\n\tbool& done = fw ? sseU8fwBuilt_ : sseU8rcBuilt_;\n\tif(done) {\n\t\treturn;\n\t}\n\tdone = true;\n\tconst BTDnaString* rd = fw ? rdfw_ : rdrc_;\n\tconst BTString* qu = fw ? qufw_ : qurc_;\n    // daehwan - allows to align a portion of a read, not the whole.\n\t// const size_t len = rd->length();\n    const size_t len = dpRows();\n\tconst size_t seglen = (len + (NWORDS_PER_REG-1)) / NWORDS_PER_REG;\n\t// How many __m128i's are needed\n\tsize_t n128s =\n\t\t64 +                    // slack bytes, for alignment?\n\t\t(seglen * ALPHA_SIZE)   // query profile data\n\t\t* 2;                    // & gap barrier data\n\tassert_gt(n128s, 0);\n\tSSEData& d = fw ? sseU8fw_ : sseU8rc_;\n\td.profbuf_.resizeNoCopy(n128s);\n\tassert(!d.profbuf_.empty());\n\td.maxPen_      = d.maxBonus_ = 0;\n\td.lastIter_    = d.lastWord_ = 0;\n\td.qprofStride_ = d.gbarStride_ = 2;\n\td.bias_ = 0; // no bias needed for end-to-end alignment; just use subtraction\n\t// For each reference character A, C, G, T, N ...\n\tfor(size_t refc = 0; refc < ALPHA_SIZE; refc++) {\n\t\t// For each segment ...\n\t\tfor(size_t i = 0; i < seglen; i++) {\n\t\t\tsize_t j = i;\n\t\t\tuint8_t *qprofWords =\n\t\t\t\treinterpret_cast<uint8_t*>(d.profbuf_.ptr() + (refc * seglen * 2) + (i * 2));\n\t\t\tuint8_t *gbarWords =\n\t\t\t\treinterpret_cast<uint8_t*>(d.profbuf_.ptr() + (refc * seglen * 2) + (i * 2) + 1);\n\t\t\t// For each sub-word (byte) ...\n\t\t\tfor(size_t k = 0; k < NWORDS_PER_REG; k++) {\n\t\t\t\tint sc = 0;\n\t\t\t\t*gbarWords = 0;\n\t\t\t\tif(j < len) {\n\t\t\t\t\tint readc = (*rd)[j];\n\t\t\t\t\tint readq = (*qu)[j];\n\t\t\t\t\tsc = sc_->score(readc, (int)(1 << refc), readq - 33);\n\t\t\t\t\t// Make score positive, to fit in an unsigned\n\t\t\t\t\tsc = -sc;\n\t\t\t\t\tassert_range(0, 255, sc);\n\t\t\t\t\tsize_t j_from_end = len - j - 1;\n\t\t\t\t\tif(j < (size_t)sc_->gapbar ||\n\t\t\t\t\t   j_from_end < (size_t)sc_->gapbar)\n\t\t\t\t\t{\n\t\t\t\t\t\t// Inside the gap barrier\n\t\t\t\t\t\t*gbarWords = 0xff;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(refc == 0 && j == len-1) {\n\t\t\t\t\t// Remember which 128-bit word and which smaller word has\n\t\t\t\t\t// the final row\n\t\t\t\t\td.lastIter_ = i;\n\t\t\t\t\td.lastWord_ = k;\n\t\t\t\t}\n\t\t\t\tif((size_t)sc > d.maxPen_) {\n\t\t\t\t\td.maxPen_ = (size_t)sc;\n\t\t\t\t}\n\t\t\t\t*qprofWords = (uint8_t)sc;\n\t\t\t\tgbarWords++;\n\t\t\t\tqprofWords++;\n\t\t\t\tj += seglen; // update offset into query\n\t\t\t}\n\t\t}\n\t}\n}\n\n#ifndef NDEBUG\n/**\n * Return true iff the cell has sane E/F/H values w/r/t its predecessors.\n */\nstatic bool cellOkEnd2EndU8(\n\tSSEData& d,\n\tsize_t row,\n\tsize_t col,\n\tint refc,\n\tint readc,\n\tint readq,\n\tconst Scoring& sc)     // scoring scheme\n{\n\tTCScore floorsc = 0;\n\tTAlScore ceilsc = MAX_I64;\n\tTAlScore offsetsc = -0xff;\n\tTAlScore sc_h_cur = (TAlScore)d.mat_.helt(row, col);\n\tTAlScore sc_e_cur = (TAlScore)d.mat_.eelt(row, col);\n\tTAlScore sc_f_cur = (TAlScore)d.mat_.felt(row, col);\n\tif(sc_h_cur > floorsc) {\n\t\tsc_h_cur += offsetsc;\n\t}\n\tif(sc_e_cur > floorsc) {\n\t\tsc_e_cur += offsetsc;\n\t}\n\tif(sc_f_cur > floorsc) {\n\t\tsc_f_cur += offsetsc;\n\t}\n\tbool gapsAllowed = true;\n\tsize_t rowFromEnd = d.mat_.nrow() - row - 1;\n\tif(row < (size_t)sc.gapbar || rowFromEnd < (size_t)sc.gapbar) {\n\t\tgapsAllowed = false;\n\t}\n\tbool e_left_trans = false, h_left_trans = false;\n\tbool f_up_trans   = false, h_up_trans = false;\n\tbool h_diag_trans = false;\n\tif(gapsAllowed) {\n\t\tTAlScore sc_h_left = floorsc;\n\t\tTAlScore sc_e_left = floorsc;\n\t\tTAlScore sc_h_up   = floorsc;\n\t\tTAlScore sc_f_up   = floorsc;\n\t\tif(col > 0 && sc_e_cur > floorsc && sc_e_cur <= ceilsc) {\n\t\t\tsc_h_left = d.mat_.helt(row, col-1) + offsetsc;\n\t\t\tsc_e_left = d.mat_.eelt(row, col-1) + offsetsc;\n\t\t\te_left_trans = (sc_e_left > floorsc && sc_e_cur == sc_e_left - sc.readGapExtend());\n\t\t\th_left_trans = (sc_h_left > floorsc && sc_e_cur == sc_h_left - sc.readGapOpen());\n\t\t\tassert(e_left_trans || h_left_trans);\n\t\t\t// Check that we couldn't have got a better E score\n\t\t\tassert_geq(sc_e_cur, sc_e_left - sc.readGapExtend());\n\t\t\tassert_geq(sc_e_cur, sc_h_left - sc.readGapOpen());\n\t\t}\n\t\tif(row > 0 && sc_f_cur > floorsc && sc_f_cur <= ceilsc) {\n\t\t\tsc_h_up = d.mat_.helt(row-1, col) + offsetsc;\n\t\t\tsc_f_up = d.mat_.felt(row-1, col) + offsetsc;\n\t\t\tf_up_trans = (sc_f_up > floorsc && sc_f_cur == sc_f_up - sc.refGapExtend());\n\t\t\th_up_trans = (sc_h_up > floorsc && sc_f_cur == sc_h_up - sc.refGapOpen());\n\t\t\tassert(f_up_trans || h_up_trans);\n\t\t\t// Check that we couldn't have got a better F score\n\t\t\tassert_geq(sc_f_cur, sc_f_up - sc.refGapExtend());\n\t\t\tassert_geq(sc_f_cur, sc_h_up - sc.refGapOpen());\n\t\t}\n\t} else {\n\t\tassert_geq(floorsc, sc_e_cur);\n\t\tassert_geq(floorsc, sc_f_cur);\n\t}\n\tif(col > 0 && row > 0 && sc_h_cur > floorsc && sc_h_cur <= ceilsc) {\n\t\tTAlScore sc_h_upleft = d.mat_.helt(row-1, col-1) + offsetsc;\n\t\tTAlScore sc_diag = sc.score(readc, (int)refc, readq - 33);\n\t\th_diag_trans = sc_h_cur == sc_h_upleft + sc_diag;\n\t}\n\tassert(\n\t\tsc_h_cur <= floorsc ||\n\t\te_left_trans ||\n\t\th_left_trans ||\n\t\tf_up_trans   ||\n\t\th_up_trans   ||\n\t\th_diag_trans ||\n\t\tsc_h_cur > ceilsc ||\n\t\trow == 0 ||\n\t\tcol == 0);\n\treturn true;\n}\n#endif /*ndef NDEBUG*/\n\n#ifdef NDEBUG\n\n#define assert_all_eq0(x)\n#define assert_all_gt(x, y)\n#define assert_all_gt_lo(x)\n#define assert_all_lt(x, y)\n#define assert_all_lt_hi(x)\n\n#else\n\n#define assert_all_eq0(x) { \\\n\t__m128i z = _mm_setzero_si128(); \\\n\t__m128i tmp = _mm_setzero_si128(); \\\n\tz = _mm_xor_si128(z, z); \\\n\ttmp = _mm_cmpeq_epi16(x, z); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_gt(x, y) { \\\n\t__m128i tmp = _mm_cmpgt_epu8(x, y); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_gt_lo(x) { \\\n\t__m128i z = _mm_setzero_si128(); \\\n\t__m128i tmp = _mm_setzero_si128(); \\\n\tz = _mm_xor_si128(z, z); \\\n\ttmp = _mm_cmpgt_epu8(x, z); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_lt(x, y) { \\\n\t__m128i z = _mm_setzero_si128(); \\\n\tz = _mm_xor_si128(z, z); \\\n\t__m128i tmp = _mm_subs_epu8(y, x); \\\n\ttmp = _mm_cmpeq_epi16(tmp, z); \\\n\tassert_eq(0x0000, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_lt_hi(x) { \\\n\t__m128i z = _mm_setzero_si128(); \\\n\t__m128i tmp = _mm_setzero_si128(); \\\n\tz = _mm_cmpeq_epu8(z, z); \\\n\tz = _mm_srli_epu8(z, 1); \\\n\ttmp = _mm_cmplt_epu8(x, z); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n#endif\n\n/**\n * Aligns by filling a dynamic programming matrix with the SSE-accelerated,\n * banded DP approach of Farrar.  As it goes, it determines which cells we\n * might backtrace from and tallies the best (highest-scoring) N backtrace\n * candidate cells per diagonal.  Also returns the alignment score of the best\n * alignment in the matrix.\n *\n * This routine does *not* maintain a matrix holding the entire matrix worth of\n * scores, nor does it maintain any other dense O(mn) data structure, as this\n * would quickly exhaust memory for queries longer than about 10,000 kb.\n * Instead, in the fill stage it maintains two columns worth of scores at a\n * time (current/previous, or right/left) - these take O(m) space.  When\n * finished with the current column, it determines which cells from the\n * previous column, if any, are candidates we might backtrace from to find a\n * full alignment.  A candidate cell has a score that rises above the threshold\n * and isn't improved upon by a match in the next column.  The best N\n * candidates per diagonal are stored in a O(m + n) data structure.\n */\nTAlScore SwAligner::alignGatherEE8(int& flag, bool debug) {\n\tassert_leq(rdf_, rd_->length());\n\tassert_leq(rdf_, qu_->length());\n\tassert_lt(rfi_, rff_);\n\tassert_lt(rdi_, rdf_);\n\tassert_eq(rd_->length(), qu_->length());\n\tassert_geq(sc_->gapbar, 1);\n\tassert(repOk());\n#ifndef NDEBUG\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\tassert_range(0, 16, (int)rf_[i]);\n\t}\n#endif\n\n\tSSEData& d = fw_ ? sseU8fw_ : sseU8rc_;\n\tSSEMetrics& met = extend_ ? sseU8ExtendMet_ : sseU8MateMet_;\n\tif(!debug) met.dp++;\n\tbuildQueryProfileEnd2EndSseU8(fw_);\n\tassert(!d.profbuf_.empty());\n\n\tassert_eq(0, d.maxBonus_);\n\tsize_t iter =\n\t\t(dpRows() + (NWORDS_PER_REG-1)) / NWORDS_PER_REG; // iter = segLen\n\t\n\tint dup;\n\t\n\t// Now set up the score vectors.  We just need two columns worth, which\n\t// we'll call \"left\" and \"right\".\n\td.vecbuf_.resize(4 * 2 * iter);\n\td.vecbuf_.zero();\n\t__m128i *vbuf_l = d.vecbuf_.ptr();\n\t__m128i *vbuf_r = d.vecbuf_.ptr() + (4 * iter);\n\n\t// This is the data structure that holds candidate cells per diagonal.\n\tconst size_t ndiags = rff_ - rfi_ + dpRows() - 1;\n\tif(!debug) {\n\t\tbtdiag_.init(ndiags, 2);\n\t}\n\n\t// Data structure that holds checkpointed anti-diagonals\n\tTAlScore perfectScore = sc_->perfectScore(dpRows());\n\tbool checkpoint = true;\n\tbool cpdebug = false;\n#ifndef NDEBUG\n\tcpdebug = dpRows() < 1000;\n#endif\n\tcper_.init(\n\t\tdpRows(),      // # rows\n\t\trff_ - rfi_,   // # columns\n\t\tcperPerPow2_,  // checkpoint every 1 << perpow2 diags (& next)\n\t\tperfectScore,  // perfect score (for sanity checks)\n\t\ttrue,          // matrix cells have 8-bit scores?\n\t\tcperTri_,      // triangular mini-fills?\n\t\tfalse,         // alignment is local?\n\t\tcpdebug);      // save all cells for debugging?\n\n\t// Many thanks to Michael Farrar for releasing his striped Smith-Waterman\n\t// implementation:\n\t//\n\t//  http://sites.google.com/site/farrarmichael/smith-waterman\n\t//\n\t// Much of the implmentation below is adapted from Michael's code.\n\n\t// Set all elts to reference gap open penalty\n\t__m128i rfgapo   = _mm_setzero_si128();\n\t__m128i rfgape   = _mm_setzero_si128();\n\t__m128i rdgapo   = _mm_setzero_si128();\n\t__m128i rdgape   = _mm_setzero_si128();\n\t__m128i vlo      = _mm_setzero_si128();\n\t__m128i vhi      = _mm_setzero_si128();\n\t__m128i ve       = _mm_setzero_si128();\n\t__m128i vf       = _mm_setzero_si128();\n\t__m128i vh       = _mm_setzero_si128();\n\t__m128i vhd      = _mm_setzero_si128();\n\t__m128i vhdtmp   = _mm_setzero_si128();\n\t__m128i vtmp     = _mm_setzero_si128();\n\t__m128i vzero    = _mm_setzero_si128();\n\t__m128i vhilsw   = _mm_setzero_si128();\n\n\tassert_gt(sc_->refGapOpen(), 0);\n\tassert_leq(sc_->refGapOpen(), MAX_U8);\n\tdup = (sc_->refGapOpen() << 8) | (sc_->refGapOpen() & 0x00ff);\n\trfgapo = _mm_insert_epi16(rfgapo, dup, 0);\n\trfgapo = _mm_shufflelo_epi16(rfgapo, 0);\n\trfgapo = _mm_shuffle_epi32(rfgapo, 0);\n\t\n\t// Set all elts to reference gap extension penalty\n\tassert_gt(sc_->refGapExtend(), 0);\n\tassert_leq(sc_->refGapExtend(), MAX_U8);\n\tassert_leq(sc_->refGapExtend(), sc_->refGapOpen());\n\tdup = (sc_->refGapExtend() << 8) | (sc_->refGapExtend() & 0x00ff);\n\trfgape = _mm_insert_epi16(rfgape, dup, 0);\n\trfgape = _mm_shufflelo_epi16(rfgape, 0);\n\trfgape = _mm_shuffle_epi32(rfgape, 0);\n\n\t// Set all elts to read gap open penalty\n\tassert_gt(sc_->readGapOpen(), 0);\n\tassert_leq(sc_->readGapOpen(), MAX_U8);\n\tdup = (sc_->readGapOpen() << 8) | (sc_->readGapOpen() & 0x00ff);\n\trdgapo = _mm_insert_epi16(rdgapo, dup, 0);\n\trdgapo = _mm_shufflelo_epi16(rdgapo, 0);\n\trdgapo = _mm_shuffle_epi32(rdgapo, 0);\n\t\n\t// Set all elts to read gap extension penalty\n\tassert_gt(sc_->readGapExtend(), 0);\n\tassert_leq(sc_->readGapExtend(), MAX_U8);\n\tassert_leq(sc_->readGapExtend(), sc_->readGapOpen());\n\tdup = (sc_->readGapExtend() << 8) | (sc_->readGapExtend() & 0x00ff);\n\trdgape = _mm_insert_epi16(rdgape, dup, 0);\n\trdgape = _mm_shufflelo_epi16(rdgape, 0);\n\trdgape = _mm_shuffle_epi32(rdgape, 0);\n\t\n\tvhi = _mm_cmpeq_epi16(vhi, vhi); // all elts = 0xffff\n\tvlo = _mm_xor_si128(vlo, vlo);   // all elts = 0\n\t\n\t// vhilsw: topmost (least sig) word set to 0x7fff, all other words=0\n\tvhilsw = _mm_shuffle_epi32(vhi, 0);\n\tvhilsw = _mm_srli_si128(vhilsw, NBYTES_PER_REG - NBYTES_PER_WORD);\n\t\n\t// Points to a long vector of __m128i where each element is a block of\n\t// contiguous cells in the E, F or H matrix.  If the index % 3 == 0, then\n\t// the block of cells is from the E matrix.  If index % 3 == 1, they're\n\t// from the F matrix.  If index % 3 == 2, then they're from the H matrix.\n\t// Blocks of cells are organized in the same interleaved manner as they are\n\t// calculated by the Farrar algorithm.\n\tconst __m128i *pvScore; // points into the query profile\n\n\tconst size_t colstride = ROWSTRIDE_2COL * iter;\n\t\n\t// Initialize the H and E vectors in the first matrix column\n\t__m128i *pvELeft = vbuf_l + 0; __m128i *pvERight = vbuf_r + 0;\n\t/* __m128i *pvFLeft = vbuf_l + 1; */ __m128i *pvFRight = vbuf_r + 1;\n\t__m128i *pvHLeft = vbuf_l + 2; __m128i *pvHRight = vbuf_r + 2;\n\t\n\t// Maximum score in final row\n\tbool found = false;\n\tTCScore lrmax = MIN_U8;\n\t\n\tfor(size_t i = 0; i < iter; i++) {\n\t\t_mm_store_si128(pvERight, vlo); pvERight += ROWSTRIDE_2COL;\n\t\t// Could initialize Hs to high or low.  If high, cells in the lower\n\t\t// triangle will have somewhat more legitiate scores, but still won't\n\t\t// be exhaustively scored.\n\t\t_mm_store_si128(pvHRight, vlo); pvHRight += ROWSTRIDE_2COL;\n\t}\n\t\n\tassert_gt(sc_->gapbar, 0);\n\tsize_t nfixup = 0;\n\n\t// Fill in the table as usual but instead of using the same gap-penalty\n\t// vector for each iteration of the inner loop, load words out of a\n\t// pre-calculated gap vector parallel to the query profile.  The pre-\n\t// calculated gap vectors enforce the gap barrier constraint by making it\n\t// infinitely costly to introduce a gap in barrier rows.\n\t//\n\t// AND use a separate loop to fill in the first row of the table, enforcing\n\t// the st_ constraints in the process.  This is awkward because it\n\t// separates the processing of the first row from the others and might make\n\t// it difficult to use the first-row results in the next row, but it might\n\t// be the simplest and least disruptive way to deal with the st_ constraint.\n\t\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\t// Swap left and right; vbuf_l is the vector on the left, which we\n\t\t// generally load from, and vbuf_r is the vector on the right, which we\n\t\t// generally store to.\n\t\tswap(vbuf_l, vbuf_r);\n\t\tpvELeft = vbuf_l + 0; pvERight = vbuf_r + 0;\n\t\t/* pvFLeft = vbuf_l + 1; */ pvFRight = vbuf_r + 1;\n\t\tpvHLeft = vbuf_l + 2; pvHRight = vbuf_r + 2;\n\t\t\n\t\t// Fetch the appropriate query profile.  Note that elements of rf_ must\n\t\t// be numbers, not masks.\n\t\tconst int refc = (int)rf_[i];\n\t\t\n\t\t// Fetch the appropriate query profile\n\t\tsize_t off = (size_t)firsts5[refc] * iter * 2;\n\t\tpvScore = d.profbuf_.ptr() + off; // even elts = query profile, odd = gap barrier\n\t\t\n\t\t// Set all cells to low value\n\t\tvf = _mm_xor_si128(vf, vf);\n\n\t\t// Load H vector from the final row of the previous column\n\t\tvh = _mm_load_si128(pvHLeft + colstride - ROWSTRIDE_2COL);\n\t\t// Shift 2 bytes down so that topmost (least sig) cell gets 0\n\t\tvh = _mm_slli_si128(vh, NBYTES_PER_WORD);\n\t\t// Fill topmost (least sig) cell with high value\n\t\tvh = _mm_or_si128(vh, vhilsw);\n\t\t\n\t\t// For each character in the reference text:\n\t\tsize_t j;\n\t\tfor(j = 0; j < iter; j++) {\n\t\t\t// Load cells from E, calculated previously\n\t\t\tve = _mm_load_si128(pvELeft);\n\t\t\tvhd = _mm_load_si128(pvHLeft);\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\tpvELeft += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Store cells in F, calculated previously\n\t\t\tvf = _mm_subs_epu8(vf, pvScore[1]); // veto some ref gap extensions\n\t\t\t_mm_store_si128(pvFRight, vf);\n\t\t\tpvFRight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Factor in query profile (matches and mismatches)\n\t\t\tvh = _mm_subs_epu8(vh, pvScore[0]);\n\t\t\t\n\t\t\t// Update H, factoring in E and F\n\t\t\tvh = _mm_max_epu8(vh, vf);\n\t\t\t\n\t\t\t// Update vE value\n\t\t\tvhdtmp = vhd;\n\t\t\tvhd = _mm_subs_epu8(vhd, rdgapo);\n\t\t\tvhd = _mm_subs_epu8(vhd, pvScore[1]); // veto some read gap opens\n\t\t\tve = _mm_subs_epu8(ve, rdgape);\n\t\t\tve = _mm_max_epu8(ve, vhd);\n\t\t\tvh = _mm_max_epu8(vh, ve);\n\t\t\t\n\t\t\t// Save the new vH values\n\t\t\t_mm_store_si128(pvHRight, vh);\n\t\t\tpvHRight += ROWSTRIDE_2COL;\n\t\t\tvtmp = vh;\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\t\n\t\t\t// Load the next h value\n\t\t\tvh = vhdtmp;\n\t\t\tpvHLeft += ROWSTRIDE_2COL;\n\n\t\t\t// Save E values\n\t\t\t_mm_store_si128(pvERight, ve);\n\t\t\tpvERight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Update vf value\n\t\t\tvtmp = _mm_subs_epu8(vtmp, rfgapo);\n\n\t\t\tvf = _mm_subs_epu8(vf, rfgape);\n\t\t\tassert_all_lt(vf, vhi);\n\t\t\tvf = _mm_max_epu8(vf, vtmp);\n\t\t\t\n\t\t\tpvScore += 2; // move on to next query profile / gap veto\n\t\t}\n\t\t// pvHStore, pvELoad, pvEStore have all rolled over to the next column\n\t\tpvFRight -= colstride; // reset to start of column\n\t\tvtmp = _mm_load_si128(pvFRight);\n\t\t\n\t\tpvHRight -= colstride; // reset to start of column\n\t\tvh = _mm_load_si128(pvHRight);\n\t\t\n\t\tpvScore = d.profbuf_.ptr() + off + 1; // reset veto vector\n\t\t\n\t\t// vf from last row gets shifted down by one to overlay the first row\n\t\t// rfgape has already been subtracted from it.\n\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\t\n\t\tvf = _mm_subs_epu8(vf, *pvScore); // veto some ref gap extensions\n\t\tvf = _mm_max_epu8(vtmp, vf);\n\t\tvtmp = _mm_subs_epu8(vf, vtmp);\n\t\tvtmp = _mm_cmpeq_epi8(vtmp, vzero);\n\t\tint cmp = _mm_movemask_epi8(vtmp);\n\t\t\n\t\t// If any element of vtmp is greater than H - gap-open...\n\t\tj = 0;\n\t\twhile(cmp != 0xffff) {\n\t\t\t// Store this vf\n\t\t\t_mm_store_si128(pvFRight, vf);\n\t\t\tpvFRight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Update vh w/r/t new vf\n\t\t\tvh = _mm_max_epu8(vh, vf);\n\t\t\t\n\t\t\t// Save vH values\n\t\t\t_mm_store_si128(pvHRight, vh);\n\t\t\tpvHRight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\tpvScore += 2;\n\t\t\t\n\t\t\tassert_lt(j, iter);\n\t\t\tif(++j == iter) {\n\t\t\t\tpvFRight -= colstride;\n\t\t\t\tvtmp = _mm_load_si128(pvFRight);   // load next vf ASAP\n\t\t\t\tpvHRight -= colstride;\n\t\t\t\tvh = _mm_load_si128(pvHRight);     // load next vh ASAP\n\t\t\t\tpvScore = d.profbuf_.ptr() + off + 1;\n\t\t\t\tj = 0;\n\t\t\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\t\t} else {\n\t\t\t\tvtmp = _mm_load_si128(pvFRight);   // load next vf ASAP\n\t\t\t\tvh = _mm_load_si128(pvHRight);     // load next vh ASAP\n\t\t\t}\n\t\t\t\n\t\t\t// Update F with another gap extension\n\t\t\tvf = _mm_subs_epu8(vf, rfgape);\n\t\t\tvf = _mm_subs_epu8(vf, *pvScore); // veto some ref gap extensions\n\t\t\tvf = _mm_max_epu8(vtmp, vf);\n\t\t\tvtmp = _mm_subs_epu8(vf, vtmp);\n\t\t\tvtmp = _mm_cmpeq_epi8(vtmp, vzero);\n\t\t\tcmp = _mm_movemask_epi8(vtmp);\n\t\t\tnfixup++;\n\t\t}\n\t\t\n\t\t// Check in the last row for the maximum so far\n\t\t__m128i *vtmp = vbuf_r + 2 /* H */ + (d.lastIter_ * ROWSTRIDE_2COL);\n\t\t// Note: we may not want to extract from the final row\n\t\tTCScore lr = ((TCScore*)(vtmp))[d.lastWord_];\n\t\tfound = true;\n\t\tif(lr > lrmax) {\n\t\t\tlrmax = lr;\n\t\t}\n\t\t\n\t\t// Now we'd like to know whether the bottommost element of the right\n\t\t// column is a candidate we might backtrace from.  First question is:\n\t\t// did it exceed the minimum score threshold?\n\t\tTAlScore score = (TAlScore)(lr - 0xff);\n\t\tif(lr == MIN_U8) {\n\t\t\tscore = MIN_I64;\n\t\t}\n\t\tif(!debug && score >= minsc_) {\n\t\t\tDpBtCandidate cand(dpRows() - 1, i - rfi_, score);\n\t\t\tbtdiag_.add(i - rfi_, cand);\n\t\t}\n\n\t\t// Save some elements to checkpoints\n\t\tif(checkpoint) {\n\t\t\t\n\t\t\t__m128i *pvE = vbuf_r + 0;\n\t\t\t__m128i *pvF = vbuf_r + 1;\n\t\t\t__m128i *pvH = vbuf_r + 2;\n\t\t\tsize_t coli = i - rfi_;\n\t\t\tif(coli < cper_.locol_) cper_.locol_ = coli;\n\t\t\tif(coli > cper_.hicol_) cper_.hicol_ = coli;\n\t\t\t\n\t\t\tif(cperTri_) {\n\t\t\t\tsize_t rc_mod = coli & cper_.lomask_;\n\t\t\t\tassert_lt(rc_mod, cper_.per_);\n\t\t\t\tint64_t row = -rc_mod-1;\n\t\t\t\tint64_t row_mod = row;\n\t\t\t\tint64_t row_div = 0;\n\t\t\t\tsize_t idx = coli >> cper_.perpow2_;\n\t\t\t\tsize_t idxrow = idx * cper_.nrow_;\n\t\t\t\tassert_eq(4, ROWSTRIDE_2COL);\n\t\t\t\tbool done = false;\n\t\t\t\twhile(true) {\n\t\t\t\t\trow += (cper_.per_ - 2);\n\t\t\t\t\trow_mod += (cper_.per_ - 2);\n\t\t\t\t\tfor(size_t j = 0; j < 2; j++) {\n\t\t\t\t\t\trow++;\n\t\t\t\t\t\trow_mod++;\n\t\t\t\t\t\tif(row >= 0 && (size_t)row < cper_.nrow_) {\n\t\t\t\t\t\t\t// Update row divided by iter_ and mod iter_\n\t\t\t\t\t\t\twhile(row_mod >= (int64_t)iter) {\n\t\t\t\t\t\t\t\trow_mod -= (int64_t)iter;\n\t\t\t\t\t\t\t\trow_div++;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tsize_t delt = idxrow + row;\n\t\t\t\t\t\t\tsize_t vecoff = (row_mod << 6) + row_div;\n\t\t\t\t\t\t\tassert_lt(row_div, 16);\n\t\t\t\t\t\t\tint16_t h_sc = ((uint8_t*)pvH)[vecoff];\n\t\t\t\t\t\t\tint16_t e_sc = ((uint8_t*)pvE)[vecoff];\n\t\t\t\t\t\t\tint16_t f_sc = ((uint8_t*)pvF)[vecoff];\n\t\t\t\t\t\t\tif(h_sc == 0) h_sc = MIN_I16;\n\t\t\t\t\t\t\telse h_sc -= 0xff;\n\t\t\t\t\t\t\tif(e_sc == 0) e_sc = MIN_I16;\n\t\t\t\t\t\t\telse e_sc -= 0xff;\n\t\t\t\t\t\t\tif(f_sc == 0) f_sc = MIN_I16;\n\t\t\t\t\t\t\telse f_sc -= 0xff;\n\t\t\t\t\t\t\tassert_leq(h_sc, cper_.perf_);\n\t\t\t\t\t\t\tassert_leq(e_sc, cper_.perf_);\n\t\t\t\t\t\t\tassert_leq(f_sc, cper_.perf_);\n\t\t\t\t\t\t\tCpQuad *qdiags = ((j == 0) ? cper_.qdiag1s_.ptr() : cper_.qdiag2s_.ptr());\n\t\t\t\t\t\t\tqdiags[delt].sc[0] = h_sc;\n\t\t\t\t\t\t\tqdiags[delt].sc[1] = e_sc;\n\t\t\t\t\t\t\tqdiags[delt].sc[2] = f_sc;\n\t\t\t\t\t\t} // if(row >= 0 && row < nrow_)\n\t\t\t\t\t\telse if(row >= 0 && (size_t)row >= cper_.nrow_) {\n\t\t\t\t\t\t\tdone = true;\n\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t}\n\t\t\t\t\t} // end of loop over anti-diags\n\t\t\t\t\tif(done) {\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t}\n\t\t\t\t\tidx++;\n\t\t\t\t\tidxrow += cper_.nrow_;\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\t// If this is the first column, take this opportunity to\n\t\t\t\t// pre-calculate the coordinates of the elements we're going to\n\t\t\t\t// checkpoint.\n\t\t\t\tif(coli == 0) {\n\t\t\t\t\tsize_t cpi    = cper_.per_-1;\n\t\t\t\t\tsize_t cpimod = cper_.per_-1;\n\t\t\t\t\tsize_t cpidiv = 0;\n\t\t\t\t\tcper_.commitMap_.clear();\n\t\t\t\t\twhile(cpi < cper_.nrow_) {\n\t\t\t\t\t\twhile(cpimod >= iter) {\n\t\t\t\t\t\t\tcpimod -= iter;\n\t\t\t\t\t\t\tcpidiv++;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tsize_t vecoff = (cpimod << 6) + cpidiv;\n\t\t\t\t\t\tcper_.commitMap_.push_back(vecoff);\n\t\t\t\t\t\tcpi += cper_.per_;\n\t\t\t\t\t\tcpimod += cper_.per_;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\t// Save all the rows\n\t\t\t\tsize_t rowoff = 0;\n\t\t\t\tsize_t sz = cper_.commitMap_.size();\n\t\t\t\tfor(size_t i = 0; i < sz; i++, rowoff += cper_.ncol_) {\n\t\t\t\t\tsize_t vecoff = cper_.commitMap_[i];\n\t\t\t\t\tint16_t h_sc = ((uint8_t*)pvH)[vecoff];\n\t\t\t\t\t//int16_t e_sc = ((uint8_t*)pvE)[vecoff];\n\t\t\t\t\tint16_t f_sc = ((uint8_t*)pvF)[vecoff];\n\t\t\t\t\tif(h_sc == 0) h_sc = MIN_I16;\n\t\t\t\t\telse h_sc -= 0xff;\n\t\t\t\t\t//if(e_sc == 0) e_sc = MIN_I16;\n\t\t\t\t\t//else e_sc -= 0xff;\n\t\t\t\t\tif(f_sc == 0) f_sc = MIN_I16;\n\t\t\t\t\telse f_sc -= 0xff;\n\t\t\t\t\tassert_leq(h_sc, cper_.perf_);\n\t\t\t\t\t//assert_leq(e_sc, cper_.perf_);\n\t\t\t\t\tassert_leq(f_sc, cper_.perf_);\n\t\t\t\t\tCpQuad& dst = cper_.qrows_[rowoff + coli];\n\t\t\t\t\tdst.sc[0] = h_sc;\n\t\t\t\t\t//dst.sc[1] = e_sc;\n\t\t\t\t\tdst.sc[2] = f_sc;\n\t\t\t\t}\n\t\t\t\t// Is this a column we'd like to checkpoint?\n\t\t\t\tif((coli & cper_.lomask_) == cper_.lomask_) {\n\t\t\t\t\t// Save the column using memcpys\n\t\t\t\t\tassert_gt(coli, 0);\n\t\t\t\t\tsize_t wordspercol = cper_.niter_ * ROWSTRIDE_2COL;\n\t\t\t\t\tsize_t coloff = (coli >> cper_.perpow2_) * wordspercol;\n\t\t\t\t\t__m128i *dst = cper_.qcols_.ptr() + coloff;\n\t\t\t\t\tmemcpy(dst, vbuf_r, sizeof(__m128i) * wordspercol);\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(cper_.debug_) {\n\t\t\t\t// Save the column using memcpys\n\t\t\t\tsize_t wordspercol = cper_.niter_ * ROWSTRIDE_2COL;\n\t\t\t\tsize_t coloff = coli * wordspercol;\n\t\t\t\t__m128i *dst = cper_.qcolsD_.ptr() + coloff;\n\t\t\t\tmemcpy(dst, vbuf_r, sizeof(__m128i) * wordspercol);\n\t\t\t}\n\t\t}\n\t}\n\t\n\t// Update metrics\n\tif(!debug) {\n\t\tsize_t ninner = (rff_ - rfi_) * iter;\n\t\tmet.col   += (rff_ - rfi_);             // DP columns\n\t\tmet.cell  += (ninner * NWORDS_PER_REG); // DP cells\n\t\tmet.inner += ninner;                    // DP inner loop iters\n\t\tmet.fixup += nfixup;                    // DP fixup loop iters\n\t}\n\n\tflag = 0;\n\n\t// Did we find a solution?\n\tTAlScore score = MIN_I64;\n\tif(!found) {\n\t\tflag = -1; // no\n\t\tif(!debug) met.dpfail++;\n\t\treturn MIN_I64;\n\t} else {\n\t\tscore = (TAlScore)(lrmax - 0xff);\n\t\tif(score < minsc_) {\n\t\t\tflag = -1; // no\n\t\t\tif(!debug) met.dpfail++;\n\t\t\treturn score;\n\t\t}\n\t}\n\t\n\t// Could we have saturated?\n\tif(lrmax == MIN_U8) {\n\t\tflag = -2; // yes\n\t\tif(!debug) met.dpsat++;\n\t\treturn MIN_I64;\n\t}\n\n\t// Now take all the backtrace candidates in the btdaig_ structure and\n\t// dump them into the btncand_ array.  They'll be sorted later.\n\tif(!debug) {\n\t\tbtdiag_.dump(btncand_);\n\t\tassert(!btncand_.empty());\n\t}\n\t\n\t// Return largest score\n\tif(!debug) met.dpsucc++;\n\treturn score;\n}\n\n/**\n * Solve the current alignment problem using SSE instructions that operate on 16\n * unsigned 8-bit values packed into a single 128-bit register.\n */\nTAlScore SwAligner::alignNucleotidesEnd2EndSseU8(int& flag, bool debug) {\n\tassert_leq(rdf_, rd_->length());\n\tassert_leq(rdf_, qu_->length());\n\tassert_lt(rfi_, rff_);\n\tassert_lt(rdi_, rdf_);\n\tassert_eq(rd_->length(), qu_->length());\n\tassert_geq(sc_->gapbar, 1);\n\tassert(repOk());\n#ifndef NDEBUG\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\tassert_range(0, 16, (int)rf_[i]);\n\t}\n#endif\n\n\tSSEData& d = fw_ ? sseU8fw_ : sseU8rc_;\n\tSSEMetrics& met = extend_ ? sseU8ExtendMet_ : sseU8MateMet_;\n\tif(!debug) met.dp++;\n\tbuildQueryProfileEnd2EndSseU8(fw_);\n\tassert(!d.profbuf_.empty());\n\n\tassert_eq(0, d.maxBonus_);\n\tsize_t iter =\n\t\t(dpRows() + (NWORDS_PER_REG-1)) / NWORDS_PER_REG; // iter = segLen\n\n\tint dup;\n\t\n\t// Many thanks to Michael Farrar for releasing his striped Smith-Waterman\n\t// implementation:\n\t//\n\t//  http://sites.google.com/site/farrarmichael/smith-waterman\n\t//\n\t// Much of the implmentation below is adapted from Michael's code.\n\n\t// Set all elts to reference gap open penalty\n\t__m128i rfgapo   = _mm_setzero_si128();\n\t__m128i rfgape   = _mm_setzero_si128();\n\t__m128i rdgapo   = _mm_setzero_si128();\n\t__m128i rdgape   = _mm_setzero_si128();\n\t__m128i vlo      = _mm_setzero_si128();\n\t__m128i vhi      = _mm_setzero_si128();\n\t__m128i ve       = _mm_setzero_si128();\n\t__m128i vf       = _mm_setzero_si128();\n\t__m128i vh       = _mm_setzero_si128();\n#if 0\n\t__m128i vhd      = _mm_setzero_si128();\n\t__m128i vhdtmp   = _mm_setzero_si128();\n#endif\n\t__m128i vtmp     = _mm_setzero_si128();\n\t__m128i vzero    = _mm_setzero_si128();\n\t__m128i vhilsw   = _mm_setzero_si128();\n\n\tassert_gt(sc_->refGapOpen(), 0);\n\tassert_leq(sc_->refGapOpen(), MAX_U8);\n\tdup = (sc_->refGapOpen() << 8) | (sc_->refGapOpen() & 0x00ff);\n\trfgapo = _mm_insert_epi16(rfgapo, dup, 0);\n\trfgapo = _mm_shufflelo_epi16(rfgapo, 0);\n\trfgapo = _mm_shuffle_epi32(rfgapo, 0);\n\t\n\t// Set all elts to reference gap extension penalty\n\tassert_gt(sc_->refGapExtend(), 0);\n\tassert_leq(sc_->refGapExtend(), MAX_U8);\n\tassert_leq(sc_->refGapExtend(), sc_->refGapOpen());\n\tdup = (sc_->refGapExtend() << 8) | (sc_->refGapExtend() & 0x00ff);\n\trfgape = _mm_insert_epi16(rfgape, dup, 0);\n\trfgape = _mm_shufflelo_epi16(rfgape, 0);\n\trfgape = _mm_shuffle_epi32(rfgape, 0);\n\n\t// Set all elts to read gap open penalty\n\tassert_gt(sc_->readGapOpen(), 0);\n\tassert_leq(sc_->readGapOpen(), MAX_U8);\n\tdup = (sc_->readGapOpen() << 8) | (sc_->readGapOpen() & 0x00ff);\n\trdgapo = _mm_insert_epi16(rdgapo, dup, 0);\n\trdgapo = _mm_shufflelo_epi16(rdgapo, 0);\n\trdgapo = _mm_shuffle_epi32(rdgapo, 0);\n\t\n\t// Set all elts to read gap extension penalty\n\tassert_gt(sc_->readGapExtend(), 0);\n\tassert_leq(sc_->readGapExtend(), MAX_U8);\n\tassert_leq(sc_->readGapExtend(), sc_->readGapOpen());\n\tdup = (sc_->readGapExtend() << 8) | (sc_->readGapExtend() & 0x00ff);\n\trdgape = _mm_insert_epi16(rdgape, dup, 0);\n\trdgape = _mm_shufflelo_epi16(rdgape, 0);\n\trdgape = _mm_shuffle_epi32(rdgape, 0);\n\t\n\tvhi = _mm_cmpeq_epi16(vhi, vhi); // all elts = 0xffff\n\tvlo = _mm_xor_si128(vlo, vlo);   // all elts = 0\n\t\n\t// vhilsw: topmost (least sig) word set to 0x7fff, all other words=0\n\tvhilsw = _mm_shuffle_epi32(vhi, 0);\n\tvhilsw = _mm_srli_si128(vhilsw, NBYTES_PER_REG - NBYTES_PER_WORD);\n\t\n\t// Points to a long vector of __m128i where each element is a block of\n\t// contiguous cells in the E, F or H matrix.  If the index % 3 == 0, then\n\t// the block of cells is from the E matrix.  If index % 3 == 1, they're\n\t// from the F matrix.  If index % 3 == 2, then they're from the H matrix.\n\t// Blocks of cells are organized in the same interleaved manner as they are\n\t// calculated by the Farrar algorithm.\n\tconst __m128i *pvScore; // points into the query profile\n\n\td.mat_.init(dpRows(), rff_ - rfi_, NWORDS_PER_REG);\n\tconst size_t colstride = d.mat_.colstride();\n\t//const size_t rowstride = d.mat_.rowstride();\n\tassert_eq(ROWSTRIDE, colstride / iter);\n\t\n\t// Initialize the H and E vectors in the first matrix column\n\t__m128i *pvHTmp = d.mat_.tmpvec(0, 0);\n\t__m128i *pvETmp = d.mat_.evec(0, 0);\n\t\n\t// Maximum score in final row\n\tbool found = false;\n\tTCScore lrmax = MIN_U8;\n\t\n\tfor(size_t i = 0; i < iter; i++) {\n\t\t_mm_store_si128(pvETmp, vlo);\n\t\t_mm_store_si128(pvHTmp, vlo); // start high in end-to-end mode\n\t\tpvETmp += ROWSTRIDE;\n\t\tpvHTmp += ROWSTRIDE;\n\t}\n\t// These are swapped just before the innermost loop\n\t__m128i *pvHStore = d.mat_.hvec(0, 0);\n\t__m128i *pvHLoad  = d.mat_.tmpvec(0, 0);\n\t__m128i *pvELoad  = d.mat_.evec(0, 0);\n\t__m128i *pvEStore = d.mat_.evecUnsafe(0, 1);\n\t__m128i *pvFStore = d.mat_.fvec(0, 0);\n\t__m128i *pvFTmp   = NULL;\n\t\n\tassert_gt(sc_->gapbar, 0);\n\tsize_t nfixup = 0;\n\t\n\t// Fill in the table as usual but instead of using the same gap-penalty\n\t// vector for each iteration of the inner loop, load words out of a\n\t// pre-calculated gap vector parallel to the query profile.  The pre-\n\t// calculated gap vectors enforce the gap barrier constraint by making it\n\t// infinitely costly to introduce a gap in barrier rows.\n\t//\n\t// AND use a separate loop to fill in the first row of the table, enforcing\n\t// the st_ constraints in the process.  This is awkward because it\n\t// separates the processing of the first row from the others and might make\n\t// it difficult to use the first-row results in the next row, but it might\n\t// be the simplest and least disruptive way to deal with the st_ constraint.\n\n\tcolstop_ = rff_ - 1;\n\tlastsolcol_ = 0;\n\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\tassert(pvFStore == d.mat_.fvec(0, i - rfi_));\n\t\tassert(pvHStore == d.mat_.hvec(0, i - rfi_));\n\t\t\n\t\t// Fetch the appropriate query profile.  Note that elements of rf_ must\n\t\t// be numbers, not masks.\n\t\tconst int refc = (int)rf_[i];\n\t\tsize_t off = (size_t)firsts5[refc] * iter * 2;\n\t\tpvScore = d.profbuf_.ptr() + off; // even elts = query profile, odd = gap barrier\n\t\t\n\t\t// Set all cells to low value\n\t\tvf = _mm_xor_si128(vf, vf);\n\n\t\t// Load H vector from the final row of the previous column\n\t\tvh = _mm_load_si128(pvHLoad + colstride - ROWSTRIDE);\n\t\t// Shift 2 bytes down so that topmost (least sig) cell gets 0\n\t\tvh = _mm_slli_si128(vh, NBYTES_PER_WORD);\n\t\t// Fill topmost (least sig) cell with high value\n\t\tvh = _mm_or_si128(vh, vhilsw);\n\t\t\n\t\t// For each character in the reference text:\n\t\tsize_t j;\n\t\tfor(j = 0; j < iter; j++) {\n\t\t\t// Load cells from E, calculated previously\n\t\t\tve = _mm_load_si128(pvELoad);\n#if 0\n\t\t\tvhd = _mm_load_si128(pvHLoad);\n#endif\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\tpvELoad += ROWSTRIDE;\n\t\t\t\n\t\t\t// Store cells in F, calculated previously\n\t\t\tvf = _mm_subs_epu8(vf, pvScore[1]); // veto some ref gap extensions\n\t\t\t_mm_store_si128(pvFStore, vf);\n\t\t\tpvFStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Factor in query profile (matches and mismatches)\n\t\t\tvh = _mm_subs_epu8(vh, pvScore[0]);\n\t\t\t\n\t\t\t// Update H, factoring in E and F\n\t\t\tvh = _mm_max_epu8(vh, ve);\n\t\t\tvh = _mm_max_epu8(vh, vf);\n\t\t\t\n\t\t\t// Save the new vH values\n\t\t\t_mm_store_si128(pvHStore, vh);\n\t\t\tpvHStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update vE value\n\t\t\tvtmp = vh;\n#if 0\n\t\t\tvhdtmp = vhd;\n\t\t\tvhd = _mm_subs_epu8(vhd, rdgapo);\n\t\t\tvhd = _mm_subs_epu8(vhd, pvScore[1]); // veto some read gap opens\n\t\t\tve = _mm_subs_epu8(ve, rdgape);\n\t\t\tve = _mm_max_epu8(ve, vhd);\n#else\n\t\t\tvh = _mm_subs_epu8(vh, rdgapo);\n\t\t\tvh = _mm_subs_epu8(vh, pvScore[1]); // veto some read gap opens\n\t\t\tve = _mm_subs_epu8(ve, rdgape);\n\t\t\tve = _mm_max_epu8(ve, vh);\n#endif\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\t\n\t\t\t// Load the next h value\n#if 0\n\t\t\tvh = vhdtmp;\n#else\n\t\t\tvh = _mm_load_si128(pvHLoad);\n#endif\n\t\t\tpvHLoad += ROWSTRIDE;\n\t\t\t\n\t\t\t// Save E values\n\t\t\t_mm_store_si128(pvEStore, ve);\n\t\t\tpvEStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update vf value\n\t\t\tvtmp = _mm_subs_epu8(vtmp, rfgapo);\n\t\t\tvf = _mm_subs_epu8(vf, rfgape);\n\t\t\tassert_all_lt(vf, vhi);\n\t\t\tvf = _mm_max_epu8(vf, vtmp);\n\t\t\t\n\t\t\tpvScore += 2; // move on to next query profile / gap veto\n\t\t}\n\t\t// pvHStore, pvELoad, pvEStore have all rolled over to the next column\n\t\tpvFTmp = pvFStore;\n\t\tpvFStore -= colstride; // reset to start of column\n\t\tvtmp = _mm_load_si128(pvFStore);\n\t\t\n\t\tpvHStore -= colstride; // reset to start of column\n\t\tvh = _mm_load_si128(pvHStore);\n\t\t\n#if 0\n#else\n\t\tpvEStore -= colstride; // reset to start of column\n\t\tve = _mm_load_si128(pvEStore);\n#endif\n\t\t\n\t\tpvHLoad = pvHStore;    // new pvHLoad = pvHStore\n\t\tpvScore = d.profbuf_.ptr() + off + 1; // reset veto vector\n\t\t\n\t\t// vf from last row gets shifted down by one to overlay the first row\n\t\t// rfgape has already been subtracted from it.\n\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\t\n\t\tvf = _mm_subs_epu8(vf, *pvScore); // veto some ref gap extensions\n\t\tvf = _mm_max_epu8(vtmp, vf);\n\t\tvtmp = _mm_subs_epu8(vf, vtmp);\n\t\tvtmp = _mm_cmpeq_epi8(vtmp, vzero);\n\t\tint cmp = _mm_movemask_epi8(vtmp);\n\t\t\n\t\t// If any element of vtmp is greater than H - gap-open...\n\t\tj = 0;\n\t\twhile(cmp != 0xffff) {\n\t\t\t// Store this vf\n\t\t\t_mm_store_si128(pvFStore, vf);\n\t\t\tpvFStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update vh w/r/t new vf\n\t\t\tvh = _mm_max_epu8(vh, vf);\n\t\t\t\n\t\t\t// Save vH values\n\t\t\t_mm_store_si128(pvHStore, vh);\n\t\t\tpvHStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update E in case it can be improved using our new vh\n#if 0\n#else\n\t\t\tvh = _mm_subs_epu8(vh, rdgapo);\n\t\t\tvh = _mm_subs_epu8(vh, *pvScore); // veto some read gap opens\n\t\t\tve = _mm_max_epu8(ve, vh);\n\t\t\t_mm_store_si128(pvEStore, ve);\n\t\t\tpvEStore += ROWSTRIDE;\n#endif\n\t\t\tpvScore += 2;\n\t\t\t\n\t\t\tassert_lt(j, iter);\n\t\t\tif(++j == iter) {\n\t\t\t\tpvFStore -= colstride;\n\t\t\t\tvtmp = _mm_load_si128(pvFStore);   // load next vf ASAP\n\t\t\t\tpvHStore -= colstride;\n\t\t\t\tvh = _mm_load_si128(pvHStore);     // load next vh ASAP\n#if 0\n#else\n\t\t\t\tpvEStore -= colstride;\n\t\t\t\tve = _mm_load_si128(pvEStore);     // load next ve ASAP\n#endif\n\t\t\t\tpvScore = d.profbuf_.ptr() + off + 1;\n\t\t\t\tj = 0;\n\t\t\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\t\t} else {\n\t\t\t\tvtmp = _mm_load_si128(pvFStore);   // load next vf ASAP\n\t\t\t\tvh = _mm_load_si128(pvHStore);     // load next vh ASAP\n#if 0\n#else\n\t\t\t\tve = _mm_load_si128(pvEStore);     // load next vh ASAP\n#endif\n\t\t\t}\n\t\t\t\n\t\t\t// Update F with another gap extension\n\t\t\tvf = _mm_subs_epu8(vf, rfgape);\n\t\t\tvf = _mm_subs_epu8(vf, *pvScore); // veto some ref gap extensions\n\t\t\tvf = _mm_max_epu8(vtmp, vf);\n\t\t\tvtmp = _mm_subs_epu8(vf, vtmp);\n\t\t\tvtmp = _mm_cmpeq_epi8(vtmp, vzero);\n\t\t\tcmp = _mm_movemask_epi8(vtmp);\n\t\t\tnfixup++;\n\t\t}\n\t\t\n#ifndef NDEBUG\n\t\tif(true && (rand() & 15) == 0) {\n\t\t\t// This is a work-intensive sanity check; each time we finish filling\n\t\t\t// a column, we check that each H, E, and F is sensible.\n\t\t\tfor(size_t k = 0; k < dpRows(); k++) {\n\t\t\t\tassert(cellOkEnd2EndU8(\n\t\t\t\t\td,\n\t\t\t\t\tk,                   // row\n\t\t\t\t\ti - rfi_,            // col\n\t\t\t\t\trefc,                // reference mask\n\t\t\t\t\t(int)(*rd_)[rdi_+k], // read char\n\t\t\t\t\t(int)(*qu_)[rdi_+k], // read quality\n\t\t\t\t\t*sc_));              // scoring scheme\n\t\t\t}\n\t\t}\n#endif\n\t\t\n\t\t__m128i *vtmp = d.mat_.hvec(d.lastIter_, i-rfi_);\n\t\t// Note: we may not want to extract from the final row\n\t\tTCScore lr = ((TCScore*)(vtmp))[d.lastWord_];\n\t\tfound = true;\n\t\tif(lr > lrmax) {\n\t\t\tlrmax = lr;\n\t\t}\n\n\t\t// pvELoad and pvHLoad are already where they need to be\n\t\t\n\t\t// Adjust the load and store vectors here.  \n\t\tpvHStore = pvHLoad + colstride;\n\t\tpvEStore = pvELoad + colstride;\n\t\tpvFStore = pvFTmp;\n\t}\n\t\n\t// Update metrics\n\tif(!debug) {\n\t\tsize_t ninner = (rff_ - rfi_) * iter;\n\t\tmet.col   += (rff_ - rfi_);             // DP columns\n\t\tmet.cell  += (ninner * NWORDS_PER_REG); // DP cells\n\t\tmet.inner += ninner;                    // DP inner loop iters\n\t\tmet.fixup += nfixup;                    // DP fixup loop iters\n\t}\n\t\n\tflag = 0;\n\t\n\t// Did we find a solution?\n\tTAlScore score = MIN_I64;\n\tif(!found) {\n\t\tflag = -1; // no\n\t\tif(!debug) met.dpfail++;\n\t\treturn MIN_I64;\n\t} else {\n\t\tscore = (TAlScore)(lrmax - 0xff);\n\t\tif(score < minsc_) {\n\t\t\tflag = -1; // no\n\t\t\tif(!debug) met.dpfail++;\n\t\t\treturn score;\n\t\t}\n\t}\n\t\n\t// Could we have saturated?\n\tif(lrmax == MIN_U8) {\n\t\tflag = -2; // yes\n\t\tif(!debug) met.dpsat++;\n\t\treturn MIN_I64;\n\t}\n\t\n\t// Return largest score\n\tif(!debug) met.dpsucc++;\n\treturn score;\n}\n\n/**\n * Given a filled-in DP table, populate the btncand_ list with candidate cells\n * that might be at the ends of valid alignments.  No need to do this unless\n * the maximum score returned by the align*() func is >= the minimum.\n *\n * Only cells that are exhaustively scored are candidates.  Those are the\n * cells inside the shape made of o's in this:\n *\n *  |-maxgaps-|\n *  *********************************    -\n *   ********************************    |\n *    *******************************    |\n *     ******************************    |\n *      *****************************    |\n *       **************************** read len\n *        ***************************    |\n *         **************************    |\n *          *************************    |\n *           ************************    |\n *            ***********oooooooooooo    -\n *            |-maxgaps-|\n *  |-readlen-|\n *  |-------skip--------|\n *\n * And it's possible for the shape to be truncated on the left and right sides.\n *\n * \n */\nbool SwAligner::gatherCellsNucleotidesEnd2EndSseU8(TAlScore best) {\n\t// What's the minimum number of rows that can possibly be spanned by an\n\t// alignment that meets the minimum score requirement?\n\tassert(sse8succ_);\n\tconst size_t ncol = rff_ - rfi_;\n\tconst size_t nrow = dpRows();\n\tassert_gt(nrow, 0);\n\tbtncand_.clear();\n\tbtncanddone_.clear();\n\tSSEData& d = fw_ ? sseU8fw_ : sseU8rc_;\n\tSSEMetrics& met = extend_ ? sseU8ExtendMet_ : sseU8MateMet_;\n\tassert(!d.profbuf_.empty());\n\tconst size_t colstride = d.mat_.colstride();\n\tASSERT_ONLY(bool sawbest = false);\n\t__m128i *pvH = d.mat_.hvec(d.lastIter_, 0);\n\tfor(size_t j = 0; j < ncol; j++) {\n\t\tTAlScore sc = (TAlScore)(((TCScore*)pvH)[d.lastWord_] - 0xff);\n\t\tassert_leq(sc, best);\n\t\tASSERT_ONLY(sawbest = (sawbest || sc == best));\n\t\tif(sc >= minsc_) {\n\t\t\t// Yes, this is legit\n\t\t\tmet.gathsol++;\n\t\t\tbtncand_.expand();\n\t\t\tbtncand_.back().init(nrow-1, j, sc);\n\t\t}\n\t\tpvH += colstride;\n\t}\n\tassert(sawbest);\n\tif(!btncand_.empty()) {\n\t\td.mat_.initMasks();\n\t}\n\treturn !btncand_.empty();\n}\n\n#define MOVE_VEC_PTR_UP(vec, rowvec, rowelt) { \\\n\tif(rowvec == 0) { \\\n\t\trowvec += d.mat_.nvecrow_; \\\n\t\tvec += d.mat_.colstride_; \\\n\t\trowelt--; \\\n\t} \\\n\trowvec--; \\\n\tvec -= ROWSTRIDE; \\\n}\n\n#define MOVE_VEC_PTR_LEFT(vec, rowvec, rowelt) { vec -= d.mat_.colstride_; }\n\n#define MOVE_VEC_PTR_UPLEFT(vec, rowvec, rowelt) { \\\n \tMOVE_VEC_PTR_UP(vec, rowvec, rowelt); \\\n \tMOVE_VEC_PTR_LEFT(vec, rowvec, rowelt); \\\n}\n\n#define MOVE_ALL_LEFT() { \\\n\tMOVE_VEC_PTR_LEFT(cur_vec, rowvec, rowelt); \\\n\tMOVE_VEC_PTR_LEFT(left_vec, left_rowvec, left_rowelt); \\\n\tMOVE_VEC_PTR_LEFT(up_vec, up_rowvec, up_rowelt); \\\n\tMOVE_VEC_PTR_LEFT(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n#define MOVE_ALL_UP() { \\\n\tMOVE_VEC_PTR_UP(cur_vec, rowvec, rowelt); \\\n\tMOVE_VEC_PTR_UP(left_vec, left_rowvec, left_rowelt); \\\n\tMOVE_VEC_PTR_UP(up_vec, up_rowvec, up_rowelt); \\\n\tMOVE_VEC_PTR_UP(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n#define MOVE_ALL_UPLEFT() { \\\n\tMOVE_VEC_PTR_UPLEFT(cur_vec, rowvec, rowelt); \\\n\tMOVE_VEC_PTR_UPLEFT(left_vec, left_rowvec, left_rowelt); \\\n\tMOVE_VEC_PTR_UPLEFT(up_vec, up_rowvec, up_rowelt); \\\n\tMOVE_VEC_PTR_UPLEFT(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n#define NEW_ROW_COL(row, col) { \\\n\trowelt = row / d.mat_.nvecrow_; \\\n\trowvec = row % d.mat_.nvecrow_; \\\n\teltvec = (col * d.mat_.colstride_) + (rowvec * ROWSTRIDE); \\\n\tcur_vec = d.mat_.matbuf_.ptr() + eltvec; \\\n\tleft_vec = cur_vec; \\\n\tleft_rowelt = rowelt; \\\n\tleft_rowvec = rowvec; \\\n\tMOVE_VEC_PTR_LEFT(left_vec, left_rowvec, left_rowelt); \\\n\tup_vec = cur_vec; \\\n\tup_rowelt = rowelt; \\\n\tup_rowvec = rowvec; \\\n\tMOVE_VEC_PTR_UP(up_vec, up_rowvec, up_rowelt); \\\n\tupleft_vec = up_vec; \\\n\tupleft_rowelt = up_rowelt; \\\n\tupleft_rowvec = up_rowvec; \\\n\tMOVE_VEC_PTR_LEFT(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n/**\n * Given the dynamic programming table and a cell, trace backwards from the\n * cell and install the edits and score/penalty in the appropriate fields\n * of res.  The RandomSource is used to break ties among equally good ways\n * of tracing back.\n *\n * Whenever we enter a cell, we check whether the read/ref coordinates of\n * that cell correspond to a cell we traversed constructing a previous\n * alignment.  If so, we backtrack to the last decision point, mask out the\n * path that led to the previously observed cell, and continue along a\n * different path; or, if there are no more paths to try, we give up.\n *\n * If an alignment is found, 'off' is set to the alignment's upstream-most\n * reference character's offset into the chromosome and true is returned.\n * Otherwise, false is returned.\n */\nbool SwAligner::backtraceNucleotidesEnd2EndSseU8(\n\tTAlScore       escore, // in: expected score\n\tSwResult&      res,    // out: store results (edits and scores) here\n\tsize_t&        off,    // out: store diagonal projection of origin\n\tsize_t&        nbts,   // out: # backtracks\n\tsize_t         row,    // start in this row\n\tsize_t         col,    // start in this column\n\tRandomSource&  rnd)    // random gen, to choose among equal paths\n{\n\tassert_lt(row, dpRows());\n\tassert_lt(col, (size_t)(rff_ - rfi_));\n\tSSEData& d = fw_ ? sseU8fw_ : sseU8rc_;\n\tSSEMetrics& met = extend_ ? sseU8ExtendMet_ : sseU8MateMet_;\n\tmet.bt++;\n\tassert(!d.profbuf_.empty());\n\tassert_lt(row, rd_->length());\n\tbtnstack_.clear(); // empty the backtrack stack\n\tbtcells_.clear();  // empty the cells-so-far list\n\tAlnScore score; score.score_ = 0;\n\t// score.gaps_ = score.ns_ = 0;\n\tsize_t origCol = col;\n\tsize_t gaps = 0, readGaps = 0, refGaps = 0;\n\tres.alres.reset();\n    EList<Edit>& ned = res.alres.ned();\n\tassert(ned.empty());\n\tassert_gt(dpRows(), row);\n\tASSERT_ONLY(size_t trimEnd = dpRows() - row - 1);\n\tsize_t trimBeg = 0;\n\tsize_t ct = SSEMatrix::H; // cell type\n\t// Row and col in terms of where they fall in the SSE vector matrix\n\tsize_t rowelt, rowvec, eltvec;\n\tsize_t left_rowelt, up_rowelt, upleft_rowelt;\n\tsize_t left_rowvec, up_rowvec, upleft_rowvec;\n\t__m128i *cur_vec, *left_vec, *up_vec, *upleft_vec;\n\tNEW_ROW_COL(row, col);\n\twhile((int)row >= 0) {\n\t\tmet.btcell++;\n\t\tnbts++;\n\t\tint readc = (*rd_)[rdi_ + row];\n\t\tint refm  = (int)rf_[rfi_ + col];\n\t\tint readq = (*qu_)[row];\n\t\tassert_leq(col, origCol);\n\t\t// Get score in this cell\n\t\tbool empty = false, reportedThru, canMoveThru, branch = false;\n\t\tint cur = SSEMatrix::H;\n\t\tif(!d.mat_.reset_[row]) {\n\t\t\td.mat_.resetRow(row);\n\t\t}\n\t\treportedThru = d.mat_.reportedThrough(row, col);\n\t\tcanMoveThru = true;\n\t\tif(reportedThru) {\n\t\t\tcanMoveThru = false;\n\t\t} else {\n\t\t\tempty = false;\n\t\t\tif(row > 0) {\n\t\t\t\tassert_gt(row, 0);\n\t\t\t\tsize_t rowFromEnd = d.mat_.nrow() - row - 1;\n\t\t\t\tbool gapsAllowed = true;\n\t\t\t\tif(row < (size_t)sc_->gapbar ||\n\t\t\t\t   rowFromEnd < (size_t)sc_->gapbar)\n\t\t\t\t{\n\t\t\t\t\tgapsAllowed = false;\n\t\t\t\t}\n\t\t\t\tconst TAlScore floorsc = MIN_I64;\n\t\t\t\tconst int offsetsc = -0xff;\n\t\t\t\t// Move to beginning of column/row\n\t\t\t\tif(ct == SSEMatrix::E) { // AKA rdgap\n\t\t\t\t\tassert_gt(col, 0);\n\t\t\t\t\tTAlScore sc_cur = ((TCScore*)(cur_vec + SSEMatrix::E))[rowelt] + offsetsc;\n\t\t\t\t\tassert(gapsAllowed);\n\t\t\t\t\t// Currently in the E matrix; incoming transition must come from the\n\t\t\t\t\t// left.  It's either a gap open from the H matrix or a gap extend from\n\t\t\t\t\t// the E matrix.\n\t\t\t\t\t// TODO: save and restore origMask as well as mask\n\t\t\t\t\tint origMask = 0, mask = 0;\n\t\t\t\t\t// Get H score of cell to the left\n\t\t\t\t\tTAlScore sc_h_left = ((TCScore*)(left_vec + SSEMatrix::H))[left_rowelt] + offsetsc;\n\t\t\t\t\tif(sc_h_left > floorsc && sc_h_left - sc_->readGapOpen() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 0);\n\t\t\t\t\t}\n\t\t\t\t\t// Get E score of cell to the left\n\t\t\t\t\tTAlScore sc_e_left = ((TCScore*)(left_vec + SSEMatrix::E))[left_rowelt] + offsetsc;\n\t\t\t\t\tif(sc_e_left > floorsc && sc_e_left - sc_->readGapExtend() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 1);\n\t\t\t\t\t}\n\t\t\t\t\torigMask = mask;\n\t\t\t\t\tassert(origMask > 0 || sc_cur <= sc_->match());\n\t\t\t\t\tif(d.mat_.isEMaskSet(row, col)) {\n\t\t\t\t\t\tmask = (d.mat_.masks_[row][col] >> 8) & 3;\n\t\t\t\t\t}\n\t\t\t\t\tif(mask == 3) {\n#if 1\n\t\t\t\t\t\t// Pick H -> E cell\n\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\td.mat_.eMaskSet(row, col, 2); // might choose E later\n#else\n\t\t\t\t\t\tif(rnd.nextU2()) {\n\t\t\t\t\t\t\t// Pick H -> E cell\n\t\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\t\td.mat_.eMaskSet(row, col, 2); // might choose E later\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t// Pick E -> E cell\n\t\t\t\t\t\t\tcur = SW_BT_RDGAP_EXTEND;\n\t\t\t\t\t\t\td.mat_.eMaskSet(row, col, 1); // might choose H later\n\t\t\t\t\t\t}\n#endif\n\t\t\t\t\t\tbranch = true;\n\t\t\t\t\t} else if(mask == 2) {\n\t\t\t\t\t\t// I chose the E cell\n\t\t\t\t\t\tcur = SW_BT_RDGAP_EXTEND;\n\t\t\t\t\t\td.mat_.eMaskSet(row, col, 0); // done\n\t\t\t\t\t} else if(mask == 1) {\n\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\td.mat_.eMaskSet(row, col, 0); // done\n\t\t\t\t\t} else {\n\t\t\t\t\t\tempty = true;\n\t\t\t\t\t\t// It's empty, so the only question left is whether we should be\n\t\t\t\t\t\t// allowed in terimnate in this cell.  If it's got a valid score\n\t\t\t\t\t\t// then we *shouldn't* be allowed to terminate here because that\n\t\t\t\t\t\t// means it's part of a larger alignment that was already reported.\n\t\t\t\t\t\tcanMoveThru = (origMask == 0);\n\t\t\t\t\t}\n\t\t\t\t\tassert(!empty || !canMoveThru);\n\t\t\t\t} else if(ct == SSEMatrix::F) { // AKA rfgap\n\t\t\t\t\tassert_gt(row, 0);\n\t\t\t\t\tassert(gapsAllowed);\n\t\t\t\t\tTAlScore sc_h_up = ((TCScore*)(up_vec  + SSEMatrix::H))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_f_up = ((TCScore*)(up_vec  + SSEMatrix::F))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_cur  = ((TCScore*)(cur_vec + SSEMatrix::F))[rowelt] + offsetsc;\n\t\t\t\t\t// Currently in the F matrix; incoming transition must come from above.\n\t\t\t\t\t// It's either a gap open from the H matrix or a gap extend from the F\n\t\t\t\t\t// matrix.\n\t\t\t\t\t// TODO: save and restore origMask as well as mask\n\t\t\t\t\tint origMask = 0, mask = 0;\n\t\t\t\t\t// Get H score of cell above\n\t\t\t\t\tif(sc_h_up > floorsc && sc_h_up - sc_->refGapOpen() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 0);\n\t\t\t\t\t}\n\t\t\t\t\t// Get F score of cell above\n\t\t\t\t\tif(sc_f_up > floorsc && sc_f_up - sc_->refGapExtend() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 1);\n\t\t\t\t\t}\n\t\t\t\t\torigMask = mask;\n\t\t\t\t\tassert(origMask > 0 || sc_cur <= sc_->match());\n\t\t\t\t\tif(d.mat_.isFMaskSet(row, col)) {\n\t\t\t\t\t\tmask = (d.mat_.masks_[row][col] >> 11) & 3;\n\t\t\t\t\t}\n\t\t\t\t\tif(mask == 3) {\n#if 1\n\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\td.mat_.fMaskSet(row, col, 2); // might choose E later\n#else\n\t\t\t\t\t\tif(rnd.nextU2()) {\n\t\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\t\td.mat_.fMaskSet(row, col, 2); // might choose E later\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t// I chose the F cell\n\t\t\t\t\t\t\tcur = SW_BT_RFGAP_EXTEND;\n\t\t\t\t\t\t\td.mat_.fMaskSet(row, col, 1); // might choose E later\n\t\t\t\t\t\t}\n#endif\n\t\t\t\t\t\tbranch = true;\n\t\t\t\t\t} else if(mask == 2) {\n\t\t\t\t\t\t// I chose the F cell\n\t\t\t\t\t\tcur = SW_BT_RFGAP_EXTEND;\n\t\t\t\t\t\td.mat_.fMaskSet(row, col, 0); // done\n\t\t\t\t\t} else if(mask == 1) {\n\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\td.mat_.fMaskSet(row, col, 0); // done\n\t\t\t\t\t} else {\n\t\t\t\t\t\tempty = true;\n\t\t\t\t\t\t// It's empty, so the only question left is whether we should be\n\t\t\t\t\t\t// allowed in terimnate in this cell.  If it's got a valid score\n\t\t\t\t\t\t// then we *shouldn't* be allowed to terminate here because that\n\t\t\t\t\t\t// means it's part of a larger alignment that was already reported.\n\t\t\t\t\t\tcanMoveThru = (origMask == 0);\n\t\t\t\t\t}\n\t\t\t\t\tassert(!empty || !canMoveThru);\n\t\t\t\t} else {\n\t\t\t\t\tassert_eq(SSEMatrix::H, ct);\n\t\t\t\t\tTAlScore sc_cur      = ((TCScore*)(cur_vec + SSEMatrix::H))[rowelt]    + offsetsc;\n\t\t\t\t\tTAlScore sc_f_up     = ((TCScore*)(up_vec  + SSEMatrix::F))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_h_up     = ((TCScore*)(up_vec  + SSEMatrix::H))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_h_left   = col > 0 ? (((TCScore*)(left_vec   + SSEMatrix::H))[left_rowelt]   + offsetsc) : floorsc;\n\t\t\t\t\tTAlScore sc_e_left   = col > 0 ? (((TCScore*)(left_vec   + SSEMatrix::E))[left_rowelt]   + offsetsc) : floorsc;\n\t\t\t\t\tTAlScore sc_h_upleft = col > 0 ? (((TCScore*)(upleft_vec + SSEMatrix::H))[upleft_rowelt] + offsetsc) : floorsc;\n\t\t\t\t\tTAlScore sc_diag     = sc_->score(readc, refm, readq - 33);\n\t\t\t\t\t// TODO: save and restore origMask as well as mask\n\t\t\t\t\tint origMask = 0, mask = 0;\n\t\t\t\t\tif(gapsAllowed) {\n\t\t\t\t\t\tif(sc_h_up     > floorsc && sc_cur == sc_h_up   - sc_->refGapOpen()) {\n\t\t\t\t\t\t\tmask |= (1 << 0);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(sc_h_left   > floorsc && sc_cur == sc_h_left - sc_->readGapOpen()) {\n\t\t\t\t\t\t\tmask |= (1 << 1);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(sc_f_up     > floorsc && sc_cur == sc_f_up   - sc_->refGapExtend()) {\n\t\t\t\t\t\t\tmask |= (1 << 2);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(sc_e_left   > floorsc && sc_cur == sc_e_left - sc_->readGapExtend()) {\n\t\t\t\t\t\t\tmask |= (1 << 3);\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\tif(sc_h_upleft > floorsc && sc_cur == sc_h_upleft + sc_diag) {\n\t\t\t\t\t\tmask |= (1 << 4);\n\t\t\t\t\t}\n\t\t\t\t\torigMask = mask;\n\t\t\t\t\tassert(origMask > 0 || sc_cur <= sc_->match());\n\t\t\t\t\tif(d.mat_.isHMaskSet(row, col)) {\n\t\t\t\t\t\tmask = (d.mat_.masks_[row][col] >> 2) & 31;\n\t\t\t\t\t}\n\t\t\t\t\tassert(gapsAllowed || mask == (1 << 4) || mask == 0);\n\t\t\t\t\tint opts = alts5[mask];\n\t\t\t\t\tint select = -1;\n\t\t\t\t\tif(opts == 1) {\n\t\t\t\t\t\tselect = firsts5[mask];\n\t\t\t\t\t\tassert_geq(mask, 0);\n\t\t\t\t\t\td.mat_.hMaskSet(row, col, 0);\n\t\t\t\t\t} else if(opts > 1) {\n#if 1\n\t\t\t\t\t\tif(       (mask & 16) != 0) {\n\t\t\t\t\t\t\tselect = 4; // H diag\n\t\t\t\t\t\t} else if((mask & 1) != 0) {\n\t\t\t\t\t\t\tselect = 0; // H up\n\t\t\t\t\t\t} else if((mask & 4) != 0) {\n\t\t\t\t\t\t\tselect = 2; // F up\n\t\t\t\t\t\t} else if((mask & 2) != 0) {\n\t\t\t\t\t\t\tselect = 1; // H left\n\t\t\t\t\t\t} else if((mask & 8) != 0) {\n\t\t\t\t\t\t\tselect = 3; // E left\n\t\t\t\t\t\t}\n#else\n\t\t\t\t\t\tselect = randFromMask(rnd, mask);\n#endif\n\t\t\t\t\t\tassert_geq(mask, 0);\n\t\t\t\t\t\tmask &= ~(1 << select);\n\t\t\t\t\t\tassert(gapsAllowed || mask == (1 << 4) || mask == 0);\n\t\t\t\t\t\td.mat_.hMaskSet(row, col, mask);\n\t\t\t\t\t\tbranch = true;\n\t\t\t\t\t} else { /* No way to backtrack! */ }\n\t\t\t\t\tif(select != -1) {\n\t\t\t\t\t\tif(select == 4) {\n\t\t\t\t\t\t\tcur = SW_BT_OALL_DIAG;\n\t\t\t\t\t\t} else if(select == 0) {\n\t\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\t} else if(select == 1) {\n\t\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\t} else if(select == 2) {\n\t\t\t\t\t\t\tcur = SW_BT_RFGAP_EXTEND;\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tassert_eq(3, select)\n\t\t\t\t\t\t\tcur = SW_BT_RDGAP_EXTEND;\n\t\t\t\t\t\t}\n\t\t\t\t\t} else {\n\t\t\t\t\t\tempty = true;\n\t\t\t\t\t\t// It's empty, so the only question left is whether we should be\n\t\t\t\t\t\t// allowed in terimnate in this cell.  If it's got a valid score\n\t\t\t\t\t\t// then we *shouldn't* be allowed to terminate here because that\n\t\t\t\t\t\t// means it's part of a larger alignment that was already reported.\n\t\t\t\t\t\tcanMoveThru = (origMask == 0);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tassert(!empty || !canMoveThru || ct == SSEMatrix::H);\n\t\t\t}\n\t\t}\n\t\t//cerr << \"reportedThrough rejected (\" << row << \", \" << col << \")\" << endl;\n\t\td.mat_.setReportedThrough(row, col);\n\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t// Cell was involved in a previously-reported alignment?\n\t\tif(!canMoveThru) {\n\t\t\tif(!btnstack_.empty()) {\n\t\t\t\t// Remove all the cells from list back to and including the\n\t\t\t\t// cell where the branch occurred\n\t\t\t\tbtcells_.resize(btnstack_.back().celsz);\n\t\t\t\t// Pop record off the top of the stack\n\t\t\t\tned.resize(btnstack_.back().nedsz);\n\t\t\t\t//aed.resize(btnstack_.back().aedsz);\n\t\t\t\trow      = btnstack_.back().row;\n\t\t\t\tcol      = btnstack_.back().col;\n\t\t\t\tgaps     = btnstack_.back().gaps;\n\t\t\t\treadGaps = btnstack_.back().readGaps;\n\t\t\t\trefGaps  = btnstack_.back().refGaps;\n\t\t\t\tscore    = btnstack_.back().score;\n\t\t\t\tct       = btnstack_.back().ct;\n\t\t\t\tbtnstack_.pop_back();\n\t\t\t\tassert(!sc_->monotone || score.score() >= escore);\n\t\t\t\tNEW_ROW_COL(row, col);\n\t\t\t\tcontinue;\n\t\t\t} else {\n\t\t\t\t// No branch points to revisit; just give up\n\t\t\t\tres.reset();\n\t\t\t\tmet.btfail++; // DP backtraces failed\n\t\t\t\treturn false;\n\t\t\t}\n\t\t}\n\t\tassert(!reportedThru);\n\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\tif(empty || row == 0) {\n\t\t\tassert_eq(SSEMatrix::H, ct);\n\t\t\tbtcells_.expand();\n\t\t\tbtcells_.back().first = row;\n\t\t\tbtcells_.back().second = col;\n\t\t\t// This cell is at the end of a legitimate alignment\n\t\t\ttrimBeg = row;\n\t\t\tassert_eq(0, trimBeg);\n\t\t\tassert_eq(btcells_.size(), dpRows() - trimBeg - trimEnd + readGaps);\n\t\t\tbreak;\n\t\t}\n\t\tif(branch) {\n\t\t\t// Add a frame to the backtrack stack\n\t\t\tbtnstack_.expand();\n\t\t\tbtnstack_.back().init(\n\t\t\t\tned.size(),\n\t\t\t\t0,               // aed.size()\n\t\t\t\tbtcells_.size(),\n\t\t\t\trow,\n\t\t\t\tcol,\n\t\t\t\tgaps,\n\t\t\t\treadGaps,\n\t\t\t\trefGaps,\n\t\t\t\tscore,\n\t\t\t\t(int)ct);\n\t\t}\n\t\tbtcells_.expand();\n\t\tbtcells_.back().first = row;\n\t\tbtcells_.back().second = col;\n\t\tswitch(cur) {\n\t\t\t// Move up and to the left.  If the reference nucleotide in the\n\t\t\t// source row mismatches the read nucleotide, penalize\n\t\t\t// it and add a nucleotide mismatch.\n\t\t\tcase SW_BT_OALL_DIAG: {\n\t\t\t\tassert_gt(row, 0); assert_gt(col, 0);\n\t\t\t\t// Check for color mismatch\n\t\t\t\tint readC = (*rd_)[row];\n\t\t\t\tint refNmask = (int)rf_[rfi_+col];\n\t\t\t\tassert_gt(refNmask, 0);\n\t\t\t\tint m = matchesEx(readC, refNmask);\n\t\t\t\tct = SSEMatrix::H;\n\t\t\t\tif(m != 1) {\n\t\t\t\t\tEdit e(\n\t\t\t\t\t\t(int)row,\n\t\t\t\t\t\tmask2dna[refNmask],\n\t\t\t\t\t\t\"ACGTN\"[readC],\n\t\t\t\t\t\tEDIT_TYPE_MM);\n\t\t\t\t\tassert(e.repOk());\n\t\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\t\tned.push_back(e);\n\t\t\t\t\tint pen = QUAL2(row, col);\n\t\t\t\t\tscore.score_ -= pen;\n\t\t\t\t\tassert(!sc_->monotone || score.score() >= escore);\n\t\t\t\t} else {\n\t\t\t\t\t// Reward a match\n\t\t\t\t\tint64_t bonus = sc_->match(30);\n\t\t\t\t\tscore.score_ += bonus;\n\t\t\t\t\tassert(!sc_->monotone || score.score() >= escore);\n\t\t\t\t}\n\t\t\t\tif(m == -1) {\n\t\t\t\t\t// score.ns_++;\n\t\t\t\t}\n\t\t\t\trow--; col--;\n\t\t\t\tMOVE_ALL_UPLEFT();\n\t\t\t\tassert(VALID_AL_SCORE(score));\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\t// Move up.  Add an edit encoding the ref gap.\n\t\t\tcase SW_BT_OALL_REF_OPEN:\n\t\t\t{\n\t\t\t\tassert_gt(row, 0);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row,\n\t\t\t\t\t'-',\n\t\t\t\t\t\"ACGTN\"[(int)(*rd_)[row]],\n\t\t\t\t\tEDIT_TYPE_REF_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\trow--;\n\t\t\t\tct = SSEMatrix::H;\n\t\t\t\tint pen = sc_->refGapOpen();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; refGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_UP();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\t// Move up.  Add an edit encoding the ref gap.\n\t\t\tcase SW_BT_RFGAP_EXTEND:\n\t\t\t{\n\t\t\t\tassert_gt(row, 1);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row,\n\t\t\t\t\t'-',\n\t\t\t\t\t\"ACGTN\"[(int)(*rd_)[row]],\n\t\t\t\t\tEDIT_TYPE_REF_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\trow--;\n\t\t\t\tct = SSEMatrix::F;\n\t\t\t\tint pen = sc_->refGapExtend();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; refGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_UP();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tcase SW_BT_OALL_READ_OPEN:\n\t\t\t{\n\t\t\t\tassert_gt(col, 0);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row+1,\n\t\t\t\t\tmask2dna[(int)rf_[rfi_+col]],\n\t\t\t\t\t'-',\n\t\t\t\t\tEDIT_TYPE_READ_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\tcol--;\n\t\t\t\tct = SSEMatrix::H;\n\t\t\t\tint pen = sc_->readGapOpen();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; readGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_LEFT();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tcase SW_BT_RDGAP_EXTEND:\n\t\t\t{\n\t\t\t\tassert_gt(col, 1);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row+1,\n\t\t\t\t\tmask2dna[(int)rf_[rfi_+col]],\n\t\t\t\t\t'-',\n\t\t\t\t\tEDIT_TYPE_READ_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\tcol--;\n\t\t\t\tct = SSEMatrix::E;\n\t\t\t\tint pen = sc_->readGapExtend();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; readGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_LEFT();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tdefault: throw 1;\n\t\t}\n\t} // while((int)row > 0)\n\tassert_eq(0, trimBeg);\n\tassert_eq(0, trimEnd);\n\tassert_geq(col, 0);\n\tassert_eq(SSEMatrix::H, ct);\n\t// The number of cells in the backtracs should equal the number of read\n\t// bases after trimming plus the number of gaps\n\tassert_eq(btcells_.size(), dpRows() - trimBeg - trimEnd + readGaps);\n\t// Check whether we went through a core diagonal and set 'reported' flag on\n\t// each cell\n\tbool overlappedCoreDiag = false;\n\tfor(size_t i = 0; i < btcells_.size(); i++) {\n\t\tsize_t rw = btcells_[i].first;\n\t\tsize_t cl = btcells_[i].second;\n\t\t// Calculate the diagonal within the *trimmed* rectangle, i.e. the\n\t\t// rectangle we dealt with in align, gather and backtrack.\n\t\tint64_t diagi = cl - rw;\n\t\t// Now adjust to the diagonal within the *untrimmed* rectangle by\n\t\t// adding on the amount trimmed from the left.\n\t\tdiagi += rect_->triml;\n\t\tif(diagi >= 0) {\n\t\t\tsize_t diag = (size_t)diagi;\n\t\t\tif(diag >= rect_->corel && diag <= rect_->corer) {\n\t\t\t\toverlappedCoreDiag = true;\n\t\t\t\tbreak;\n\t\t\t}\n\t\t}\n#ifndef NDEBUG\n\t\t//assert(!d.mat_.reportedThrough(rw, cl));\n\t\t//d.mat_.setReportedThrough(rw, cl);\n\t\tassert(d.mat_.reportedThrough(rw, cl));\n#endif\n\t}\n\tif(!overlappedCoreDiag) {\n\t\t// Must overlap a core diagonal.  Otherwise, we run the risk of\n\t\t// reporting an alignment that overlaps (and trumps) a higher-scoring\n\t\t// alignment that lies partially outside the dynamic programming\n\t\t// rectangle.\n\t\tres.reset();\n\t\tmet.corerej++;\n\t\treturn false;\n\t}\n\tint readC = (*rd_)[rdi_+row];      // get last char in read\n\tint refNmask = (int)rf_[rfi_+col]; // get last ref char ref involved in aln\n\tassert_gt(refNmask, 0);\n\tint m = matchesEx(readC, refNmask);\n\tif(m != 1) {\n\t\tEdit e((int)row, mask2dna[refNmask], \"ACGTN\"[readC], EDIT_TYPE_MM);\n\t\tassert(e.repOk());\n\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\tned.push_back(e);\n\t\tscore.score_ -= QUAL2(row, col);\n\t\tassert_geq(score.score(), minsc_);\n\t} else {\n\t\tscore.score_ += sc_->match(30);\n\t}\n\tif(m == -1) {\n\t\t// score.ns_++;\n\t}\n#if 0\n\tif(score.ns_ > nceil_) {\n\t\t// Alignment has too many Ns in it!\n\t\tres.reset();\n\t\tmet.nrej++;\n\t\treturn false;\n\t}\n#endif\n\tres.reverse();\n\tassert(Edit::repOk(ned, (*rd_)));\n\tassert_eq(score.score(), escore);\n\tassert_leq(gaps, rdgap_ + rfgap_);\n\toff = col;\n\tassert_lt(col + (size_t)rfi_, (size_t)rff_);\n\t// score.gaps_ = gaps;\n\tres.alres.setScore(score);\n#if 0\n\tres.alres.setShape(\n\t\trefidx_,                  // ref id\n\t\toff + rfi_ + rect_->refl, // 0-based ref offset\n\t\treflen_,                  // length of entire reference\n\t\tfw_,                      // aligned to Watson?\n\t\trdf_ - rdi_,              // read length\n\t\ttrue,                     // pretrim soft?\n\t\t0,                        // pretrim 5' end\n\t\t0,                        // pretrim 3' end\n\t\ttrue,                     // alignment trim soft?\n\t\tfw_ ? trimBeg : trimEnd,  // alignment trim 5' end\n\t\tfw_ ? trimEnd : trimBeg); // alignment trim 3' end\n#endif\n\tsize_t refns = 0;\n\tfor(size_t i = col; i <= origCol; i++) {\n\t\tif((int)rf_[rfi_+i] > 15) {\n\t\t\trefns++;\n\t\t}\n\t}\n\t// res.alres.setRefNs(refns);\n\tassert(Edit::repOk(ned, (*rd_), true, trimBeg, trimEnd));\n\tassert(res.repOk());\n#ifndef NDEBUG\n\tsize_t gapsCheck = 0;\n\tfor(size_t i = 0; i < ned.size(); i++) {\n\t\tif(ned[i].isGap()) gapsCheck++;\n\t}\n\tassert_eq(gaps, gapsCheck);\n\tBTDnaString refstr;\n\tfor(size_t i = col; i <= origCol; i++) {\n\t\trefstr.append(firsts5[(int)rf_[rfi_+i]]);\n\t}\n\tBTDnaString editstr;\n    // daehwan\n\t// Edit::toRef((*rd_), ned, editstr, true, trimBeg, trimEnd);\n    Edit::toRef((*rd_), ned, editstr, true, trimBeg + rdi_, trimEnd + (rd_->length() - rdf_));\n\tif(refstr != editstr) {\n\t\tcerr << \"Decoded nucleotides and edits don't match reference:\" << endl;\n\t\tcerr << \"           score: \" << score.score()\n\t\t     << \" (\" << gaps << \" gaps)\" << endl;\n\t\tcerr << \"           edits: \";\n\t\tEdit::print(cerr, ned);\n\t\tcerr << endl;\n\t\tcerr << \"    decoded nucs: \" << (*rd_) << endl;\n\t\tcerr << \"     edited nucs: \" << editstr << endl;\n\t\tcerr << \"  reference nucs: \" << refstr << endl;\n\t\tassert(0);\n\t}\n#endif\n\tmet.btsucc++; // DP backtraces succeeded\n\treturn true;\n}\n"
  },
  {
    "path": "aligner_swsse_loc_i16.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n/**\n * aligner_sw_sse.cpp\n *\n * Versions of key alignment functions that use vector instructions to\n * accelerate dynamic programming.  Based chiefly on the striped Smith-Waterman\n * paper and implementation by Michael Farrar.  See:\n *\n * Farrar M. Striped Smith-Waterman speeds database searches six times over\n * other SIMD implementations. Bioinformatics. 2007 Jan 15;23(2):156-61.\n * http://sites.google.com/site/farrarmichael/smith-waterman\n *\n * While the paper describes an implementation of Smith-Waterman, we extend it\n * do end-to-end read alignment as well as local alignment.  The change\n * required for this is minor: we simply let vmax be the maximum element in the\n * score domain rather than the minimum.\n *\n * The vectorized dynamic programming implementation lacks some features that\n * make it hard to adapt to solving the entire dynamic-programming alignment\n * problem.  For instance:\n *\n * - It doesn't respect gap barriers on either end of the read\n * - It just gives a maximum; not enough information to backtrace without\n *   redoing some alignment\n * - It's a little difficult to handle st_ and en_, especially st_.\n * - The query profile mechanism makes handling of ambiguous reference bases a\n *   little tricky (16 cols in query profile lookup table instead of 5)\n *\n * Given the drawbacks, it is tempting to use SSE dynamic programming as a\n * filter rather than as an aligner per se.  Here are a few ideas for how it\n * can be extended to handle more of the alignment problem:\n *\n * - Save calculated scores to a big array as we go.  We return to this array\n *   to find and backtrace from good solutions.\n */\n\n#include <limits>\n#include \"aligner_sw.h\"\n\nstatic const size_t NBYTES_PER_REG  = 16;\nstatic const size_t NWORDS_PER_REG  = 8;\nstatic const size_t NBITS_PER_WORD  = 16;\nstatic const size_t NBYTES_PER_WORD = 2;\n\n// In 16-bit local mode, we have the option of using signed saturated\n// arithmetic.  Because we have signed arithmetic, there's no need to\n// add/subtract bias when building an applying the query profile.  The lowest\n// value we can use is 0x8000, greatest is 0x7fff.\n\ntypedef int16_t TCScore;\n\n/**\n * Build query profile look up tables for the read.  The query profile look\n * up table is organized as a 1D array indexed by [i][j] where i is the\n * reference character in the current DP column (0=A, 1=C, etc), and j is\n * the segment of the query we're currently working on.\n */\nvoid SwAligner::buildQueryProfileLocalSseI16(bool fw) {\n\tbool& done = fw ? sseI16fwBuilt_ : sseI16rcBuilt_;\n\tif(done) {\n\t\treturn;\n\t}\n\tdone = true;\n\tconst BTDnaString* rd = fw ? rdfw_ : rdrc_;\n\tconst BTString* qu = fw ? qufw_ : qurc_;\n\tconst size_t len = rd->length();\n\tconst size_t seglen = (len + (NWORDS_PER_REG-1)) / NWORDS_PER_REG;\n\t// How many __m128i's are needed\n\tsize_t n128s =\n\t\t64 +                    // slack bytes, for alignment?\n\t\t(seglen * ALPHA_SIZE)   // query profile data\n\t\t* 2;                    // & gap barrier data\n\tassert_gt(n128s, 0);\n\tSSEData& d = fw ? sseI16fw_ : sseI16rc_;\n\td.profbuf_.resizeNoCopy(n128s);\n\tassert(!d.profbuf_.empty());\n\td.maxPen_      = d.maxBonus_ = 0;\n\td.lastIter_    = d.lastWord_ = 0;\n\td.qprofStride_ = d.gbarStride_ = 2;\n\td.bias_ = 0; // no bias when words are signed\n\t// For each reference character A, C, G, T, N ...\n\tfor(size_t refc = 0; refc < ALPHA_SIZE; refc++) {\n\t\t// For each segment ...\n\t\tfor(size_t i = 0; i < seglen; i++) {\n\t\t\tsize_t j = i;\n\t\t\tint16_t *qprofWords =\n\t\t\t\treinterpret_cast<int16_t*>(d.profbuf_.ptr() + (refc * seglen * 2) + (i * 2));\n\t\t\tint16_t *gbarWords =\n\t\t\t\treinterpret_cast<int16_t*>(d.profbuf_.ptr() + (refc * seglen * 2) + (i * 2) + 1);\n\t\t\t// For each sub-word (byte) ...\n\t\t\tfor(size_t k = 0; k < NWORDS_PER_REG; k++) {\n\t\t\t\tint sc = 0;\n\t\t\t\t*gbarWords = 0;\n\t\t\t\tif(j < len) {\n\t\t\t\t\tint readc = (*rd)[j];\n\t\t\t\t\tint readq = (*qu)[j];\n\t\t\t\t\tsc = sc_->score(readc, (int)(1 << refc), readq - 33);\n\t\t\t\t\tsize_t j_from_end = len - j - 1;\n\t\t\t\t\tif(j < (size_t)sc_->gapbar ||\n\t\t\t\t\t   j_from_end < (size_t)sc_->gapbar)\n\t\t\t\t\t{\n\t\t\t\t\t\t// Inside the gap barrier\n\t\t\t\t\t\t*gbarWords = 0x8000; // add this twice\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(refc == 0 && j == len-1) {\n\t\t\t\t\t// Remember which 128-bit word and which smaller word has\n\t\t\t\t\t// the final row\n\t\t\t\t\td.lastIter_ = i;\n\t\t\t\t\td.lastWord_ = k;\n\t\t\t\t}\n\t\t\t\tif(sc < 0) {\n\t\t\t\t\tif((size_t)(-sc) > d.maxPen_) {\n\t\t\t\t\t\td.maxPen_ = (size_t)(-sc);\n\t\t\t\t\t}\n\t\t\t\t} else {\n\t\t\t\t\tif((size_t)sc > d.maxBonus_) {\n\t\t\t\t\t\td.maxBonus_ = (size_t)sc;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\t*qprofWords = (int16_t)sc;\n\t\t\t\tgbarWords++;\n\t\t\t\tqprofWords++;\n\t\t\t\tj += seglen; // update offset into query\n\t\t\t}\n\t\t}\n\t}\n}\n\n#ifndef NDEBUG\n/**\n * Return true iff the cell has sane E/F/H values w/r/t its predecessors.\n */\nstatic bool cellOkLocalI16(\n\tSSEData& d,\n\tsize_t row,\n\tsize_t col,\n\tint refc,\n\tint readc,\n\tint readq,\n\tconst Scoring& sc)     // scoring scheme\n{\n\tTCScore floorsc = MIN_I16;\n\tTCScore ceilsc = MIN_I16-1;\n\tTAlScore offsetsc = 0x8000;\n\tTAlScore sc_h_cur = (TAlScore)d.mat_.helt(row, col);\n\tTAlScore sc_e_cur = (TAlScore)d.mat_.eelt(row, col);\n\tTAlScore sc_f_cur = (TAlScore)d.mat_.felt(row, col);\n\tif(sc_h_cur > floorsc) {\n\t\tsc_h_cur += offsetsc;\n\t}\n\tif(sc_e_cur > floorsc) {\n\t\tsc_e_cur += offsetsc;\n\t}\n\tif(sc_f_cur > floorsc) {\n\t\tsc_f_cur += offsetsc;\n\t}\n\tbool gapsAllowed = true;\n\tsize_t rowFromEnd = d.mat_.nrow() - row - 1;\n\tif(row < (size_t)sc.gapbar || rowFromEnd < (size_t)sc.gapbar) {\n\t\tgapsAllowed = false;\n\t}\n\tbool e_left_trans = false, h_left_trans = false;\n\tbool f_up_trans   = false, h_up_trans = false;\n\tbool h_diag_trans = false;\n\tif(gapsAllowed) {\n\t\tTAlScore sc_h_left = floorsc;\n\t\tTAlScore sc_e_left = floorsc;\n\t\tTAlScore sc_h_up   = floorsc;\n\t\tTAlScore sc_f_up   = floorsc;\n\t\tif(col > 0 && sc_e_cur > floorsc && sc_e_cur <= ceilsc) {\n\t\t\tsc_h_left = d.mat_.helt(row, col-1) + offsetsc;\n\t\t\tsc_e_left = d.mat_.eelt(row, col-1) + offsetsc;\n\t\t\te_left_trans = (sc_e_left > floorsc && sc_e_cur == sc_e_left - sc.readGapExtend());\n\t\t\th_left_trans = (sc_h_left > floorsc && sc_e_cur == sc_h_left - sc.readGapOpen());\n\t\t\tassert(e_left_trans || h_left_trans);\n\t\t}\n\t\tif(row > 0 && sc_f_cur > floorsc && sc_f_cur <= ceilsc) {\n\t\t\tsc_h_up = d.mat_.helt(row-1, col) + offsetsc;\n\t\t\tsc_f_up = d.mat_.felt(row-1, col) + offsetsc;\n\t\t\tf_up_trans = (sc_f_up > floorsc && sc_f_cur == sc_f_up - sc.refGapExtend());\n\t\t\th_up_trans = (sc_h_up > floorsc && sc_f_cur == sc_h_up - sc.refGapOpen());\n\t\t\tassert(f_up_trans || h_up_trans);\n\t\t}\n\t} else {\n\t\tassert_geq(floorsc, sc_e_cur);\n\t\tassert_geq(floorsc, sc_f_cur);\n\t}\n\tif(col > 0 && row > 0 && sc_h_cur > floorsc && sc_h_cur <= ceilsc) {\n\t\tTAlScore sc_h_upleft = d.mat_.helt(row-1, col-1) + offsetsc;\n\t\tTAlScore sc_diag = sc.score(readc, (int)refc, readq - 33);\n\t\th_diag_trans = sc_h_cur == sc_h_upleft + sc_diag;\n\t}\n\tassert(\n\t\tsc_h_cur <= floorsc ||\n\t\te_left_trans ||\n\t\th_left_trans ||\n\t\tf_up_trans   ||\n\t\th_up_trans   ||\n\t\th_diag_trans ||\n\t\tsc_h_cur > ceilsc ||\n\t\trow == 0 ||\n\t\tcol == 0);\n\treturn true;\n}\n#endif /*ndef NDEBUG*/\n\n#ifdef NDEBUG\n\n#define assert_all_eq0(x)\n#define assert_all_gt(x, y)\n#define assert_all_gt_lo(x)\n#define assert_all_lt(x, y)\n#define assert_all_lt_hi(x)\n\n#else\n\n#define assert_all_eq0(x) { \\\n\t__m128i z = _mm_setzero_si128(); \\\n\t__m128i tmp = _mm_setzero_si128(); \\\n\tz = _mm_xor_si128(z, z); \\\n\ttmp = _mm_cmpeq_epi16(x, z); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_gt(x, y) { \\\n\t__m128i tmp = _mm_cmpgt_epi16(x, y); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_gt_lo(x) { \\\n\t__m128i z = _mm_setzero_si128(); \\\n\t__m128i tmp = _mm_setzero_si128(); \\\n\tz = _mm_xor_si128(z, z); \\\n\ttmp = _mm_cmpgt_epi16(x, z); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_lt(x, y) { \\\n\t__m128i tmp = _mm_cmplt_epi16(x, y); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_leq(x, y) { \\\n\t__m128i tmp = _mm_cmpgt_epi16(x, y); \\\n\tassert_eq(0x0000, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_lt_hi(x) { \\\n\t__m128i z = _mm_setzero_si128(); \\\n\t__m128i tmp = _mm_setzero_si128(); \\\n\tz = _mm_cmpeq_epi16(z, z); \\\n\tz = _mm_srli_epi16(z, 1); \\\n\ttmp = _mm_cmplt_epi16(x, z); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n#endif\n\n/**\n * Aligns by filling a dynamic programming matrix with the SSE-accelerated,\n * banded DP approach of Farrar.  As it goes, it determines which cells we\n * might backtrace from and tallies the best (highest-scoring) N backtrace\n * candidate cells per diagonal.  Also returns the alignment score of the best\n * alignment in the matrix.\n *\n * This routine does *not* maintain a matrix holding the entire matrix worth of\n * scores, nor does it maintain any other dense O(mn) data structure, as this\n * would quickly exhaust memory for queries longer than about 10,000 kb.\n * Instead, in the fill stage it maintains two columns worth of scores at a\n * time (current/previous, or right/left) - these take O(m) space.  When\n * finished with the current column, it determines which cells from the\n * previous column, if any, are candidates we might backtrace from to find a\n * full alignment.  A candidate cell has a score that rises above the threshold\n * and isn't improved upon by a match in the next column.  The best N\n * candidates per diagonal are stored in a O(m + n) data structure.\n */\nTAlScore SwAligner::alignGatherLoc16(int& flag, bool debug) {\n\tassert_leq(rdf_, rd_->length());\n\tassert_leq(rdf_, qu_->length());\n\tassert_lt(rfi_, rff_);\n\tassert_lt(rdi_, rdf_);\n\tassert_eq(rd_->length(), qu_->length());\n\tassert_geq(sc_->gapbar, 1);\n\tassert_gt(minsc_, 0);\n\tassert_leq(minsc_, MAX_I16);\n\tassert(repOk());\n#ifndef NDEBUG\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\tassert_range(0, 16, (int)rf_[i]);\n\t}\n#endif\n\n\tSSEData& d = fw_ ? sseI16fw_ : sseI16rc_;\n\tSSEMetrics& met = extend_ ? sseI16ExtendMet_ : sseI16MateMet_;\n\tif(!debug) met.dp++;\n\tbuildQueryProfileLocalSseI16(fw_);\n\tassert(!d.profbuf_.empty());\n\n\tassert_gt(d.maxBonus_, 0);\n\tsize_t iter =\n\t\t(dpRows() + (NWORDS_PER_REG-1)) / NWORDS_PER_REG; // iter = segLen\n\t\n\t// Now set up the score vectors.  We just need two columns worth, which\n\t// we'll call \"left\" and \"right\".\n\td.vecbuf_.resize(ROWSTRIDE_2COL * iter * 2);\n\td.vecbuf_.zero();\n\t__m128i *vbuf_l = d.vecbuf_.ptr();\n\t__m128i *vbuf_r = d.vecbuf_.ptr() + (ROWSTRIDE_2COL * iter);\n\t\n\t// This is the data structure that holds candidate cells per diagonal.\n\tconst size_t ndiags = rff_ - rfi_ + dpRows() - 1;\n\tif(!debug) {\n\t\tbtdiag_.init(ndiags, 2);\n\t}\n\t\n\t// Data structure that holds checkpointed anti-diagonals\n\tTAlScore perfectScore = sc_->perfectScore(dpRows());\n\tbool checkpoint = true;\n\tbool cpdebug = false;\n#ifndef NDEBUG\n\tcpdebug = dpRows() < 1000;\n#endif\n\tcper_.init(\n\t\tdpRows(),      // # rows\n\t\trff_ - rfi_,   // # columns\n\t\tcperPerPow2_,  // checkpoint every 1 << perpow2 diags (& next)\n\t\tperfectScore,  // perfect score (for sanity checks)\n\t\tfalse,         // matrix cells have 8-bit scores?\n\t\tcperTri_,      // triangular mini-fills?\n\t\ttrue,          // alignment is local?\n\t\tcpdebug);      // save all cells for debugging?\n\n\t// Many thanks to Michael Farrar for releasing his striped Smith-Waterman\n\t// implementation:\n\t//\n\t//  http://sites.google.com/site/farrarmichael/smith-waterman\n\t//\n\t// Much of the implmentation below is adapted from Michael's code.\n\n\t// Set all elts to reference gap open penalty\n\t__m128i rfgapo   = _mm_setzero_si128();\n\t__m128i rfgape   = _mm_setzero_si128();\n\t__m128i rdgapo   = _mm_setzero_si128();\n\t__m128i rdgape   = _mm_setzero_si128();\n\t__m128i vlo      = _mm_setzero_si128();\n\t__m128i vhi      = _mm_setzero_si128();\n\t__m128i vlolsw   = _mm_setzero_si128();\n\t__m128i vmax     = _mm_setzero_si128();\n\t__m128i vcolmax  = _mm_setzero_si128();\n\t__m128i vmaxtmp  = _mm_setzero_si128();\n\t__m128i ve       = _mm_setzero_si128();\n\t__m128i vf       = _mm_setzero_si128();\n\t__m128i vh       = _mm_setzero_si128();\n\t__m128i vhd      = _mm_setzero_si128();\n\t__m128i vhdtmp   = _mm_setzero_si128();\n\t__m128i vtmp     = _mm_setzero_si128();\n\t__m128i vzero    = _mm_setzero_si128();\n\t__m128i vminsc   = _mm_setzero_si128();\n\n\tassert_gt(sc_->refGapOpen(), 0);\n\tassert_leq(sc_->refGapOpen(), MAX_I16);\n\trfgapo = _mm_insert_epi16(rfgapo, sc_->refGapOpen(), 0);\n\trfgapo = _mm_shufflelo_epi16(rfgapo, 0);\n\trfgapo = _mm_shuffle_epi32(rfgapo, 0);\n\t\n\t// Set all elts to reference gap extension penalty\n\tassert_gt(sc_->refGapExtend(), 0);\n\tassert_leq(sc_->refGapExtend(), MAX_I16);\n\tassert_leq(sc_->refGapExtend(), sc_->refGapOpen());\n\trfgape = _mm_insert_epi16(rfgape, sc_->refGapExtend(), 0);\n\trfgape = _mm_shufflelo_epi16(rfgape, 0);\n\trfgape = _mm_shuffle_epi32(rfgape, 0);\n\n\t// Set all elts to read gap open penalty\n\tassert_gt(sc_->readGapOpen(), 0);\n\tassert_leq(sc_->readGapOpen(), MAX_I16);\n\trdgapo = _mm_insert_epi16(rdgapo, sc_->readGapOpen(), 0);\n\trdgapo = _mm_shufflelo_epi16(rdgapo, 0);\n\trdgapo = _mm_shuffle_epi32(rdgapo, 0);\n\t\n\t// Set all elts to read gap extension penalty\n\tassert_gt(sc_->readGapExtend(), 0);\n\tassert_leq(sc_->readGapExtend(), MAX_I16);\n\tassert_leq(sc_->readGapExtend(), sc_->readGapOpen());\n\trdgape = _mm_insert_epi16(rdgape, sc_->readGapExtend(), 0);\n\trdgape = _mm_shufflelo_epi16(rdgape, 0);\n\trdgape = _mm_shuffle_epi32(rdgape, 0);\n\t\n\t// Set all elts to minimum score threshold.  Actually, to 1 less than the\n\t// threshold so we can use gt instead of geq.\n\tvminsc = _mm_insert_epi16(vminsc, (int)minsc_-1, 0);\n\tvminsc = _mm_shufflelo_epi16(vminsc, 0);\n\tvminsc = _mm_shuffle_epi32(vminsc, 0);\n\n\t// Set all elts to 0x8000 (min value for signed 16-bit)\n\tvlo = _mm_cmpeq_epi16(vlo, vlo);             // all elts = 0xffff\n\tvlo = _mm_slli_epi16(vlo, NBITS_PER_WORD-1); // all elts = 0x8000\n\t\n\t// Set all elts to 0x7fff (max value for signed 16-bit)\n\tvhi = _mm_cmpeq_epi16(vhi, vhi);             // all elts = 0xffff\n\tvhi = _mm_srli_epi16(vhi, 1);                // all elts = 0x7fff\n\t\n\t// Set all elts to 0x8000 (min value for signed 16-bit)\n\tvmax = vlo;\n\t\n\t// vlolsw: topmost (least sig) word set to 0x8000, all other words=0\n\tvlolsw = _mm_shuffle_epi32(vlo, 0);\n\tvlolsw = _mm_srli_si128(vlolsw, NBYTES_PER_REG - NBYTES_PER_WORD);\n\t\n\t// Points to a long vector of __m128i where each element is a block of\n\t// contiguous cells in the E, F or H matrix.  If the index % 3 == 0, then\n\t// the block of cells is from the E matrix.  If index % 3 == 1, they're\n\t// from the F matrix.  If index % 3 == 2, then they're from the H matrix.\n\t// Blocks of cells are organized in the same interleaved manner as they are\n\t// calculated by the Farrar algorithm.\n\tconst __m128i *pvScore; // points into the query profile\n\n\tconst size_t colstride = ROWSTRIDE_2COL * iter;\n\t\n\t// Initialize the H and E vectors in the first matrix column\n\t__m128i *pvELeft = vbuf_l + 0; __m128i *pvERight = vbuf_r + 0;\n\t//__m128i *pvFLeft = vbuf_l + 1;\n\t__m128i *pvFRight = vbuf_r + 1;\n\t__m128i *pvHLeft = vbuf_l + 2; __m128i *pvHRight = vbuf_r + 2;\n\t\n\tfor(size_t i = 0; i < iter; i++) {\n\t\t// start low in local mode\n\t\t_mm_store_si128(pvERight, vlo); pvERight += ROWSTRIDE_2COL;\n\t\t_mm_store_si128(pvHRight, vlo); pvHRight += ROWSTRIDE_2COL;\n\t\t// Note: right and left are going to be swapped as soon as we enter\n\t\t// the outer loop below\n\t}\n\t\n\tassert_gt(sc_->gapbar, 0);\n\tsize_t nfixup = 0;\n\tTAlScore matchsc = sc_->match(30);\n\tTAlScore leftmax = MIN_I64;\n\n\t// Fill in the table as usual but instead of using the same gap-penalty\n\t// vector for each iteration of the inner loop, load words out of a\n\t// pre-calculated gap vector parallel to the query profile.  The pre-\n\t// calculated gap vectors enforce the gap barrier constraint by making it\n\t// infinitely costly to introduce a gap in barrier rows.\n\t//\n\t// AND use a separate loop to fill in the first row of the table, enforcing\n\t// the st_ constraints in the process.  This is awkward because it\n\t// separates the processing of the first row from the others and might make\n\t// it difficult to use the first-row results in the next row, but it might\n\t// be the simplest and least disruptive way to deal with the st_ constraint.\n\t\n\tsize_t off = MAX_SIZE_T, lastoff;\n\tbool bailed = false;\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\t// Swap left and right; vbuf_l is the vector on the left, which we\n\t\t// generally load from, and vbuf_r is the vector on the right, which we\n\t\t// generally store to.\n\t\tswap(vbuf_l, vbuf_r);\n\t\tpvELeft = vbuf_l + 0; pvERight = vbuf_r + 0;\n\t\t/* pvFLeft = vbuf_l + 1; */ pvFRight = vbuf_r + 1;\n\t\tpvHLeft = vbuf_l + 2; pvHRight = vbuf_r + 2;\n\t\t\n\t\t// Fetch this column's reference mask\n\t\tconst int refm = (int)rf_[i];\n\t\t\n\t\t// Fetch the appropriate query profile\n\t\tlastoff = off;\n\t\toff = (size_t)firsts5[refm] * iter * 2;\n\t\tpvScore = d.profbuf_.ptr() + off; // even elts = query profile, odd = gap barrier\n\t\t\n\t\t// Load H vector from the final row of the previous column.\n\t\t// ??? perhaps we should calculate the next iter's F instead of the\n\t\t// current iter's?  The way we currently do it, seems like it will\n\t\t// almost always require at least one fixup loop iter (to recalculate\n\t\t// this topmost F).\n\t\tvh = _mm_load_si128(pvHLeft + colstride - ROWSTRIDE_2COL);\n\t\t\n\t\t// Set all F cells to low value\n\t\tvf = _mm_cmpeq_epi16(vf, vf);\n\t\tvf = _mm_slli_epi16(vf, NBITS_PER_WORD-1);\n\t\tvf = _mm_or_si128(vf, vlolsw);\n\t\t// vf now contains the vertical contribution\n\n\t\t// Store cells in F, calculated previously\n\t\t// No need to veto ref gap extensions, they're all 0x8000s\n\t\t_mm_store_si128(pvFRight, vf);\n\t\tpvFRight += ROWSTRIDE_2COL;\n\t\t\n\t\t// Shift down so that topmost (least sig) cell gets 0\n\t\tvh = _mm_slli_si128(vh, NBYTES_PER_WORD);\n\t\t// Fill topmost (least sig) cell with low value\n\t\tvh = _mm_or_si128(vh, vlolsw);\n\t\t\n\t\t// We pull out one loop iteration to make it easier to veto values in the top row\n\t\t\n\t\t// Load cells from E, calculated previously\n\t\tve = _mm_load_si128(pvELeft);\n\t\tvhd = _mm_load_si128(pvHLeft);\n\t\tassert_all_lt(ve, vhi);\n\t\tpvELeft += ROWSTRIDE_2COL;\n\t\t// ve now contains the horizontal contribution\n\t\t\n\t\t// Factor in query profile (matches and mismatches)\n\t\tvh = _mm_adds_epi16(vh, pvScore[0]);\n\t\t// vh now contains the diagonal contribution\n\t\t\n\t\t// Update vE value\n\t\tvhdtmp = vhd;\n\t\tvhd = _mm_subs_epi16(vhd, rdgapo);\n\t\tvhd = _mm_adds_epi16(vhd, pvScore[1]); // veto some read gap opens\n\t\tvhd = _mm_adds_epi16(vhd, pvScore[1]); // veto some read gap opens\n\t\tve = _mm_subs_epi16(ve, rdgape);\n\t\tve = _mm_max_epi16(ve, vhd);\n\n\t\t// Update H, factoring in E and F\n\t\tvh = _mm_max_epi16(vh, ve);\n\t\t// F won't change anything!\n\n\t\tvf = vh;\n\n\t\t// Update highest score so far\n\t\tvcolmax = vh;\n\t\t\n\t\t// Save the new vH values\n\t\t_mm_store_si128(pvHRight, vh);\n\n\t\tassert_all_lt(ve, vhi);\n\n\t\tvh = vhdtmp;\n\n\t\tassert_all_lt(ve, vhi);\n\t\tpvHRight += ROWSTRIDE_2COL;\n\t\tpvHLeft += ROWSTRIDE_2COL;\n\t\t\n\t\t// Save E values\n\t\t_mm_store_si128(pvERight, ve);\n\t\tpvERight += ROWSTRIDE_2COL;\n\t\t\n\t\t// Update vf value\n\t\tvf = _mm_subs_epi16(vf, rfgapo);\n\t\tassert_all_lt(vf, vhi);\n\t\t\n\t\tpvScore += 2; // move on to next query profile\n\n\t\t// For each character in the reference text:\n\t\tsize_t j;\n\t\tfor(j = 1; j < iter; j++) {\n\t\t\t// Load cells from E, calculated previously\n\t\t\tve = _mm_load_si128(pvELeft);\n\t\t\tvhd = _mm_load_si128(pvHLeft);\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\tpvELeft += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Store cells in F, calculated previously\n\t\t\tvf = _mm_adds_epi16(vf, pvScore[1]); // veto some ref gap extensions\n\t\t\tvf = _mm_adds_epi16(vf, pvScore[1]); // veto some ref gap extensions\n\t\t\t_mm_store_si128(pvFRight, vf);\n\t\t\tpvFRight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Factor in query profile (matches and mismatches)\n\t\t\tvh = _mm_adds_epi16(vh, pvScore[0]);\n\t\t\tvh = _mm_max_epi16(vh, vf);\n\t\t\t\n\t\t\t// Update vE value\n\t\t\tvhdtmp = vhd;\n\t\t\tvhd = _mm_subs_epi16(vhd, rdgapo);\n\t\t\tvhd = _mm_adds_epi16(vhd, pvScore[1]); // veto some read gap opens\n\t\t\tvhd = _mm_adds_epi16(vhd, pvScore[1]); // veto some read gap opens\n\t\t\tve = _mm_subs_epi16(ve, rdgape);\n\t\t\tve = _mm_max_epi16(ve, vhd);\n\t\t\t\n\t\t\tvh = _mm_max_epi16(vh, ve);\n\t\t\tvtmp = vh;\n\t\t\t\n\t\t\t// Update highest score encountered this far\n\t\t\tvcolmax = _mm_max_epi16(vcolmax, vh);\n\t\t\t\n\t\t\t// Save the new vH values\n\t\t\t_mm_store_si128(pvHRight, vh);\n\n\t\t\tvh = vhdtmp;\n\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\tpvHRight += ROWSTRIDE_2COL;\n\t\t\tpvHLeft += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Save E values\n\t\t\t_mm_store_si128(pvERight, ve);\n\t\t\tpvERight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Update vf value\n\t\t\tvtmp = _mm_subs_epi16(vtmp, rfgapo);\n\t\t\tvf = _mm_subs_epi16(vf, rfgape);\n\t\t\tassert_all_lt(vf, vhi);\n\t\t\tvf = _mm_max_epi16(vf, vtmp);\n\t\t\t\n\t\t\tpvScore += 2; // move on to next query profile / gap veto\n\t\t}\n\t\t// pvHStore, pvELoad, pvEStore have all rolled over to the next column\n\t\tpvFRight -= colstride; // reset to start of column\n\t\tvtmp = _mm_load_si128(pvFRight);\n\t\t\n\t\tpvHRight -= colstride; // reset to start of column\n\t\tvh = _mm_load_si128(pvHRight);\n\t\t\n\t\tpvScore = d.profbuf_.ptr() + off + 1; // reset veto vector\n\t\t\n\t\t// vf from last row gets shifted down by one to overlay the first row\n\t\t// rfgape has already been subtracted from it.\n\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\tvf = _mm_or_si128(vf, vlolsw);\n\t\t\n\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\tvf = _mm_max_epi16(vtmp, vf);\n\t\tvtmp = _mm_cmpgt_epi16(vf, vtmp);\n\t\tint cmp = _mm_movemask_epi8(vtmp);\n\t\t\n\t\t// If any element of vtmp is greater than H - gap-open...\n\t\tj = 0;\n\t\twhile(cmp != 0x0000) {\n\t\t\t// Store this vf\n\t\t\t_mm_store_si128(pvFRight, vf);\n\t\t\tpvFRight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Update vh w/r/t new vf\n\t\t\tvh = _mm_max_epi16(vh, vf);\n\t\t\t\n\t\t\t// Save vH values\n\t\t\t_mm_store_si128(pvHRight, vh);\n\t\t\tpvHRight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Update highest score encountered so far.\n\t\t\tvcolmax = _mm_max_epi16(vcolmax, vh);\n\n\t\t\tpvScore += 2;\n\t\t\t\n\t\t\tassert_lt(j, iter);\n\t\t\tif(++j == iter) {\n\t\t\t\tpvFRight -= colstride;\n\t\t\t\tvtmp = _mm_load_si128(pvFRight);   // load next vf ASAP\n\t\t\t\tpvHRight -= colstride;\n\t\t\t\tvh = _mm_load_si128(pvHRight);     // load next vh ASAP\n\t\t\t\tpvScore = d.profbuf_.ptr() + off + 1;\n\t\t\t\tj = 0;\n\t\t\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\t\t\tvf = _mm_or_si128(vf, vlolsw);\n\t\t\t} else {\n\t\t\t\tvtmp = _mm_load_si128(pvFRight);   // load next vf ASAP\n\t\t\t\tvh = _mm_load_si128(pvHRight);     // load next vh ASAP\n\t\t\t}\n\t\t\t\n\t\t\t// Update F with another gap extension\n\t\t\tvf = _mm_subs_epi16(vf, rfgape);\n\t\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\t\tvf = _mm_max_epi16(vtmp, vf);\n\t\t\tvtmp = _mm_cmpgt_epi16(vf, vtmp);\n\t\t\tcmp = _mm_movemask_epi8(vtmp);\n\t\t\tnfixup++;\n\t\t}\n\n\t\t// Now we'd like to know exactly which cells in the left column are\n\t\t// candidates we might backtrace from.  First question is: did *any*\n\t\t// elements in the column exceed the minimum score threshold?\n\t\tif(!debug && leftmax >= minsc_) {\n\t\t\t// Yes.  Next question is: which cells are candidates?  We have to\n\t\t\t// allow matches in the right column to override matches above and\n\t\t\t// to the left in the left column.\n\t\t\tassert_gt(i - rfi_, 0);\n\t\t\tpvHLeft  = vbuf_l + 2;\n\t\t\tassert_lt(lastoff, MAX_SIZE_T);\n\t\t\tpvScore = d.profbuf_.ptr() + lastoff; // even elts = query profile, odd = gap barrier\n\t\t\tfor(size_t k = 0; k < iter; k++) {\n\t\t\t\tvh = _mm_load_si128(pvHLeft);\n\t\t\t\tvtmp = _mm_cmpgt_epi16(pvScore[0], vzero);\n\t\t\t\tint cmp = _mm_movemask_epi8(vtmp);\n\t\t\t\tif(cmp != 0) {\n\t\t\t\t\t// At least one candidate in this mask.  Now iterate\n\t\t\t\t\t// through vm/vh to evaluate individual cells.\n\t\t\t\t\tfor(size_t m = 0; m < NWORDS_PER_REG; m++) {\n\t\t\t\t\t\tsize_t row = k + m * iter;\n\t\t\t\t\t\tif(row >= dpRows()) {\n\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tTAlScore sc = (TAlScore)(((TCScore *)&vh)[m] + 0x8000);\n\t\t\t\t\t\tif(sc >= minsc_) {\n\t\t\t\t\t\t\tif(((TCScore *)&vtmp)[m] != 0) {\n\t\t\t\t\t\t\t\t// Add to data structure holding all candidates\n\t\t\t\t\t\t\t\tsize_t col = i - rfi_ - 1; // -1 b/c prev col\n\t\t\t\t\t\t\t\tsize_t frombot = dpRows() - row - 1;\n\t\t\t\t\t\t\t\tDpBtCandidate cand(row, col, sc);\n\t\t\t\t\t\t\t\tbtdiag_.add(frombot + col, cand);\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tpvHLeft += ROWSTRIDE_2COL;\n\t\t\t\tpvScore += 2;\n\t\t\t}\n\t\t}\n\n\t\t// Save some elements to checkpoints\n\t\tif(checkpoint) {\n\t\t\t\n\t\t\t__m128i *pvE = vbuf_r + 0;\n\t\t\t__m128i *pvF = vbuf_r + 1;\n\t\t\t__m128i *pvH = vbuf_r + 2;\n\t\t\tsize_t coli = i - rfi_;\n\t\t\tif(coli < cper_.locol_) cper_.locol_ = coli;\n\t\t\tif(coli > cper_.hicol_) cper_.hicol_ = coli;\n\t\t\t\n\t\t\tif(cperTri_) {\n\t\t\t\tsize_t rc_mod = coli & cper_.lomask_;\n\t\t\t\tassert_lt(rc_mod, cper_.per_);\n\t\t\t\tint64_t row = -rc_mod-1;\n\t\t\t\tint64_t row_mod = row;\n\t\t\t\tint64_t row_div = 0;\n\t\t\t\tsize_t idx = coli >> cper_.perpow2_;\n\t\t\t\tsize_t idxrow = idx * cper_.nrow_;\n\t\t\t\tassert_eq(4, ROWSTRIDE_2COL);\n\t\t\t\tbool done = false;\n\t\t\t\twhile(true) {\n\t\t\t\t\trow += (cper_.per_ - 2);\n\t\t\t\t\trow_mod += (cper_.per_ - 2);\n\t\t\t\t\tfor(size_t j = 0; j < 2; j++) {\n\t\t\t\t\t\trow++;\n\t\t\t\t\t\trow_mod++;\n\t\t\t\t\t\tif(row >= 0 && (size_t)row < cper_.nrow_) {\n\t\t\t\t\t\t\t// Update row divided by iter_ and mod iter_\n\t\t\t\t\t\t\twhile(row_mod >= (int64_t)iter) {\n\t\t\t\t\t\t\t\trow_mod -= (int64_t)iter;\n\t\t\t\t\t\t\t\trow_div++;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tsize_t delt = idxrow + row;\n\t\t\t\t\t\t\tsize_t vecoff = (row_mod << 5) + row_div;\n\t\t\t\t\t\t\tassert_lt(row_div, 8);\n\t\t\t\t\t\t\tint16_t h_sc = ((int16_t*)pvH)[vecoff];\n\t\t\t\t\t\t\tint16_t e_sc = ((int16_t*)pvE)[vecoff];\n\t\t\t\t\t\t\tint16_t f_sc = ((int16_t*)pvF)[vecoff];\n\t\t\t\t\t\t\th_sc += 0x8000; assert_geq(h_sc, 0);\n\t\t\t\t\t\t\te_sc += 0x8000; assert_geq(e_sc, 0);\n\t\t\t\t\t\t\tf_sc += 0x8000; assert_geq(f_sc, 0);\n\t\t\t\t\t\t\tassert_leq(h_sc, cper_.perf_);\n\t\t\t\t\t\t\tassert_leq(e_sc, cper_.perf_);\n\t\t\t\t\t\t\tassert_leq(f_sc, cper_.perf_);\n\t\t\t\t\t\t\tCpQuad *qdiags = ((j == 0) ? cper_.qdiag1s_.ptr() : cper_.qdiag2s_.ptr());\n\t\t\t\t\t\t\tqdiags[delt].sc[0] = h_sc;\n\t\t\t\t\t\t\tqdiags[delt].sc[1] = e_sc;\n\t\t\t\t\t\t\tqdiags[delt].sc[2] = f_sc;\n\t\t\t\t\t\t} // if(row >= 0 && row < nrow_)\n\t\t\t\t\t\telse if(row >= 0 && (size_t)row >= cper_.nrow_) {\n\t\t\t\t\t\t\tdone = true;\n\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t}\n\t\t\t\t\t} // end of loop over anti-diags\n\t\t\t\t\tif(done) {\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t}\n\t\t\t\t\tidx++;\n\t\t\t\t\tidxrow += cper_.nrow_;\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\t// If this is the first column, take this opportunity to\n\t\t\t\t// pre-calculate the coordinates of the elements we're going to\n\t\t\t\t// checkpoint.\n\t\t\t\tif(coli == 0) {\n\t\t\t\t\tsize_t cpi    = cper_.per_-1;\n\t\t\t\t\tsize_t cpimod = cper_.per_-1;\n\t\t\t\t\tsize_t cpidiv = 0;\n\t\t\t\t\tcper_.commitMap_.clear();\n\t\t\t\t\twhile(cpi < cper_.nrow_) {\n\t\t\t\t\t\twhile(cpimod >= iter) {\n\t\t\t\t\t\t\tcpimod -= iter;\n\t\t\t\t\t\t\tcpidiv++;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tsize_t vecoff = (cpimod << 5) + cpidiv;\n\t\t\t\t\t\tcper_.commitMap_.push_back(vecoff);\n\t\t\t\t\t\tcpi += cper_.per_;\n\t\t\t\t\t\tcpimod += cper_.per_;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\t// Save all the rows\n\t\t\t\tsize_t rowoff = 0;\n\t\t\t\tsize_t sz = cper_.commitMap_.size();\n\t\t\t\tfor(size_t i = 0; i < sz; i++, rowoff += cper_.ncol_) {\n\t\t\t\t\tsize_t vecoff = cper_.commitMap_[i];\n\t\t\t\t\tint16_t h_sc = ((int16_t*)pvH)[vecoff];\n\t\t\t\t\t//int16_t e_sc = ((int16_t*)pvE)[vecoff];\n\t\t\t\t\tint16_t f_sc = ((int16_t*)pvF)[vecoff];\n\t\t\t\t\th_sc += 0x8000; assert_geq(h_sc, 0);\n\t\t\t\t\t//e_sc += 0x8000; assert_geq(e_sc, 0);\n\t\t\t\t\tf_sc += 0x8000; assert_geq(f_sc, 0);\n\t\t\t\t\tassert_leq(h_sc, cper_.perf_);\n\t\t\t\t\t//assert_leq(e_sc, cper_.perf_);\n\t\t\t\t\tassert_leq(f_sc, cper_.perf_);\n\t\t\t\t\tCpQuad& dst = cper_.qrows_[rowoff + coli];\n\t\t\t\t\tdst.sc[0] = h_sc;\n\t\t\t\t\t//dst.sc[1] = e_sc;\n\t\t\t\t\tdst.sc[2] = f_sc;\n\t\t\t\t}\n\t\t\t\t// Is this a column we'd like to checkpoint?\n\t\t\t\tif((coli & cper_.lomask_) == cper_.lomask_) {\n\t\t\t\t\t// Save the column using memcpys\n\t\t\t\t\tassert_gt(coli, 0);\n\t\t\t\t\tsize_t wordspercol = cper_.niter_ * ROWSTRIDE_2COL;\n\t\t\t\t\tsize_t coloff = (coli >> cper_.perpow2_) * wordspercol;\n\t\t\t\t\t__m128i *dst = cper_.qcols_.ptr() + coloff;\n\t\t\t\t\tmemcpy(dst, vbuf_r, sizeof(__m128i) * wordspercol);\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(cper_.debug_) {\n\t\t\t\t// Save the column using memcpys\n\t\t\t\tsize_t wordspercol = cper_.niter_ * ROWSTRIDE_2COL;\n\t\t\t\tsize_t coloff = coli * wordspercol;\n\t\t\t\t__m128i *dst = cper_.qcolsD_.ptr() + coloff;\n\t\t\t\tmemcpy(dst, vbuf_r, sizeof(__m128i) * wordspercol);\n\t\t\t}\n\t\t}\n\n\t\tvmax = _mm_max_epi16(vmax, vcolmax);\n\t\t{\n\t\t\t// Get single largest score in this column\n\t\t\tvmaxtmp = vcolmax;\n\t\t\tvtmp = _mm_srli_si128(vmaxtmp, 8);\n\t\t\tvmaxtmp = _mm_max_epi16(vmaxtmp, vtmp);\n\t\t\tvtmp = _mm_srli_si128(vmaxtmp, 4);\n\t\t\tvmaxtmp = _mm_max_epi16(vmaxtmp, vtmp);\n\t\t\tvtmp = _mm_srli_si128(vmaxtmp, 2);\n\t\t\tvmaxtmp = _mm_max_epi16(vmaxtmp, vtmp);\n\t\t\tint16_t ret = _mm_extract_epi16(vmaxtmp, 0);\n\t\t\tTAlScore score = (TAlScore)(ret + 0x8000);\n\t\t\tif(ret == MIN_I16) {\n\t\t\t\tscore = MIN_I64;\n\t\t\t}\n\t\t\t\n\t\t\tif(score < minsc_) {\n\t\t\t\tsize_t ncolleft = rff_ - i - 1;\n\t\t\t\tif(max<TAlScore>(score, 0) + (TAlScore)ncolleft * matchsc < minsc_) {\n\t\t\t\t\t// Bail!  There can't possibly be a valid alignment that\n\t\t\t\t\t// passes through this column.\n\t\t\t\t\tbailed = true;\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t}\n\t\t\t\n\t\t\tleftmax = score;\n\t\t}\n\t}\n\t\n\tlastoff = off;\n\t\n\t// Now we'd like to know exactly which cells in the *rightmost* column are\n\t// candidates we might backtrace from.  Did *any* elements exceed the\n\t// minimum score threshold?\n\tif(!debug && !bailed && leftmax >= minsc_) {\n\t\t// Yes.  Next question is: which cells are candidates?  We have to\n\t\t// allow matches in the right column to override matches above and\n\t\t// to the left in the left column.\n\t\tpvHLeft  = vbuf_r + 2;\n\t\tassert_lt(lastoff, MAX_SIZE_T);\n\t\tpvScore = d.profbuf_.ptr() + lastoff; // even elts = query profile, odd = gap barrier\n\t\tfor(size_t k = 0; k < iter; k++) {\n\t\t\tvh = _mm_load_si128(pvHLeft);\n\t\t\tvtmp = _mm_cmpgt_epi16(pvScore[0], vzero);\n\t\t\tint cmp = _mm_movemask_epi8(vtmp);\n\t\t\tif(cmp != 0) {\n\t\t\t\t// At least one candidate in this mask.  Now iterate\n\t\t\t\t// through vm/vh to evaluate individual cells.\n\t\t\t\tfor(size_t m = 0; m < NWORDS_PER_REG; m++) {\n\t\t\t\t\tsize_t row = k + m * iter;\n\t\t\t\t\tif(row >= dpRows()) {\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t}\n\t\t\t\t\tTAlScore sc = (TAlScore)(((TCScore *)&vh)[m] + 0x8000);\n\t\t\t\t\tif(sc >= minsc_) {\n\t\t\t\t\t\tif(((TCScore *)&vtmp)[m] != 0) {\n\t\t\t\t\t\t\t// Add to data structure holding all candidates\n\t\t\t\t\t\t\tsize_t col = rff_ - rfi_ - 1; // -1 b/c prev col\n\t\t\t\t\t\t\tsize_t frombot = dpRows() - row - 1;\n\t\t\t\t\t\t\tDpBtCandidate cand(row, col, sc);\n\t\t\t\t\t\t\tbtdiag_.add(frombot + col, cand);\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\tpvHLeft += ROWSTRIDE_2COL;\n\t\t\tpvScore += 2;\n\t\t}\n\t}\n\n\t// Find largest score in vmax\n\tvtmp = _mm_srli_si128(vmax, 8);\n\tvmax = _mm_max_epi16(vmax, vtmp);\n\tvtmp = _mm_srli_si128(vmax, 4);\n\tvmax = _mm_max_epi16(vmax, vtmp);\n\tvtmp = _mm_srli_si128(vmax, 2);\n\tvmax = _mm_max_epi16(vmax, vtmp);\n\tint16_t ret = _mm_extract_epi16(vmax, 0);\n\n\t// Update metrics\n\tif(!debug) {\n\t\tsize_t ninner = (rff_ - rfi_) * iter;\n\t\tmet.col   += (rff_ - rfi_);             // DP columns\n\t\tmet.cell  += (ninner * NWORDS_PER_REG); // DP cells\n\t\tmet.inner += ninner;                    // DP inner loop iters\n\t\tmet.fixup += nfixup;                    // DP fixup loop iters\n\t}\n\n\tflag = 0;\n\n\t// Did we find a solution?\n\tTAlScore score = MIN_I64;\n\tif(ret == MIN_I16) {\n\t\tflag = -1; // no\n\t\tif(!debug) met.dpfail++;\n\t\treturn MIN_I64;\n\t} else {\n\t\tscore = (TAlScore)(ret + 0x8000);\n\t\tif(score < minsc_) {\n\t\t\tflag = -1; // no\n\t\t\tif(!debug) met.dpfail++;\n\t\t\treturn score;\n\t\t}\n\t}\n\t\n\t// Could we have saturated?\n\tif(ret == MAX_I16) {\n\t\tflag = -2; // yes\n\t\tif(!debug) met.dpsat++;\n\t\treturn MIN_I64;\n\t}\n\t\n\t// Now take all the backtrace candidates in the btdaig_ structure and\n\t// dump them into the btncand_ array.  They'll be sorted later.\n\tif(!debug) {\n\t\tbtdiag_.dump(btncand_);\t\n\t\tassert(!btncand_.empty());\n\t}\n\t\n\t// Return largest score\n\tif(!debug) met.dpsucc++;\n\treturn score;\n}\n\n/**\n * Solve the current alignment problem using SSE instructions that operate on 8\n * signed 16-bit values packed into a single 128-bit register.\n */\nTAlScore SwAligner::alignNucleotidesLocalSseI16(int& flag, bool debug) {\n\tassert_leq(rdf_, rd_->length());\n\tassert_leq(rdf_, qu_->length());\n\tassert_lt(rfi_, rff_);\n\tassert_lt(rdi_, rdf_);\n\tassert_eq(rd_->length(), qu_->length());\n\tassert_geq(sc_->gapbar, 1);\n\tassert(repOk());\n#ifndef NDEBUG\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\tassert_range(0, 16, (int)rf_[i]);\n\t}\n#endif\n\n\tSSEData& d = fw_ ? sseI16fw_ : sseI16rc_;\n\tSSEMetrics& met = extend_ ? sseI16ExtendMet_ : sseI16MateMet_;\n\tif(!debug) met.dp++;\n\tbuildQueryProfileLocalSseI16(fw_);\n\tassert(!d.profbuf_.empty());\n\n\tassert_gt(d.maxBonus_, 0);\n\tsize_t iter =\n\t\t(dpRows() + (NWORDS_PER_REG-1)) / NWORDS_PER_REG; // iter = segLen\n\n\t// Many thanks to Michael Farrar for releasing his striped Smith-Waterman\n\t// implementation:\n\t//\n\t//  http://sites.google.com/site/farrarmichael/smith-waterman\n\t//\n\t// Much of the implmentation below is adapted from Michael's code.\n\n\t// Set all elts to reference gap open penalty\n\t__m128i rfgapo   = _mm_setzero_si128();\n\t__m128i rfgape   = _mm_setzero_si128();\n\t__m128i rdgapo   = _mm_setzero_si128();\n\t__m128i rdgape   = _mm_setzero_si128();\n\t__m128i vlo      = _mm_setzero_si128();\n\t__m128i vhi      = _mm_setzero_si128();\n\t__m128i vlolsw   = _mm_setzero_si128();\n\t__m128i vmax     = _mm_setzero_si128();\n\t__m128i vcolmax  = _mm_setzero_si128();\n\t__m128i vmaxtmp  = _mm_setzero_si128();\n\t__m128i ve       = _mm_setzero_si128();\n\t__m128i vf       = _mm_setzero_si128();\n\t__m128i vh       = _mm_setzero_si128();\n\t__m128i vtmp     = _mm_setzero_si128();\n\n\tassert_gt(sc_->refGapOpen(), 0);\n\tassert_leq(sc_->refGapOpen(), MAX_I16);\n\trfgapo = _mm_insert_epi16(rfgapo, sc_->refGapOpen(), 0);\n\trfgapo = _mm_shufflelo_epi16(rfgapo, 0);\n\trfgapo = _mm_shuffle_epi32(rfgapo, 0);\n\t\n\t// Set all elts to reference gap extension penalty\n\tassert_gt(sc_->refGapExtend(), 0);\n\tassert_leq(sc_->refGapExtend(), MAX_I16);\n\tassert_leq(sc_->refGapExtend(), sc_->refGapOpen());\n\trfgape = _mm_insert_epi16(rfgape, sc_->refGapExtend(), 0);\n\trfgape = _mm_shufflelo_epi16(rfgape, 0);\n\trfgape = _mm_shuffle_epi32(rfgape, 0);\n\n\t// Set all elts to read gap open penalty\n\tassert_gt(sc_->readGapOpen(), 0);\n\tassert_leq(sc_->readGapOpen(), MAX_I16);\n\trdgapo = _mm_insert_epi16(rdgapo, sc_->readGapOpen(), 0);\n\trdgapo = _mm_shufflelo_epi16(rdgapo, 0);\n\trdgapo = _mm_shuffle_epi32(rdgapo, 0);\n\t\n\t// Set all elts to read gap extension penalty\n\tassert_gt(sc_->readGapExtend(), 0);\n\tassert_leq(sc_->readGapExtend(), MAX_I16);\n\tassert_leq(sc_->readGapExtend(), sc_->readGapOpen());\n\trdgape = _mm_insert_epi16(rdgape, sc_->readGapExtend(), 0);\n\trdgape = _mm_shufflelo_epi16(rdgape, 0);\n\trdgape = _mm_shuffle_epi32(rdgape, 0);\n\n\t// Set all elts to 0x8000 (min value for signed 16-bit)\n\tvlo = _mm_cmpeq_epi16(vlo, vlo);             // all elts = 0xffff\n\tvlo = _mm_slli_epi16(vlo, NBITS_PER_WORD-1); // all elts = 0x8000\n\t\n\t// Set all elts to 0x7fff (max value for signed 16-bit)\n\tvhi = _mm_cmpeq_epi16(vhi, vhi);             // all elts = 0xffff\n\tvhi = _mm_srli_epi16(vhi, 1);                // all elts = 0x7fff\n\t\n\t// Set all elts to 0x8000 (min value for signed 16-bit)\n\tvmax = vlo;\n\t\n\t// vlolsw: topmost (least sig) word set to 0x8000, all other words=0\n\tvlolsw = _mm_shuffle_epi32(vlo, 0);\n\tvlolsw = _mm_srli_si128(vlolsw, NBYTES_PER_REG - NBYTES_PER_WORD);\n\t\n\t// Points to a long vector of __m128i where each element is a block of\n\t// contiguous cells in the E, F or H matrix.  If the index % 3 == 0, then\n\t// the block of cells is from the E matrix.  If index % 3 == 1, they're\n\t// from the F matrix.  If index % 3 == 2, then they're from the H matrix.\n\t// Blocks of cells are organized in the same interleaved manner as they are\n\t// calculated by the Farrar algorithm.\n\tconst __m128i *pvScore; // points into the query profile\n\n\td.mat_.init(dpRows(), rff_ - rfi_, NWORDS_PER_REG);\n\tconst size_t colstride = d.mat_.colstride();\n\t//const size_t rowstride = d.mat_.rowstride();\n\tassert_eq(ROWSTRIDE, colstride / iter);\n\t\n\t// Initialize the H and E vectors in the first matrix column\n\t__m128i *pvHTmp = d.mat_.tmpvec(0, 0);\n\t__m128i *pvETmp = d.mat_.evec(0, 0);\n\t\n\tfor(size_t i = 0; i < iter; i++) {\n\t\t_mm_store_si128(pvETmp, vlo);\n\t\t_mm_store_si128(pvHTmp, vlo); // start low in local mode\n\t\tpvETmp += ROWSTRIDE;\n\t\tpvHTmp += ROWSTRIDE;\n\t}\n\t// These are swapped just before the innermost loop\n\t__m128i *pvHStore = d.mat_.hvec(0, 0);\n\t__m128i *pvHLoad  = d.mat_.tmpvec(0, 0);\n\t__m128i *pvELoad  = d.mat_.evec(0, 0);\n\t__m128i *pvEStore = d.mat_.evecUnsafe(0, 1);\n\t__m128i *pvFStore = d.mat_.fvec(0, 0);\n\t__m128i *pvFTmp   = NULL;\n\t\n\tassert_gt(sc_->gapbar, 0);\n\tsize_t nfixup = 0;\n\tTAlScore matchsc = sc_->match(30);\n\n\t// Fill in the table as usual but instead of using the same gap-penalty\n\t// vector for each iteration of the inner loop, load words out of a\n\t// pre-calculated gap vector parallel to the query profile.  The pre-\n\t// calculated gap vectors enforce the gap barrier constraint by making it\n\t// infinitely costly to introduce a gap in barrier rows.\n\t//\n\t// AND use a separate loop to fill in the first row of the table, enforcing\n\t// the st_ constraints in the process.  This is awkward because it\n\t// separates the processing of the first row from the others and might make\n\t// it difficult to use the first-row results in the next row, but it might\n\t// be the simplest and least disruptive way to deal with the st_ constraint.\n\t\n\tcolstop_ = rff_ - rfi_;\n\tlastsolcol_ = 0;\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\tassert(pvFStore == d.mat_.fvec(0, i - rfi_));\n\t\tassert(pvHStore == d.mat_.hvec(0, i - rfi_));\n\t\t\n\t\t// Fetch this column's reference mask\n\t\tconst int refm = (int)rf_[i];\n\t\t\n\t\t// Fetch the appropriate query profile\n\t\tsize_t off = (size_t)firsts5[refm] * iter * 2;\n\t\tpvScore = d.profbuf_.ptr() + off; // even elts = query profile, odd = gap barrier\n\t\t\n\t\t// Load H vector from the final row of the previous column\n\t\tvh = _mm_load_si128(pvHLoad + colstride - ROWSTRIDE);\n\t\t\n\t\t// Set all F cells to low value\n\t\tvf = _mm_cmpeq_epi16(vf, vf);\n\t\tvf = _mm_slli_epi16(vf, NBITS_PER_WORD-1);\n\t\tvf = _mm_or_si128(vf, vlolsw);\n\t\t// vf now contains the vertical contribution\n\n\t\t// Store cells in F, calculated previously\n\t\t// No need to veto ref gap extensions, they're all 0x8000s\n\t\t_mm_store_si128(pvFStore, vf);\n\t\tpvFStore += ROWSTRIDE;\n\t\t\n\t\t// Shift down so that topmost (least sig) cell gets 0\n\t\tvh = _mm_slli_si128(vh, NBYTES_PER_WORD);\n\t\t// Fill topmost (least sig) cell with low value\n\t\tvh = _mm_or_si128(vh, vlolsw);\n\t\t\n\t\t// We pull out one loop iteration to make it easier to veto values in the top row\n\t\t\n\t\t// Load cells from E, calculated previously\n\t\tve = _mm_load_si128(pvELoad);\n\t\tassert_all_lt(ve, vhi);\n\t\tpvELoad += ROWSTRIDE;\n\t\t// ve now contains the horizontal contribution\n\t\t\n\t\t// Factor in query profile (matches and mismatches)\n\t\tvh = _mm_adds_epi16(vh, pvScore[0]);\n\t\t// vh now contains the diagonal contribution\n\t\t\n\t\t// Update H, factoring in E and F\n\t\tvtmp = _mm_max_epi16(vh, ve);\n\t\t// F won't change anything!\n\t\t\n\t\tvh = vtmp;\n\t\t\n\t\t// Update highest score so far\n\t\tvcolmax = vlo;\n\t\tvcolmax = _mm_max_epi16(vcolmax, vh);\n\t\t\n\t\t// Save the new vH values\n\t\t_mm_store_si128(pvHStore, vh);\n\t\tpvHStore += ROWSTRIDE;\n\t\t\n\t\t// Update vE value\n\t\tvf = vh;\n\t\tvh = _mm_subs_epi16(vh, rdgapo);\n\t\tvh = _mm_adds_epi16(vh, pvScore[1]); // veto some read gap opens\n\t\tvh = _mm_adds_epi16(vh, pvScore[1]); // veto some read gap opens\n\t\tve = _mm_subs_epi16(ve, rdgape);\n\t\tve = _mm_max_epi16(ve, vh);\n\t\tassert_all_lt(ve, vhi);\n\t\t\n\t\t// Load the next h value\n\t\tvh = _mm_load_si128(pvHLoad);\n\t\tpvHLoad += ROWSTRIDE;\n\t\t\n\t\t// Save E values\n\t\t_mm_store_si128(pvEStore, ve);\n\t\tpvEStore += ROWSTRIDE;\n\t\t\n\t\t// Update vf value\n\t\tvf = _mm_subs_epi16(vf, rfgapo);\n\t\tassert_all_lt(vf, vhi);\n\t\t\n\t\tpvScore += 2; // move on to next query profile\n\n\t\t// For each character in the reference text:\n\t\tsize_t j;\n\t\tfor(j = 1; j < iter; j++) {\n\t\t\t// Load cells from E, calculated previously\n\t\t\tve = _mm_load_si128(pvELoad);\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\tpvELoad += ROWSTRIDE;\n\t\t\t\n\t\t\t// Store cells in F, calculated previously\n\t\t\tvf = _mm_adds_epi16(vf, pvScore[1]); // veto some ref gap extensions\n\t\t\tvf = _mm_adds_epi16(vf, pvScore[1]); // veto some ref gap extensions\n\t\t\t_mm_store_si128(pvFStore, vf);\n\t\t\tpvFStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Factor in query profile (matches and mismatches)\n\t\t\tvh = _mm_adds_epi16(vh, pvScore[0]);\n\t\t\t\n\t\t\t// Update H, factoring in E and F\n\t\t\tvh = _mm_max_epi16(vh, ve);\n\t\t\tvh = _mm_max_epi16(vh, vf);\n\t\t\t\n\t\t\t// Update highest score encountered this far\n\t\t\tvcolmax = _mm_max_epi16(vcolmax, vh);\n\t\t\t\n\t\t\t// Save the new vH values\n\t\t\t_mm_store_si128(pvHStore, vh);\n\t\t\tpvHStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update vE value\n\t\t\tvtmp = vh;\n\t\t\tvh = _mm_subs_epi16(vh, rdgapo);\n\t\t\tvh = _mm_adds_epi16(vh, pvScore[1]); // veto some read gap opens\n\t\t\tvh = _mm_adds_epi16(vh, pvScore[1]); // veto some read gap opens\n\t\t\tve = _mm_subs_epi16(ve, rdgape);\n\t\t\tve = _mm_max_epi16(ve, vh);\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\t\n\t\t\t// Load the next h value\n\t\t\tvh = _mm_load_si128(pvHLoad);\n\t\t\tpvHLoad += ROWSTRIDE;\n\t\t\t\n\t\t\t// Save E values\n\t\t\t_mm_store_si128(pvEStore, ve);\n\t\t\tpvEStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update vf value\n\t\t\tvtmp = _mm_subs_epi16(vtmp, rfgapo);\n\t\t\tvf = _mm_subs_epi16(vf, rfgape);\n\t\t\tassert_all_lt(vf, vhi);\n\t\t\tvf = _mm_max_epi16(vf, vtmp);\n\t\t\t\n\t\t\tpvScore += 2; // move on to next query profile / gap veto\n\t\t}\n\t\t// pvHStore, pvELoad, pvEStore have all rolled over to the next column\n\t\tpvFTmp = pvFStore;\n\t\tpvFStore -= colstride; // reset to start of column\n\t\tvtmp = _mm_load_si128(pvFStore);\n\t\t\n\t\tpvHStore -= colstride; // reset to start of column\n\t\tvh = _mm_load_si128(pvHStore);\n\t\t\n\t\tpvEStore -= colstride; // reset to start of column\n\t\tve = _mm_load_si128(pvEStore);\n\t\t\n\t\tpvHLoad = pvHStore;    // new pvHLoad = pvHStore\n\t\tpvScore = d.profbuf_.ptr() + off + 1; // reset veto vector\n\t\t\n\t\t// vf from last row gets shifted down by one to overlay the first row\n\t\t// rfgape has already been subtracted from it.\n\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\tvf = _mm_or_si128(vf, vlolsw);\n\t\t\n\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\tvf = _mm_max_epi16(vtmp, vf);\n\t\tvtmp = _mm_cmpgt_epi16(vf, vtmp);\n\t\tint cmp = _mm_movemask_epi8(vtmp);\n\t\t\n\t\t// If any element of vtmp is greater than H - gap-open...\n\t\tj = 0;\n\t\twhile(cmp != 0x0000) {\n\t\t\t// Store this vf\n\t\t\t_mm_store_si128(pvFStore, vf);\n\t\t\tpvFStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update vh w/r/t new vf\n\t\t\tvh = _mm_max_epi16(vh, vf);\n\t\t\t\n\t\t\t// Save vH values\n\t\t\t_mm_store_si128(pvHStore, vh);\n\t\t\tpvHStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update highest score encountered this far\n\t\t\tvcolmax = _mm_max_epi16(vcolmax, vh);\n\t\t\t\n\t\t\t// Update E in case it can be improved using our new vh\n\t\t\tvh = _mm_subs_epi16(vh, rdgapo);\n\t\t\tvh = _mm_adds_epi16(vh, *pvScore); // veto some read gap opens\n\t\t\tvh = _mm_adds_epi16(vh, *pvScore); // veto some read gap opens\n\t\t\tve = _mm_max_epi16(ve, vh);\n\t\t\t_mm_store_si128(pvEStore, ve);\n\t\t\tpvEStore += ROWSTRIDE;\n\t\t\tpvScore += 2;\n\t\t\t\n\t\t\tassert_lt(j, iter);\n\t\t\tif(++j == iter) {\n\t\t\t\tpvFStore -= colstride;\n\t\t\t\tvtmp = _mm_load_si128(pvFStore);   // load next vf ASAP\n\t\t\t\tpvHStore -= colstride;\n\t\t\t\tvh = _mm_load_si128(pvHStore);     // load next vh ASAP\n\t\t\t\tpvEStore -= colstride;\n\t\t\t\tve = _mm_load_si128(pvEStore);     // load next ve ASAP\n\t\t\t\tpvScore = d.profbuf_.ptr() + off + 1;\n\t\t\t\tj = 0;\n\t\t\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\t\t\tvf = _mm_or_si128(vf, vlolsw);\n\t\t\t} else {\n\t\t\t\tvtmp = _mm_load_si128(pvFStore);   // load next vf ASAP\n\t\t\t\tvh = _mm_load_si128(pvHStore);     // load next vh ASAP\n\t\t\t\tve = _mm_load_si128(pvEStore);     // load next vh ASAP\n\t\t\t}\n\t\t\t\n\t\t\t// Update F with another gap extension\n\t\t\tvf = _mm_subs_epi16(vf, rfgape);\n\t\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\t\tvf = _mm_adds_epi16(vf, *pvScore); // veto some ref gap extensions\n\t\t\tvf = _mm_max_epi16(vtmp, vf);\n\t\t\tvtmp = _mm_cmpgt_epi16(vf, vtmp);\n\t\t\tcmp = _mm_movemask_epi8(vtmp);\n\t\t\tnfixup++;\n\t\t}\n\t\t\n#ifndef NDEBUG\n\t\tif((rand() & 15) == 0) {\n\t\t\t// This is a work-intensive sanity check; each time we finish filling\n\t\t\t// a column, we check that each H, E, and F is sensible.\n\t\t\tfor(size_t k = 0; k < dpRows(); k++) {\n\t\t\t\tassert(cellOkLocalI16(\n\t\t\t\t\td,\n\t\t\t\t\tk,                   // row\n\t\t\t\t\ti - rfi_,            // col\n\t\t\t\t\trefm,                // reference mask\n\t\t\t\t\t(int)(*rd_)[rdi_+k], // read char\n\t\t\t\t\t(int)(*qu_)[rdi_+k], // read quality\n\t\t\t\t\t*sc_));              // scoring scheme\n\t\t\t}\n\t\t}\n#endif\n\n\t\t// Store column maximum vector in first element of tmp\n\t\tvmax = _mm_max_epi16(vmax, vcolmax);\n\t\t_mm_store_si128(d.mat_.tmpvec(0, i - rfi_), vcolmax);\n\n\t\t{\n\t\t\t// Get single largest score in this column\n\t\t\tvmaxtmp = vcolmax;\n\t\t\tvtmp = _mm_srli_si128(vmaxtmp, 8);\n\t\t\tvmaxtmp = _mm_max_epi16(vmaxtmp, vtmp);\n\t\t\tvtmp = _mm_srli_si128(vmaxtmp, 4);\n\t\t\tvmaxtmp = _mm_max_epi16(vmaxtmp, vtmp);\n\t\t\tvtmp = _mm_srli_si128(vmaxtmp, 2);\n\t\t\tvmaxtmp = _mm_max_epi16(vmaxtmp, vtmp);\n\t\t\tint16_t ret = _mm_extract_epi16(vmaxtmp, 0);\n\t\t\tTAlScore score = (TAlScore)(ret + 0x8000);\n\t\t\t\n\t\t\tif(score < minsc_) {\n\t\t\t\tsize_t ncolleft = rff_ - i - 1;\n\t\t\t\tif(score + (TAlScore)ncolleft * matchsc < minsc_) {\n\t\t\t\t\t// Bail!  We're guaranteed not to see a valid alignment in\n\t\t\t\t\t// the rest of the matrix\n\t\t\t\t\tcolstop_ = (i+1) - rfi_;\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\tlastsolcol_ = i - rfi_;\n\t\t\t}\n\t\t}\n\n\t\t// pvELoad and pvHLoad are already where they need to be\n\t\t\n\t\t// Adjust the load and store vectors here.  \n\t\tpvHStore = pvHLoad + colstride;\n\t\tpvEStore = pvELoad + colstride;\n\t\tpvFStore = pvFTmp;\n\t}\n\n\t// Find largest score in vmax\n\tvtmp = _mm_srli_si128(vmax, 8);\n\tvmax = _mm_max_epi16(vmax, vtmp);\n\tvtmp = _mm_srli_si128(vmax, 4);\n\tvmax = _mm_max_epi16(vmax, vtmp);\n\tvtmp = _mm_srli_si128(vmax, 2);\n\tvmax = _mm_max_epi16(vmax, vtmp);\n\tint16_t ret = _mm_extract_epi16(vmax, 0);\n\n\t// Update metrics\n\tif(!debug) {\n\t\tsize_t ninner = (rff_ - rfi_) * iter;\n\t\tmet.col   += (rff_ - rfi_);             // DP columns\n\t\tmet.cell  += (ninner * NWORDS_PER_REG); // DP cells\n\t\tmet.inner += ninner;                    // DP inner loop iters\n\t\tmet.fixup += nfixup;                    // DP fixup loop iters\n\t}\n\n\tflag = 0;\n\n\t// Did we find a solution?\n\tTAlScore score = MIN_I64;\n\tif(ret == MIN_I16) {\n\t\tflag = -1; // no\n\t\tif(!debug) met.dpfail++;\n\t\treturn MIN_I64;\n\t} else {\n\t\tscore = (TAlScore)(ret + 0x8000);\n\t\tif(score < minsc_) {\n\t\t\tflag = -1; // no\n\t\t\tif(!debug) met.dpfail++;\n\t\t\treturn score;\n\t\t}\n\t}\n\t\n\t// Could we have saturated?\n\tif(ret == MAX_I16) {\n\t\tflag = -2; // yes\n\t\tif(!debug) met.dpsat++;\n\t\treturn MIN_I64;\n\t}\n\t\n\t// Return largest score\n\tif(!debug) met.dpsucc++;\n\treturn score;\n}\n\n/**\n * Given a filled-in DP table, populate the btncand_ list with candidate cells\n * that might be at the ends of valid alignments.  No need to do this unless\n * the maximum score returned by the align*() func is >= the minimum.\n *\n * We needn't consider cells that have no chance of reaching any of the core\n * diagonals.  These are the cells that are more than 'maxgaps' cells away from\n * a core diagonal.\n *\n * We need to be careful to consider that the rectangle might be truncated on\n * one or both ends.\n *\n * The seed extend case looks like this:\n *\n *      |Rectangle|   0: seed diagonal\n *      **OO0oo----   o: \"RHS gap\" diagonals\n *      -**OO0oo---   O: \"LHS gap\" diagonals\n *      --**OO0oo--   *: \"LHS extra\" diagonals\n *      ---**OO0oo-   -: cells that can't possibly be involved in a valid    \n *      ----**OO0oo      alignment that overlaps one of the core diagonals\n *\n * The anchor-to-left case looks like this:\n *\n *   |Anchor|  | ---- Rectangle ---- |\n *   o---------OO0000000000000oo------  0: mate diagonal (also core diags!)\n *   -o---------OO0000000000000oo-----  o: \"RHS gap\" diagonals\n *   --o---------OO0000000000000oo----  O: \"LHS gap\" diagonals\n *   ---oo--------OO0000000000000oo---  *: \"LHS extra\" diagonals\n *   -----o--------OO0000000000000oo--  -: cells that can't possibly be\n *   ------o--------OO0000000000000oo-     involved in a valid alignment that\n *   -------o--------OO0000000000000oo     overlaps one of the core diagonals\n *                     XXXXXXXXXXXXX\n *                     | RHS Range |\n *                     ^           ^\n *                     rl          rr\n *\n * The anchor-to-right case looks like this:\n *\n *    ll          lr\n *    v           v\n *    | LHS Range |\n *    XXXXXXXXXXXXX          |Anchor|\n *  OO0000000000000oo--------o--------  0: mate diagonal (also core diags!)\n *  -OO0000000000000oo--------o-------  o: \"RHS gap\" diagonals\n *  --OO0000000000000oo--------o------  O: \"LHS gap\" diagonals\n *  ---OO0000000000000oo--------oo----  *: \"LHS extra\" diagonals\n *  ----OO0000000000000oo---------o---  -: cells that can't possibly be\n *  -----OO0000000000000oo---------o--     involved in a valid alignment that\n *  ------OO0000000000000oo---------o-     overlaps one of the core diagonals\n *  | ---- Rectangle ---- |\n */\nbool SwAligner::gatherCellsNucleotidesLocalSseI16(TAlScore best) {\n\t// What's the minimum number of rows that can possibly be spanned by an\n\t// alignment that meets the minimum score requirement?\n\tassert(sse16succ_);\n\tsize_t bonus = (size_t)sc_->match(30);\n\tconst size_t ncol = lastsolcol_ + 1;\n\tconst size_t nrow = dpRows();\n\tassert_gt(nrow, 0);\n\tbtncand_.clear();\n\tbtncanddone_.clear();\n\tSSEData& d = fw_ ? sseI16fw_ : sseI16rc_;\n\tSSEMetrics& met = extend_ ? sseI16ExtendMet_ : sseI16MateMet_;\n\tassert(!d.profbuf_.empty());\n\t//const size_t rowstride = d.mat_.rowstride();\n\t//const size_t colstride = d.mat_.colstride();\n\tsize_t iter = (dpRows() + (NWORDS_PER_REG - 1)) / NWORDS_PER_REG;\n\tassert_gt(iter, 0);\n\tassert_geq(minsc_, 0);\n\tassert_gt(bonus, 0);\n\tsize_t minrow = (size_t)(((minsc_ + bonus - 1) / bonus) - 1);\n\tfor(size_t j = 0; j < ncol; j++) {\n\t\t// Establish the range of rows where a backtrace from the cell in this\n\t\t// row/col is close enough to one of the core diagonals that it could\n\t\t// conceivably count\n\t\tsize_t nrow_lo = MIN_SIZE_T;\n\t\tsize_t nrow_hi = nrow;\n\t\t// First, check if there is a cell in this column with a score\n\t\t// above the score threshold\n\t\t__m128i vmax = *d.mat_.tmpvec(0, j);\n\t\t__m128i vtmp = _mm_srli_si128(vmax, 8);\n\t\tvmax = _mm_max_epi16(vmax, vtmp);\n\t\tvtmp = _mm_srli_si128(vmax, 4);\n\t\tvmax = _mm_max_epi16(vmax, vtmp);\n\t\tvtmp = _mm_srli_si128(vmax, 2);\n\t\tvmax = _mm_max_epi16(vmax, vtmp);\n\t\tTAlScore score = (TAlScore)((int16_t)_mm_extract_epi16(vmax, 0) + 0x8000);\n\t\tassert_geq(score, 0);\n#ifndef NDEBUG\n\t\t{\n\t\t\t// Start in upper vector row and move down\n\t\t\tTAlScore max = 0;\n\t\t\tvmax = *d.mat_.tmpvec(0, j);\n\t\t\t__m128i *pvH = d.mat_.hvec(0, j);\n\t\t\tfor(size_t i = 0; i < iter; i++) {\n\t\t\t\tfor(size_t k = 0; k < NWORDS_PER_REG; k++) {\n\t\t\t\t\tTAlScore sc = (TAlScore)(((TCScore*)pvH)[k] + 0x8000);\n\t\t\t\t\tTAlScore scm = (TAlScore)(((TCScore*)&vmax)[k] + 0x8000);\n\t\t\t\t\tassert_leq(sc, scm);\n\t\t\t\t\tif(sc > max) {\n\t\t\t\t\t\tmax = sc;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tpvH += ROWSTRIDE;\n\t\t\t}\n\t\t\tassert_eq(max, score);\n\t\t}\n#endif\n\t\tif(score < minsc_) {\n\t\t\t// Scores in column aren't good enough\n\t\t\tcontinue;\n\t\t}\n\t\t// Get pointer to first cell in column to examine:\n\t\t__m128i *pvHorig = d.mat_.hvec(0, j);\n\t\t__m128i *pvH     = pvHorig;\n\t\t// Get pointer to the vector in the following column that corresponds\n\t\t// to the cells diagonally down and to the right from the cells in pvH\n\t\t__m128i *pvHSucc = (j < ncol-1) ? d.mat_.hvec(0, j+1) : NULL;\n\t\t// Start in upper vector row and move down\n\t\tfor(size_t i = 0; i < iter; i++) {\n\t\t\tif(pvHSucc != NULL) {\n\t\t\t\tpvHSucc += ROWSTRIDE;\n\t\t\t\tif(i == iter-1) {\n\t\t\t\t\tpvHSucc = d.mat_.hvec(0, j+1);\n\t\t\t\t}\n\t\t\t}\n\t\t\t// Which elements of this vector are exhaustively scored?\n\t\t\tsize_t rdoff = i;\n\t\t\tfor(size_t k = 0; k < NWORDS_PER_REG; k++) {\n\t\t\t\t// Is this row, col one that we can potential backtrace from?\n\t\t\t\t// I.e. are we close enough to a core diagonal?\n\t\t\t\tif(rdoff >= nrow_lo && rdoff < nrow_hi) {\n\t\t\t\t\t// This cell has been exhaustively scored\n\t\t\t\t\tif(rdoff >= minrow) {\n\t\t\t\t\t\t// ... and it could potentially score high enough\n\t\t\t\t\t\tTAlScore sc = (TAlScore)(((TCScore*)pvH)[k] + 0x8000);\n\t\t\t\t\t\tassert_leq(sc, best);\n\t\t\t\t\t\tif(sc >= minsc_) {\n\t\t\t\t\t\t\t// This is a potential solution\n\t\t\t\t\t\t\tbool matchSucc = false;\n\t\t\t\t\t\t\tint readc = (*rd_)[rdoff];\n\t\t\t\t\t\t\tint refc = rf_[j + rfi_];\n\t\t\t\t\t\t\tbool match = ((refc & (1 << readc)) != 0);\n\t\t\t\t\t\t\tif(rdoff < dpRows()-1) {\n\t\t\t\t\t\t\t\tint readcSucc = (*rd_)[rdoff+1];\n\t\t\t\t\t\t\t\tint refcSucc = rf_[j + rfi_ + 1];\n\t\t\t\t\t\t\t\tassert_range(0, 16, refcSucc);\n\t\t\t\t\t\t\t\tmatchSucc = ((refcSucc & (1 << readcSucc)) != 0);\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tif(match && !matchSucc) {\n\t\t\t\t\t\t\t\t// Yes, this is legit\n\t\t\t\t\t\t\t\tmet.gathsol++;\n\t\t\t\t\t\t\t\tbtncand_.expand();\n\t\t\t\t\t\t\t\tbtncand_.back().init(rdoff, j, sc);\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t} else {\n\t\t\t\t\t// Already saw every element in the vector that's been\n\t\t\t\t\t// exhaustively scored\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t\trdoff += iter;\n\t\t\t}\n\t\t\tpvH += ROWSTRIDE;\n\t\t}\n\t}\n\tif(!btncand_.empty()) {\n\t\td.mat_.initMasks();\n\t}\n\treturn !btncand_.empty();\n}\n\n#define MOVE_VEC_PTR_UP(vec, rowvec, rowelt) { \\\n\tif(rowvec == 0) { \\\n\t\trowvec += d.mat_.nvecrow_; \\\n\t\tvec += d.mat_.colstride_; \\\n\t\trowelt--; \\\n\t} \\\n\trowvec--; \\\n\tvec -= ROWSTRIDE; \\\n}\n\n#define MOVE_VEC_PTR_LEFT(vec, rowvec, rowelt) { vec -= d.mat_.colstride_; }\n\n#define MOVE_VEC_PTR_UPLEFT(vec, rowvec, rowelt) { \\\n \tMOVE_VEC_PTR_UP(vec, rowvec, rowelt); \\\n \tMOVE_VEC_PTR_LEFT(vec, rowvec, rowelt); \\\n}\n\n#define MOVE_ALL_LEFT() { \\\n\tMOVE_VEC_PTR_LEFT(cur_vec, rowvec, rowelt); \\\n\tMOVE_VEC_PTR_LEFT(left_vec, left_rowvec, left_rowelt); \\\n\tMOVE_VEC_PTR_LEFT(up_vec, up_rowvec, up_rowelt); \\\n\tMOVE_VEC_PTR_LEFT(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n#define MOVE_ALL_UP() { \\\n\tMOVE_VEC_PTR_UP(cur_vec, rowvec, rowelt); \\\n\tMOVE_VEC_PTR_UP(left_vec, left_rowvec, left_rowelt); \\\n\tMOVE_VEC_PTR_UP(up_vec, up_rowvec, up_rowelt); \\\n\tMOVE_VEC_PTR_UP(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n#define MOVE_ALL_UPLEFT() { \\\n\tMOVE_VEC_PTR_UPLEFT(cur_vec, rowvec, rowelt); \\\n\tMOVE_VEC_PTR_UPLEFT(left_vec, left_rowvec, left_rowelt); \\\n\tMOVE_VEC_PTR_UPLEFT(up_vec, up_rowvec, up_rowelt); \\\n\tMOVE_VEC_PTR_UPLEFT(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n#define NEW_ROW_COL(row, col) { \\\n\trowelt = row / d.mat_.nvecrow_; \\\n\trowvec = row % d.mat_.nvecrow_; \\\n\teltvec = (col * d.mat_.colstride_) + (rowvec * ROWSTRIDE); \\\n\tcur_vec = d.mat_.matbuf_.ptr() + eltvec; \\\n\tleft_vec = cur_vec; \\\n\tleft_rowelt = rowelt; \\\n\tleft_rowvec = rowvec; \\\n\tMOVE_VEC_PTR_LEFT(left_vec, left_rowvec, left_rowelt); \\\n\tup_vec = cur_vec; \\\n\tup_rowelt = rowelt; \\\n\tup_rowvec = rowvec; \\\n\tMOVE_VEC_PTR_UP(up_vec, up_rowvec, up_rowelt); \\\n\tupleft_vec = up_vec; \\\n\tupleft_rowelt = up_rowelt; \\\n\tupleft_rowvec = up_rowvec; \\\n\tMOVE_VEC_PTR_LEFT(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n/**\n * Given the dynamic programming table and a cell, trace backwards from the\n * cell and install the edits and score/penalty in the appropriate fields of\n * res.  The RandomSource is used to break ties among equally good ways of\n * tracing back.\n *\n * Whenever we enter a cell, we check if its read/ref coordinates correspond to\n * a cell we traversed constructing a previous alignment.  If so, we backtrack\n * to the last decision point, mask out the path that led to the previously\n * observed cell, and continue along a different path.  If there are no more\n * paths to try, we stop.\n *\n * If an alignment is found, 'off' is set to the alignment's upstream-most\n * reference character's offset and true is returned.  Otherwise, false is\n * returned.\n *\n * In local alignment mode, this method is liable to be slow, especially for\n * long reads.  This is chiefly because if there is one valid solution\n * (especially if it is pretty high scoring), then many, many paths shooting\n * off that solution's path will also have valid solutions.\n */\nbool SwAligner::backtraceNucleotidesLocalSseI16(\n\tTAlScore       escore, // in: expected score\n\tSwResult&      res,    // out: store results (edits and scores) here\n\tsize_t&        off,    // out: store diagonal projection of origin\n\tsize_t&        nbts,   // out: # backtracks\n\tsize_t         row,    // start in this row\n\tsize_t         col,    // start in this column\n\tRandomSource&  rnd)    // random gen, to choose among equal paths\n{\n\tassert_lt(row, dpRows());\n\tassert_lt(col, (size_t)(rff_ - rfi_));\n\tSSEData& d = fw_ ? sseI16fw_ : sseI16rc_;\n\tSSEMetrics& met = extend_ ? sseI16ExtendMet_ : sseI16MateMet_;\n\tmet.bt++;\n\tassert(!d.profbuf_.empty());\n\tassert_lt(row, rd_->length());\n\tbtnstack_.clear(); // empty the backtrack stack\n\tbtcells_.clear();  // empty the cells-so-far list\n\tAlnScore score;\n\t// score.score_ = score.gaps_ = score.ns_ = 0;\n\tsize_t origCol = col;\n\tsize_t gaps = 0, readGaps = 0, refGaps = 0;\n\tres.alres.reset();\n    EList<Edit>& ned = res.alres.ned();\n\tassert(ned.empty());\n\tassert_gt(dpRows(), row);\n\tASSERT_ONLY(size_t trimEnd = dpRows() - row - 1);\n\tsize_t trimBeg = 0;\n\tsize_t ct = SSEMatrix::H; // cell type\n\t// Row and col in terms of where they fall in the SSE vector matrix\n\tsize_t rowelt, rowvec, eltvec;\n\tsize_t left_rowelt, up_rowelt, upleft_rowelt;\n\tsize_t left_rowvec, up_rowvec, upleft_rowvec;\n\t__m128i *cur_vec, *left_vec, *up_vec, *upleft_vec;\n\tconst size_t gbar = sc_->gapbar;\n\tNEW_ROW_COL(row, col);\n\t// If 'backEliminate' is true, then every time we visit a cell, we remove\n\t// edges into the cell.  We do this to avoid some of the thrashing around\n\t// that occurs when there are lots of valid candidates in the same DP\n\t// problem.\n\t//const bool backEliminate = true;\n\twhile((int)row >= 0) {\n\t\t// TODO: As soon as we enter a cell, set it as being reported through,\n\t\t// *and* mark all cells that point into this cell as being reported\n\t\t// through.  This will save us from having to consider quite so many\n\t\t// candidates.\n\t\t\n\t\tmet.btcell++;\n\t\tnbts++;\n\t\tint readc = (*rd_)[rdi_ + row];\n\t\tint refm  = (int)rf_[rfi_ + col];\n\t\tint readq = (*qu_)[row];\n\t\tassert_leq(col, origCol);\n\t\t// Get score in this cell\n\t\tbool empty = false, reportedThru, canMoveThru, branch = false;\n\t\tint cur = SSEMatrix::H;\n\t\tif(!d.mat_.reset_[row]) {\n\t\t\td.mat_.resetRow(row);\n\t\t}\n\t\treportedThru = d.mat_.reportedThrough(row, col);\n\t\tcanMoveThru = true;\n\t\tif(reportedThru) {\n\t\t\tcanMoveThru = false;\n\t\t} else {\n\t\t\tempty = false;\n\t\t\tif(row > 0) {\n\t\t\t\tsize_t rowFromEnd = d.mat_.nrow() - row - 1;\n\t\t\t\tbool gapsAllowed = !(row < gbar || rowFromEnd < gbar);\n\t\t\t\tconst int floorsc = 0;\n\t\t\t\tconst int offsetsc = 0x8000;\n\t\t\t\t// Move to beginning of column/row\n\t\t\t\tif(ct == SSEMatrix::E) { // AKA rdgap\n\t\t\t\t\tassert_gt(col, 0);\n\t\t\t\t\tTAlScore sc_cur = ((TCScore*)(cur_vec + SSEMatrix::E))[rowelt] + offsetsc;\n\t\t\t\t\tassert(gapsAllowed);\n\t\t\t\t\t// Currently in the E matrix; incoming transition must come from the\n\t\t\t\t\t// left.  It's either a gap open from the H matrix or a gap extend from\n\t\t\t\t\t// the E matrix.\n\t\t\t\t\t// TODO: save and restore origMask as well as mask\n\t\t\t\t\tint origMask = 0, mask = 0;\n\t\t\t\t\t// Get H score of cell to the left\n\t\t\t\t\tTAlScore sc_h_left = ((TCScore*)(left_vec + SSEMatrix::H))[left_rowelt] + offsetsc;\n\t\t\t\t\tif(sc_h_left > floorsc && sc_h_left - sc_->readGapOpen() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 0); // horiz H -> E move possible\n\t\t\t\t\t}\n\t\t\t\t\t// Get E score of cell to the left\n\t\t\t\t\tTAlScore sc_e_left = ((TCScore*)(left_vec + SSEMatrix::E))[left_rowelt] + offsetsc;\n\t\t\t\t\tif(sc_e_left > floorsc && sc_e_left - sc_->readGapExtend() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 1); // horiz E -> E move possible\n\t\t\t\t\t}\n\t\t\t\t\torigMask = mask;\n\t\t\t\t\tassert(origMask > 0 || sc_cur <= sc_->match());\n\t\t\t\t\tif(d.mat_.isEMaskSet(row, col)) {\n\t\t\t\t\t\tmask = (d.mat_.masks_[row][col] >> 8) & 3;\n\t\t\t\t\t}\n\t\t\t\t\tif(mask == 3) {\n\t\t\t\t\t\t// Horiz H -> E or horiz E -> E moves possible\n#if 1\n\t\t\t\t\t\t// Pick H -> E cell\n\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\td.mat_.eMaskSet(row, col, 2); // might choose E later\n#else\n\t\t\t\t\t\tif(rnd.nextU2()) {\n\t\t\t\t\t\t\t// Pick H -> E cell\n\t\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\t\td.mat_.eMaskSet(row, col, 2); // might choose E later\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t// Pick E -> E cell\n\t\t\t\t\t\t\tcur = SW_BT_RDGAP_EXTEND;\n\t\t\t\t\t\t\td.mat_.eMaskSet(row, col, 1); // might choose H later\n\t\t\t\t\t\t}\n#endif\n\t\t\t\t\t\tbranch = true;\n\t\t\t\t\t} else if(mask == 2) {\n\t\t\t\t\t\t// Only horiz E -> E move possible, pick it\n\t\t\t\t\t\tcur = SW_BT_RDGAP_EXTEND;\n\t\t\t\t\t\td.mat_.eMaskSet(row, col, 0); // done\n\t\t\t\t\t} else if(mask == 1) {\n\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\td.mat_.eMaskSet(row, col, 0); // done\n\t\t\t\t\t} else {\n\t\t\t\t\t\tempty = true;\n\t\t\t\t\t\t// It's empty, so the only question left is whether we should be\n\t\t\t\t\t\t// allowed in terimnate in this cell.  If it's got a valid score\n\t\t\t\t\t\t// then we *shouldn't* be allowed to terminate here because that\n\t\t\t\t\t\t// means it's part of a larger alignment that was already reported.\n\t\t\t\t\t\tcanMoveThru = (origMask == 0);\n\t\t\t\t\t}\n\t\t\t\t\tif(!branch) {\n\t\t\t\t\t\t// Is this where we can eliminate some incoming paths as well?\n\t\t\t\t\t}\n\t\t\t\t\tassert(!empty || !canMoveThru);\n\t\t\t\t} else if(ct == SSEMatrix::F) { // AKA rfgap\n\t\t\t\t\tassert_gt(row, 0);\n\t\t\t\t\tassert(gapsAllowed);\n\t\t\t\t\tTAlScore sc_h_up = ((TCScore*)(up_vec  + SSEMatrix::H))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_f_up = ((TCScore*)(up_vec  + SSEMatrix::F))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_cur  = ((TCScore*)(cur_vec + SSEMatrix::F))[rowelt] + offsetsc;\n\t\t\t\t\t// Currently in the F matrix; incoming transition must come from above.\n\t\t\t\t\t// It's either a gap open from the H matrix or a gap extend from the F\n\t\t\t\t\t// matrix.\n\t\t\t\t\t// TODO: save and restore origMask as well as mask\n\t\t\t\t\tint origMask = 0, mask = 0;\n\t\t\t\t\t// Get H score of cell above\n\t\t\t\t\tif(sc_h_up > floorsc && sc_h_up - sc_->refGapOpen() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 0);\n\t\t\t\t\t}\n\t\t\t\t\t// Get F score of cell above\n\t\t\t\t\tif(sc_f_up > floorsc && sc_f_up - sc_->refGapExtend() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 1);\n\t\t\t\t\t}\n\t\t\t\t\torigMask = mask;\n\t\t\t\t\tassert(origMask > 0 || sc_cur <= sc_->match());\n\t\t\t\t\tif(d.mat_.isFMaskSet(row, col)) {\n\t\t\t\t\t\tmask = (d.mat_.masks_[row][col] >> 11) & 3;\n\t\t\t\t\t}\n\t\t\t\t\tif(mask == 3) {\n#if 1\n\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\td.mat_.fMaskSet(row, col, 2); // might choose E later\n#else\n\t\t\t\t\t\tif(rnd.nextU2()) {\n\t\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\t\td.mat_.fMaskSet(row, col, 2); // might choose E later\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t// I chose the F cell\n\t\t\t\t\t\t\tcur = SW_BT_RFGAP_EXTEND;\n\t\t\t\t\t\t\td.mat_.fMaskSet(row, col, 1); // might choose E later\n\t\t\t\t\t\t}\n#endif\n\t\t\t\t\t\tbranch = true;\n\t\t\t\t\t} else if(mask == 2) {\n\t\t\t\t\t\t// I chose the F cell\n\t\t\t\t\t\tcur = SW_BT_RFGAP_EXTEND;\n\t\t\t\t\t\td.mat_.fMaskSet(row, col, 0); // done\n\t\t\t\t\t} else if(mask == 1) {\n\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\td.mat_.fMaskSet(row, col, 0); // done\n\t\t\t\t\t} else {\n\t\t\t\t\t\tempty = true;\n\t\t\t\t\t\t// It's empty, so the only question left is whether we should be\n\t\t\t\t\t\t// allowed in terimnate in this cell.  If it's got a valid score\n\t\t\t\t\t\t// then we *shouldn't* be allowed to terminate here because that\n\t\t\t\t\t\t// means it's part of a larger alignment that was already reported.\n\t\t\t\t\t\tcanMoveThru = (origMask == 0);\n\t\t\t\t\t}\n\t\t\t\t\tassert(!empty || !canMoveThru);\n\t\t\t\t} else {\n\t\t\t\t\tassert_eq(SSEMatrix::H, ct);\n\t\t\t\t\tTAlScore sc_cur      = ((TCScore*)(cur_vec + SSEMatrix::H))[rowelt]    + offsetsc;\n\t\t\t\t\tTAlScore sc_f_up     = ((TCScore*)(up_vec  + SSEMatrix::F))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_h_up     = ((TCScore*)(up_vec  + SSEMatrix::H))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_h_left   = col > 0 ? (((TCScore*)(left_vec   + SSEMatrix::H))[left_rowelt]   + offsetsc) : floorsc;\n\t\t\t\t\tTAlScore sc_e_left   = col > 0 ? (((TCScore*)(left_vec   + SSEMatrix::E))[left_rowelt]   + offsetsc) : floorsc;\n\t\t\t\t\tTAlScore sc_h_upleft = col > 0 ? (((TCScore*)(upleft_vec + SSEMatrix::H))[upleft_rowelt] + offsetsc) : floorsc;\n\t\t\t\t\tTAlScore sc_diag     = sc_->score(readc, refm, readq - 33);\n\t\t\t\t\t// TODO: save and restore origMask as well as mask\n\t\t\t\t\tint origMask = 0, mask = 0;\n\t\t\t\t\tif(gapsAllowed) {\n\t\t\t\t\t\tif(sc_h_up     > floorsc && sc_cur == sc_h_up   - sc_->refGapOpen()) {\n\t\t\t\t\t\t\tmask |= (1 << 0);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(sc_h_left   > floorsc && sc_cur == sc_h_left - sc_->readGapOpen()) {\n\t\t\t\t\t\t\tmask |= (1 << 1);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(sc_f_up     > floorsc && sc_cur == sc_f_up   - sc_->refGapExtend()) {\n\t\t\t\t\t\t\tmask |= (1 << 2);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(sc_e_left   > floorsc && sc_cur == sc_e_left - sc_->readGapExtend()) {\n\t\t\t\t\t\t\tmask |= (1 << 3);\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\tif(sc_h_upleft > floorsc && sc_cur == sc_h_upleft + sc_diag) {\n\t\t\t\t\t\tmask |= (1 << 4); // diagonal is \n\t\t\t\t\t}\n\t\t\t\t\torigMask = mask;\n\t\t\t\t\tassert(origMask > 0 || sc_cur <= sc_->match());\n\t\t\t\t\tif(d.mat_.isHMaskSet(row, col)) {\n\t\t\t\t\t\tmask = (d.mat_.masks_[row][col] >> 2) & 31;\n\t\t\t\t\t}\n\t\t\t\t\tassert(gapsAllowed || mask == (1 << 4) || mask == 0);\n\t\t\t\t\tint opts = alts5[mask];\n\t\t\t\t\tint select = -1;\n\t\t\t\t\tif(opts == 1) {\n\t\t\t\t\t\tselect = firsts5[mask];\n\t\t\t\t\t\tassert_geq(mask, 0);\n\t\t\t\t\t\td.mat_.hMaskSet(row, col, 0);\n\t\t\t\t\t} else if(opts > 1) {\n#if 1\n\t\t\t\t\t\tif(       (mask & 16) != 0) {\n\t\t\t\t\t\t\tselect = 4; // H diag\n\t\t\t\t\t\t} else if((mask & 1) != 0) {\n\t\t\t\t\t\t\tselect = 0; // H up\n\t\t\t\t\t\t} else if((mask & 4) != 0) {\n\t\t\t\t\t\t\tselect = 2; // F up\n\t\t\t\t\t\t} else if((mask & 2) != 0) {\n\t\t\t\t\t\t\tselect = 1; // H left\n\t\t\t\t\t\t} else if((mask & 8) != 0) {\n\t\t\t\t\t\t\tselect = 3; // E left\n\t\t\t\t\t\t}\n#else\n\t\t\t\t\t\tselect = randFromMask(rnd, mask);\n#endif\n\t\t\t\t\t\tassert_geq(mask, 0);\n\t\t\t\t\t\tmask &= ~(1 << select);\n\t\t\t\t\t\tassert(gapsAllowed || mask == (1 << 4) || mask == 0);\n\t\t\t\t\t\td.mat_.hMaskSet(row, col, mask);\n\t\t\t\t\t\tbranch = true;\n\t\t\t\t\t} else { /* No way to backtrack! */ }\n\t\t\t\t\tif(select != -1) {\n\t\t\t\t\t\tif(select == 4) {\n\t\t\t\t\t\t\tcur = SW_BT_OALL_DIAG;\n\t\t\t\t\t\t} else if(select == 0) {\n\t\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\t} else if(select == 1) {\n\t\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\t} else if(select == 2) {\n\t\t\t\t\t\t\tcur = SW_BT_RFGAP_EXTEND;\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tassert_eq(3, select)\n\t\t\t\t\t\t\tcur = SW_BT_RDGAP_EXTEND;\n\t\t\t\t\t\t}\n\t\t\t\t\t} else {\n\t\t\t\t\t\tempty = true;\n\t\t\t\t\t\t// It's empty, so the only question left is whether we should be\n\t\t\t\t\t\t// allowed in terimnate in this cell.  If it's got a valid score\n\t\t\t\t\t\t// then we *shouldn't* be allowed to terminate here because that\n\t\t\t\t\t\t// means it's part of a larger alignment that was already reported.\n\t\t\t\t\t\tcanMoveThru = (origMask == 0);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tassert(!empty || !canMoveThru || ct == SSEMatrix::H);\n\t\t\t} // if(row > 0)\n\t\t} // else clause of if(reportedThru)\n\t\tif(!reportedThru) {\n\t\t\td.mat_.setReportedThrough(row, col);\n\t\t}\n\t\tassert(d.mat_.reportedThrough(row, col));\n\t\t//if(backEliminate && row < d.mat_.nrow()-1) {\n\t\t//\t// Possibly pick off neighbors below and to the right if the\n\t\t//\t// neighbor's only way of backtracking is through this cell.\n\t\t//}\n\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t// Cell was involved in a previously-reported alignment?\n\t\tif(!canMoveThru) {\n\t\t\tif(!btnstack_.empty()) {\n\t\t\t\t// Remove all the cells from list back to and including the\n\t\t\t\t// cell where the branch occurred\n\t\t\t\tbtcells_.resize(btnstack_.back().celsz);\n\t\t\t\t// Pop record off the top of the stack\n\t\t\t\tned.resize(btnstack_.back().nedsz);\n\t\t\t\t//aed.resize(btnstack_.back().aedsz);\n\t\t\t\trow      = btnstack_.back().row;\n\t\t\t\tcol      = btnstack_.back().col;\n\t\t\t\tgaps     = btnstack_.back().gaps;\n\t\t\t\treadGaps = btnstack_.back().readGaps;\n\t\t\t\trefGaps  = btnstack_.back().refGaps;\n\t\t\t\tscore    = btnstack_.back().score;\n\t\t\t\tct       = btnstack_.back().ct;\n\t\t\t\tbtnstack_.pop_back();\n\t\t\t\tassert(!sc_->monotone || score.score() >= escore);\n\t\t\t\tNEW_ROW_COL(row, col);\n\t\t\t\tcontinue;\n\t\t\t} else {\n\t\t\t\t// No branch points to revisit; just give up\n\t\t\t\tres.reset();\n\t\t\t\tmet.btfail++; // DP backtraces failed\n\t\t\t\treturn false;\n\t\t\t}\n\t\t}\n\t\tassert(!reportedThru);\n\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\tif(empty || row == 0) {\n\t\t\tassert_eq(SSEMatrix::H, ct);\n\t\t\tbtcells_.expand();\n\t\t\tbtcells_.back().first = row;\n\t\t\tbtcells_.back().second = col;\n\t\t\t// This cell is at the end of a legitimate alignment\n\t\t\ttrimBeg = row;\n\t\t\tassert_eq(btcells_.size(), dpRows() - trimBeg - trimEnd + readGaps);\n\t\t\tbreak;\n\t\t}\n\t\tif(branch) {\n\t\t\t// Add a frame to the backtrack stack\n\t\t\tbtnstack_.expand();\n\t\t\tbtnstack_.back().init(\n\t\t\t\tned.size(),\n\t\t\t\t0,               // aed.size()\n\t\t\t\tbtcells_.size(),\n\t\t\t\trow,\n\t\t\t\tcol,\n\t\t\t\tgaps,\n\t\t\t\treadGaps,\n\t\t\t\trefGaps,\n\t\t\t\tscore,\n\t\t\t\t(int)ct);\n\t\t}\n\t\tbtcells_.expand();\n\t\tbtcells_.back().first = row;\n\t\tbtcells_.back().second = col;\n\t\tswitch(cur) {\n\t\t\t// Move up and to the left.  If the reference nucleotide in the\n\t\t\t// source row mismatches the read nucleotide, penalize\n\t\t\t// it and add a nucleotide mismatch.\n\t\t\tcase SW_BT_OALL_DIAG: {\n\t\t\t\tassert_gt(row, 0); assert_gt(col, 0);\n\t\t\t\tint readC = (*rd_)[row];\n\t\t\t\tint refNmask = (int)rf_[rfi_+col];\n\t\t\t\tassert_gt(refNmask, 0);\n\t\t\t\tint m = matchesEx(readC, refNmask);\n\t\t\t\tct = SSEMatrix::H;\n\t\t\t\tif(m != 1) {\n\t\t\t\t\tEdit e(\n\t\t\t\t\t\t(int)row,\n\t\t\t\t\t\tmask2dna[refNmask],\n\t\t\t\t\t\t\"ACGTN\"[readC],\n\t\t\t\t\t\tEDIT_TYPE_MM);\n\t\t\t\t\tassert(e.repOk());\n\t\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\t\tned.push_back(e);\n\t\t\t\t\tint pen = QUAL2(row, col);\n\t\t\t\t\tscore.score_ -= pen;\n\t\t\t\t\tassert(!sc_->monotone || score.score() >= escore);\n\t\t\t\t} else {\n\t\t\t\t\t// Reward a match\n\t\t\t\t\tint64_t bonus = sc_->match(30);\n\t\t\t\t\tscore.score_ += bonus;\n\t\t\t\t\tassert(!sc_->monotone || score.score() >= escore);\n\t\t\t\t}\n\t\t\t\tif(m == -1) {\n\t\t\t\t\t// score.ns_++;\n\t\t\t\t}\n\t\t\t\trow--; col--;\n\t\t\t\tMOVE_ALL_UPLEFT();\n\t\t\t\tassert(VALID_AL_SCORE(score));\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\t// Move up.  Add an edit encoding the ref gap.\n\t\t\tcase SW_BT_OALL_REF_OPEN:\n\t\t\t{\n\t\t\t\tassert_gt(row, 0);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row,\n\t\t\t\t\t'-',\n\t\t\t\t\t\"ACGTN\"[(int)(*rd_)[row]],\n\t\t\t\t\tEDIT_TYPE_REF_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\trow--;\n\t\t\t\tct = SSEMatrix::H;\n\t\t\t\tint pen = sc_->refGapOpen();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; refGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_UP();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\t// Move up.  Add an edit encoding the ref gap.\n\t\t\tcase SW_BT_RFGAP_EXTEND:\n\t\t\t{\n\t\t\t\tassert_gt(row, 1);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row,\n\t\t\t\t\t'-',\n\t\t\t\t\t\"ACGTN\"[(int)(*rd_)[row]],\n\t\t\t\t\tEDIT_TYPE_REF_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\trow--;\n\t\t\t\tct = SSEMatrix::F;\n\t\t\t\tint pen = sc_->refGapExtend();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; refGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_UP();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tcase SW_BT_OALL_READ_OPEN:\n\t\t\t{\n\t\t\t\tassert_gt(col, 0);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row+1,\n\t\t\t\t\tmask2dna[(int)rf_[rfi_+col]],\n\t\t\t\t\t'-',\n\t\t\t\t\tEDIT_TYPE_READ_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\tcol--;\n\t\t\t\tct = SSEMatrix::H;\n\t\t\t\tint pen = sc_->readGapOpen();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; readGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_LEFT();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tcase SW_BT_RDGAP_EXTEND:\n\t\t\t{\n\t\t\t\tassert_gt(col, 1);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row+1,\n\t\t\t\t\tmask2dna[(int)rf_[rfi_+col]],\n\t\t\t\t\t'-',\n\t\t\t\t\tEDIT_TYPE_READ_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\tcol--;\n\t\t\t\tct = SSEMatrix::E;\n\t\t\t\tint pen = sc_->readGapExtend();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; readGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_LEFT();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tdefault: throw 1;\n\t\t}\n\t} // while((int)row > 0)\n\tassert_geq(col, 0);\n\tassert_eq(SSEMatrix::H, ct);\n\t// The number of cells in the backtracs should equal the number of read\n\t// bases after trimming plus the number of gaps\n\tassert_eq(btcells_.size(), dpRows() - trimBeg - trimEnd + readGaps);\n\t// Check whether we went through a core diagonal and set 'reported' flag on\n\t// each cell\n\tbool overlappedCoreDiag = false;\n\tfor(size_t i = 0; i < btcells_.size(); i++) {\n\t\tsize_t rw = btcells_[i].first;\n\t\tsize_t cl = btcells_[i].second;\n\t\t// Calculate the diagonal within the *trimmed* rectangle, i.e. the\n\t\t// rectangle we dealt with in align, gather and backtrack.\n\t\tint64_t diagi = cl - rw;\n\t\t// Now adjust to the diagonal within the *untrimmed* rectangle by\n\t\t// adding on the amount trimmed from the left.\n\t\tdiagi += rect_->triml;\n\t\tif(diagi >= 0) {\n\t\t\tsize_t diag = (size_t)diagi;\n\t\t\tif(diag >= rect_->corel && diag <= rect_->corer) {\n\t\t\t\toverlappedCoreDiag = true;\n\t\t\t\tbreak;\n\t\t\t}\n\t\t}\n#ifndef NDEBUG\n\t\t//assert(!d.mat_.reportedThrough(rw, cl));\n\t\t//d.mat_.setReportedThrough(rw, cl);\n\t\tassert(d.mat_.reportedThrough(rw, cl));\n#endif\n\t}\n\tif(!overlappedCoreDiag) {\n\t\t// Must overlap a core diagonal.  Otherwise, we run the risk of\n\t\t// reporting an alignment that overlaps (and trumps) a higher-scoring\n\t\t// alignment that lies partially outside the dynamic programming\n\t\t// rectangle.\n\t\tres.reset();\n\t\tmet.corerej++;\n\t\treturn false;\n\t}\n\tint readC = (*rd_)[rdi_+row];      // get last char in read\n\tint refNmask = (int)rf_[rfi_+col]; // get last ref char ref involved in aln\n\tassert_gt(refNmask, 0);\n\tint m = matchesEx(readC, refNmask);\n\tif(m != 1) {\n\t\tEdit e((int)row, mask2dna[refNmask], \"ACGTN\"[readC], EDIT_TYPE_MM);\n\t\tassert(e.repOk());\n\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\tned.push_back(e);\n\t\tscore.score_ -= QUAL2(row, col);\n\t\tassert_geq(score.score(), minsc_);\n\t} else {\n\t\tscore.score_ += sc_->match(30);\n\t}\n\tif(m == -1) {\n\t\t// score.ns_++;\n\t}\n#if 0\n\tif(score.ns_ > nceil_) {\n\t\t// Alignment has too many Ns in it!\n\t\tres.reset();\n\t\tmet.nrej++;\n\t\treturn false;\n\t}\n#endif\n\tres.reverse();\n\tassert(Edit::repOk(ned, (*rd_)));\n\tassert_eq(score.score(), escore);\n\tassert_leq(gaps, rdgap_ + rfgap_);\n\toff = col;\n\tassert_lt(col + (size_t)rfi_, (size_t)rff_);\n\t// score.gaps_ = gaps;\n\tres.alres.setScore(score);\n#if 0\n\tres.alres.setShape(\n\t\trefidx_,                  // ref id\n\t\toff + rfi_ + rect_->refl, // 0-based ref offset\n\t\treflen_,                  // reference length\n\t\tfw_,                      // aligned to Watson?\n\t\trdf_ - rdi_,              // read length\n\t\ttrue,                     // pretrim soft?\n\t\t0,                        // pretrim 5' end\n\t\t0,                        // pretrim 3' end\n\t\ttrue,                     // alignment trim soft?\n\t\tfw_ ? trimBeg : trimEnd,  // alignment trim 5' end\n\t\tfw_ ? trimEnd : trimBeg); // alignment trim 3' end\n#endif\n\tsize_t refns = 0;\n\tfor(size_t i = col; i <= origCol; i++) {\n\t\tif((int)rf_[rfi_+i] > 15) {\n\t\t\trefns++;\n\t\t}\n\t}\n\t// res.alres.setRefNs(refns);\n\tassert(Edit::repOk(ned, (*rd_), true, trimBeg, trimEnd));\n\tassert(res.repOk());\n#ifndef NDEBUG\n\tsize_t gapsCheck = 0;\n\tfor(size_t i = 0; i < ned.size(); i++) {\n\t\tif(ned[i].isGap()) gapsCheck++;\n\t}\n\tassert_eq(gaps, gapsCheck);\n\tBTDnaString refstr;\n\tfor(size_t i = col; i <= origCol; i++) {\n\t\trefstr.append(firsts5[(int)rf_[rfi_+i]]);\n\t}\n\tBTDnaString editstr;\n\tEdit::toRef((*rd_), ned, editstr, true, trimBeg, trimEnd);\n\tif(refstr != editstr) {\n\t\tcerr << \"Decoded nucleotides and edits don't match reference:\" << endl;\n\t\tcerr << \"           score: \" << score.score()\n\t\t     << \" (\" << gaps << \" gaps)\" << endl;\n\t\tcerr << \"           edits: \";\n\t\tEdit::print(cerr, ned);\n\t\tcerr << endl;\n\t\tcerr << \"    decoded nucs: \" << (*rd_) << endl;\n\t\tcerr << \"     edited nucs: \" << editstr << endl;\n\t\tcerr << \"  reference nucs: \" << refstr << endl;\n\t\tassert(0);\n\t}\n#endif\n\tmet.btsucc++; // DP backtraces succeeded\n\treturn true;\n}\n"
  },
  {
    "path": "aligner_swsse_loc_u8.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n/**\n * aligner_sw_sse.cpp\n *\n * Versions of key alignment functions that use vector instructions to\n * accelerate dynamic programming.  Based chiefly on the striped Smith-Waterman\n * paper and implementation by Michael Farrar.  See:\n *\n * Farrar M. Striped Smith-Waterman speeds database searches six times over\n * other SIMD implementations. Bioinformatics. 2007 Jan 15;23(2):156-61.\n * http://sites.google.com/site/farrarmichael/smith-waterman\n *\n * While the paper describes an implementation of Smith-Waterman, we extend it\n * do end-to-end read alignment as well as local alignment.  The change\n * required for this is minor: we simply let vmax be the maximum element in the\n * score domain rather than the minimum.\n *\n * The vectorized dynamic programming implementation lacks some features that\n * make it hard to adapt to solving the entire dynamic-programming alignment\n * problem.  For instance:\n *\n * - It doesn't respect gap barriers on either end of the read\n * - It just gives a maximum; not enough information to backtrace without\n *   redoing some alignment\n * - It's a little difficult to handle st_ and en_, especially st_.\n * - The query profile mechanism makes handling of ambiguous reference bases a\n *   little tricky (16 cols in query profile lookup table instead of 5)\n *\n * Given the drawbacks, it is tempting to use SSE dynamic programming as a\n * filter rather than as an aligner per se.  Here are a few ideas for how it\n * can be extended to handle more of the alignment problem:\n *\n * - Save calculated scores to a big array as we go.  We return to this array\n *   to find and backtrace from good solutions.\n */\n\n#include <limits>\n#include \"aligner_sw.h\"\n\n// static const size_t NBYTES_PER_REG  = 16;\nstatic const size_t NWORDS_PER_REG  = 16;\n// static const size_t NBITS_PER_WORD  = 8;\nstatic const size_t NBYTES_PER_WORD = 1;\n\n// In local mode, we start low (0) and go high (255).  Factoring in a query\n// profile involves unsigned saturating addition.  All query profile elements\n// should be expressed as a positive number; this is done by adding -min\n// where min is the smallest (negative) score in the query profile.\n\ntypedef uint8_t TCScore;\n\n/**\n * Build query profile look up tables for the read.  The query profile look\n * up table is organized as a 1D array indexed by [i][j] where i is the\n * reference character in the current DP column (0=A, 1=C, etc), and j is\n * the segment of the query we're currently working on.\n */\nvoid SwAligner::buildQueryProfileLocalSseU8(bool fw) {\n\tbool& done = fw ? sseU8fwBuilt_ : sseU8rcBuilt_;\n\tif(done) {\n\t\treturn;\n\t}\n\tdone = true;\n\tconst BTDnaString* rd = fw ? rdfw_ : rdrc_;\n\tconst BTString* qu = fw ? qufw_ : qurc_;\n\tconst size_t len = rd->length();\n\tconst size_t seglen = (len + (NWORDS_PER_REG-1)) / NWORDS_PER_REG;\n\t// How many __m128i's are needed\n\tsize_t n128s =\n\t\t64 +                    // slack bytes, for alignment?\n\t\t(seglen * ALPHA_SIZE)   // query profile data\n\t\t* 2;                    // & gap barrier data\n\tassert_gt(n128s, 0);\n\tSSEData& d = fw ? sseU8fw_ : sseU8rc_;\n\td.profbuf_.resizeNoCopy(n128s);\n\tassert(!d.profbuf_.empty());\n\td.maxPen_      = d.maxBonus_ = 0;\n\td.lastIter_    = d.lastWord_ = 0;\n\td.qprofStride_ = d.gbarStride_ = 2;\n\td.bias_ = 0;\n\t// Calculate bias\n\tfor(size_t refc = 0; refc < ALPHA_SIZE; refc++) {\n\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\tint readc = (*rd)[i];\n\t\t\tint readq = (*qu)[i];\n\t\t\tint sc = sc_->score(readc, (int)(1 << refc), readq - 33);\n\t\t\tif(sc < 0 && sc < d.bias_) {\n\t\t\t\td.bias_ = sc;\n\t\t\t}\n\t\t}\n\t}\n\tassert_leq(d.bias_, 0);\n\td.bias_ = -d.bias_;\n\t// For each reference character A, C, G, T, N ...\n\tfor(size_t refc = 0; refc < ALPHA_SIZE; refc++) {\n\t\t// For each segment ...\n\t\tfor(size_t i = 0; i < seglen; i++) {\n\t\t\tsize_t j = i;\n\t\t\tuint8_t *qprofWords =\n\t\t\t\treinterpret_cast<uint8_t*>(d.profbuf_.ptr() + (refc * seglen * 2) + (i * 2));\n\t\t\tuint8_t *gbarWords =\n\t\t\t\treinterpret_cast<uint8_t*>(d.profbuf_.ptr() + (refc * seglen * 2) + (i * 2) + 1);\n\t\t\t// For each sub-word (byte) ...\n\t\t\tfor(size_t k = 0; k < NWORDS_PER_REG; k++) {\n\t\t\t\tint sc = 0;\n\t\t\t\t*gbarWords = 0;\n\t\t\t\tif(j < len) {\n\t\t\t\t\tint readc = (*rd)[j];\n\t\t\t\t\tint readq = (*qu)[j];\n\t\t\t\t\tsc = sc_->score(readc, (int)(1 << refc), readq - 33);\n\t\t\t\t\tassert_range(0, 255, sc + d.bias_);\n\t\t\t\t\tsize_t j_from_end = len - j - 1;\n\t\t\t\t\tif(j < (size_t)sc_->gapbar ||\n\t\t\t\t\t   j_from_end < (size_t)sc_->gapbar)\n\t\t\t\t\t{\n\t\t\t\t\t\t// Inside the gap barrier\n\t\t\t\t\t\t*gbarWords = 0xff;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(refc == 0 && j == len-1) {\n\t\t\t\t\t// Remember which 128-bit word and which smaller word has\n\t\t\t\t\t// the final row\n\t\t\t\t\td.lastIter_ = i;\n\t\t\t\t\td.lastWord_ = k;\n\t\t\t\t}\n\t\t\t\tif(sc < 0) {\n\t\t\t\t\tif((size_t)(-sc) > d.maxPen_) {\n\t\t\t\t\t\td.maxPen_ = (size_t)(-sc);\n\t\t\t\t\t}\n\t\t\t\t} else {\n\t\t\t\t\tif((size_t)sc > d.maxBonus_) {\n\t\t\t\t\t\td.maxBonus_ = (size_t)sc;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\t*qprofWords = (uint8_t)(sc + d.bias_);\n\t\t\t\tgbarWords++;\n\t\t\t\tqprofWords++;\n\t\t\t\tj += seglen; // update offset into query\n\t\t\t}\n\t\t}\n\t}\n}\n\n#ifndef NDEBUG\n/**\n * Return true iff the cell has sane E/F/H values w/r/t its predecessors.\n */\nstatic bool cellOkLocalU8(\n\tSSEData& d,\n\tsize_t row,\n\tsize_t col,\n\tint refc,\n\tint readc,\n\tint readq,\n\tconst Scoring& sc)     // scoring scheme\n{\n\tTCScore floorsc = 0;\n\tTCScore ceilsc = 255 - d.bias_ - 1;\n\tTAlScore offsetsc = 0;\n\tTAlScore sc_h_cur = (TAlScore)d.mat_.helt(row, col);\n\tTAlScore sc_e_cur = (TAlScore)d.mat_.eelt(row, col);\n\tTAlScore sc_f_cur = (TAlScore)d.mat_.felt(row, col);\n\tif(sc_h_cur > floorsc) {\n\t\tsc_h_cur += offsetsc;\n\t}\n\tif(sc_e_cur > floorsc) {\n\t\tsc_e_cur += offsetsc;\n\t}\n\tif(sc_f_cur > floorsc) {\n\t\tsc_f_cur += offsetsc;\n\t}\n\tbool gapsAllowed = true;\n\tsize_t rowFromEnd = d.mat_.nrow() - row - 1;\n\tif(row < (size_t)sc.gapbar || rowFromEnd < (size_t)sc.gapbar) {\n\t\tgapsAllowed = false;\n\t}\n\tbool e_left_trans = false, h_left_trans = false;\n\tbool f_up_trans   = false, h_up_trans = false;\n\tbool h_diag_trans = false;\n\tif(gapsAllowed) {\n\t\tTAlScore sc_h_left = floorsc;\n\t\tTAlScore sc_e_left = floorsc;\n\t\tTAlScore sc_h_up   = floorsc;\n\t\tTAlScore sc_f_up   = floorsc;\n\t\tif(col > 0 && sc_e_cur > floorsc && sc_e_cur <= ceilsc) {\n\t\t\tsc_h_left = d.mat_.helt(row, col-1) + offsetsc;\n\t\t\tsc_e_left = d.mat_.eelt(row, col-1) + offsetsc;\n\t\t\te_left_trans = (sc_e_left > floorsc && sc_e_cur == sc_e_left - sc.readGapExtend());\n\t\t\th_left_trans = (sc_h_left > floorsc && sc_e_cur == sc_h_left - sc.readGapOpen());\n\t\t\tassert(e_left_trans || h_left_trans);\n\t\t}\n\t\tif(row > 0 && sc_f_cur > floorsc && sc_f_cur <= ceilsc) {\n\t\t\tsc_h_up = d.mat_.helt(row-1, col) + offsetsc;\n\t\t\tsc_f_up = d.mat_.felt(row-1, col) + offsetsc;\n\t\t\tf_up_trans = (sc_f_up > floorsc && sc_f_cur == sc_f_up - sc.refGapExtend());\n\t\t\th_up_trans = (sc_h_up > floorsc && sc_f_cur == sc_h_up - sc.refGapOpen());\n\t\t\tassert(f_up_trans || h_up_trans);\n\t\t}\n\t} else {\n\t\tassert_geq(floorsc, sc_e_cur);\n\t\tassert_geq(floorsc, sc_f_cur);\n\t}\n\tif(col > 0 && row > 0 && sc_h_cur > floorsc && sc_h_cur <= ceilsc) {\n\t\tTAlScore sc_h_upleft = d.mat_.helt(row-1, col-1) + offsetsc;\n\t\tTAlScore sc_diag = sc.score(readc, (int)refc, readq - 33);\n\t\th_diag_trans = sc_h_cur == sc_h_upleft + sc_diag;\n\t}\n\tassert(\n\t\tsc_h_cur <= floorsc ||\n\t\te_left_trans ||\n\t\th_left_trans ||\n\t\tf_up_trans   ||\n\t\th_up_trans   ||\n\t\th_diag_trans ||\n\t\tsc_h_cur > ceilsc ||\n\t\trow == 0 ||\n\t\tcol == 0);\n\treturn true;\n}\n#endif /*ndef NDEBUG*/\n\n#ifdef NDEBUG\n\n#define assert_all_eq0(x)\n#define assert_all_gt(x, y)\n#define assert_all_gt_lo(x)\n#define assert_all_lt(x, y)\n#define assert_all_lt_hi(x)\n\n#else\n\n#define assert_all_eq0(x) { \\\n\t__m128i z = _mm_setzero_si128(); \\\n\t__m128i tmp = _mm_setzero_si128(); \\\n\tz = _mm_xor_si128(z, z); \\\n\ttmp = _mm_cmpeq_epi16(x, z); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_gt(x, y) { \\\n\t__m128i tmp = _mm_cmpgt_epu8(x, y); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_gt_lo(x) { \\\n\t__m128i z = _mm_setzero_si128(); \\\n\t__m128i tmp = _mm_setzero_si128(); \\\n\tz = _mm_xor_si128(z, z); \\\n\ttmp = _mm_cmpgt_epu8(x, z); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_lt(x, y) { \\\n\t__m128i z = _mm_setzero_si128(); \\\n\tz = _mm_xor_si128(z, z); \\\n\t__m128i tmp = _mm_subs_epu8(y, x); \\\n\ttmp = _mm_cmpeq_epi16(tmp, z); \\\n\tassert_eq(0x0000, _mm_movemask_epi8(tmp)); \\\n}\n\n#define assert_all_lt_hi(x) { \\\n\t__m128i z = _mm_setzero_si128(); \\\n\t__m128i tmp = _mm_setzero_si128(); \\\n\tz = _mm_cmpeq_epu8(z, z); \\\n\tz = _mm_srli_epu8(z, 1); \\\n\ttmp = _mm_cmplt_epu8(x, z); \\\n\tassert_eq(0xffff, _mm_movemask_epi8(tmp)); \\\n}\n#endif\n\n/**\n * Aligns by filling a dynamic programming matrix with the SSE-accelerated,\n * banded DP approach of Farrar.  As it goes, it determines which cells we\n * might backtrace from and tallies the best (highest-scoring) N backtrace\n * candidate cells per diagonal.  Also returns the alignment score of the best\n * alignment in the matrix.\n *\n * This routine does *not* maintain a matrix holding the entire matrix worth of\n * scores, nor does it maintain any other dense O(mn) data structure, as this\n * would quickly exhaust memory for queries longer than about 10,000 kb.\n * Instead, in the fill stage it maintains two columns worth of scores at a\n * time (current/previous, or right/left) - these take O(m) space.  When\n * finished with the current column, it determines which cells from the\n * previous column, if any, are candidates we might backtrace from to find a\n * full alignment.  A candidate cell has a score that rises above the threshold\n * and isn't improved upon by a match in the next column.  The best N\n * candidates per diagonal are stored in a O(m + n) data structure.\n */\nTAlScore SwAligner::alignGatherLoc8(int& flag, bool debug) {\n\tassert_leq(rdf_, rd_->length());\n\tassert_leq(rdf_, qu_->length());\n\tassert_lt(rfi_, rff_);\n\tassert_lt(rdi_, rdf_);\n\tassert_eq(rd_->length(), qu_->length());\n\tassert_geq(sc_->gapbar, 1);\n\tassert_gt(minsc_, 0);\n\tassert(repOk());\n#ifndef NDEBUG\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\tassert_range(0, 16, (int)rf_[i]);\n\t}\n#endif\n\n\tSSEData& d = fw_ ? sseU8fw_ : sseU8rc_;\n\tSSEMetrics& met = extend_ ? sseU8ExtendMet_ : sseU8MateMet_;\n\tif(!debug) met.dp++;\n\tbuildQueryProfileLocalSseU8(fw_);\n\tassert(!d.profbuf_.empty());\n\tassert_gt(d.bias_, 0);\n\tassert_lt(d.bias_, 127);\n\t\n\tassert_gt(d.maxBonus_, 0);\n\tsize_t iter =\n\t\t(dpRows() + (NWORDS_PER_REG-1)) / NWORDS_PER_REG; // iter = segLen\n\t\n\t// Now set up the score vectors.  We just need two columns worth, which\n\t// we'll call \"left\" and \"right\".\n\td.vecbuf_.resize(ROWSTRIDE_2COL * iter * 2);\n\td.vecbuf_.zero();\n\t__m128i *vbuf_l = d.vecbuf_.ptr();\n\t__m128i *vbuf_r = d.vecbuf_.ptr() + (ROWSTRIDE_2COL * iter);\n\t\n\t// This is the data structure that holds candidate cells per diagonal.\n\tconst size_t ndiags = rff_ - rfi_ + dpRows() - 1;\n\tif(!debug) {\n\t\tbtdiag_.init(ndiags, 2);\n\t}\n\t\n\t// Data structure that holds checkpointed anti-diagonals\n\tTAlScore perfectScore = sc_->perfectScore(dpRows());\n\tbool checkpoint = true;\n\tbool cpdebug = false;\n#ifndef NDEBUG\n\tcpdebug = dpRows() < 1000;\n#endif\n\tcper_.init(\n\t\tdpRows(),      // # rows\n\t\trff_ - rfi_,   // # columns\n\t\tcperPerPow2_,  // checkpoint every 1 << perpow2 diags (& next)\n\t\tperfectScore,  // perfect score (for sanity checks)\n\t\ttrue,          // matrix cells have 8-bit scores?\n\t\tcperTri_,      // triangular mini-fills?\n\t\ttrue,          // alignment is local?\n\t\tcpdebug);      // save all cells for debugging?\n\n\t// Many thanks to Michael Farrar for releasing his striped Smith-Waterman\n\t// implementation:\n\t//\n\t//  http://sites.google.com/site/farrarmichael/smith-waterman\n\t//\n\t// Much of the implmentation below is adapted from Michael's code.\n\n\t// Set all elts to reference gap open penalty\n\t__m128i rfgapo   = _mm_setzero_si128();\n\t__m128i rfgape   = _mm_setzero_si128();\n\t__m128i rdgapo   = _mm_setzero_si128();\n\t__m128i rdgape   = _mm_setzero_si128();\n\t__m128i vlo      = _mm_setzero_si128();\n\t__m128i vhi      = _mm_setzero_si128();\n\t__m128i vmax     = _mm_setzero_si128();\n\t__m128i vcolmax  = _mm_setzero_si128();\n\t__m128i vmaxtmp  = _mm_setzero_si128();\n\t__m128i ve       = _mm_setzero_si128();\n\t__m128i vf       = _mm_setzero_si128();\n\t__m128i vh       = _mm_setzero_si128();\n\t__m128i vhd      = _mm_setzero_si128();\n\t__m128i vhdtmp   = _mm_setzero_si128();\n\t__m128i vtmp     = _mm_setzero_si128();\n\t__m128i vzero    = _mm_setzero_si128();\n\t__m128i vbias    = _mm_setzero_si128();\n\t__m128i vbiasm1  = _mm_setzero_si128();\n\t__m128i vminsc   = _mm_setzero_si128();\n\n\tint dup;\n\n\tassert_gt(sc_->refGapOpen(), 0);\n\tassert_leq(sc_->refGapOpen(), MAX_U8);\n\tdup = (sc_->refGapOpen() << 8) | (sc_->refGapOpen() & 0x00ff);\n\trfgapo = _mm_insert_epi16(rfgapo, dup, 0);\n\trfgapo = _mm_shufflelo_epi16(rfgapo, 0);\n\trfgapo = _mm_shuffle_epi32(rfgapo, 0);\n\t\n\t// Set all elts to reference gap extension penalty\n\tassert_gt(sc_->refGapExtend(), 0);\n\tassert_leq(sc_->refGapExtend(), MAX_U8);\n\tassert_leq(sc_->refGapExtend(), sc_->refGapOpen());\n\tdup = (sc_->refGapExtend() << 8) | (sc_->refGapExtend() & 0x00ff);\n\trfgape = _mm_insert_epi16(rfgape, dup, 0);\n\trfgape = _mm_shufflelo_epi16(rfgape, 0);\n\trfgape = _mm_shuffle_epi32(rfgape, 0);\n\n\t// Set all elts to read gap open penalty\n\tassert_gt(sc_->readGapOpen(), 0);\n\tassert_leq(sc_->readGapOpen(), MAX_U8);\n\tdup = (sc_->readGapOpen() << 8) | (sc_->readGapOpen() & 0x00ff);\n\trdgapo = _mm_insert_epi16(rdgapo, dup, 0);\n\trdgapo = _mm_shufflelo_epi16(rdgapo, 0);\n\trdgapo = _mm_shuffle_epi32(rdgapo, 0);\n\t\n\t// Set all elts to read gap extension penalty\n\tassert_gt(sc_->readGapExtend(), 0);\n\tassert_leq(sc_->readGapExtend(), MAX_U8);\n\tassert_leq(sc_->readGapExtend(), sc_->readGapOpen());\n\tdup = (sc_->readGapExtend() << 8) | (sc_->readGapExtend() & 0x00ff);\n\trdgape = _mm_insert_epi16(rdgape, dup, 0);\n\trdgape = _mm_shufflelo_epi16(rdgape, 0);\n\trdgape = _mm_shuffle_epi32(rdgape, 0);\n\t\n\t// Set all elts to minimum score threshold.  Actually, to 1 less than the\n\t// threshold so we can use gt instead of geq.\n\tdup = (((int)minsc_ - 1) << 8) | (((int)minsc_ - 1) & 0x00ff);\n\tvminsc = _mm_insert_epi16(vminsc, dup, 0);\n\tvminsc = _mm_shufflelo_epi16(vminsc, 0);\n\tvminsc = _mm_shuffle_epi32(vminsc, 0);\n\n\tdup = ((d.bias_ - 1) << 8) | ((d.bias_ - 1) & 0x00ff);\n\tvbiasm1 = _mm_insert_epi16(vbiasm1, dup, 0);\n\tvbiasm1 = _mm_shufflelo_epi16(vbiasm1, 0);\n\tvbiasm1 = _mm_shuffle_epi32(vbiasm1, 0);\n\tvhi = _mm_cmpeq_epi16(vhi, vhi); // all elts = 0xffff\n\tvlo = _mm_xor_si128(vlo, vlo);   // all elts = 0\n\tvmax = vlo;\n\t\n\t// Make a vector of bias offsets\n\tdup = (d.bias_ << 8) | (d.bias_ & 0x00ff);\n\tvbias = _mm_insert_epi16(vbias, dup, 0);\n\tvbias = _mm_shufflelo_epi16(vbias, 0);\n\tvbias = _mm_shuffle_epi32(vbias, 0);\n\t\n\t// Points to a long vector of __m128i where each element is a block of\n\t// contiguous cells in the E, F or H matrix.  If the index % 3 == 0, then\n\t// the block of cells is from the E matrix.  If index % 3 == 1, they're\n\t// from the F matrix.  If index % 3 == 2, then they're from the H matrix.\n\t// Blocks of cells are organized in the same interleaved manner as they are\n\t// calculated by the Farrar algorithm.\n\tconst __m128i *pvScore; // points into the query profile\n\n\tconst size_t colstride = ROWSTRIDE_2COL * iter;\n\t\n\t// Initialize the H and E vectors in the first matrix column\n\t__m128i *pvELeft = vbuf_l + 0; __m128i *pvERight = vbuf_r + 0;\n\t/* __m128i *pvFLeft = vbuf_l + 1; */ __m128i *pvFRight = vbuf_r + 1;\n\t__m128i *pvHLeft = vbuf_l + 2; __m128i *pvHRight = vbuf_r + 2;\n\t\n\tfor(size_t i = 0; i < iter; i++) {\n\t\t// start low in local mode\n\t\t_mm_store_si128(pvERight, vlo); pvERight += ROWSTRIDE_2COL;\n\t\t_mm_store_si128(pvHRight, vlo); pvHRight += ROWSTRIDE_2COL;\n\t}\n\t\n\tassert_gt(sc_->gapbar, 0);\n\tsize_t nfixup = 0;\n\tTAlScore matchsc = sc_->match(30);\n\tTAlScore leftmax = MIN_I64;\n\n\t// Fill in the table as usual but instead of using the same gap-penalty\n\t// vector for each iteration of the inner loop, load words out of a\n\t// pre-calculated gap vector parallel to the query profile.  The pre-\n\t// calculated gap vectors enforce the gap barrier constraint by making it\n\t// infinitely costly to introduce a gap in barrier rows.\n\t//\n\t// AND use a separate loop to fill in the first row of the table, enforcing\n\t// the st_ constraints in the process.  This is awkward because it\n\t// separates the processing of the first row from the others and might make\n\t// it difficult to use the first-row results in the next row, but it might\n\t// be the simplest and least disruptive way to deal with the st_ constraint.\n\t\n\tsize_t off = MAX_SIZE_T, lastoff;\n\tbool bailed = false;\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\t// Swap left and right; vbuf_l is the vector on the left, which we\n\t\t// generally load from, and vbuf_r is the vector on the right, which we\n\t\t// generally store to.\n\t\tswap(vbuf_l, vbuf_r);\n\t\tpvELeft = vbuf_l + 0; pvERight = vbuf_r + 0;\n\t\t/* pvFLeft = vbuf_l + 1; */ pvFRight = vbuf_r + 1;\n\t\tpvHLeft = vbuf_l + 2; pvHRight = vbuf_r + 2;\n\t\t\n\t\t// Fetch this column's reference mask\n\t\tconst int refm = (int)rf_[i];\n\t\t\n\t\t// Fetch the appropriate query profile\n\t\tlastoff = off;\n\t\toff = (size_t)firsts5[refm] * iter * 2;\n\t\tpvScore = d.profbuf_.ptr() + off; // even elts = query profile, odd = gap barrier\n\t\t\n\t\t// Load H vector from the final row of the previous column.\n\t\t// ??? perhaps we should calculate the next iter's F instead of the\n\t\t// current iter's?  The way we currently do it, seems like it will\n\t\t// almost always require at least one fixup loop iter (to recalculate\n\t\t// this topmost F).\n\t\tvh = _mm_load_si128(pvHLeft + colstride - ROWSTRIDE_2COL);\n\t\t\n\t\t// Set all cells to low value\n\t\tvf = _mm_xor_si128(vf, vf);\n\t\t// vf now contains the vertical contribution\n\n\t\t// Store cells in F, calculated previously\n\t\t// No need to veto ref gap extensions, they're all 0x00s\n\t\t_mm_store_si128(pvFRight, vf);\n\t\tpvFRight += ROWSTRIDE_2COL;\n\t\t\n\t\t// Shift down so that topmost (least sig) cell gets 0\n\t\tvh = _mm_slli_si128(vh, NBYTES_PER_WORD);\n\t\t\n\t\t// We pull out one loop iteration to make it easier to veto values in the top row\n\t\t\n\t\t// Load cells from E, calculated previously\n\t\tve = _mm_load_si128(pvELeft);\n\t\tvhd = _mm_load_si128(pvHLeft);\n\t\tassert_all_lt(ve, vhi);\n\t\tpvELeft += ROWSTRIDE_2COL;\n\t\t// ve now contains the horizontal contribution\n\t\t\n\t\t// Factor in query profile (matches and mismatches)\n\t\tvh = _mm_adds_epu8(vh, pvScore[0]);\n\t\tvh = _mm_subs_epu8(vh, vbias);\n\t\t// vh now contains the diagonal contribution\n\n\t\tvhdtmp = vhd;\n\t\tvhd = _mm_subs_epu8(vhd, rdgapo);\n\t\tvhd = _mm_subs_epu8(vhd, pvScore[1]); // veto some read gap opens\n\t\tve = _mm_subs_epu8(ve, rdgape);\n\t\tve = _mm_max_epu8(ve, vhd);\n\n\t\tvh = _mm_max_epu8(vh, ve);\n\t\tvf = vh;\n\n\t\t// Update highest score so far\n\t\tvcolmax = vh;\n\t\t\n\t\t// Save the new vH values\n\t\t_mm_store_si128(pvHRight, vh);\n\n\t\tvh = vhdtmp;\n\t\tassert_all_lt(ve, vhi);\n\t\tpvHRight += ROWSTRIDE_2COL;\n\t\tpvHLeft += ROWSTRIDE_2COL;\n\t\t\n\t\t// Save E values\n\t\t_mm_store_si128(pvERight, ve);\n\t\tpvERight += ROWSTRIDE_2COL;\n\t\t\n\t\t// Update vf value\n\t\tvf = _mm_subs_epu8(vf, rfgapo);\n\t\tassert_all_lt(vf, vhi);\n\t\t\n\t\tpvScore += 2; // move on to next query profile\n\n\t\t// For each character in the reference text:\n\t\tsize_t j;\n\t\tfor(j = 1; j < iter; j++) {\n\t\t\t// Load cells from E, calculated previously\n\t\t\tve = _mm_load_si128(pvELeft);\n\t\t\tvhd = _mm_load_si128(pvHLeft);\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\tpvELeft += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Store cells in F, calculated previously\n\t\t\tvf = _mm_subs_epu8(vf, pvScore[1]); // veto some ref gap extensions\n\t\t\t_mm_store_si128(pvFRight, vf);\n\t\t\tpvFRight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Factor in query profile (matches and mismatches)\n\t\t\tvh = _mm_adds_epu8(vh, pvScore[0]);\n\t\t\tvh = _mm_subs_epu8(vh, vbias);\n\t\t\t\n\t\t\t// Update H, factoring in E and F\n\t\t\tvh = _mm_max_epu8(vh, vf);\n\n\t\t\tvhdtmp = vhd;\n\t\t\tvhd = _mm_subs_epu8(vhd, rdgapo);\n\t\t\tvhd = _mm_subs_epu8(vhd, pvScore[1]); // veto some read gap opens\n\t\t\tve = _mm_subs_epu8(ve, rdgape);\n\t\t\tve = _mm_max_epu8(ve, vhd);\n\t\t\t\n\t\t\tvh = _mm_max_epu8(vh, ve);\n\t\t\tvtmp = vh;\n\t\t\t\n\t\t\t// Update highest score encountered this far\n\t\t\tvcolmax = _mm_max_epu8(vcolmax, vh);\n\t\t\t\n\t\t\t// Save the new vH values\n\t\t\t_mm_store_si128(pvHRight, vh);\n\n\t\t\tvh = vhdtmp;\n\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\tpvHRight += ROWSTRIDE_2COL;\n\t\t\tpvHLeft += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Save E values\n\t\t\t_mm_store_si128(pvERight, ve);\n\t\t\tpvERight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Update vf value\n\t\t\tvtmp = _mm_subs_epu8(vtmp, rfgapo);\n\t\t\tvf = _mm_subs_epu8(vf, rfgape);\n\t\t\tassert_all_lt(vf, vhi);\n\t\t\tvf = _mm_max_epu8(vf, vtmp);\n\t\t\t\n\t\t\tpvScore += 2; // move on to next query profile / gap veto\n\t\t}\n\t\t// pvHStore, pvELoad, pvEStore have all rolled over to the next column\n\t\tpvFRight -= colstride; // reset to start of column\n\t\tvtmp = _mm_load_si128(pvFRight);\n\t\t\n\t\tpvHRight -= colstride; // reset to start of column\n\t\tvh = _mm_load_si128(pvHRight);\n\t\t\n\t\tpvScore = d.profbuf_.ptr() + off + 1; // reset veto vector\n\t\t\n\t\t// vf from last row gets shifted down by one to overlay the first row\n\t\t// rfgape has already been subtracted from it.\n\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\t\n\t\tvf = _mm_subs_epu8(vf, *pvScore); // veto some ref gap extensions\n\t\tvf = _mm_max_epu8(vtmp, vf);\n\t\t// TODO: We're testing whether F changed.  Can't we just assume that F\n\t\t// did change and instead check whether H changed?  Might save us from\n\t\t// entering the fixup loop.\n\t\tvtmp = _mm_subs_epu8(vf, vtmp);\n\t\tvtmp = _mm_cmpeq_epi8(vtmp, vzero);\n\t\tint cmp = _mm_movemask_epi8(vtmp);\n\t\t\n\t\t// If any element of vtmp is greater than H - gap-open...\n\t\tj = 0;\n\t\twhile(cmp != 0xffff) {\n\t\t\t// Store this vf\n\t\t\t_mm_store_si128(pvFRight, vf);\n\t\t\tpvFRight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Update vh w/r/t new vf\n\t\t\tvh = _mm_max_epu8(vh, vf);\n\t\t\t\n\t\t\t// Save vH values\n\t\t\t_mm_store_si128(pvHRight, vh);\n\t\t\tpvHRight += ROWSTRIDE_2COL;\n\t\t\t\n\t\t\t// Update highest score encountered so far.\n\t\t\tvcolmax = _mm_max_epu8(vcolmax, vh);\n\n\t\t\tpvScore += 2;\n\t\t\t\n\t\t\tassert_lt(j, iter);\n\t\t\tif(++j == iter) {\n\t\t\t\tpvFRight -= colstride;\n\t\t\t\tvtmp = _mm_load_si128(pvFRight);   // load next vf ASAP\n\t\t\t\tpvHRight -= colstride;\n\t\t\t\tvh = _mm_load_si128(pvHRight);     // load next vh ASAP\n\t\t\t\tpvScore = d.profbuf_.ptr() + off + 1;\n\t\t\t\tj = 0;\n\t\t\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\t\t} else {\n\t\t\t\tvtmp = _mm_load_si128(pvFRight);   // load next vf ASAP\n\t\t\t\tvh = _mm_load_si128(pvHRight);     // load next vh ASAP\n\t\t\t}\n\t\t\t\n\t\t\t// Update F with another gap extension\n\t\t\tvf = _mm_subs_epu8(vf, rfgape);\n\t\t\tvf = _mm_subs_epu8(vf, *pvScore); // veto some ref gap extensions\n\t\t\tvf = _mm_max_epu8(vtmp, vf);\n\t\t\tvtmp = _mm_subs_epu8(vf, vtmp);\n\t\t\tvtmp = _mm_cmpeq_epi8(vtmp, vzero);\n\t\t\tcmp = _mm_movemask_epi8(vtmp);\n\t\t\tnfixup++;\n\t\t}\n\n\t\t// Now we'd like to know exactly which cells in the left column are\n\t\t// candidates we might backtrace from.  First question is: did *any*\n\t\t// elements in the column exceed the minimum score threshold?\n\t\tif(!debug && leftmax >= minsc_) {\n\t\t\t// Yes.  Next question is: which cells are candidates?  We have to\n\t\t\t// allow matches in the right column to override matches above and\n\t\t\t// to the left in the left column.\n\t\t\tassert_gt(i - rfi_, 0);\n\t\t\tpvHLeft  = vbuf_l + 2;\n\t\t\tassert_lt(lastoff, MAX_SIZE_T);\n\t\t\tpvScore = d.profbuf_.ptr() + lastoff; // even elts = query profile, odd = gap barrier\n\t\t\tfor(size_t k = 0; k < iter; k++) {\n\t\t\t\tvh = _mm_load_si128(pvHLeft);\n\t\t\t\tvtmp = _mm_cmpgt_epi8(pvScore[0], vbiasm1);\n\t\t\t\tint cmp = _mm_movemask_epi8(vtmp);\n\t\t\t\tif(cmp != 0xffff) {\n\t\t\t\t\t// At least one candidate in this mask.  Now iterate\n\t\t\t\t\t// through vm/vh to evaluate individual cells.\n\t\t\t\t\tfor(size_t m = 0; m < NWORDS_PER_REG; m++) {\n\t\t\t\t\t\tsize_t row = k + m * iter;\n\t\t\t\t\t\tif(row >= dpRows()) {\n\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(((TCScore *)&vtmp)[m] > 0 && ((TCScore *)&vh)[m] >= minsc_) {\n\t\t\t\t\t\t\tTCScore sc = ((TCScore *)&vh)[m];\n\t\t\t\t\t\t\tassert_geq(sc, minsc_);\n\t\t\t\t\t\t\t// Add to data structure holding all candidates\n\t\t\t\t\t\t\tsize_t col = i - rfi_ - 1; // -1 b/c prev col\n\t\t\t\t\t\t\tsize_t frombot = dpRows() - row - 1;\n\t\t\t\t\t\t\tDpBtCandidate cand(row, col, sc);\n\t\t\t\t\t\t\tbtdiag_.add(frombot + col, cand);\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tpvHLeft += ROWSTRIDE_2COL;\n\t\t\t\tpvScore += 2;\n\t\t\t}\n\t\t}\n\n\t\t// Save some elements to checkpoints\n\t\tif(checkpoint) {\n\t\t\t\n\t\t\t__m128i *pvE = vbuf_r + 0;\n\t\t\t__m128i *pvF = vbuf_r + 1;\n\t\t\t__m128i *pvH = vbuf_r + 2;\n\t\t\tsize_t coli = i - rfi_;\n\t\t\tif(coli < cper_.locol_) cper_.locol_ = coli;\n\t\t\tif(coli > cper_.hicol_) cper_.hicol_ = coli;\n\t\t\tif(cperTri_) {\n\t\t\t\t// Checkpoint for triangular mini-fills\n\t\t\t\tsize_t rc_mod = coli & cper_.lomask_;\n\t\t\t\tassert_lt(rc_mod, cper_.per_);\n\t\t\t\tint64_t row = -rc_mod-1;\n\t\t\t\tint64_t row_mod = row;\n\t\t\t\tint64_t row_div = 0;\n\t\t\t\tsize_t idx = coli >> cper_.perpow2_;\n\t\t\t\tsize_t idxrow = idx * cper_.nrow_;\n\t\t\t\tassert_eq(4, ROWSTRIDE_2COL);\n\t\t\t\tbool done = false;\n\t\t\t\twhile(true) {\n\t\t\t\t\trow += (cper_.per_ - 2);\n\t\t\t\t\trow_mod += (cper_.per_ - 2);\n\t\t\t\t\tfor(size_t j = 0; j < 2; j++) {\n\t\t\t\t\t\trow++;\n\t\t\t\t\t\trow_mod++;\n\t\t\t\t\t\tif(row >= 0 && (size_t)row < cper_.nrow_) {\n\t\t\t\t\t\t\t// Update row divided by iter_ and mod iter_\n\t\t\t\t\t\t\twhile(row_mod >= (int64_t)iter) {\n\t\t\t\t\t\t\t\trow_mod -= (int64_t)iter;\n\t\t\t\t\t\t\t\trow_div++;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tsize_t delt = idxrow + row;\n\t\t\t\t\t\t\tsize_t vecoff = (row_mod << 6) + row_div;\n\t\t\t\t\t\t\tassert_lt(row_div, 16);\n\t\t\t\t\t\t\tint16_t h_sc = ((uint8_t*)pvH)[vecoff];\n\t\t\t\t\t\t\tint16_t e_sc = ((uint8_t*)pvE)[vecoff];\n\t\t\t\t\t\t\tint16_t f_sc = ((uint8_t*)pvF)[vecoff];\n\t\t\t\t\t\t\tassert_leq(h_sc, cper_.perf_);\n\t\t\t\t\t\t\tassert_leq(e_sc, cper_.perf_);\n\t\t\t\t\t\t\tassert_leq(f_sc, cper_.perf_);\n\t\t\t\t\t\t\tCpQuad *qdiags = ((j == 0) ? cper_.qdiag1s_.ptr() : cper_.qdiag2s_.ptr());\n\t\t\t\t\t\t\tqdiags[delt].sc[0] = h_sc;\n\t\t\t\t\t\t\tqdiags[delt].sc[1] = e_sc;\n\t\t\t\t\t\t\tqdiags[delt].sc[2] = f_sc;\n\t\t\t\t\t\t} // if(row >= 0 && row < nrow_)\n\t\t\t\t\t\telse if(row >= 0 && (size_t)row >= cper_.nrow_) {\n\t\t\t\t\t\t\tdone = true;\n\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t}\n\t\t\t\t\t} // for(size_t j = 0; j < 2; j++)\n\t\t\t\t\tif(done) {\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t}\n\t\t\t\t\tidx++;\n\t\t\t\t\tidxrow += cper_.nrow_;\n\t\t\t\t} // while(true)\n\t\t\t} else {\n\t\t\t\t// Checkpoint for square mini-fills\n\t\t\t\n\t\t\t\t// If this is the first column, take this opportunity to\n\t\t\t\t// pre-calculate the coordinates of the elements we're going to\n\t\t\t\t// checkpoint.\n\t\t\t\tif(coli == 0) {\n\t\t\t\t\tsize_t cpi    = cper_.per_-1;\n\t\t\t\t\tsize_t cpimod = cper_.per_-1;\n\t\t\t\t\tsize_t cpidiv = 0;\n\t\t\t\t\tcper_.commitMap_.clear();\n\t\t\t\t\twhile(cpi < cper_.nrow_) {\n\t\t\t\t\t\twhile(cpimod >= iter) {\n\t\t\t\t\t\t\tcpimod -= iter;\n\t\t\t\t\t\t\tcpidiv++;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tsize_t vecoff = (cpimod << 6) + cpidiv;\n\t\t\t\t\t\tcper_.commitMap_.push_back(vecoff);\n\t\t\t\t\t\tcpi += cper_.per_;\n\t\t\t\t\t\tcpimod += cper_.per_;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\t// Save all the rows\n\t\t\t\tsize_t rowoff = 0;\n\t\t\t\tsize_t sz = cper_.commitMap_.size();\n\t\t\t\tfor(size_t i = 0; i < sz; i++, rowoff += cper_.ncol_) {\n\t\t\t\t\tsize_t vecoff = cper_.commitMap_[i];\n\t\t\t\t\tint16_t h_sc = ((uint8_t*)pvH)[vecoff];\n\t\t\t\t\t//int16_t e_sc = ((uint8_t*)pvE)[vecoff];\n\t\t\t\t\tint16_t f_sc = ((uint8_t*)pvF)[vecoff];\n\t\t\t\t\tassert_leq(h_sc, cper_.perf_);\n\t\t\t\t\t//assert_leq(e_sc, cper_.perf_);\n\t\t\t\t\tassert_leq(f_sc, cper_.perf_);\n\t\t\t\t\tCpQuad& dst = cper_.qrows_[rowoff + coli];\n\t\t\t\t\tdst.sc[0] = h_sc;\n\t\t\t\t\t//dst.sc[1] = e_sc;\n\t\t\t\t\tdst.sc[2] = f_sc;\n\t\t\t\t}\n\t\t\t\t// Is this a column we'd like to checkpoint?\n\t\t\t\tif((coli & cper_.lomask_) == cper_.lomask_) {\n\t\t\t\t\t// Save the column using memcpys\n\t\t\t\t\tassert_gt(coli, 0);\n\t\t\t\t\tsize_t wordspercol = cper_.niter_ * ROWSTRIDE_2COL;\n\t\t\t\t\tsize_t coloff = (coli >> cper_.perpow2_) * wordspercol;\n\t\t\t\t\t__m128i *dst = cper_.qcols_.ptr() + coloff;\n\t\t\t\t\tmemcpy(dst, vbuf_r, sizeof(__m128i) * wordspercol);\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(cper_.debug_) {\n\t\t\t\t// Save the column using memcpys\n\t\t\t\tsize_t wordspercol = cper_.niter_ * ROWSTRIDE_2COL;\n\t\t\t\tsize_t coloff = coli * wordspercol;\n\t\t\t\t__m128i *dst = cper_.qcolsD_.ptr() + coloff;\n\t\t\t\tmemcpy(dst, vbuf_r, sizeof(__m128i) * wordspercol);\n\t\t\t}\n\t\t}\n\n\t\t// Store column maximum vector in first element of tmp\n\t\tvmax = _mm_max_epu8(vmax, vcolmax);\n\n\t\t{\n\t\t\t// Get single largest score in this column\n\t\t\tvmaxtmp = vcolmax;\n\t\t\tvtmp = _mm_srli_si128(vmaxtmp, 8);\n\t\t\tvmaxtmp = _mm_max_epu8(vmaxtmp, vtmp);\n\t\t\tvtmp = _mm_srli_si128(vmaxtmp, 4);\n\t\t\tvmaxtmp = _mm_max_epu8(vmaxtmp, vtmp);\n\t\t\tvtmp = _mm_srli_si128(vmaxtmp, 2);\n\t\t\tvmaxtmp = _mm_max_epu8(vmaxtmp, vtmp);\n\t\t\tvtmp = _mm_srli_si128(vmaxtmp, 1);\n\t\t\tvmaxtmp = _mm_max_epu8(vmaxtmp, vtmp);\n\t\t\tint score = _mm_extract_epi16(vmaxtmp, 0);\n\t\t\tscore = score & 0x00ff;\n\n\t\t\t// Could we have saturated?\n\t\t\tif(score + d.bias_ >= 255) {\n\t\t\t\tflag = -2; // yes\n\t\t\t\tif(!debug) met.dpsat++;\n\t\t\t\treturn MIN_I64;\n\t\t\t}\n\t\t\t\n\t\t\tif(score < minsc_) {\n\t\t\t\tsize_t ncolleft = rff_ - i - 1;\n\t\t\t\tif(score + (TAlScore)ncolleft * matchsc < minsc_) {\n\t\t\t\t\t// Bail!  There can't possibly be a valid alignment that\n\t\t\t\t\t// passes through this column.\n\t\t\t\t\tbailed = true;\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t}\n\t\t\t\n\t\t\tleftmax = score;\n\t\t}\n\t}\n\t\n\tlastoff = off;\n\t\n\t// Now we'd like to know exactly which cells in the *rightmost* column are\n\t// candidates we might backtrace from.  Did *any* elements exceed the\n\t// minimum score threshold?\n\tif(!debug && !bailed && leftmax >= minsc_) {\n\t\t// Yes.  Next question is: which cells are candidates?  We have to\n\t\t// allow matches in the right column to override matches above and\n\t\t// to the left in the left column.\n\t\tpvHLeft  = vbuf_r + 2;\n\t\tassert_lt(lastoff, MAX_SIZE_T);\n\t\tpvScore = d.profbuf_.ptr() + lastoff; // even elts = query profile, odd = gap barrier\n\t\tfor(size_t k = 0; k < iter; k++) {\n\t\t\tvh = _mm_load_si128(pvHLeft);\n\t\t\tvtmp = _mm_cmpgt_epi8(pvScore[0], vbiasm1);\n\t\t\tint cmp = _mm_movemask_epi8(vtmp);\n\t\t\tif(cmp != 0xffff) {\n\t\t\t\t// At least one candidate in this mask.  Now iterate\n\t\t\t\t// through vm/vh to evaluate individual cells.\n\t\t\t\tfor(size_t m = 0; m < NWORDS_PER_REG; m++) {\n\t\t\t\t\tsize_t row = k + m * iter;\n\t\t\t\t\tif(row >= dpRows()) {\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t}\n\t\t\t\t\tif(((TCScore *)&vtmp)[m] > 0 && ((TCScore *)&vh)[m] >= minsc_) {\n\t\t\t\t\t\tTCScore sc = ((TCScore *)&vh)[m];\n\t\t\t\t\t\tassert_geq(sc, minsc_);\n\t\t\t\t\t\t// Add to data structure holding all candidates\n\t\t\t\t\t\tsize_t col = rff_ - rfi_ - 1; // -1 b/c prev col\n\t\t\t\t\t\tsize_t frombot = dpRows() - row - 1;\n\t\t\t\t\t\tDpBtCandidate cand(row, col, sc);\n\t\t\t\t\t\tbtdiag_.add(frombot + col, cand);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\tpvHLeft += ROWSTRIDE_2COL;\n\t\t\tpvScore += 2;\n\t\t}\n\t}\n\n\t// Find largest score in vmax\n\tvtmp = _mm_srli_si128(vmax, 8);\n\tvmax = _mm_max_epu8(vmax, vtmp);\n\tvtmp = _mm_srli_si128(vmax, 4);\n\tvmax = _mm_max_epu8(vmax, vtmp);\n\tvtmp = _mm_srli_si128(vmax, 2);\n\tvmax = _mm_max_epu8(vmax, vtmp);\n\tvtmp = _mm_srli_si128(vmax, 1);\n\tvmax = _mm_max_epu8(vmax, vtmp);\n\t\n\t// Update metrics\n\tif(!debug) {\n\t\tsize_t ninner = (rff_ - rfi_) * iter;\n\t\tmet.col   += (rff_ - rfi_);             // DP columns\n\t\tmet.cell  += (ninner * NWORDS_PER_REG); // DP cells\n\t\tmet.inner += ninner;                    // DP inner loop iters\n\t\tmet.fixup += nfixup;                    // DP fixup loop iters\n\t}\n\t\n\tint score = _mm_extract_epi16(vmax, 0);\n\tscore = score & 0x00ff;\n\n\tflag = 0;\n\t\n\t// Could we have saturated?\n\tif(score + d.bias_ >= 255) {\n\t\tflag = -2; // yes\n\t\tif(!debug) met.dpsat++;\n\t\treturn MIN_I64;\n\t}\n\n\t// Did we find a solution?\n\tif(score == MIN_U8 || score < minsc_) {\n\t\tflag = -1; // no\n\t\tif(!debug) met.dpfail++;\n\t\treturn (TAlScore)score;\n\t}\n\t\n\t// Now take all the backtrace candidates in the btdaig_ structure and\n\t// dump them into the btncand_ array.  They'll be sorted later.\n\tif(!debug) {\n\t\tassert(!btdiag_.empty());\n\t\tbtdiag_.dump(btncand_);\t\n\t\tassert(!btncand_.empty());\n\t}\n\t\n\t// Return largest score\n\tif(!debug) met.dpsucc++;\n\treturn (TAlScore)score;\n}\n\n/**\n * Solve the current alignment problem using SSE instructions that operate on 16\n * unsigned 8-bit values packed into a single 128-bit register.\n */\nTAlScore SwAligner::alignNucleotidesLocalSseU8(int& flag, bool debug) {\n\tassert_leq(rdf_, rd_->length());\n\tassert_leq(rdf_, qu_->length());\n\tassert_lt(rfi_, rff_);\n\tassert_lt(rdi_, rdf_);\n\tassert_eq(rd_->length(), qu_->length());\n\tassert_geq(sc_->gapbar, 1);\n\tassert(repOk());\n#ifndef NDEBUG\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\tassert_range(0, 16, (int)rf_[i]);\n\t}\n#endif\n\n\tSSEData& d = fw_ ? sseU8fw_ : sseU8rc_;\n\tSSEMetrics& met = extend_ ? sseU8ExtendMet_ : sseU8MateMet_;\n\tif(!debug) met.dp++;\n\tbuildQueryProfileLocalSseU8(fw_);\n\tassert(!d.profbuf_.empty());\n\tassert_geq(d.bias_, 0);\n\n\tassert_gt(d.maxBonus_, 0);\n\tsize_t iter =\n\t\t(dpRows() + (NWORDS_PER_REG-1)) / NWORDS_PER_REG; // iter = segLen\n\n\tint dup;\n\t\n\t// Many thanks to Michael Farrar for releasing his striped Smith-Waterman\n\t// implementation:\n\t//\n\t//  http://sites.google.com/site/farrarmichael/smith-waterman\n\t//\n\t// Much of the implmentation below is adapted from Michael's code.\n\n\t// Set all elts to reference gap open penalty\n\t__m128i rfgapo   = _mm_setzero_si128();\n\t__m128i rfgape   = _mm_setzero_si128();\n\t__m128i rdgapo   = _mm_setzero_si128();\n\t__m128i rdgape   = _mm_setzero_si128();\n\t__m128i vlo      = _mm_setzero_si128();\n\t__m128i vhi      = _mm_setzero_si128();\n\t__m128i vmax     = _mm_setzero_si128();\n\t__m128i vcolmax  = _mm_setzero_si128();\n\t__m128i vmaxtmp  = _mm_setzero_si128();\n\t__m128i ve       = _mm_setzero_si128();\n\t__m128i vf       = _mm_setzero_si128();\n\t__m128i vh       = _mm_setzero_si128();\n\t__m128i vtmp     = _mm_setzero_si128();\n\t__m128i vzero    = _mm_setzero_si128();\n\t__m128i vbias    = _mm_setzero_si128();\n\n\tassert_gt(sc_->refGapOpen(), 0);\n\tassert_leq(sc_->refGapOpen(), MAX_U8);\n\tdup = (sc_->refGapOpen() << 8) | (sc_->refGapOpen() & 0x00ff);\n\trfgapo = _mm_insert_epi16(rfgapo, dup, 0);\n\trfgapo = _mm_shufflelo_epi16(rfgapo, 0);\n\trfgapo = _mm_shuffle_epi32(rfgapo, 0);\n\t\n\t// Set all elts to reference gap extension penalty\n\tassert_gt(sc_->refGapExtend(), 0);\n\tassert_leq(sc_->refGapExtend(), MAX_U8);\n\tassert_leq(sc_->refGapExtend(), sc_->refGapOpen());\n\tdup = (sc_->refGapExtend() << 8) | (sc_->refGapExtend() & 0x00ff);\n\trfgape = _mm_insert_epi16(rfgape, dup, 0);\n\trfgape = _mm_shufflelo_epi16(rfgape, 0);\n\trfgape = _mm_shuffle_epi32(rfgape, 0);\n\n\t// Set all elts to read gap open penalty\n\tassert_gt(sc_->readGapOpen(), 0);\n\tassert_leq(sc_->readGapOpen(), MAX_U8);\n\tdup = (sc_->readGapOpen() << 8) | (sc_->readGapOpen() & 0x00ff);\n\trdgapo = _mm_insert_epi16(rdgapo, dup, 0);\n\trdgapo = _mm_shufflelo_epi16(rdgapo, 0);\n\trdgapo = _mm_shuffle_epi32(rdgapo, 0);\n\t\n\t// Set all elts to read gap extension penalty\n\tassert_gt(sc_->readGapExtend(), 0);\n\tassert_leq(sc_->readGapExtend(), MAX_U8);\n\tassert_leq(sc_->readGapExtend(), sc_->readGapOpen());\n\tdup = (sc_->readGapExtend() << 8) | (sc_->readGapExtend() & 0x00ff);\n\trdgape = _mm_insert_epi16(rdgape, dup, 0);\n\trdgape = _mm_shufflelo_epi16(rdgape, 0);\n\trdgape = _mm_shuffle_epi32(rdgape, 0);\n\t\n\tvhi = _mm_cmpeq_epi16(vhi, vhi); // all elts = 0xffff\n\tvlo = _mm_xor_si128(vlo, vlo);   // all elts = 0\n\tvmax = vlo;\n\t\n\t// Make a vector of bias offsets\n\tdup = (d.bias_ << 8) | (d.bias_ & 0x00ff);\n\tvbias = _mm_insert_epi16(vbias, dup, 0);\n\tvbias = _mm_shufflelo_epi16(vbias, 0);\n\tvbias = _mm_shuffle_epi32(vbias, 0);\n\t\n\t// Points to a long vector of __m128i where each element is a block of\n\t// contiguous cells in the E, F or H matrix.  If the index % 3 == 0, then\n\t// the block of cells is from the E matrix.  If index % 3 == 1, they're\n\t// from the F matrix.  If index % 3 == 2, then they're from the H matrix.\n\t// Blocks of cells are organized in the same interleaved manner as they are\n\t// calculated by the Farrar algorithm.\n\tconst __m128i *pvScore; // points into the query profile\n\n\td.mat_.init(dpRows(), rff_ - rfi_, NWORDS_PER_REG);\n\tconst size_t colstride = d.mat_.colstride();\n\t//const size_t rowstride = d.mat_.rowstride();\n\tassert_eq(ROWSTRIDE, colstride / iter);\n\t\n\t// Initialize the H and E vectors in the first matrix column\n\t__m128i *pvHTmp = d.mat_.tmpvec(0, 0);\n\t__m128i *pvETmp = d.mat_.evec(0, 0);\n\t\n\tfor(size_t i = 0; i < iter; i++) {\n\t\t_mm_store_si128(pvETmp, vlo);\n\t\t_mm_store_si128(pvHTmp, vlo); // start low in local mode\n\t\tpvETmp += ROWSTRIDE;\n\t\tpvHTmp += ROWSTRIDE;\n\t}\n\t// These are swapped just before the innermost loop\n\t__m128i *pvHStore = d.mat_.hvec(0, 0);\n\t__m128i *pvHLoad  = d.mat_.tmpvec(0, 0);\n\t__m128i *pvELoad  = d.mat_.evec(0, 0);\n\t__m128i *pvEStore = d.mat_.evecUnsafe(0, 1);\n\t__m128i *pvFStore = d.mat_.fvec(0, 0);\n\t__m128i *pvFTmp   = NULL;\n\t\n\tassert_gt(sc_->gapbar, 0);\n\tsize_t nfixup = 0;\n\tTAlScore matchsc = sc_->match(30);\n\t\n\t// Fill in the table as usual but instead of using the same gap-penalty\n\t// vector for each iteration of the inner loop, load words out of a\n\t// pre-calculated gap vector parallel to the query profile.  The pre-\n\t// calculated gap vectors enforce the gap barrier constraint by making it\n\t// infinitely costly to introduce a gap in barrier rows.\n\t//\n\t// AND use a separate loop to fill in the first row of the table, enforcing\n\t// the st_ constraints in the process.  This is awkward because it\n\t// separates the processing of the first row from the others and might make\n\t// it difficult to use the first-row results in the next row, but it might\n\t// be the simplest and least disruptive way to deal with the st_ constraint.\n\t\n\tcolstop_ = rff_ - rfi_;\n\tlastsolcol_ = 0;\n\tfor(size_t i = (size_t)rfi_; i < (size_t)rff_; i++) {\n\t\tassert(pvFStore == d.mat_.fvec(0, i - rfi_));\n\t\tassert(pvHStore == d.mat_.hvec(0, i - rfi_));\n\t\t\n\t\t// Fetch this column's reference mask\n\t\tconst int refm = (int)rf_[i];\n\t\t\n\t\t// Fetch the appropriate query profile\n\t\tsize_t off = (size_t)firsts5[refm] * iter * 2;\n\t\tpvScore = d.profbuf_.ptr() + off; // even elts = query profile, odd = gap barrier\n\t\t\n\t\t// Load H vector from the final row of the previous column\n\t\tvh = _mm_load_si128(pvHLoad + colstride - ROWSTRIDE);\n\t\t\n\t\t// Set all cells to low value\n\t\tvf = _mm_xor_si128(vf, vf);\n\t\t\n\t\t// Store cells in F, calculated previously\n\t\t// No need to veto ref gap extensions, they're all 0x00s\n\t\t_mm_store_si128(pvFStore, vf);\n\t\tpvFStore += ROWSTRIDE;\n\t\t\n\t\t// Shift down so that topmost (least sig) cell gets 0\n\t\tvh = _mm_slli_si128(vh, NBYTES_PER_WORD);\n\t\t\n\t\t// We pull out one loop iteration to make it easier to veto values in the top row\n\t\t\n\t\t// Load cells from E, calculated previously\n\t\tve = _mm_load_si128(pvELoad);\n\t\tassert_all_lt(ve, vhi);\n\t\tpvELoad += ROWSTRIDE;\n\t\t\n\t\t// Factor in query profile (matches and mismatches)\n\t\tvh = _mm_adds_epu8(vh, pvScore[0]);\n\t\tvh = _mm_subs_epu8(vh, vbias);\n\t\t\n\t\t// Update H, factoring in E and F\n\t\tvh = _mm_max_epu8(vh, ve);\n\t\tvh = _mm_max_epu8(vh, vf);\n\t\t\n\t\t// Update highest score so far\n\t\tvcolmax = _mm_xor_si128(vcolmax, vcolmax);\n\t\tvcolmax = _mm_max_epu8(vcolmax, vh);\n\t\t\n\t\t// Save the new vH values\n\t\t_mm_store_si128(pvHStore, vh);\n\t\tpvHStore += ROWSTRIDE;\n\t\t\n\t\t// Update vE value\n\t\tvf = vh;\n\t\tvh = _mm_subs_epu8(vh, rdgapo);\n\t\tvh = _mm_subs_epu8(vh, pvScore[1]); // veto some read gap opens\n\t\tve = _mm_subs_epu8(ve, rdgape);\n\t\tve = _mm_max_epu8(ve, vh);\n\t\tassert_all_lt(ve, vhi);\n\t\t\n\t\t// Load the next h value\n\t\tvh = _mm_load_si128(pvHLoad);\n\t\tpvHLoad += ROWSTRIDE;\n\t\t\n\t\t// Save E values\n\t\t_mm_store_si128(pvEStore, ve);\n\t\tpvEStore += ROWSTRIDE;\n\t\t\n\t\t// Update vf value\n\t\tvf = _mm_subs_epu8(vf, rfgapo);\n\t\tassert_all_lt(vf, vhi);\n\t\t\n\t\tpvScore += 2; // move on to next query profile\n\n\t\t// For each character in the reference text:\n\t\tsize_t j;\n\t\tfor(j = 1; j < iter; j++) {\n\t\t\t// Load cells from E, calculated previously\n\t\t\tve = _mm_load_si128(pvELoad);\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\tpvELoad += ROWSTRIDE;\n\t\t\t\n\t\t\t// Store cells in F, calculated previously\n\t\t\tvf = _mm_subs_epu8(vf, pvScore[1]); // veto some ref gap extensions\n\t\t\t_mm_store_si128(pvFStore, vf);\n\t\t\tpvFStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Factor in query profile (matches and mismatches)\n\t\t\tvh = _mm_adds_epu8(vh, pvScore[0]);\n\t\t\tvh = _mm_subs_epu8(vh, vbias);\n\t\t\t\n\t\t\t// Update H, factoring in E and F\n\t\t\tvh = _mm_max_epu8(vh, ve);\n\t\t\tvh = _mm_max_epu8(vh, vf);\n\t\t\t\n\t\t\t// Update highest score encountered this far\n\t\t\tvcolmax = _mm_max_epu8(vcolmax, vh);\n\t\t\t\n\t\t\t// Save the new vH values\n\t\t\t_mm_store_si128(pvHStore, vh);\n\t\t\tpvHStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update vE value\n\t\t\tvtmp = vh;\n\t\t\tvh = _mm_subs_epu8(vh, rdgapo);\n\t\t\tvh = _mm_subs_epu8(vh, pvScore[1]); // veto some read gap opens\n\t\t\tve = _mm_subs_epu8(ve, rdgape);\n\t\t\tve = _mm_max_epu8(ve, vh);\n\t\t\tassert_all_lt(ve, vhi);\n\t\t\t\n\t\t\t// Load the next h value\n\t\t\tvh = _mm_load_si128(pvHLoad);\n\t\t\tpvHLoad += ROWSTRIDE;\n\t\t\t\n\t\t\t// Save E values\n\t\t\t_mm_store_si128(pvEStore, ve);\n\t\t\tpvEStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update vf value\n\t\t\tvtmp = _mm_subs_epu8(vtmp, rfgapo);\n\t\t\tvf = _mm_subs_epu8(vf, rfgape);\n\t\t\tassert_all_lt(vf, vhi);\n\t\t\tvf = _mm_max_epu8(vf, vtmp);\n\t\t\t\n\t\t\tpvScore += 2; // move on to next query profile / gap veto\n\t\t}\n\t\t// pvHStore, pvELoad, pvEStore have all rolled over to the next column\n\t\tpvFTmp = pvFStore;\n\t\tpvFStore -= colstride; // reset to start of column\n\t\tvtmp = _mm_load_si128(pvFStore);\n\t\t\n\t\tpvHStore -= colstride; // reset to start of column\n\t\tvh = _mm_load_si128(pvHStore);\n\t\t\n\t\tpvEStore -= colstride; // reset to start of column\n\t\tve = _mm_load_si128(pvEStore);\n\t\t\n\t\tpvHLoad = pvHStore;    // new pvHLoad = pvHStore\n\t\tpvScore = d.profbuf_.ptr() + off + 1; // reset veto vector\n\t\t\n\t\t// vf from last row gets shifted down by one to overlay the first row\n\t\t// rfgape has already been subtracted from it.\n\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\t\n\t\tvf = _mm_subs_epu8(vf, *pvScore); // veto some ref gap extensions\n\t\tvf = _mm_max_epu8(vtmp, vf);\n\t\tvtmp = _mm_subs_epu8(vf, vtmp);\n\t\tvtmp = _mm_cmpeq_epi8(vtmp, vzero);\n\t\tint cmp = _mm_movemask_epi8(vtmp);\n\t\t\n\t\t// If any element of vtmp is greater than H - gap-open...\n\t\tj = 0;\n\t\twhile(cmp != 0xffff) {\n\t\t\t// Store this vf\n\t\t\t_mm_store_si128(pvFStore, vf);\n\t\t\tpvFStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update vh w/r/t new vf\n\t\t\tvh = _mm_max_epu8(vh, vf);\n\t\t\t\n\t\t\t// Save vH values\n\t\t\t_mm_store_si128(pvHStore, vh);\n\t\t\tpvHStore += ROWSTRIDE;\n\t\t\t\n\t\t\t// Update highest score encountered this far\n\t\t\tvcolmax = _mm_max_epu8(vcolmax, vh);\n\t\t\t\n\t\t\t// Update E in case it can be improved using our new vh\n\t\t\tvh = _mm_subs_epu8(vh, rdgapo);\n\t\t\tvh = _mm_subs_epu8(vh, *pvScore); // veto some read gap opens\n\t\t\tve = _mm_max_epu8(ve, vh);\n\t\t\t_mm_store_si128(pvEStore, ve);\n\t\t\tpvEStore += ROWSTRIDE;\n\t\t\tpvScore += 2;\n\t\t\t\n\t\t\tassert_lt(j, iter);\n\t\t\tif(++j == iter) {\n\t\t\t\tpvFStore -= colstride;\n\t\t\t\tvtmp = _mm_load_si128(pvFStore);   // load next vf ASAP\n\t\t\t\tpvHStore -= colstride;\n\t\t\t\tvh = _mm_load_si128(pvHStore);     // load next vh ASAP\n\t\t\t\tpvEStore -= colstride;\n\t\t\t\tve = _mm_load_si128(pvEStore);     // load next ve ASAP\n\t\t\t\tpvScore = d.profbuf_.ptr() + off + 1;\n\t\t\t\tj = 0;\n\t\t\t\tvf = _mm_slli_si128(vf, NBYTES_PER_WORD);\n\t\t\t} else {\n\t\t\t\tvtmp = _mm_load_si128(pvFStore);   // load next vf ASAP\n\t\t\t\tvh = _mm_load_si128(pvHStore);     // load next vh ASAP\n\t\t\t\tve = _mm_load_si128(pvEStore);     // load next vh ASAP\n\t\t\t}\n\t\t\t\n\t\t\t// Update F with another gap extension\n\t\t\tvf = _mm_subs_epu8(vf, rfgape);\n\t\t\tvf = _mm_subs_epu8(vf, *pvScore); // veto some ref gap extensions\n\t\t\tvf = _mm_max_epu8(vtmp, vf);\n\t\t\tvtmp = _mm_subs_epu8(vf, vtmp);\n\t\t\tvtmp = _mm_cmpeq_epi8(vtmp, vzero);\n\t\t\tcmp = _mm_movemask_epi8(vtmp);\n\t\t\tnfixup++;\n\t\t}\n\n#ifndef NDEBUG\n\t\tif((rand() & 15) == 0) {\n\t\t\t// This is a work-intensive sanity check; each time we finish filling\n\t\t\t// a column, we check that each H, E, and F is sensible.\n\t\t\tfor(size_t k = 0; k < dpRows(); k++) {\n\t\t\t\tassert(cellOkLocalU8(\n\t\t\t\t\td,\n\t\t\t\t\tk,                   // row\n\t\t\t\t\ti - rfi_,            // col\n\t\t\t\t\trefm,                // reference mask\n\t\t\t\t\t(int)(*rd_)[rdi_+k], // read char\n\t\t\t\t\t(int)(*qu_)[rdi_+k], // read quality\n\t\t\t\t\t*sc_));              // scoring scheme\n\t\t\t}\n\t\t}\n#endif\n\n\t\t// Store column maximum vector in first element of tmp\n\t\tvmax = _mm_max_epu8(vmax, vcolmax);\n\t\t_mm_store_si128(d.mat_.tmpvec(0, i - rfi_), vcolmax);\n\n\t\t{\n\t\t\t// Get single largest score in this column\n\t\t\tvmaxtmp = vcolmax;\n\t\t\tvtmp = _mm_srli_si128(vmaxtmp, 8);\n\t\t\tvmaxtmp = _mm_max_epu8(vmaxtmp, vtmp);\n\t\t\tvtmp = _mm_srli_si128(vmaxtmp, 4);\n\t\t\tvmaxtmp = _mm_max_epu8(vmaxtmp, vtmp);\n\t\t\tvtmp = _mm_srli_si128(vmaxtmp, 2);\n\t\t\tvmaxtmp = _mm_max_epu8(vmaxtmp, vtmp);\n\t\t\tvtmp = _mm_srli_si128(vmaxtmp, 1);\n\t\t\tvmaxtmp = _mm_max_epu8(vmaxtmp, vtmp);\n\t\t\tint score = _mm_extract_epi16(vmaxtmp, 0);\n\t\t\tscore = score & 0x00ff;\n\n\t\t\t// Could we have saturated?\n\t\t\tif(score + d.bias_ >= 255) {\n\t\t\t\tflag = -2; // yes\n\t\t\t\tif(!debug) met.dpsat++;\n\t\t\t\treturn MIN_I64;\n\t\t\t}\n\t\t\t\n\t\t\tif(score < minsc_) {\n\t\t\t\tsize_t ncolleft = rff_ - i - 1;\n\t\t\t\tif(score + (TAlScore)ncolleft * matchsc < minsc_) {\n\t\t\t\t\t// Bail!  We're guaranteed not to see a valid alignment in\n\t\t\t\t\t// the rest of the matrix\n\t\t\t\t\tcolstop_ = (i+1) - rfi_;\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\tlastsolcol_ = i - rfi_;\n\t\t\t}\n\t\t}\n\t\t\n\t\t// pvELoad and pvHLoad are already where they need to be\n\t\t\n\t\t// Adjust the load and store vectors here.  \n\t\tpvHStore = pvHLoad + colstride;\n\t\tpvEStore = pvELoad + colstride;\n\t\tpvFStore = pvFTmp;\n\t}\n\n\t// Find largest score in vmax\n\tvtmp = _mm_srli_si128(vmax, 8);\n\tvmax = _mm_max_epu8(vmax, vtmp);\n\tvtmp = _mm_srli_si128(vmax, 4);\n\tvmax = _mm_max_epu8(vmax, vtmp);\n\tvtmp = _mm_srli_si128(vmax, 2);\n\tvmax = _mm_max_epu8(vmax, vtmp);\n\tvtmp = _mm_srli_si128(vmax, 1);\n\tvmax = _mm_max_epu8(vmax, vtmp);\n\t\n\t// Update metrics\n\tif(!debug) {\n\t\tsize_t ninner = (rff_ - rfi_) * iter;\n\t\tmet.col   += (rff_ - rfi_);             // DP columns\n\t\tmet.cell  += (ninner * NWORDS_PER_REG); // DP cells\n\t\tmet.inner += ninner;                    // DP inner loop iters\n\t\tmet.fixup += nfixup;                    // DP fixup loop iters\n\t}\n\t\n\tint score = _mm_extract_epi16(vmax, 0);\n\tscore = score & 0x00ff;\n\n\tflag = 0;\n\t\n\t// Could we have saturated?\n\tif(score + d.bias_ >= 255) {\n\t\tflag = -2; // yes\n\t\tif(!debug) met.dpsat++;\n\t\treturn MIN_I64;\n\t}\n\n\t// Did we find a solution?\n\tif(score == MIN_U8 || score < minsc_) {\n\t\tflag = -1; // no\n\t\tif(!debug) met.dpfail++;\n\t\treturn (TAlScore)score;\n\t}\n\t\n\t// Return largest score\n\tif(!debug) met.dpsucc++;\n\treturn (TAlScore)score;\n}\n\n/**\n * Given a filled-in DP table, populate the btncand_ list with candidate cells\n * that might be at the ends of valid alignments.  No need to do this unless\n * the maximum score returned by the align*() func is >= the minimum.\n *\n * We needn't consider cells that have no chance of reaching any of the core\n * diagonals.  These are the cells that are more than 'maxgaps' cells away from\n * a core diagonal.\n *\n * We need to be careful to consider that the rectangle might be truncated on\n * one or both ends.\n *\n * The seed extend case looks like this:\n *\n *      |Rectangle|   0: seed diagonal\n *      **OO0oo----   o: \"RHS gap\" diagonals\n *      -**OO0oo---   O: \"LHS gap\" diagonals\n *      --**OO0oo--   *: \"LHS extra\" diagonals\n *      ---**OO0oo-   -: cells that can't possibly be involved in a valid    \n *      ----**OO0oo      alignment that overlaps one of the core diagonals\n *\n * The anchor-to-left case looks like this:\n *\n *   |Anchor|  | ---- Rectangle ---- |\n *   o---------OO0000000000000oo------  0: mate diagonal (also core diags!)\n *   -o---------OO0000000000000oo-----  o: \"RHS gap\" diagonals\n *   --o---------OO0000000000000oo----  O: \"LHS gap\" diagonals\n *   ---oo--------OO0000000000000oo---  *: \"LHS extra\" diagonals\n *   -----o--------OO0000000000000oo--  -: cells that can't possibly be\n *   ------o--------OO0000000000000oo-     involved in a valid alignment that\n *   -------o--------OO0000000000000oo     overlaps one of the core diagonals\n *                     XXXXXXXXXXXXX\n *                     | RHS Range |\n *                     ^           ^\n *                     rl          rr\n *\n * The anchor-to-right case looks like this:\n *\n *    ll          lr\n *    v           v\n *    | LHS Range |\n *    XXXXXXXXXXXXX          |Anchor|\n *  OO0000000000000oo--------o--------  0: mate diagonal (also core diags!)\n *  -OO0000000000000oo--------o-------  o: \"RHS gap\" diagonals\n *  --OO0000000000000oo--------o------  O: \"LHS gap\" diagonals\n *  ---OO0000000000000oo--------oo----  *: \"LHS extra\" diagonals\n *  ----OO0000000000000oo---------o---  -: cells that can't possibly be\n *  -----OO0000000000000oo---------o--     involved in a valid alignment that\n *  ------OO0000000000000oo---------o-     overlaps one of the core diagonals\n *  | ---- Rectangle ---- |\n */\nbool SwAligner::gatherCellsNucleotidesLocalSseU8(TAlScore best) {\n\t// What's the minimum number of rows that can possibly be spanned by an\n\t// alignment that meets the minimum score requirement?\n\tassert(sse8succ_);\n\tsize_t bonus = (size_t)sc_->match(30);\n\tconst size_t ncol = lastsolcol_ + 1;\n\tconst size_t nrow = dpRows();\n\tassert_gt(nrow, 0);\n\tbtncand_.clear();\n\tbtncanddone_.clear();\n\tSSEData& d = fw_ ? sseU8fw_ : sseU8rc_;\n\tSSEMetrics& met = extend_ ? sseU8ExtendMet_ : sseU8MateMet_;\n\tassert(!d.profbuf_.empty());\n\t//const size_t rowstride = d.mat_.rowstride();\n\t//const size_t colstride = d.mat_.colstride();\n\tsize_t iter = (dpRows() + (NWORDS_PER_REG - 1)) / NWORDS_PER_REG;\n\tassert_gt(iter, 0);\n\tassert_geq(minsc_, 0);\n\tassert_gt(bonus, 0);\n\tsize_t minrow = (size_t)(((minsc_ + bonus - 1) / bonus) - 1);\n\tfor(size_t j = 0; j < ncol; j++) {\n\t\t// Establish the range of rows where a backtrace from the cell in this\n\t\t// row/col is close enough to one of the core diagonals that it could\n\t\t// conceivably count\n\t\tsize_t nrow_lo = MIN_SIZE_T;\n\t\tsize_t nrow_hi = nrow;\n\t\t// First, check if there is a cell in this column with a score\n\t\t// above the score threshold\n\t\t__m128i vmax = *d.mat_.tmpvec(0, j);\n\t\t__m128i vtmp = _mm_srli_si128(vmax, 8);\n\t\tvmax = _mm_max_epu8(vmax, vtmp);\n\t\tvtmp = _mm_srli_si128(vmax, 4);\n\t\tvmax = _mm_max_epu8(vmax, vtmp);\n\t\tvtmp = _mm_srli_si128(vmax, 2);\n\t\tvmax = _mm_max_epu8(vmax, vtmp);\n\t\tvtmp = _mm_srli_si128(vmax, 1);\n\t\tvmax = _mm_max_epu8(vmax, vtmp);\n\t\tint score = _mm_extract_epi16(vmax, 0);\n\t\tscore = score & 0x00ff;\n#ifndef NDEBUG\n\t\t{\n\t\t\t// Start in upper vector row and move down\n\t\t\tTAlScore max = 0;\n\t\t\t__m128i *pvH = d.mat_.hvec(0, j);\n\t\t\tfor(size_t i = 0; i < iter; i++) {\n\t\t\t\tfor(size_t k = 0; k < NWORDS_PER_REG; k++) {\n\t\t\t\t\tTAlScore sc = (TAlScore)((TCScore*)pvH)[k];\n\t\t\t\t\tif(sc > max) {\n\t\t\t\t\t\tmax = sc;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tpvH += ROWSTRIDE;\n\t\t\t}\n\t\t\tassert_eq(max, score);\n\t\t}\n#endif\n\t\tif((TAlScore)score < minsc_) {\n\t\t\t// Scores in column aren't good enough\n\t\t\tcontinue;\n\t\t}\n\t\t// Get pointer to first cell in column to examine:\n\t\t__m128i *pvHorig = d.mat_.hvec(0, j);\n\t\t__m128i *pvH     = pvHorig;\n\t\t// Get pointer to the vector in the following column that corresponds\n\t\t// to the cells diagonally down and to the right from the cells in pvH\n\t\t__m128i *pvHSucc = (j < ncol-1) ? d.mat_.hvec(0, j+1) : NULL;\n\t\t// Start in upper vector row and move down\n\t\tfor(size_t i = 0; i < iter; i++) {\n\t\t\tif(pvHSucc != NULL) {\n\t\t\t\tpvHSucc += ROWSTRIDE;\n\t\t\t\tif(i == iter-1) {\n\t\t\t\t\tpvHSucc = d.mat_.hvec(0, j+1);\n\t\t\t\t}\n\t\t\t}\n\t\t\t// Which elements of this vector are exhaustively scored?\n\t\t\tsize_t rdoff = i;\n\t\t\tfor(size_t k = 0; k < NWORDS_PER_REG; k++) {\n\t\t\t\t// Is this row, col one that we can potential backtrace from?\n\t\t\t\t// I.e. are we close enough to a core diagonal?\n\t\t\t\tif(rdoff >= nrow_lo && rdoff < nrow_hi) {\n\t\t\t\t\t// This cell has been exhaustively scored\n\t\t\t\t\tif(rdoff >= minrow) {\n\t\t\t\t\t\t// ... and it could potentially score high enough\n\t\t\t\t\t\tTAlScore sc = (TAlScore)((TCScore*)pvH)[k];\n\t\t\t\t\t\tassert_leq(sc, best);\n\t\t\t\t\t\tif(sc >= minsc_) {\n\t\t\t\t\t\t\t// This is a potential solution\n\t\t\t\t\t\t\tbool matchSucc = false;\n\t\t\t\t\t\t\tint readc = (*rd_)[rdoff];\n\t\t\t\t\t\t\tint refc = rf_[j + rfi_];\n\t\t\t\t\t\t\tbool match = ((refc & (1 << readc)) != 0);\n\t\t\t\t\t\t\tif(rdoff < dpRows()-1) {\n\t\t\t\t\t\t\t\tint readcSucc = (*rd_)[rdoff+1];\n\t\t\t\t\t\t\t\tint refcSucc = rf_[j + rfi_ + 1];\n\t\t\t\t\t\t\t\tassert_range(0, 16, refcSucc);\n\t\t\t\t\t\t\t\tmatchSucc = ((refcSucc & (1 << readcSucc)) != 0);\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tif(match && !matchSucc) {\n\t\t\t\t\t\t\t\t// Yes, this is legit\n\t\t\t\t\t\t\t\tmet.gathsol++;\n\t\t\t\t\t\t\t\tbtncand_.expand();\n\t\t\t\t\t\t\t\tbtncand_.back().init(rdoff, j, sc);\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t} else {\n\t\t\t\t\t// Already saw every element in the vector that's been\n\t\t\t\t\t// exhaustively scored\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t\trdoff += iter;\n\t\t\t}\n\t\t\tpvH += ROWSTRIDE;\n\t\t}\n\t}\n\tif(!btncand_.empty()) {\n\t\td.mat_.initMasks();\n\t}\n\treturn !btncand_.empty();\n}\n\n#define MOVE_VEC_PTR_UP(vec, rowvec, rowelt) { \\\n\tif(rowvec == 0) { \\\n\t\trowvec += d.mat_.nvecrow_; \\\n\t\tvec += d.mat_.colstride_; \\\n\t\trowelt--; \\\n\t} \\\n\trowvec--; \\\n\tvec -= ROWSTRIDE; \\\n}\n\n#define MOVE_VEC_PTR_LEFT(vec, rowvec, rowelt) { vec -= d.mat_.colstride_; }\n\n#define MOVE_VEC_PTR_UPLEFT(vec, rowvec, rowelt) { \\\n \tMOVE_VEC_PTR_UP(vec, rowvec, rowelt); \\\n \tMOVE_VEC_PTR_LEFT(vec, rowvec, rowelt); \\\n}\n\n#define MOVE_ALL_LEFT() { \\\n\tMOVE_VEC_PTR_LEFT(cur_vec, rowvec, rowelt); \\\n\tMOVE_VEC_PTR_LEFT(left_vec, left_rowvec, left_rowelt); \\\n\tMOVE_VEC_PTR_LEFT(up_vec, up_rowvec, up_rowelt); \\\n\tMOVE_VEC_PTR_LEFT(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n#define MOVE_ALL_UP() { \\\n\tMOVE_VEC_PTR_UP(cur_vec, rowvec, rowelt); \\\n\tMOVE_VEC_PTR_UP(left_vec, left_rowvec, left_rowelt); \\\n\tMOVE_VEC_PTR_UP(up_vec, up_rowvec, up_rowelt); \\\n\tMOVE_VEC_PTR_UP(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n#define MOVE_ALL_UPLEFT() { \\\n\tMOVE_VEC_PTR_UPLEFT(cur_vec, rowvec, rowelt); \\\n\tMOVE_VEC_PTR_UPLEFT(left_vec, left_rowvec, left_rowelt); \\\n\tMOVE_VEC_PTR_UPLEFT(up_vec, up_rowvec, up_rowelt); \\\n\tMOVE_VEC_PTR_UPLEFT(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n#define NEW_ROW_COL(row, col) { \\\n\trowelt = row / d.mat_.nvecrow_; \\\n\trowvec = row % d.mat_.nvecrow_; \\\n\teltvec = (col * d.mat_.colstride_) + (rowvec * ROWSTRIDE); \\\n\tcur_vec = d.mat_.matbuf_.ptr() + eltvec; \\\n\tleft_vec = cur_vec; \\\n\tleft_rowelt = rowelt; \\\n\tleft_rowvec = rowvec; \\\n\tMOVE_VEC_PTR_LEFT(left_vec, left_rowvec, left_rowelt); \\\n\tup_vec = cur_vec; \\\n\tup_rowelt = rowelt; \\\n\tup_rowvec = rowvec; \\\n\tMOVE_VEC_PTR_UP(up_vec, up_rowvec, up_rowelt); \\\n\tupleft_vec = up_vec; \\\n\tupleft_rowelt = up_rowelt; \\\n\tupleft_rowvec = up_rowvec; \\\n\tMOVE_VEC_PTR_LEFT(upleft_vec, upleft_rowvec, upleft_rowelt); \\\n}\n\n/**\n * Given the dynamic programming table and a cell, trace backwards from the\n * cell and install the edits and score/penalty in the appropriate fields\n * of SwResult res, which contains an AlnRes.  The RandomSource is used to\n * break ties among equally good ways of tracing back.\n *\n * Upon entering a cell, we check if the read/ref coordinates of the cell\n * correspond to a cell we traversed constructing a previous alignment.  If so,\n * we backtrack to the last decision point, mask out the path that led to the\n * previously observed cell, and continue along a different path; or, if there\n * are no more paths to try, we give up.\n *\n * An alignment found is subject to a filtering step designed to remove\n * alignments that could spuriously trump a better alignment falling partially\n * outside the rectangle.\n *\n *          1\n *      67890123456   0: seed diagonal\n *      **OO0oo----   o: right-hand \"gap\" diagonals: band of 'maxgap' diags\n *      -**OO0oo---   O: left-hand \"gap\" diagonals: band of 'maxgap' diags\n *      --**OO0oo--   *: \"extra\" diagonals: additional band of 'maxgap' diags\n *      ---**OO0oo-   +: cells not in any of the above \n *      ----**OO0oo\n *            |-|\n *   Gotta touch one of these diags\n *\n * Basically, the filtering step removes alignments that do not at some point\n * touch a cell labeled '0' or 'O' in the diagram above.\n *\n */\nbool SwAligner::backtraceNucleotidesLocalSseU8(\n\tTAlScore       escore, // in: expected score\n\tSwResult&      res,    // out: store results (edits and scores) here\n\tsize_t&        off,    // out: store diagonal projection of origin\n\tsize_t&        nbts,   // out: # backtracks\n\tsize_t         row,    // start in this row\n\tsize_t         col,    // start in this column\n\tRandomSource&  rnd)    // random gen, to choose among equal paths\n{\n\tassert_lt(row, dpRows());\n\tassert_lt(col, (size_t)(rff_ - rfi_));\n\tSSEData& d = fw_ ? sseU8fw_ : sseU8rc_;\n\tSSEMetrics& met = extend_ ? sseU8ExtendMet_ : sseU8MateMet_;\n\tmet.bt++;\n\tassert(!d.profbuf_.empty());\n\tassert_lt(row, rd_->length());\n\tbtnstack_.clear(); // empty the backtrack stack\n\tbtcells_.clear();  // empty the cells-so-far list\n\tAlnScore score; score.score_ = 0;\n\t// score.gaps_ = score.ns_ = 0;\n\tASSERT_ONLY(size_t origCol = col);\n\tsize_t gaps = 0, readGaps = 0, refGaps = 0;\n\tres.alres.reset();\n    EList<Edit>& ned = res.alres.ned();\n\tassert(ned.empty());\n\tassert_gt(dpRows(), row);\n\tASSERT_ONLY(size_t trimEnd = dpRows() - row - 1);\n\tsize_t trimBeg = 0;\n\tsize_t ct = SSEMatrix::H; // cell type\n\t// Row and col in terms of where they fall in the SSE vector matrix\n\tsize_t rowelt, rowvec, eltvec;\n\tsize_t left_rowelt, up_rowelt, upleft_rowelt;\n\tsize_t left_rowvec, up_rowvec, upleft_rowvec;\n\t__m128i *cur_vec, *left_vec, *up_vec, *upleft_vec;\n\tNEW_ROW_COL(row, col);\n\twhile((int)row >= 0) {\n\t\tmet.btcell++;\n\t\tnbts++;\n\t\tint readc = (*rd_)[rdi_ + row];\n\t\tint refm  = (int)rf_[rfi_ + col];\n\t\tint readq = (*qu_)[row];\n\t\tassert_leq(col, origCol);\n\t\t// Get score in this cell\n\t\tbool empty = false, reportedThru, canMoveThru, branch = false;\n\t\tint cur = SSEMatrix::H;\n\t\tif(!d.mat_.reset_[row]) {\n\t\t\td.mat_.resetRow(row);\n\t\t}\n\t\treportedThru = d.mat_.reportedThrough(row, col);\n\t\tcanMoveThru = true;\n\t\tif(reportedThru) {\n\t\t\tcanMoveThru = false;\n\t\t} else {\n\t\t\tempty = false;\n\t\t\tif(row > 0) {\n\t\t\t\tassert_gt(row, 0);\n\t\t\t\tsize_t rowFromEnd = d.mat_.nrow() - row - 1;\n\t\t\t\tbool gapsAllowed = true;\n\t\t\t\tif(row < (size_t)sc_->gapbar ||\n\t\t\t\t   rowFromEnd < (size_t)sc_->gapbar)\n\t\t\t\t{\n\t\t\t\t\tgapsAllowed = false;\n\t\t\t\t}\n\t\t\t\tconst int floorsc = 0;\n\t\t\t\tconst int offsetsc = 0;\n\t\t\t\t// Move to beginning of column/row\n\t\t\t\tif(ct == SSEMatrix::E) { // AKA rdgap\n\t\t\t\t\tassert_gt(col, 0);\n\t\t\t\t\tTAlScore sc_cur = ((TCScore*)(cur_vec + SSEMatrix::E))[rowelt] + offsetsc;\n\t\t\t\t\tassert(gapsAllowed);\n\t\t\t\t\t// Currently in the E matrix; incoming transition must come from the\n\t\t\t\t\t// left.  It's either a gap open from the H matrix or a gap extend from\n\t\t\t\t\t// the E matrix.\n\t\t\t\t\t// TODO: save and restore origMask as well as mask\n\t\t\t\t\tint origMask = 0, mask = 0;\n\t\t\t\t\t// Get H score of cell to the left\n\t\t\t\t\tTAlScore sc_h_left = ((TCScore*)(left_vec + SSEMatrix::H))[left_rowelt] + offsetsc;\n\t\t\t\t\tif(sc_h_left > 0 && sc_h_left - sc_->readGapOpen() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 0);\n\t\t\t\t\t}\n\t\t\t\t\t// Get E score of cell to the left\n\t\t\t\t\tTAlScore sc_e_left = ((TCScore*)(left_vec + SSEMatrix::E))[left_rowelt] + offsetsc;\n\t\t\t\t\tif(sc_e_left > 0 && sc_e_left - sc_->readGapExtend() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 1);\n\t\t\t\t\t}\n\t\t\t\t\torigMask = mask;\n\t\t\t\t\tassert(origMask > 0 || sc_cur <= sc_->match());\n\t\t\t\t\tif(d.mat_.isEMaskSet(row, col)) {\n\t\t\t\t\t\tmask = (d.mat_.masks_[row][col] >> 8) & 3;\n\t\t\t\t\t}\n\t\t\t\t\tif(mask == 3) {\n#if 1\n\t\t\t\t\t\t// Pick H -> E cell\n\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\td.mat_.eMaskSet(row, col, 2); // might choose E later\n#else\n\t\t\t\t\t\tif(rnd.nextU2()) {\n\t\t\t\t\t\t\t// Pick H -> E cell\n\t\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\t\td.mat_.eMaskSet(row, col, 2); // might choose E later\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t// Pick E -> E cell\n\t\t\t\t\t\t\tcur = SW_BT_RDGAP_EXTEND;\n\t\t\t\t\t\t\td.mat_.eMaskSet(row, col, 1); // might choose H later\n\t\t\t\t\t\t}\n#endif\n\t\t\t\t\t\tbranch = true;\n\t\t\t\t\t} else if(mask == 2) {\n\t\t\t\t\t\t// I chose the E cell\n\t\t\t\t\t\tcur = SW_BT_RDGAP_EXTEND;\n\t\t\t\t\t\td.mat_.eMaskSet(row, col, 0); // done\n\t\t\t\t\t} else if(mask == 1) {\n\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\td.mat_.eMaskSet(row, col, 0); // done\n\t\t\t\t\t} else {\n\t\t\t\t\t\tempty = true;\n\t\t\t\t\t\t// It's empty, so the only question left is whether we should be\n\t\t\t\t\t\t// allowed in terimnate in this cell.  If it's got a valid score\n\t\t\t\t\t\t// then we *shouldn't* be allowed to terminate here because that\n\t\t\t\t\t\t// means it's part of a larger alignment that was already reported.\n\t\t\t\t\t\tcanMoveThru = (origMask == 0);\n\t\t\t\t\t}\n\t\t\t\t\tassert(!empty || !canMoveThru);\n\t\t\t\t} else if(ct == SSEMatrix::F) { // AKA rfgap\n\t\t\t\t\tassert_gt(row, 0);\n\t\t\t\t\tassert(gapsAllowed);\n\t\t\t\t\tTAlScore sc_h_up = ((TCScore*)(up_vec  + SSEMatrix::H))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_f_up = ((TCScore*)(up_vec  + SSEMatrix::F))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_cur  = ((TCScore*)(cur_vec + SSEMatrix::F))[rowelt] + offsetsc;\n\t\t\t\t\t// Currently in the F matrix; incoming transition must come from above.\n\t\t\t\t\t// It's either a gap open from the H matrix or a gap extend from the F\n\t\t\t\t\t// matrix.\n\t\t\t\t\t// TODO: save and restore origMask as well as mask\n\t\t\t\t\tint origMask = 0, mask = 0;\n\t\t\t\t\t// Get H score of cell above\n\t\t\t\t\tif(sc_h_up > floorsc && sc_h_up - sc_->refGapOpen() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 0);\n\t\t\t\t\t}\n\t\t\t\t\t// Get F score of cell above\n\t\t\t\t\tif(sc_f_up > floorsc && sc_f_up - sc_->refGapExtend() == sc_cur) {\n\t\t\t\t\t\tmask |= (1 << 1);\n\t\t\t\t\t}\n\t\t\t\t\torigMask = mask;\n\t\t\t\t\tassert(origMask > 0 || sc_cur <= sc_->match());\n\t\t\t\t\tif(d.mat_.isFMaskSet(row, col)) {\n\t\t\t\t\t\tmask = (d.mat_.masks_[row][col] >> 11) & 3;\n\t\t\t\t\t}\n\t\t\t\t\tif(mask == 3) {\n#if 1\n\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\td.mat_.fMaskSet(row, col, 2); // might choose E later\n#else\n\t\t\t\t\t\tif(rnd.nextU2()) {\n\t\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\t\td.mat_.fMaskSet(row, col, 2); // might choose E later\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t// I chose the F cell\n\t\t\t\t\t\t\tcur = SW_BT_RFGAP_EXTEND;\n\t\t\t\t\t\t\td.mat_.fMaskSet(row, col, 1); // might choose E later\n\t\t\t\t\t\t}\n#endif\n\t\t\t\t\t\tbranch = true;\n\t\t\t\t\t} else if(mask == 2) {\n\t\t\t\t\t\t// I chose the F cell\n\t\t\t\t\t\tcur = SW_BT_RFGAP_EXTEND;\n\t\t\t\t\t\td.mat_.fMaskSet(row, col, 0); // done\n\t\t\t\t\t} else if(mask == 1) {\n\t\t\t\t\t\t// I chose the H cell\n\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\td.mat_.fMaskSet(row, col, 0); // done\n\t\t\t\t\t} else {\n\t\t\t\t\t\tempty = true;\n\t\t\t\t\t\t// It's empty, so the only question left is whether we should be\n\t\t\t\t\t\t// allowed in terimnate in this cell.  If it's got a valid score\n\t\t\t\t\t\t// then we *shouldn't* be allowed to terminate here because that\n\t\t\t\t\t\t// means it's part of a larger alignment that was already reported.\n\t\t\t\t\t\tcanMoveThru = (origMask == 0);\n\t\t\t\t\t}\n\t\t\t\t\tassert(!empty || !canMoveThru);\n\t\t\t\t} else {\n\t\t\t\t\tassert_eq(SSEMatrix::H, ct);\n\t\t\t\t\tTAlScore sc_cur      = ((TCScore*)(cur_vec + SSEMatrix::H))[rowelt]    + offsetsc;\n\t\t\t\t\tTAlScore sc_f_up     = ((TCScore*)(up_vec  + SSEMatrix::F))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_h_up     = ((TCScore*)(up_vec  + SSEMatrix::H))[up_rowelt] + offsetsc;\n\t\t\t\t\tTAlScore sc_h_left   = col > 0 ? (((TCScore*)(left_vec   + SSEMatrix::H))[left_rowelt]   + offsetsc) : floorsc;\n\t\t\t\t\tTAlScore sc_e_left   = col > 0 ? (((TCScore*)(left_vec   + SSEMatrix::E))[left_rowelt]   + offsetsc) : floorsc;\n\t\t\t\t\tTAlScore sc_h_upleft = col > 0 ? (((TCScore*)(upleft_vec + SSEMatrix::H))[upleft_rowelt] + offsetsc) : floorsc;\n\t\t\t\t\tTAlScore sc_diag     = sc_->score(readc, refm, readq - 33);\n\t\t\t\t\t// TODO: save and restore origMask as well as mask\n\t\t\t\t\tint origMask = 0, mask = 0;\n\t\t\t\t\tif(gapsAllowed) {\n\t\t\t\t\t\tif(sc_h_up     > floorsc && sc_cur == sc_h_up   - sc_->refGapOpen()) {\n\t\t\t\t\t\t\tmask |= (1 << 0);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(sc_h_left   > floorsc && sc_cur == sc_h_left - sc_->readGapOpen()) {\n\t\t\t\t\t\t\tmask |= (1 << 1);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(sc_f_up     > floorsc && sc_cur == sc_f_up   - sc_->refGapExtend()) {\n\t\t\t\t\t\t\tmask |= (1 << 2);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(sc_e_left   > floorsc && sc_cur == sc_e_left - sc_->readGapExtend()) {\n\t\t\t\t\t\t\tmask |= (1 << 3);\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\tif(sc_h_upleft > floorsc && sc_cur == sc_h_upleft + sc_diag) {\n\t\t\t\t\t\tmask |= (1 << 4);\n\t\t\t\t\t}\n\t\t\t\t\torigMask = mask;\n\t\t\t\t\tassert(origMask > 0 || sc_cur <= sc_->match());\n\t\t\t\t\tif(d.mat_.isHMaskSet(row, col)) {\n\t\t\t\t\t\tmask = (d.mat_.masks_[row][col] >> 2) & 31;\n\t\t\t\t\t}\n\t\t\t\t\tassert(gapsAllowed || mask == (1 << 4) || mask == 0);\n\t\t\t\t\tint opts = alts5[mask];\n\t\t\t\t\tint select = -1;\n\t\t\t\t\tif(opts == 1) {\n\t\t\t\t\t\tselect = firsts5[mask];\n\t\t\t\t\t\tassert_geq(mask, 0);\n\t\t\t\t\t\td.mat_.hMaskSet(row, col, 0);\n\t\t\t\t\t} else if(opts > 1) {\n#if 1\n\t\t\t\t\t\tif(       (mask & 16) != 0) {\n\t\t\t\t\t\t\tselect = 4; // H diag\n\t\t\t\t\t\t} else if((mask & 1) != 0) {\n\t\t\t\t\t\t\tselect = 0; // H up\n\t\t\t\t\t\t} else if((mask & 4) != 0) {\n\t\t\t\t\t\t\tselect = 2; // F up\n\t\t\t\t\t\t} else if((mask & 2) != 0) {\n\t\t\t\t\t\t\tselect = 1; // H left\n\t\t\t\t\t\t} else if((mask & 8) != 0) {\n\t\t\t\t\t\t\tselect = 3; // E left\n\t\t\t\t\t\t}\n#else\n\t\t\t\t\t\tselect = randFromMask(rnd, mask);\n#endif\n\t\t\t\t\t\tassert_geq(mask, 0);\n\t\t\t\t\t\tmask &= ~(1 << select);\n\t\t\t\t\t\tassert(gapsAllowed || mask == (1 << 4) || mask == 0);\n\t\t\t\t\t\td.mat_.hMaskSet(row, col, mask);\n\t\t\t\t\t\tbranch = true;\n\t\t\t\t\t} else { /* No way to backtrack! */ }\n\t\t\t\t\tif(select != -1) {\n\t\t\t\t\t\tif(select == 4) {\n\t\t\t\t\t\t\tcur = SW_BT_OALL_DIAG;\n\t\t\t\t\t\t} else if(select == 0) {\n\t\t\t\t\t\t\tcur = SW_BT_OALL_REF_OPEN;\n\t\t\t\t\t\t} else if(select == 1) {\n\t\t\t\t\t\t\tcur = SW_BT_OALL_READ_OPEN;\n\t\t\t\t\t\t} else if(select == 2) {\n\t\t\t\t\t\t\tcur = SW_BT_RFGAP_EXTEND;\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tassert_eq(3, select)\n\t\t\t\t\t\t\tcur = SW_BT_RDGAP_EXTEND;\n\t\t\t\t\t\t}\n\t\t\t\t\t} else {\n\t\t\t\t\t\tempty = true;\n\t\t\t\t\t\t// It's empty, so the only question left is whether we should be\n\t\t\t\t\t\t// allowed in terimnate in this cell.  If it's got a valid score\n\t\t\t\t\t\t// then we *shouldn't* be allowed to terminate here because that\n\t\t\t\t\t\t// means it's part of a larger alignment that was already reported.\n\t\t\t\t\t\tcanMoveThru = (origMask == 0);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tassert(!empty || !canMoveThru || ct == SSEMatrix::H);\n\t\t\t}\n\t\t}\n\t\td.mat_.setReportedThrough(row, col);\n\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t// Cell was involved in a previously-reported alignment?\n\t\tif(!canMoveThru) {\n\t\t\tif(!btnstack_.empty()) {\n\t\t\t\t// Remove all the cells from list back to and including the\n\t\t\t\t// cell where the branch occurred\n\t\t\t\tbtcells_.resize(btnstack_.back().celsz);\n\t\t\t\t// Pop record off the top of the stack\n\t\t\t\tned.resize(btnstack_.back().nedsz);\n\t\t\t\t//aed.resize(btnstack_.back().aedsz);\n\t\t\t\trow      = btnstack_.back().row;\n\t\t\t\tcol      = btnstack_.back().col;\n\t\t\t\tgaps     = btnstack_.back().gaps;\n\t\t\t\treadGaps = btnstack_.back().readGaps;\n\t\t\t\trefGaps  = btnstack_.back().refGaps;\n\t\t\t\tscore    = btnstack_.back().score;\n\t\t\t\tct       = btnstack_.back().ct;\n\t\t\t\tbtnstack_.pop_back();\n\t\t\t\tassert(!sc_->monotone || score.score() >= escore);\n\t\t\t\tNEW_ROW_COL(row, col);\n\t\t\t\tcontinue;\n\t\t\t} else {\n\t\t\t\t// No branch points to revisit; just give up\n\t\t\t\tres.reset();\n\t\t\t\tmet.btfail++; // DP backtraces failed\n\t\t\t\treturn false;\n\t\t\t}\n\t\t}\n\t\tassert(!reportedThru);\n\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\tif(empty || row == 0) {\n\t\t\tassert_eq(SSEMatrix::H, ct);\n\t\t\tbtcells_.expand();\n\t\t\tbtcells_.back().first = row;\n\t\t\tbtcells_.back().second = col;\n\t\t\t// This cell is at the end of a legitimate alignment\n\t\t\ttrimBeg = row;\n\t\t\tassert_eq(btcells_.size(), dpRows() - trimBeg - trimEnd + readGaps);\n\t\t\tbreak;\n\t\t}\n\t\tif(branch) {\n\t\t\t// Add a frame to the backtrack stack\n\t\t\tbtnstack_.expand();\n\t\t\tbtnstack_.back().init(\n\t\t\t\tned.size(),\n\t\t\t\t0,               // aed.size()\n\t\t\t\tbtcells_.size(),\n\t\t\t\trow,\n\t\t\t\tcol,\n\t\t\t\tgaps,\n\t\t\t\treadGaps,\n\t\t\t\trefGaps,\n\t\t\t\tscore,\n\t\t\t\t(int)ct);\n\t\t}\n\t\tbtcells_.expand();\n\t\tbtcells_.back().first = row;\n\t\tbtcells_.back().second = col;\n\t\tswitch(cur) {\n\t\t\t// Move up and to the left.  If the reference nucleotide in the\n\t\t\t// source row mismatches the read nucleotide, penalize\n\t\t\t// it and add a nucleotide mismatch.\n\t\t\tcase SW_BT_OALL_DIAG: {\n\t\t\t\tassert_gt(row, 0); assert_gt(col, 0);\n\t\t\t\t// Check for color mismatch\n\t\t\t\tint readC = (*rd_)[row];\n\t\t\t\tint refNmask = (int)rf_[rfi_+col];\n\t\t\t\tassert_gt(refNmask, 0);\n\t\t\t\tint m = matchesEx(readC, refNmask);\n\t\t\t\tct = SSEMatrix::H;\n\t\t\t\tif(m != 1) {\n\t\t\t\t\tEdit e(\n\t\t\t\t\t\t(int)row,\n\t\t\t\t\t\tmask2dna[refNmask],\n\t\t\t\t\t\t\"ACGTN\"[readC],\n\t\t\t\t\t\tEDIT_TYPE_MM);\n\t\t\t\t\tassert(e.repOk());\n\t\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\t\tned.push_back(e);\n\t\t\t\t\tint pen = QUAL2(row, col);\n\t\t\t\t\tscore.score_ -= pen;\n\t\t\t\t\tassert(!sc_->monotone || score.score() >= escore);\n\t\t\t\t} else {\n\t\t\t\t\t// Reward a match\n\t\t\t\t\tint64_t bonus = sc_->match(30);\n\t\t\t\t\tscore.score_ += bonus;\n\t\t\t\t\tassert(!sc_->monotone || score.score() >= escore);\n\t\t\t\t}\n\t\t\t\tif(m == -1) {\n\t\t\t\t\t// score.ns_++;\n\t\t\t\t}\n\t\t\t\trow--; col--;\n\t\t\t\tMOVE_ALL_UPLEFT();\n\t\t\t\tassert(VALID_AL_SCORE(score));\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\t// Move up.  Add an edit encoding the ref gap.\n\t\t\tcase SW_BT_OALL_REF_OPEN:\n\t\t\t{\n\t\t\t\tassert_gt(row, 0);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row,\n\t\t\t\t\t'-',\n\t\t\t\t\t\"ACGTN\"[(int)(*rd_)[row]],\n\t\t\t\t\tEDIT_TYPE_REF_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\trow--;\n\t\t\t\tct = SSEMatrix::H;\n\t\t\t\tint pen = sc_->refGapOpen();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; refGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_UP();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\t// Move up.  Add an edit encoding the ref gap.\n\t\t\tcase SW_BT_RFGAP_EXTEND:\n\t\t\t{\n\t\t\t\tassert_gt(row, 1);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row,\n\t\t\t\t\t'-',\n\t\t\t\t\t\"ACGTN\"[(int)(*rd_)[row]],\n\t\t\t\t\tEDIT_TYPE_REF_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\trow--;\n\t\t\t\tct = SSEMatrix::F;\n\t\t\t\tint pen = sc_->refGapExtend();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; refGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_UP();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tcase SW_BT_OALL_READ_OPEN:\n\t\t\t{\n\t\t\t\tassert_gt(col, 0);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row+1,\n\t\t\t\t\tmask2dna[(int)rf_[rfi_+col]],\n\t\t\t\t\t'-',\n\t\t\t\t\tEDIT_TYPE_READ_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\tcol--;\n\t\t\t\tct = SSEMatrix::H;\n\t\t\t\tint pen = sc_->readGapOpen();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; readGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_LEFT();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tcase SW_BT_RDGAP_EXTEND:\n\t\t\t{\n\t\t\t\tassert_gt(col, 1);\n\t\t\t\tEdit e(\n\t\t\t\t\t(int)row+1,\n\t\t\t\t\tmask2dna[(int)rf_[rfi_+col]],\n\t\t\t\t\t'-',\n\t\t\t\t\tEDIT_TYPE_READ_GAP);\n\t\t\t\tassert(e.repOk());\n\t\t\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\t\t\tned.push_back(e);\n\t\t\t\tassert_geq(row, (size_t)sc_->gapbar);\n\t\t\t\tassert_geq((int)(rdf_-rdi_-row-1), sc_->gapbar-1);\n\t\t\t\tcol--;\n\t\t\t\tct = SSEMatrix::E;\n\t\t\t\tint pen = sc_->readGapExtend();\n\t\t\t\tscore.score_ -= pen;\n\t\t\t\tassert(!sc_->monotone || score.score() >= minsc_);\n\t\t\t\tgaps++; readGaps++;\n\t\t\t\tassert_eq(gaps, Edit::numGaps(ned));\n\t\t\t\tassert_leq(gaps, rdgap_ + rfgap_);\n\t\t\t\tMOVE_ALL_LEFT();\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tdefault: throw 1;\n\t\t}\n\t} // while((int)row > 0)\n\tassert_geq(col, 0);\n\tassert_eq(SSEMatrix::H, ct);\n\t// The number of cells in the backtracs should equal the number of read\n\t// bases after trimming plus the number of gaps\n\tassert_eq(btcells_.size(), dpRows() - trimBeg - trimEnd + readGaps);\n\t// Check whether we went through a core diagonal and set 'reported' flag on\n\t// each cell\n\tbool overlappedCoreDiag = false;\n\tfor(size_t i = 0; i < btcells_.size(); i++) {\n\t\tsize_t rw = btcells_[i].first;\n\t\tsize_t cl = btcells_[i].second;\n\t\t// Calculate the diagonal within the *trimmed* rectangle, i.e. the\n\t\t// rectangle we dealt with in align, gather and backtrack.\n\t\tint64_t diagi = cl - rw;\n\t\t// Now adjust to the diagonal within the *untrimmed* rectangle by\n\t\t// adding on the amount trimmed from the left.\n\t\tdiagi += rect_->triml;\n\t\tif(diagi >= 0) {\n\t\t\tsize_t diag = (size_t)diagi;\n\t\t\tif(diag >= rect_->corel && diag <= rect_->corer) {\n\t\t\t\toverlappedCoreDiag = true;\n\t\t\t\tbreak;\n\t\t\t}\n\t\t}\n#ifndef NDEBUG\n\t\t//assert(!d.mat_.reportedThrough(rw, cl));\n\t\t//d.mat_.setReportedThrough(rw, cl);\n\t\tassert(d.mat_.reportedThrough(rw, cl));\n#endif\n\t}\n\tif(!overlappedCoreDiag) {\n\t\t// Must overlap a core diagonal.  Otherwise, we run the risk of\n\t\t// reporting an alignment that overlaps (and trumps) a higher-scoring\n\t\t// alignment that lies partially outside the dynamic programming\n\t\t// rectangle.\n\t\tres.reset();\n\t\tmet.corerej++;\n\t\treturn false;\n\t}\n\tint readC = (*rd_)[rdi_+row];      // get last char in read\n\tint refNmask = (int)rf_[rfi_+col]; // get last ref char ref involved in aln\n\tassert_gt(refNmask, 0);\n\tint m = matchesEx(readC, refNmask);\n\tif(m != 1) {\n\t\tEdit e((int)row, mask2dna[refNmask], \"ACGTN\"[readC], EDIT_TYPE_MM);\n\t\tassert(e.repOk());\n\t\tassert(ned.empty() || ned.back().pos >= row);\n\t\tned.push_back(e);\n\t\tscore.score_ -= QUAL2(row, col);\n\t\tassert_geq(score.score(), minsc_);\n\t} else {\n\t\tscore.score_ += sc_->match(30);\n\t}\n\tif(m == -1) {\n\t\t// score.ns_++;\n\t}\n#if 0\n\tif(score.ns_ > nceil_) {\n\t\t// Alignment has too many Ns in it!\n\t\tres.reset();\n\t\tmet.nrej++;\n\t\treturn false;\n\t}\n#endif\n\tres.reverse();\n\tassert(Edit::repOk(ned, (*rd_)));\n\tassert_eq(score.score(), escore);\n\tassert_leq(gaps, rdgap_ + rfgap_);\n\toff = col;\n\tassert_lt(col + (size_t)rfi_, (size_t)rff_);\n\t// score.gaps_ = gaps;\n\tres.alres.setScore(score);\n#if 0\n\tres.alres.setShape(\n\t\trefidx_,                  // ref id\n\t\toff + rfi_ + rect_->refl, // 0-based ref offset\n\t\treflen_,                  // reference length\n\t\tfw_,                      // aligned to Watson?\n\t\trdf_ - rdi_,              // read length\n\t\ttrue,                     // pretrim soft?\n\t\t0,                        // pretrim 5' end\n\t\t0,                        // pretrim 3' end\n\t\ttrue,                     // alignment trim soft?\n\t\tfw_ ? trimBeg : trimEnd,  // alignment trim 5' end\n\t\tfw_ ? trimEnd : trimBeg); // alignment trim 3' end\n\tsize_t refns = 0;\n\tfor(size_t i = col; i <= origCol; i++) {\n\t\tif((int)rf_[rfi_+i] > 15) {\n\t\t\trefns++;\n\t\t}\n\t}\n#endif\n\t// res.alres.setRefNs(refns);\n\tassert(Edit::repOk(ned, (*rd_), true, trimBeg, trimEnd));\n\tassert(res.repOk());\n#ifndef NDEBUG\n\tsize_t gapsCheck = 0;\n\tfor(size_t i = 0; i < ned.size(); i++) {\n\t\tif(ned[i].isGap()) gapsCheck++;\n\t}\n\tassert_eq(gaps, gapsCheck);\n\tBTDnaString refstr;\n\tfor(size_t i = col; i <= origCol; i++) {\n\t\trefstr.append(firsts5[(int)rf_[rfi_+i]]);\n\t}\n\tBTDnaString editstr;\n\tEdit::toRef((*rd_), ned, editstr, true, trimBeg, trimEnd);\n\tif(refstr != editstr) {\n\t\tcerr << \"Decoded nucleotides and edits don't match reference:\" << endl;\n\t\tcerr << \"           score: \" << score.score()\n\t\t     << \" (\" << gaps << \" gaps)\" << endl;\n\t\tcerr << \"           edits: \";\n\t\tEdit::print(cerr, ned);\n\t\tcerr << endl;\n\t\tcerr << \"    decoded nucs: \" << (*rd_) << endl;\n\t\tcerr << \"     edited nucs: \" << editstr << endl;\n\t\tcerr << \"  reference nucs: \" << refstr << endl;\n\t\tassert(0);\n\t}\n#endif\n\tmet.btsucc++; // DP backtraces succeeded\n\treturn true;\n}\n"
  },
  {
    "path": "aln_sink.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef ALN_SINK_H_\n#define ALN_SINK_H_\n\n#include <limits>\n#include <utility>\n#include <map>\n#include \"read.h\"\n#include \"ds.h\"\n#include \"simple_func.h\"\n#include \"outq.h\"\n#include \"aligner_result.h\"\n#include \"hyperloglogplus.h\"\n#include \"timer.h\"\n#include \"taxonomy.h\"\n\n\n// Forward decl\ntemplate <typename index_t>\nclass SeedResults;\n\nenum {\n\tOUTPUT_SAM = 1\n};\n\n\nstruct ReadCounts {\n\tuint32_t n_reads;\n\tuint32_t sum_score;\n\tdouble summed_hit_len;\n\tdouble weighted_reads;\n\tuint32_t n_unique_reads;\n};\n\n/**\n * Metrics summarizing the species level information we have\n */\nstruct SpeciesMetrics {\n    \n    //\n    struct IDs {\n        EList<uint64_t, 5> ids;\n        bool operator<(const IDs& o) const {\n            if(ids.size() != o.ids.size()) return ids.size() < o.ids.size();\n            for(size_t i = 0; i < ids.size(); i++) {\n                assert_lt(i, o.ids.size());\n                if(ids[i] != o.ids[i]) return ids[i] < o.ids[i];\n            }\n            return false;\n        }\n        \n        IDs& operator=(const IDs& other) {\n            if(this == &other)\n                return *this;\n            \n            ids = other.ids;\n            return *this;\n        }\n    };\n\n\n\tSpeciesMetrics():mutex_m() {\n\t    reset();\n\t}\n\n\tvoid reset() {\n\t\tspecies_counts.clear();\n\t\t//for(map<uint32_t, HyperLogLogPlusMinus<uint64_t> >::iterator it = this->species_kmers.begin(); it != this->species_kmers.end(); ++it) {\n\t\t//\tit->second.reset();\n\t\t//} //TODO: is this required?\n\t\tspecies_kmers.clear();\n        num_non_leaves = 0;\n\t}\n\n\tvoid init(\n              const map<uint64_t, ReadCounts>& species_counts_,\n              const map<uint64_t, HyperLogLogPlusMinus<uint64_t> >& species_kmers_,\n              const map<IDs, uint64_t>& observed_)\n\t{\n\t\tspecies_counts = species_counts_;\n\t\tspecies_kmers = species_kmers_;\n        observed = observed_;\n        num_non_leaves = 0;\n    }\n\n\t/**\n\t * Merge (add) the counters in the given ReportingMetrics object\n\t * into this object.  This is the only safe way to update a\n\t * ReportingMetrics shared by multiple threads.\n\t */\n\tvoid merge(const SpeciesMetrics& met, bool getLock = false) {\n        ThreadSafe ts(&mutex_m, getLock);\n\n        // update species read count\n        for(map<uint64_t, ReadCounts>::const_iterator it = met.species_counts.begin(); it != met.species_counts.end(); ++it) {\n        \tif (species_counts.find(it->first) == species_counts.end()) {\n        \t\tspecies_counts[it->first] = it->second;\n        \t} else {\n        \t\tspecies_counts[it->first].n_reads += it->second.n_reads;\n        \t\tspecies_counts[it->first].sum_score += it->second.sum_score;\n        \t\tspecies_counts[it->first].summed_hit_len += it->second.summed_hit_len;\n        \t\tspecies_counts[it->first].weighted_reads += it->second.weighted_reads;\n        \t\tspecies_counts[it->first].n_unique_reads += it->second.n_unique_reads;\n        \t}\n        }\n\n        // update species k-mers\n        for(map<uint64_t, HyperLogLogPlusMinus<uint64_t> >::const_iterator it = met.species_kmers.begin(); it != met.species_kmers.end(); ++it) {\n        \tspecies_kmers[it->first].merge(&(it->second));\n        }\n\n        for(map<IDs, uint64_t>::const_iterator itr = met.observed.begin(); itr != met.observed.end(); itr++) {\n            const IDs& ids = itr->first;\n            uint64_t count = itr->second;\n            \n            if(observed.find(ids) == observed.end()) {\n                observed[ids] = count;\n            } else {\n                observed[ids] += count;\n            }\n        }\n    }\n\n\tvoid addSpeciesCounts(\n                          uint64_t taxID,\n                          int64_t score,\n                          int64_t max_score,\n                          double summed_hit_len,\n                          double weighted_read,\n                          uint32_t nresult)\n    {\n\t\tspecies_counts[taxID].n_reads += 1;\n\t\tspecies_counts[taxID].sum_score += 1;\n\t\tspecies_counts[taxID].weighted_reads += weighted_read;\n\t\tspecies_counts[taxID].summed_hit_len += summed_hit_len;\n\t\tif(nresult == 1) {\n\t\t\tspecies_counts[taxID].n_unique_reads += 1;\n\t\t}\n\n        // Only consider good hits for abundance analysis\n        // DK - for debugging purposes\n        if(score >= max_score) {\n            cur_ids.ids.push_back(taxID);\n            if(cur_ids.ids.size() == nresult) {\n                cur_ids.ids.sort();\n                if(observed.find(cur_ids) == observed.end()) {\n                    observed[cur_ids] = 1;\n                } else {\n                    observed[cur_ids] += 1;\n                }\n                cur_ids.ids.clear();\n            }\n        }\n\t}\n\n\tvoid addAllKmers(\n                     uint64_t taxID,\n                     const BTDnaString &btdna,\n                     size_t begin,\n                     size_t len) {\n#ifdef FLORIAN_DEBUG\n\t\tcerr << \"add all kmers for \" << taxID << \" from \" << begin << \" for \" << len << \": \" << string(btdna.toZBuf()).substr(begin,len) << endl;\n#endif\n\t\tuint64_t kmer = btdna.int_kmer<uint64_t>(begin,begin+len);\n\t\tspecies_kmers[taxID].add(kmer);\n\t\tsize_t i = begin;\n\t\twhile (i+32 < len) {\n\t\t\tkmer = btdna.next_kmer(kmer,i);\n\t\t\tspecies_kmers[taxID].add(kmer);\n\t\t\t++i;\n\t\t}\n\t}\n\n\tsize_t nDistinctKmers(uint64_t taxID) {\n\t\treturn(species_kmers[taxID].cardinality());\n\t}\n    \n    static void EM(\n                   const map<IDs, uint64_t>& observed,\n                   const map<uint64_t, EList<uint64_t> >& ancestors,\n                   const map<uint64_t, uint64_t>& tid_to_num,\n                   const EList<double>& p,\n                   EList<double>& p_next,\n                   const EList<size_t>& len)\n    {\n        assert_eq(p.size(), len.size());\n        \n        // E step\n        p_next.fill(0.0);\n        // for each assigned read set\n        for(map<IDs, uint64_t>::const_iterator itr = observed.begin(); itr != observed.end(); itr++) {\n            const EList<uint64_t, 5>& ids = itr->first.ids; // all ids assigned to the read set\n            uint64_t count = itr->second; // number of reads in the read set\n            double psum = 0.0;\n            for(size_t i = 0; i < ids.size(); i++) {\n                uint64_t tid = ids[i];\n                // Leaves?\n                map<uint64_t, uint64_t>::const_iterator id_itr = tid_to_num.find(tid);\n                if(id_itr != tid_to_num.end()) {\n                    uint64_t num = id_itr->second;\n                    assert_lt(num, p.size());\n                    psum += p[num];\n                } else { // Ancestors\n                    map<uint64_t, EList<uint64_t> >::const_iterator a_itr = ancestors.find(tid);\n                    if(a_itr == ancestors.end())\n                        continue;\n                    const EList<uint64_t>& children = a_itr->second;\n                    for(size_t c = 0; c < children.size(); c++) {\n                        uint64_t c_tid = children[c];\n                        map<uint64_t, uint64_t>::const_iterator id_itr = tid_to_num.find(c_tid);\n                        if(id_itr == tid_to_num.end())\n                            continue;\n                        uint64_t c_num = id_itr->second;\n                        psum += p[c_num];\n                    }\n                }\n            }\n\n            if(psum == 0.0) continue;\n            \n            for(size_t i = 0; i < ids.size(); i++) {\n                uint64_t tid = ids[i];\n                // Leaves?\n                map<uint64_t, uint64_t>::const_iterator id_itr = tid_to_num.find(tid);\n                if(id_itr != tid_to_num.end()) {\n                    uint64_t num = id_itr->second;\n                    assert_leq(p[num], psum);\n                    p_next[num] += (count * (p[num] / psum));\n                } else {\n                    map<uint64_t, EList<uint64_t> >::const_iterator a_itr = ancestors.find(tid);\n                    if(a_itr == ancestors.end())\n                        continue;\n                    const EList<uint64_t>& children = a_itr->second;\n                    for(size_t c = 0; c < children.size(); c++) {\n                        uint64_t c_tid = children[c];\n                        map<uint64_t, uint64_t>::const_iterator id_itr = tid_to_num.find(c_tid);\n                        if(id_itr == tid_to_num.end())\n                            continue;\n                        uint64_t c_num = id_itr->second;\n                        p_next[c_num] += (count * (p[c_num] / psum));\n                    }\n                }\n            }\n        }\n        \n        // M step\n        double sum = 0.0;\n        for(size_t i = 0; i < p_next.size(); i++) {\n            sum += (p_next[i] / len[i]);\n        }\n        for(size_t i = 0; i < p_next.size(); i++) {\n            p_next[i] = p_next[i] / len[i] / sum;\n        }\n    }\n    \n    void calculateAbundance(const Ebwt<uint64_t>& ebwt, uint8_t rank)\n    {\n        const map<uint64_t, TaxonomyNode>& tree = ebwt.tree();\n        \n        // Find leaves\n        set<uint64_t> leaves;\n        for(map<IDs, uint64_t>::iterator itr = observed.begin(); itr != observed.end(); itr++) {\n            const IDs& ids = itr->first;\n            for(size_t i = 0; i < ids.ids.size(); i++) {\n                uint64_t tid = ids.ids[i];\n                map<uint64_t, TaxonomyNode>::const_iterator tree_itr = tree.find(tid);\n                if(tree_itr == tree.end())\n                    continue;\n                const TaxonomyNode& node = tree_itr->second;\n                if(!node.leaf) {\n                    //if(tax_rank_num[node.rank] > tax_rank_num[rank]) {\n                        continue;\n                    //}\n                }\n                leaves.insert(tree_itr->first);\n            }\n        }\n        \n \n#ifdef DAEHWAN_DEBUG\n        cerr << \"\\t\\tnumber of leaves: \" << leaves.size() << endl;\n#endif\n        \n        // Find all descendants coming from the same ancestor\n        map<uint64_t, EList<uint64_t> > ancestors;\n        for(map<IDs, uint64_t>::iterator itr = observed.begin(); itr != observed.end(); itr++) {\n            const IDs& ids = itr->first;\n            for(size_t i = 0; i < ids.ids.size(); i++) {\n                uint64_t tid = ids.ids[i];\n                if(leaves.find(tid) != leaves.end())\n                    continue;\n                if(ancestors.find(tid) != ancestors.end())\n                    continue;\n                ancestors[tid].clear();\n                for(set<uint64_t> ::const_iterator leaf_itr = leaves.begin(); leaf_itr != leaves.end(); leaf_itr++) {\n                    uint64_t tid2 = *leaf_itr;\n                    assert(tree.find(tid2) != tree.end());\n                    assert(tree.find(tid2)->second.leaf);\n                    uint64_t temp_tid2 = tid2;\n                    while(true) {\n                        map<uint64_t, TaxonomyNode>::const_iterator tree_itr = tree.find(temp_tid2);\n                        if(tree_itr == tree.end())\n                            break;\n                        const TaxonomyNode& node = tree_itr->second;\n                        if(tid == node.parent_tid) {\n                            ancestors[tid].push_back(tid2);\n                        }\n                        if(temp_tid2 == node.parent_tid)\n                            break;\n                        temp_tid2 = node.parent_tid;\n                    }\n                }\n                ancestors[tid].sort();\n            }\n        }\n        \n#ifdef DAEHWAN_DEBUG\n        cerr << \"\\t\\tnumber of ancestors: \" << ancestors.size() << endl;\n        for(map<uint64_t, EList<uint64_t> >::const_iterator itr = ancestors.begin(); itr != ancestors.end(); itr++) {\n            uint64_t tid = itr->first;\n            const EList<uint64_t>& children = itr->second;\n            if(children.size() <= 0)\n                continue;\n            map<uint64_t, TaxonomyNode>::const_iterator tree_itr = tree.find(tid);\n            if(tree_itr == tree.end())\n                continue;\n            const TaxonomyNode& node = tree_itr->second;\n            cerr << \"\\t\\t\\t\" << tid << \": \" << children.size() << \"\\t\" << get_tax_rank(node.rank) << endl;\n            cerr << \"\\t\\t\\t\\t\";\n            for(size_t i = 0; i < children.size(); i++) {\n                cerr << children[i];\n                if(i + 1 < children.size())\n                    cerr << \",\";\n                if(i > 10) {\n                    cerr << \" ...\";\n                    break;\n                }\n            }\n            cerr << endl;\n        }\n        \n        uint64_t test_tid = 0, test_tid2 = 0;\n#endif\n        // Lengths of genomes (or contigs)\n        const map<uint64_t, uint64_t>& size_table = ebwt.size();\n        \n        // Initialize probabilities\n        map<uint64_t, uint64_t> tid_to_num; // taxonomic ID to corresponding element of a list\n        EList<double> p;\n        EList<size_t> len; // genome lengths\n        for(map<IDs, uint64_t>::iterator itr = observed.begin(); itr != observed.end(); itr++) {\n            const IDs& ids = itr->first;\n            uint64_t count = itr->second;\n            for(size_t i = 0; i < ids.ids.size(); i++) {\n                uint64_t tid = ids.ids[i];\n                if(leaves.find(tid) == leaves.end())\n                    continue;\n                \n#ifdef DAEHWAN_DEBUG\n                if((tid == test_tid || tid == test_tid2) &&\n                   count >= 100) {\n                    cerr << tid << \": \" << count << \"\\t\";\n                    for(size_t j = 0; j < ids.ids.size(); j++) {\n                        cerr << ids.ids[j];\n                        if(j + 1 < ids.ids.size())\n                            cerr << \",\";\n                    }\n                    cerr << endl;\n                }\n#endif\n                \n                if(tid_to_num.find(tid) == tid_to_num.end()) {\n                    tid_to_num[tid] = p.size();\n                    p.push_back(1.0 / ids.ids.size() * count);\n                    map<uint64_t, uint64_t>::const_iterator size_itr = size_table.find(tid);\n                    if(size_itr != size_table.end()) {\n                        len.push_back(size_itr->second);\n                    } else {\n                        len.push_back(std::numeric_limits<size_t>::max());\n                    }\n                } else {\n                    uint64_t num = tid_to_num[tid];\n                    assert_lt(num, p.size());\n                    p[num] += (1.0 / ids.ids.size() * count);\n                }\n            }\n        }\n        \n        assert_eq(p.size(), len.size());\n        \n        {\n            double sum = 0.0;\n            for(size_t i = 0; i < p.size(); i++) {\n                sum += (p[i] / len[i]);\n            }\n            for(size_t i = 0; i < p.size(); i++) {\n                p[i] = (p[i] / len[i]) / sum;\n            }\n        }\n        \n        EList<double> p_next; p_next.resizeExact(p.size());\n        EList<double> p_next2; p_next2.resizeExact(p.size());\n        EList<double> p_r; p_r.resizeExact(p.size());\n        EList<double> p_v; p_v.resizeExact(p.size());\n        size_t num_iteration = 0;\n        double diff = 0.0;\n        while(true) {\n#ifdef DAEHWAN_DEBUG\n            if(num_iteration % 50 == 0) {\n                if(test_tid != 0 || test_tid2 != 0)\n                    cerr << \"iter \" << num_iteration << endl;\n                if(test_tid != 0)\n                    cerr << \"\\t\" << test_tid << \": \" << p[tid_to_num[test_tid]] << endl;\n                if(test_tid2 != 0)\n                    cerr << \"\\t\" << test_tid2 << \": \" << p[tid_to_num[test_tid2]] << endl;\n            }\n#endif\n            \n            // Accelerated version of EM - SQUAREM iteration\n            //    Varadhan, R. & Roland, C. Scand. J. Stat. 35, 335–353 (2008).\n            //    Also, this algorithm is used in Sailfish - http://www.nature.com/nbt/journal/v32/n5/full/nbt.2862.html\n#if 1\n            EM(observed, ancestors, tid_to_num, p, p_next, len);\n            EM(observed, ancestors, tid_to_num, p_next, p_next2, len);\n            double sum_squared_r = 0.0, sum_squared_v = 0.0;\n            for(size_t i = 0; i < p.size(); i++) {\n                p_r[i] = p_next[i] - p[i];\n                sum_squared_r += (p_r[i] * p_r[i]);\n                p_v[i] = p_next2[i] - p_next[i] - p_r[i];\n                sum_squared_v += (p_v[i] * p_v[i]);\n            }\n            if(sum_squared_v > 0.0) {\n                double gamma = -sqrt(sum_squared_r / sum_squared_v);\n                for(size_t i = 0; i < p.size(); i++) {\n                    p_next2[i] = max(0.0, p[i] - 2 * gamma * p_r[i] + gamma * gamma * p_v[i]);\n                }\n                EM(observed, ancestors, tid_to_num, p_next2, p_next, len);\n            }\n            \n#else\n            EM(observed, ancestors, tid_to_num, p, p_next, len);\n#endif\n            \n            diff = 0.0;\n            for(size_t i = 0; i < p.size(); i++) {\n                diff += (p[i] > p_next[i] ? p[i] - p_next[i] : p_next[i] - p[i]);\n            }\n            if(diff < 0.0000000001) break;\n            if(++num_iteration >= 10000) break;\n            p = p_next;\n        }\n        \n        cerr << \"Number of iterations in EM algorithm: \" << num_iteration << endl;\n        cerr << \"Probability diff. (P - P_prev) in the last iteration: \" << diff << endl;\n        \n        {\n            // Calculate abundance normalized by genome size\n            abundance_len.clear();\n            double sum = 0.0;\n            for(map<uint64_t, uint64_t>::iterator itr = tid_to_num.begin(); itr != tid_to_num.end(); itr++) {\n                uint64_t tid = itr->first;\n                uint64_t num = itr->second;\n                assert_lt(num, p.size());\n                abundance_len[tid] = p[num];\n                sum += (p[num] * len[num]);\n            }\n            \n            // Calculate abundance without genome size taken into account\n            abundance.clear();\n            for(map<uint64_t, uint64_t>::iterator itr = tid_to_num.begin(); itr != tid_to_num.end(); itr++) {\n                uint64_t tid = itr->first;\n                uint64_t num = itr->second;\n                assert_lt(num, p.size());\n                abundance[tid] = (p[num] * len[num]) / sum;\n            }\n        }\n    }\n\n\tmap<uint64_t, ReadCounts> species_counts;                        // read count per species\n\tmap<uint64_t, HyperLogLogPlusMinus<uint64_t> > species_kmers;    // unique k-mer count per species\n    \n    map<IDs, uint64_t>     observed;\n    IDs                    cur_ids;\n    uint32_t               num_non_leaves;\n    map<uint64_t, double>  abundance;      // abundance without genome size taken into consideration\n    map<uint64_t, double>  abundance_len;  // abundance normalized by genome size\n\n\tMUTEX_T mutex_m;\n};\n\n\n/**\n * Metrics summarizing the work done by the reporter and summarizing\n * the number of reads that align, that fail to align, and that align\n * non-uniquely.\n */\nstruct ReportingMetrics {\n\n\tReportingMetrics():mutex_m() {\n\t    reset();\n\t}\n\n\tvoid reset() {\n\t\tinit(0, 0, 0, 0);\n\t}\n\n\tvoid init(\n\t\tuint64_t nread_,\n\t\tuint64_t npaired_,\n\t\tuint64_t nunpaired_,\n\t\tuint64_t nconcord_uni_)\n\t{\n\t\tnread         = nread_;\n\t\t\n\t\tnpaired       = npaired_;\n\t\tnunpaired     = nunpaired_;\n\t\t\n\t\tnconcord_uni  = nconcord_uni_;\n    }\n\t\n\t/**\n\t * Merge (add) the counters in the given ReportingMetrics object\n\t * into this object.  This is the only safe way to update a\n\t * ReportingMetrics shared by multiple threads.\n\t */\n\tvoid merge(const ReportingMetrics& met, bool getLock = false) {\n        ThreadSafe ts(&mutex_m, getLock);\n\t\tnread         += met.nread;\n\n\t\tnpaired       += met.npaired;\n\t\tnunpaired     += met.nunpaired;\n\n\t\tnconcord_uni  += met.nconcord_uni;\n    }\n\n\tuint64_t  nread;         // # reads\n\tuint64_t  npaired;       // # pairs\n\tuint64_t  nunpaired;     // # unpaired reads\n\t\n\t// Paired\n\t\n\t// Concordant\n\tuint64_t  nconcord_uni;  // # pairs with unique concordant alns\n\t\t\n\tMUTEX_T mutex_m;\n};\n\n// Type for expression numbers of hits\ntypedef int64_t THitInt;\n\n/**\n * Parameters affecting reporting of alignments, specifically -k & -a,\n * -m & -M.\n */\nstruct ReportingParams {\n\n\texplicit ReportingParams(THitInt khits_, bool compressed_)\n\t{\n\t\tinit(khits_, compressed_);\n\t}\n\n\tvoid init(THitInt khits_, bool compressed_)\n\t{\n\t\tkhits = khits_;     // -k (or high if -a)\n        if(compressed_) {\n            ihits = max<THitInt>(khits, 5) * 4;\n        } else {\n            ihits = max<THitInt>(khits, 5) * 40;\n        }\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that reporting parameters are internally consistent.\n\t */\n\tbool repOk() const {\n\t\tassert_geq(khits, 1);\n\t\treturn true;\n\t}\n#endif\n\t\n\tinline THitInt mult() const {\n\t\treturn khits;\n\t}\n\n\t// Number of assignments to report\n\tTHitInt khits;\n    \n    // Number of internal assignments\n    THitInt ihits;\n};\n\n/**\n * A state machine keeping track of the number and type of alignments found so\n * far.  Its purpose is to inform the caller as to what stage the alignment is\n * in and what categories of alignment are still of interest.  This information\n * should allow the caller to short-circuit some alignment work.  Another\n * purpose is to tell the AlnSinkWrap how many and what type of alignment to\n * report.\n *\n * TODO: This class does not keep accurate information about what\n * short-circuiting took place.  If a read is identical to a previous read,\n * there should be a way to query this object to determine what work, if any,\n * has to be re-done for the new read.\n */\nclass ReportingState {\n\npublic:\n\n\tenum {\n\t\tNO_READ = 1,        // haven't got a read yet\n\t\tCONCORDANT_PAIRS,   // looking for concordant pairs\n\t\tDONE                // finished looking\n\t};\n\n\t// Flags for different ways we can finish out a category of potential\n\t// alignments.\n\t\n\tenum {\n\t\tEXIT_DID_NOT_EXIT = 1,        // haven't finished\n\t\tEXIT_DID_NOT_ENTER,           // never tried search\t\n\t\tEXIT_SHORT_CIRCUIT_k,         // -k exceeded\n\t\tEXIT_NO_ALIGNMENTS,           // none found\n\t\tEXIT_WITH_ALIGNMENTS          // some found\n\t};\n\t\n\tReportingState(const ReportingParams& p) : p_(p) { reset(); }\n\t\n\t/**\n\t * Set all state to uninitialized defaults.\n\t */\n\tvoid reset() {\n\t\tstate_ = ReportingState::NO_READ;\n\t\tpaired_ = false;\n\t\tnconcord_ = 0;\n\t\tdoneConcord_ = false;\n\t\texitConcord_ = ReportingState::EXIT_DID_NOT_ENTER;\n\t\tdone_ = false;\n\t}\n\t\n\t/**\n\t * Return true iff this ReportingState has been initialized with a call to\n\t * nextRead() since the last time reset() was called.\n\t */\n\tbool inited() const { return state_ != ReportingState::NO_READ; }\n\n\t/**\n\t * Initialize state machine with a new read.  The state we start in depends\n\t * on whether it's paired-end or unpaired.\n\t */\n\tvoid nextRead(bool paired);\n\n\t/**\n\t * Caller uses this member function to indicate that one additional\n\t * concordant alignment has been found.\n\t */\n\tbool foundConcordant();\n\n\t/**\n\t * Caller uses this member function to indicate that one additional\n\t * discordant alignment has been found.\n\t */\n\tbool foundUnpaired(bool mate1);\n\t\n\t/**\n\t * Called to indicate that the aligner has finished searching for\n\t * alignments.  This gives us a chance to finalize our state.\n\t *\n\t * TODO: Keep track of short-circuiting information.\n\t */\n\tvoid finish();\n\t\n\t/**\n\t * Populate given counters with the number of various kinds of alignments\n\t * to report for this read.  Concordant alignments are preferable to (and\n\t * mutually exclusive with) discordant alignments, and paired-end\n\t * alignments are preferable to unpaired alignments.\n\t *\n\t * The caller also needs some additional information for the case where a\n\t * pair or unpaired read aligns repetitively.  If the read is paired-end\n\t * and the paired-end has repetitive concordant alignments, that should be\n\t * reported, and 'pairMax' is set to true to indicate this.  If the read is\n\t * paired-end, does not have any conordant alignments, but does have\n\t * repetitive alignments for one or both mates, then that should be\n\t * reported, and 'unpair1Max' and 'unpair2Max' are set accordingly.\n\t *\n\t * Note that it's possible in the case of a paired-end read for the read to\n\t * have repetitive concordant alignments, but for one mate to have a unique\n\t * unpaired alignment.\n\t */\n\tvoid getReport(uint64_t& nconcordAln) const; // # concordant alignments to report\n\n\t/**\n\t * Return an integer representing the alignment state we're in.\n\t */\n\tinline int state() const { return state_; }\n\t\n\t/**\n\t * If false, there's no need to solve any more dynamic programming problems\n\t * for finding opposite mates.\n\t */\n\tinline bool doneConcordant() const { return doneConcord_; }\n\t\n\t/**\n\t * Return true iff all alignment stages have been exited.\n\t */\n\tinline bool done() const { return done_; }\n\n\tinline uint64_t numConcordant() const { return nconcord_; }\n\n\tinline int exitConcordant() const { return exitConcord_; }\n\n\t/**\n\t * Return ReportingParams object governing this ReportingState.\n\t */\n\tconst ReportingParams& params() const {\n\t\treturn p_;\n\t}\n\nprotected:\n\tconst ReportingParams& p_;  // reporting parameters\n\tint state_;          // state we're currently in\n\tbool paired_;        // true iff read we're currently handling is paired\n\tuint64_t nconcord_;  // # concordants found so far\n\tbool doneConcord_;   // true iff we're no longner interested in concordants\n\tint exitConcord_;    // flag indicating how we exited concordant state\n\tbool done_;          // done with all alignments\n};\n\n/**\n * Global hit sink for hits from the MultiSeed aligner.  Encapsulates\n * all aspects of the MultiSeed aligner hitsink that are global to all\n * threads.  This includes aspects relating to:\n *\n * (a) synchronized access to the output stream\n * (b) the policy to be enforced by the per-thread wrapper\n *\n * TODO: Implement splitting up of alignments into separate files\n * according to genomic coordinate.\n */\ntemplate <typename index_t>\nclass AlnSink {\n\n\ttypedef EList<std::string> StrList;\n\npublic:\n\n\texplicit AlnSink(\n                     OutputQueue& oq,\n                     const StrList& refnames,\n                     const EList<uint32_t>& tab_fmt_cols,\n                     bool quiet) :\n    oq_(oq),\n    refnames_(refnames),\n    tab_fmt_cols_(tab_fmt_cols),\n    quiet_(quiet)\n\t{\n\t}\n\n\t/**\n\t * Destroy HitSinkobject;\n\t */\n\tvirtual ~AlnSink() { }\n\n\t/**\n\t * Called when the AlnSink is wrapped by a new AlnSinkWrap.  This helps us\n\t * keep track of whether the main lock or any of the per-stream locks will\n\t * be contended by multiple threads.\n\t */\n\tvoid addWrapper() { numWrappers_++; }\n\n\t/**\n\t * Append a single hit to the given output stream.  If\n\t * synchronization is required, append() assumes the caller has\n\t * already grabbed the appropriate lock.\n\t */\n\tvirtual void append(\n\t\tBTString&             o,\n\t\tsize_t                threadId,\n\t\tconst Read           *rd1,\n\t\tconst Read           *rd2,\n\t\tconst TReadId         rdid,\n\t\tAlnRes               *rs1,\n\t\tAlnRes               *rs2,\n\t\tconst AlnSetSumm&     summ,\n\t\tconst PerReadMetrics& prm,\n\t\tSpeciesMetrics& sm,\n\t\tbool report2,\n\t\tsize_t n_results) = 0;\n\n\t/**\n\t * Report a given batch of hits for the given read or read pair.\n\t * Should be called just once per read pair.  Assumes all the\n\t * alignments are paired, split between rs1 and rs2.\n\t *\n\t * The caller hasn't decided which alignments get reported as primary\n\t * or secondary; that's up to the routine.  Because the caller might\n\t * want to know this, we use the pri1 and pri2 out arguments to\n\t * convey this.\n\t */\n\tvirtual void reportHits(\n\t\tBTString&             o,              // write to this buffer\n\t\tsize_t                threadId,       // which thread am I?\n\t\tconst Read           *rd1,            // mate #1\n\t\tconst Read           *rd2,            // mate #2\n\t\tconst TReadId         rdid,           // read ID\n\t\tconst EList<size_t>&  select1,        // random subset of rd1s\n\t\tconst EList<size_t>*  select2,        // random subset of rd2s\n\t\tEList<AlnRes>        *rs1,            // alignments for mate #1\n\t\tEList<AlnRes>        *rs2,            // alignments for mate #2\n\t\tbool                  maxed,          // true iff -m/-M exceeded\n\t\tconst AlnSetSumm&     summ,           // summary\n\t\tconst PerReadMetrics& prm,            // per-read metrics\n\t\tSpeciesMetrics& sm,             // species metrics\n\t\tbool                  getLock = true) // true iff lock held by caller\n\t{\n\t\tassert(rd1 != NULL || rd2 != NULL);\n\t\tassert(rs1 != NULL || rs2 != NULL);\n\n        for(size_t i = 0; i < select1.size(); i++) {\n            AlnRes* r1 = ((rs1 != NULL) ? &rs1->get(select1[i]) : NULL);\n            AlnRes* r2 = ((rs2 != NULL) ? &rs2->get(select1[i]) : NULL);\n            append(o, threadId, rd1, rd2, rdid, r1, r2, summ, prm, sm, true, select1.size());\n        }\n\t}\n\n\t/**\n\t * Report an unaligned read.  Typically we do nothing, but we might\n\t * want to print a placeholder when output is chained.\n\t */\n\tvirtual void reportUnaligned(\n\t\tBTString&             o,              // write to this string\n\t\tsize_t                threadId,       // which thread am I?\n\t\tconst Read           *rd1,            // mate #1\n\t\tconst Read           *rd2,            // mate #2\n\t\tconst TReadId         rdid,           // read ID\n\t\tconst AlnSetSumm&     summ,           // summary\n\t\tconst PerReadMetrics& prm,            // per-read metrics\n\t\tbool                  report2,        // report alns for both mates?\n\t\tbool                  getLock = true) // true iff lock held by caller\n\t{\n\t\t// FIXME: reportUnaligned does nothing\n\t\t//append(o, threadId, rd1, rd2, rdid, NULL, NULL, summ, prm, NULL,report2);\n\t}\n\n\t/**\n\t * Print summary of how many reads aligned, failed to align and aligned\n\t * repetitively.  Write it to stderr.  Optionally write Hadoop counter\n\t * updates.\n\t */\n\tvoid printAlSumm(\n\t\tconst ReportingMetrics& met,\n\t\tsize_t repThresh, // threshold for uniqueness, or max if no thresh\n\t\tbool discord,     // looked for discordant alignments\n\t\tbool mixed,       // looked for unpaired alignments where paired failed?\n\t\tbool hadoopOut);  // output Hadoop counters?\n\n\t/**\n\t * Called when all alignments are complete.  It is assumed that no\n\t * synchronization is necessary.\n\t */\n\tvoid finish(\n\t\tsize_t repThresh,\n\t\tbool discord,\n\t\tbool mixed,\n\t\tbool hadoopOut)\n\t{\n\t\t// Close output streams\n\t\tif(!quiet_) {\n\t\t\tprintAlSumm(\n\t\t\t\tmet_,\n\t\t\t\trepThresh,\n\t\t\t\tdiscord,\n\t\t\t\tmixed,\n\t\t\t\thadoopOut);\n\t\t}\n\t}\n\n#ifndef NDEBUG\n\t/**\n\t * Check that hit sink is internally consistent.\n\t */\n\tbool repOk() const { return true; }\n#endif\n\t\n\t//\n\t// Related to reporting seed hits\n\t//\n\n\t/**\n\t * Given a Read and associated, filled-in SeedResults objects,\n\t * print a record summarizing the seed hits.\n\t */\n\tvoid reportSeedSummary(\n\t\tBTString&          o,\n\t\tconst Read&        rd,\n\t\tTReadId            rdid,\n\t\tsize_t             threadId,\n\t\tconst SeedResults<index_t>& rs,\n\t\tbool               getLock = true);\n\n\t/**\n\t * Given a Read, print an empty record (all 0s).\n\t */\n\tvoid reportEmptySeedSummary(\n\t\tBTString&          o,\n\t\tconst Read&        rd,\n\t\tTReadId            rdid,\n\t\tsize_t             threadId,\n\t\tbool               getLock = true);\n\n\t/**\n\t * Append a batch of unresolved seed alignment results (i.e. seed\n\t * alignments where all we know is the reference sequence aligned\n\t * to and its SA range, not where it falls in the reference\n\t * sequence) to the given output stream in Bowtie's seed-alignment\n\t * verbose-mode format.\n\t */\n\tvirtual void appendSeedSummary(\n\t\tBTString&     o,\n\t\tconst Read&   rd,\n\t\tconst TReadId rdid,\n\t\tsize_t        seedsTried,\n\t\tsize_t        nonzero,\n\t\tsize_t        ranges,\n\t\tsize_t        elts,\n\t\tsize_t        seedsTriedFw,\n\t\tsize_t        nonzeroFw,\n\t\tsize_t        rangesFw,\n\t\tsize_t        eltsFw,\n\t\tsize_t        seedsTriedRc,\n\t\tsize_t        nonzeroRc,\n\t\tsize_t        rangesRc,\n\t\tsize_t        eltsRc);\n\n\t/**\n\t * Merge given metrics in with ours by summing all individual metrics.\n\t */\n\tvoid mergeMetrics(const ReportingMetrics& met, bool getLock = true) {\n\t\tmet_.merge(met, getLock);\n\t}\n\n\t/**\n\t * Return mutable reference to the shared OutputQueue.\n\t */\n\tOutputQueue& outq() {\n\t\treturn oq_;\n\t}\n\nprotected:\n    \n\tOutputQueue&       oq_;           // output queue\n\tint                numWrappers_;  // # threads owning a wrapper for this HitSink\n\tconst StrList&     refnames_;     // reference names\n\tconst EList<uint32_t>&     tab_fmt_cols_; // Columns that are printed in tabular format\n\tbool               quiet_;        // true -> don't print alignment stats at the end\n\tReportingMetrics   met_;          // global repository of reporting metrics\n};\n\n/**\n * Per-thread hit sink \"wrapper\" for the MultiSeed aligner.  Encapsulates\n * aspects of the MultiSeed aligner hit sink that are per-thread.  This\n * includes aspects relating to:\n *\n * (a) Enforcement of the reporting policy\n * (b) Tallying of results\n * (c) Storing of results for the previous read in case this allows us to\n *     short-circuit some work for the next read (i.e. if it's identical)\n *\n * PHASED ALIGNMENT ASSUMPTION\n *\n * We make some assumptions about how alignment proceeds when we try to\n * short-circuit work for identical reads.  Specifically, we assume that for\n * each read the aligner proceeds in a series of stages (or perhaps just one\n * stage).  In each stage, the aligner either:\n *\n * (a)  Finds no alignments, or\n * (b)  Finds some alignments and short circuits out of the stage with some\n *      random reporting involved (e.g. in -k and/or -M modes), or\n * (c)  Finds all of the alignments in the stage\n *\n * In the event of (a), the aligner proceeds to the next stage and keeps\n * trying; we can skip the stage entirely for the next read if it's identical.\n * In the event of (b), or (c), the aligner stops and does not proceed to\n * further stages.  In the event of (b1), if the next read is identical we\n * would like to tell the aligner to start again at the beginning of the stage\n * that was short-circuited.\n *\n * In any event, the rs1_/rs2_/rs1u_/rs2u_ fields contain the alignments found\n * in the last alignment stage attempted.\n *\n * HANDLING REPORTING LIMITS\n *\n * The user can specify reporting limits, like -k (specifies number of\n * alignments to report out of those found) and -M (specifies a ceiling s.t. if\n * there are more alignments than the ceiling, read is called repetitive and\n * best found is reported).  Enforcing these limits is straightforward for\n * unpaired alignments: if a new alignment causes us to exceed the -M ceiling,\n * we can stop looking.\n *\n * The case where both paired-end and unpaired alignments are possible is\n * trickier.  Once we have a number of unpaired alignments that exceeds the\n * ceiling, we can stop looking *for unpaired alignments* - but we can't\n * necessarily stop looking for paired-end alignments, since there may yet be\n * more to find.  However, if the input read is not a pair, then we can stop at\n * this point.  If the input read is a pair and we have a number of paired\n * aligments that exceeds the -M ceiling, we can stop looking.\n *\n * CONCORDANT & DISCORDANT, PAIRED & UNPAIRED\n *\n * A note on paired-end alignment: Clearly, if an input read is\n * paired-end and we find either concordant or discordant paired-end\n * alignments for the read, then we would like to tally and report\n * those alignments as such (and not as groups of 2 unpaired\n * alignments).  And if we fail to find any paired-end alignments, but\n * we do find some unpaired alignments for one mate or the other, then\n * we should clearly tally and report those alignments as unpaired\n * alignments (if the user so desires).\n *\n * The situation is murkier when there are no paired-end alignments,\n * but there are unpaired alignments for *both* mates.  In this case,\n * we might want to pick out zero or more pairs of mates and classify\n * those pairs as discordant paired-end alignments.  And we might want\n * to classify the remaining alignments as unpaired.  But how do we\n * pick which pairs if any to call discordant?\n *\n * Because the most obvious use for discordant pairs is for identifying\n * large-scale variation, like rearrangements or large indels, we would\n * usually like to be conservative about what we call a discordant\n * alignment.  If there's a good chance that one or the other of the\n * two mates has a good alignment to another place on the genome, this\n * compromises the evidence for the large-scale variant.  For this\n * reason, Bowtie 2's policy is: if there are no paired-end alignments\n * and there is *exactly one alignment each* for both mates, then the\n * two alignments are paired and treated as a discordant paired-end\n * alignment.  Otherwise, all alignments are treated as unpaired\n * alignments.\n *\n * When both paired and unpaired alignments are discovered by the\n * aligner, only the paired alignments are reported by default.  This\n * is sensible considering relative likelihoods: if a good paired-end\n * alignment is found, it is much more likely that the placement of\n * the two mates implied by that paired alignment is correct than any\n * placement implied by an unpaired alignment.\n *\n * \n */\ntemplate <typename index_t>\nclass AlnSinkWrap {\npublic:\n\n\tAlnSinkWrap(\n\t\tAlnSink<index_t>& g,       // AlnSink being wrapped\n\t\tconst ReportingParams& rp, // Parameters governing reporting\n\t\tsize_t threadId,           // Thread ID\n        bool secondary = false) :  // Secondary alignments\n\t\tg_(g),\n\t\trp_(rp),\n        threadid_(threadId),\n    \tsecondary_(secondary),\n\t\tinit_(false),   \n\t\tmaxed1_(false),       // read is pair and we maxed out mate 1 unp alns\n\t\tmaxed2_(false),       // read is pair and we maxed out mate 2 unp alns\n\t\tmaxedOverall_(false), // alignments found so far exceed -m/-M ceiling\n\t\tbestPair_(std::numeric_limits<TAlScore>::min()),\n\t\tbest2Pair_(std::numeric_limits<TAlScore>::min()),\n\t\tbestUnp1_(std::numeric_limits<TAlScore>::min()),\n\t\tbest2Unp1_(std::numeric_limits<TAlScore>::min()),\n\t\tbestUnp2_(std::numeric_limits<TAlScore>::min()),\n\t\tbest2Unp2_(std::numeric_limits<TAlScore>::min()),\n        bestSplicedPair_(0),\n        best2SplicedPair_(0),\n        bestSplicedUnp1_(0),\n        best2SplicedUnp1_(0),\n        bestSplicedUnp2_(0),\n        best2SplicedUnp2_(0),\n\t\trd1_(NULL),    // mate 1\n\t\trd2_(NULL),    // mate 2\n\t\trdid_(std::numeric_limits<TReadId>::max()), // read id\n\t\trs_(),         // mate 1 alignments for paired-end alignments\n\t\tselect_(),     // for selecting random subsets for mate 1\n\t\tst_(rp)        // reporting state - what's left to do?\n\t{\n\t\tassert(rp_.repOk());\n\t}\n\n\tAlnSink<index_t>& getSink() {\n\t\treturn(g_);\n\t}\n\n\t/**\n\t * Initialize the wrapper with a new read pair and return an\n\t * integer >= -1 indicating which stage the aligner should start\n\t * at.  If -1 is returned, the aligner can skip the read entirely.\n\t * at.  If .  Checks if the new read pair is identical to the\n\t * previous pair.  If it is, then we return the id of the first\n\t * stage to run.\n\t */\n\tint nextRead(\n\t\t// One of the other of rd1, rd2 will = NULL if read is unpaired\n\t\tconst Read* rd1,      // new mate #1\n\t\tconst Read* rd2,      // new mate #2\n\t\tTReadId rdid,         // read ID for new pair\n\t\tbool qualitiesMatter);// aln policy distinguishes b/t quals?\n\n\t/**\n\t * Inform global, shared AlnSink object that we're finished with\n\t * this read.  The global AlnSink is responsible for updating\n\t * counters, creating the output record, and delivering the record\n\t * to the appropriate output stream.\n\t */\n\tvoid finishRead(\n\t\tconst SeedResults<index_t> *sr1, // seed alignment results for mate 1\n\t\tconst SeedResults<index_t> *sr2, // seed alignment results for mate 2\n\t\tbool               exhaust1,     // mate 1 exhausted?\n\t\tbool               exhaust2,     // mate 2 exhausted?\n\t\tbool               nfilt1,       // mate 1 N-filtered?\n\t\tbool               nfilt2,       // mate 2 N-filtered?\n\t\tbool               scfilt1,      // mate 1 score-filtered?\n\t\tbool               scfilt2,      // mate 2 score-filtered?\n\t\tbool               lenfilt1,     // mate 1 length-filtered?\n\t\tbool               lenfilt2,     // mate 2 length-filtered?\n\t\tbool               qcfilt1,      // mate 1 qc-filtered?\n\t\tbool               qcfilt2,      // mate 2 qc-filtered?\n\t\tbool               sortByScore,  // prioritize alignments by score\n\t\tRandomSource&      rnd,          // pseudo-random generator\n\t\tReportingMetrics&  met,          // reporting metrics\n\t\tSpeciesMetrics&    smet,         // species metrics\n\t\tconst PerReadMetrics& prm,       // per-read metrics\n\t\tbool suppressSeedSummary = true,\n\t\tbool suppressAlignments = false);\n\t\n\t/**\n\t * Called by the aligner when a new unpaired or paired alignment is\n\t * discovered in the given stage.  This function checks whether the\n\t * addition of this alignment causes the reporting policy to be\n\t * violated (by meeting or exceeding the limits set by -k, -m, -M),\n\t * in which case true is returned immediately and the aligner is\n\t * short circuited.  Otherwise, the alignment is tallied and false\n\t * is returned.\n\t */\n\tbool report(\n\t\tint stage,\n        const AlnRes* rs);\n\n#ifndef NDEBUG\n\t/**\n\t * Check that hit sink wrapper is internally consistent.\n\t */\n\tbool repOk() const {\n\t\tif(init_) {\n\t\t\tassert(rd1_ != NULL);\n\t\t\tassert_neq(std::numeric_limits<TReadId>::max(), rdid_);\n\t\t}\n\t\treturn true;\n\t}\n#endif\n\t\n\t/**\n\t * Return true iff no alignments have been reported to this wrapper\n\t * since the last call to nextRead().\n\t */\n\tbool empty() const {\n\t\treturn rs_.empty();\n\t}\n\t\n\t/**\n\t * Return true iff we have already encountered a number of alignments that\n\t * exceeds the -m/-M ceiling.  TODO: how does this distinguish between\n\t * pairs and mates?\n\t */\n\tbool maxed() const {\n\t\treturn maxedOverall_;\n\t}\n\t\n\t/**\n\t * Return true if the current read is paired.\n\t */\n\tbool readIsPair() const {\n\t\treturn rd1_ != NULL && rd2_ != NULL;\n\t}\n\t\n\t/**\n\t * Return true iff nextRead() has been called since the last time\n\t * finishRead() was called.\n\t */\n\tbool inited() const { return init_; }\n\n\t/**\n\t * Return a const ref to the ReportingState object associated with the\n\t * AlnSinkWrap.\n\t */\n\tconst ReportingState& state() const { return st_; }\n    \n    const ReportingParams& reportingParams() { return rp_;}\n\t\n    SpeciesMetrics& speciesMetrics() { return g_.speciesMetrics(); }\n\t\n\t/**\n\t * Return true iff at least two alignments have been reported so far for an\n\t * unpaired read or mate 1.\n\t */\n\tbool hasSecondBestUnp1() const {\n\t\treturn best2Unp1_ != std::numeric_limits<TAlScore>::min();\n\t}\n\n\t/**\n\t * Return true iff at least two alignments have been reported so far for\n\t * mate 2.\n\t */\n\tbool hasSecondBestUnp2() const {\n\t\treturn best2Unp2_ != std::numeric_limits<TAlScore>::min();\n\t}\n\n\t/**\n\t * Return true iff at least two paired-end alignments have been reported so\n\t * far.\n\t */\n\tbool hasSecondBestPair() const {\n\t\treturn best2Pair_ != std::numeric_limits<TAlScore>::min();\n\t}\n\t\n\t/**\n\t * Get best score observed so far for an unpaired read or mate 1.\n\t */\n\tTAlScore bestUnp1() const {\n\t\treturn bestUnp1_;\n\t}\n\n\t/**\n\t * Get second-best score observed so far for an unpaired read or mate 1.\n\t */\n\tTAlScore secondBestUnp1() const {\n\t\treturn best2Unp1_;\n\t}\n\n\t/**\n\t * Get best score observed so far for mate 2.\n\t */\n\tTAlScore bestUnp2() const {\n\t\treturn bestUnp2_;\n\t}\n\n\t/**\n\t * Get second-best score observed so far for mate 2.\n\t */\n\tTAlScore secondBestUnp2() const {\n\t\treturn best2Unp2_;\n\t}\n\n\t/**\n\t * Get best score observed so far for paired-end read.\n\t */\n\tTAlScore bestPair() const {\n\t\treturn bestPair_;\n\t}\n\n\t/**\n\t * Get second-best score observed so far for paired-end read.\n\t */\n\tTAlScore secondBestPair() const {\n\t\treturn best2Pair_;\n\t}\n    \n    \n    /**\n     *\n     */\n    void getPair(const EList<AlnRes>*& rs) const { rs = &rs_; }\n\nprotected:\n\n\t/**\n\t * Return true iff the read in rd1/rd2 matches the last read handled, which\n\t * should still be in rd1_/rd2_.\n\t */\n\tbool sameRead(\n\t\tconst Read* rd1,\n\t\tconst Read* rd2,\n\t\tbool qualitiesMatter);\n\n\t/**\n\t * Given that rs is already populated with alignments, consider the\n\t * alignment policy and make random selections where necessary.  E.g. if we\n\t * found 10 alignments and the policy is -k 2 -m 20, select 2 alignments at\n\t * random.  We \"select\" an alignment by setting the parallel entry in the\n\t * 'select' list to true.\n\t */\n\tsize_t selectAlnsToReport(\n\t\tconst EList<AlnRes>& rs,     // alignments to select from\n\t\tuint64_t             num,    // number of alignments to select\n\t\tEList<size_t>&       select, // list to put results in\n\t\tRandomSource&        rnd)\n\t\tconst;\n\n\t/**\n\t * rs1 (possibly together with rs2 if reads are paired) are populated with\n\t * alignments.  Here we prioritize them according to alignment score, and\n\t * some randomness to break ties.  Priorities are returned in the 'select'\n\t * list.\n\t */\n\tsize_t selectByScore(\n\t\tconst EList<AlnRes>* rs,    // alignments to select from (mate 1)\n\t\tuint64_t             num,    // number of alignments to select\n\t\tEList<size_t>&       select, // prioritized list to put results in\n\t\tRandomSource&        rnd)\n\t\tconst;\n\n\tAlnSink<index_t>& g_;     // global alignment sink\n\tReportingParams   rp_;    // reporting parameters: khits, mhits etc\n\tsize_t            threadid_; // thread ID\n    bool              secondary_; // allow for secondary alignments\n\tbool              init_;  // whether we're initialized w/ read pair\n\tbool              maxed1_; // true iff # unpaired mate-1 alns reported so far exceeded -m/-M\n\tbool              maxed2_; // true iff # unpaired mate-2 alns reported so far exceeded -m/-M\n\tbool              maxedOverall_; // true iff # paired-end alns reported so far exceeded -m/-M\n\tTAlScore          bestPair_;     // greatest score so far for paired-end\n\tTAlScore          best2Pair_;    // second-greatest score so far for paired-end\n\tTAlScore          bestUnp1_;     // greatest score so far for unpaired/mate1\n\tTAlScore          best2Unp1_;    // second-greatest score so far for unpaired/mate1\n\tTAlScore          bestUnp2_;     // greatest score so far for mate 2\n\tTAlScore          best2Unp2_;    // second-greatest score so far for mate 2\n    index_t           bestSplicedPair_;\n    index_t           best2SplicedPair_;\n    index_t           bestSplicedUnp1_;\n    index_t           best2SplicedUnp1_;\n    index_t           bestSplicedUnp2_;\n    index_t           best2SplicedUnp2_;\n\tconst Read*       rd1_;   // mate #1\n\tconst Read*       rd2_;   // mate #2\n\tTReadId           rdid_;  // read ID (potentially used for ordering)\n\tEList<AlnRes>     rs_;   // paired alignments for mate #1\n\tEList<size_t>     select_; // parallel to rs1_/rs2_ - which to report\n\tReportingState    st_;      // reporting state - what's left to do?\n\t\n\tEList<std::pair<TAlScore, size_t> > selectBuf_;\n\tBTString obuf_;\n};\n\n/**\n * An AlnSink concrete subclass for printing SAM alignments.  The user might\n * want to customize SAM output in various ways.  We encapsulate all these\n * customizations, and some of the key printing routines, in the SamConfig\n * class in sam.h/sam.cpp.\n */\ntemplate <typename index_t>\nclass AlnSinkSam : public AlnSink<index_t> {\n\n\ttypedef EList<std::string> StrList;\n\npublic:\n\n\tAlnSinkSam(\n               Ebwt<index_t>*   ebwt,\n               OutputQueue&     oq,           // output queue\n               const StrList&   refnames,     // reference names\n\t\t\t   const EList<uint32_t>&   tab_fmt_cols, // columns to output in the tabular format\n               bool             quiet) :\n    AlnSink<index_t>(oq,\n                     refnames,\n\t\t\t\t\t tab_fmt_cols,\n                     quiet),\n    ebwt_(ebwt)\n    { }\n\t\n\tvirtual ~AlnSinkSam() { }\n\n\t/**\n\t * Append a single alignment result, which might be paired or\n\t * unpaired, to the given output stream in Bowtie's verbose-mode\n\t * format.  If the alignment is paired-end, print mate1's alignment\n\t * then mate2's alignment.\n\t */\n\tvirtual void append(\n\t\tBTString&     o,           // write output to this string\n\t\tsize_t        threadId,    // which thread am I?\n\t\tconst Read*   rd1,         // mate #1\n\t\tconst Read*   rd2,         // mate #2\n\t\tconst TReadId rdid,        // read ID\n\t\tAlnRes* rs1,               // alignments for mate #1\n\t\tAlnRes* rs2,               // alignments for mate #2\n\t\tconst AlnSetSumm& summ,    // summary\n\t\tconst PerReadMetrics& prm, // per-read metrics\n\t\tSpeciesMetrics& sm,  // species metrics\n\t\tbool report2,              // report alns for both mates\n\t\tsize_t n_results)          // number of results for read\n\t{\n\t\tassert(rd1 != NULL || rd2 != NULL);\n        appendMate(*ebwt_, o, *rd1, rd2, rdid, rs1, rs2, summ, prm, sm, n_results);\n\t}\n\nprotected:\n\n\t/**\n\t * Append a single per-mate alignment result to the given output\n\t * stream.  If the alignment is part of a pair, information about\n\t * the opposite mate and its alignment are given in rdo/rso.\n\t */\n\tvoid appendMate(\n                    Ebwt<index_t>& ebwt,\n                    BTString&     o,\n                    const Read&   rd,\n                    const Read*   rdo,\n                    const TReadId rdid,\n                    AlnRes* rs,\n                    AlnRes* rso,\n                    const AlnSetSumm& summ,\n                    const PerReadMetrics& prm, // per-read metrics\n                    SpeciesMetrics& sm,   // species metrics\n                    size_t n_results);\n\n\n    Ebwt<index_t>*   ebwt_;\n\tBTDnaString      dseq_;    // buffer for decoded read sequence\n\tBTString         dqual_;   // buffer for decoded quality sequence\n};\n\nstatic inline std::ostream& printPct(\n\t\t\t\t\t\t\t  std::ostream& os,\n\t\t\t\t\t\t\t  uint64_t num,\n\t\t\t\t\t\t\t  uint64_t denom)\n{\n\tdouble pct = 0.0f;\n\tif(denom != 0) { pct = 100.0 * (double)num / (double)denom; }\n\tos << fixed << setprecision(2) << pct << '%';\n\treturn os;\n}\n\n/**\n * Print a friendly summary of:\n *\n *  1. How many reads were aligned and had one or more alignments\n *     reported\n *  2. How many reads exceeded the -m or -M ceiling and therefore had\n *     their alignments suppressed or sampled\n *  3. How many reads failed to align entirely\n *\n * Optionally print a series of Hadoop streaming-style counter updates\n * with similar information.\n */\ntemplate <typename index_t>\nvoid AlnSink<index_t>::printAlSumm(\n\t\t\t\t\t\t\t\t   const ReportingMetrics& met,\n\t\t\t\t\t\t\t\t   size_t repThresh,   // threshold for uniqueness, or max if no thresh\n\t\t\t\t\t\t\t\t   bool discord,       // looked for discordant alignments\n\t\t\t\t\t\t\t\t   bool mixed,         // looked for unpaired alignments where paired failed?\n\t\t\t\t\t\t\t\t   bool hadoopOut)     // output Hadoop counters?\n{\n\t// NOTE: there's a filtering step at the very beginning, so everything\n\t// being reported here is post filtering\n#if 0\n\tbool canRep = repThresh != MAX_SIZE_T;\n\tif(hadoopOut) {\n\t\tcerr << \"reporter:counter:Centrifuge,Reads processed,\" << met.nread << endl;\n\t}\n\tuint64_t totread = met.nread;\n\tif(totread > 0) {\n\t\tcerr << \"\" << met.nread << \" reads (or pairs); of these:\" << endl;\n\t} else {\n\t\tassert_eq(0, met.npaired);\n\t\tassert_eq(0, met.nunpaired);\n\t\tcerr << \"\" << totread << \" reads (or pairs)\" << endl;\n\t}\n\tif(totread > 0) {\n\t\t// Concordants\n\t\tcerr << \"    \" << met.nconcord << \" (\";\n\t\tprintPct(cerr, met.nconcord, met.npaired);\n\t\tcerr << \") classified 0 times\" << endl;\n\t\t\n        // Print the number that aligned concordantly exactly once\n        cerr << \"    \" << met.nconcord_uni << \" (\";\n        printPct(cerr, met.nconcord_uni, met.npaired);\n        cerr << \") classified exactly 1 time\" << endl;\n        \n#if 0\n        // Print the number that aligned concordantly more than once\n        cerr << \"    \" << met.nconcord_uni2 << \" (\";\n        printPct(cerr, met.nconcord_uni2, met.npaired);\n        cerr << \") classified >1 times\" << endl;\n#endif\n\t}\n    \n#if 0\n\tuint64_t totunpair = met.nunpaired;\n\tuint64_t tot_al_cand = totunpair + totpair*2;\n\tuint64_t tot_al =\n\t(met.nconcord_uni + met.nconcord_rep)*2 +\n\t(met.ndiscord)*2 +\n\tmet.nunp_0_uni +\n\tmet.nunp_0_rep + \n\tmet.nunp_uni +\n\tmet.nunp_rep;\n\tassert_leq(tot_al, tot_al_cand);\n\tprintPct(cerr, tot_al, tot_al_cand);\n#endif\n\tcerr << \" overall classification rate\" << endl;\n#endif\n}\n\n/**\n * Return true iff the read in rd1/rd2 matches the last read handled, which\n * should still be in rd1_/rd2_.\n */\ntemplate <typename index_t>\nbool AlnSinkWrap<index_t>::sameRead(\n\t\t\t\t\t\t\t\t\t// One of the other of rd1, rd2 will = NULL if read is unpaired\n\t\t\t\t\t\t\t\t\tconst Read* rd1,      // new mate #1\n\t\t\t\t\t\t\t\t\tconst Read* rd2,      // new mate #2\n\t\t\t\t\t\t\t\t\tbool qualitiesMatter) // aln policy distinguishes b/t quals?\n{\n\tbool same = false;\n\tif(rd1_ != NULL || rd2_ != NULL) {\n\t\t// This is not the first time the sink was initialized with\n\t\t// a read.  Check if new read/pair is identical to previous\n\t\t// read/pair\n\t\tif((rd1_ == NULL) == (rd1 == NULL) &&\n\t\t   (rd2_ == NULL) == (rd2 == NULL))\n\t\t{\n\t\t\tbool m1same = (rd1 == NULL && rd1_ == NULL);\n\t\t\tif(!m1same) {\n\t\t\t\tassert(rd1 != NULL);\n\t\t\t\tassert(rd1_ != NULL);\n\t\t\t\tm1same = Read::same(\n\t\t\t\t\t\t\t\t\trd1->patFw,  // new seq\n\t\t\t\t\t\t\t\t\trd1->qual,   // new quals\n\t\t\t\t\t\t\t\t\trd1_->patFw, // old seq\n\t\t\t\t\t\t\t\t\trd1_->qual,  // old quals\n\t\t\t\t\t\t\t\t\tqualitiesMatter);\n\t\t\t}\n\t\t\tif(m1same) {\n\t\t\t\tbool m2same = (rd2 == NULL && rd2_ == NULL);\n\t\t\t\tif(!m2same) {\n\t\t\t\t\tm2same = Read::same(\n\t\t\t\t\t\t\t\t\t\trd2->patFw,  // new seq\n\t\t\t\t\t\t\t\t\t\trd2->qual,   // new quals\n\t\t\t\t\t\t\t\t\t\trd2_->patFw, // old seq\n\t\t\t\t\t\t\t\t\t\trd2_->qual,  // old quals\n\t\t\t\t\t\t\t\t\t\tqualitiesMatter);\n\t\t\t\t}\n\t\t\t\tsame = m2same;\n\t\t\t}\n\t\t}\n\t}\n\treturn same;\n}\n\n/**\n * Initialize the wrapper with a new read pair and return an integer >= -1\n * indicating which stage the aligner should start at.  If -1 is returned, the\n * aligner can skip the read entirely.  Checks if the new read pair is\n * identical to the previous pair.  If it is, then we return the id of the\n * first stage to run.\n */\ntemplate <typename index_t>\nint AlnSinkWrap<index_t>::nextRead(\n\t\t\t\t\t\t\t\t   // One of the other of rd1, rd2 will = NULL if read is unpaired\n\t\t\t\t\t\t\t\t   const Read* rd1,      // new mate #1\n\t\t\t\t\t\t\t\t   const Read* rd2,      // new mate #2\n\t\t\t\t\t\t\t\t   TReadId rdid,         // read ID for new pair\n\t\t\t\t\t\t\t\t   bool qualitiesMatter) // aln policy distinguishes b/t quals?\n{\n\tassert(!init_);\n\tassert(rd1 != NULL || rd2 != NULL);\n\tinit_ = true;\n\t// Keep copy of new read, so that we can compare it with the\n\t// next one\n\tif(rd1 != NULL) {\n\t\trd1_ = rd1;\n\t} else rd1_ = NULL;\n\tif(rd2 != NULL) {\n\t\trd2_ = rd2;\n\t} else rd2_ = NULL;\n\trdid_ = rdid;\n\t// Caller must now align the read\n\tmaxed1_ = false;\n\tmaxed2_ = false;\n\tmaxedOverall_ = false;\n\tbestPair_ = best2Pair_ =\n\tbestUnp1_ = best2Unp1_ =\n\tbestUnp2_ = best2Unp2_ = std::numeric_limits<THitInt>::min();\n    bestSplicedPair_ = best2SplicedPair_ =\n    bestSplicedUnp1_ = best2SplicedUnp1_ =\n    bestSplicedUnp2_ = best2SplicedUnp2_ = 0;\n\trs_.clear();     // clear out paired-end alignments\n\tst_.nextRead(readIsPair()); // reset state\n\tassert(empty());\n\tassert(!maxed());\n\t// Start from the first stage\n\treturn 0;\n}\n\n/**\n * Inform global, shared AlnSink object that we're finished with this read.\n * The global AlnSink is responsible for updating counters, creating the output\n * record, and delivering the record to the appropriate output stream.\n *\n * What gets reported for a paired-end alignment?\n *\n * 1. If there are reportable concordant alignments, report those and stop\n * 2. If there are reportable discordant alignments, report those and stop\n * 3. If unpaired alignments can be reported:\n *    3a. Report \n #\n * Update metrics.  Only ambiguity is: what if a pair aligns repetitively and\n * one of its mates aligns uniquely?\n *\n * \tuint64_t al;   // # mates w/ >= 1 reported alignment\n *  uint64_t unal; // # mates w/ 0 alignments\n *  uint64_t max;  // # mates withheld for exceeding -M/-m ceiling\n *  uint64_t al_concord;  // # pairs w/ >= 1 concordant alignment\n *  uint64_t al_discord;  // # pairs w/ >= 1 discordant alignment\n *  uint64_t max_concord; // # pairs maxed out\n *  uint64_t unal_pair;   // # pairs where neither mate aligned\n */\ntemplate <typename index_t>\nvoid AlnSinkWrap<index_t>::finishRead(\n\t\t\t\t\t\t\t\t\t  const SeedResults<index_t> *sr1, // seed alignment results for mate 1\n\t\t\t\t\t\t\t\t\t  const SeedResults<index_t> *sr2, // seed alignment results for mate 2\n\t\t\t\t\t\t\t\t\t  bool               exhaust1,     // mate 1 exhausted?\n\t\t\t\t\t\t\t\t\t  bool               exhaust2,     // mate 2 exhausted?\n\t\t\t\t\t\t\t\t\t  bool               nfilt1,       // mate 1 N-filtered?\n\t\t\t\t\t\t\t\t\t  bool               nfilt2,       // mate 2 N-filtered?\n\t\t\t\t\t\t\t\t\t  bool               scfilt1,      // mate 1 score-filtered?\n\t\t\t\t\t\t\t\t\t  bool               scfilt2,      // mate 2 score-filtered?\n\t\t\t\t\t\t\t\t\t  bool               lenfilt1,     // mate 1 length-filtered?\n\t\t\t\t\t\t\t\t\t  bool               lenfilt2,     // mate 2 length-filtered?\n\t\t\t\t\t\t\t\t\t  bool               qcfilt1,      // mate 1 qc-filtered?\n\t\t\t\t\t\t\t\t\t  bool               qcfilt2,      // mate 2 qc-filtered?\n\t\t\t\t\t\t\t\t\t  bool               sortByScore,  // prioritize alignments by score\n\t\t\t\t\t\t\t\t\t  RandomSource&      rnd,          // pseudo-random generator\n\t\t\t\t\t\t\t\t\t  ReportingMetrics&  met,          // reporting metrics\n\t\t\t\t\t\t\t\t\t  SpeciesMetrics&    smet,         // species metrics\n\t\t\t\t\t\t\t\t\t  const PerReadMetrics& prm,       // per-read metrics\n\t\t\t\t\t\t\t\t\t  bool suppressSeedSummary,        // = true\n\t\t\t\t\t\t\t\t\t  bool suppressAlignments)         // = false\n{\n\tobuf_.clear();\n\tOutputQueueMark qqm(g_.outq(), obuf_, rdid_, threadid_);\n\tassert(init_);\n\tif(!suppressSeedSummary) {\n\t\tif(sr1 != NULL) {\n\t\t\tassert(rd1_ != NULL);\n\t\t\t// Mate exists and has non-empty SeedResults\n\t\t\tg_.reportSeedSummary(obuf_, *rd1_, rdid_, threadid_, *sr1, true);\n\t\t} else if(rd1_ != NULL) {\n\t\t\t// Mate exists but has NULL SeedResults\n\t\t\tg_.reportEmptySeedSummary(obuf_, *rd1_, rdid_, true);\n\t\t}\n\t\tif(sr2 != NULL) {\n\t\t\tassert(rd2_ != NULL);\n\t\t\t// Mate exists and has non-empty SeedResults\n\t\t\tg_.reportSeedSummary(obuf_, *rd2_, rdid_, threadid_, *sr2, true);\n\t\t} else if(rd2_ != NULL) {\n\t\t\t// Mate exists but has NULL SeedResults\n\t\t\tg_.reportEmptySeedSummary(obuf_, *rd2_, rdid_, true);\n\t\t}\n\t}\n\n\t// TODO FB: Cconsider counting species here, and allow to disable counting\n\n\tif(!suppressAlignments) {\n\t\t// Ask the ReportingState what to report\n\t\tst_.finish();\n\t\tuint64_t nconcord = 0;\n\t\tbool pairMax = false;\n\t\tst_.getReport(nconcord);\n\t\tassert_leq(nconcord, rs_.size());\n\t\tassert_gt(rp_.khits, 0);\n\t\tmet.nread++;\n\n\t\tif(readIsPair()) {\n\t\t\tmet.npaired++;\n\t\t} else {\n\t\t\tmet.nunpaired++;\n\t\t}\n\t\t// Report concordant paired-end alignments if possible\n\t\tif(nconcord > 0) {\n\t\t\tAlnSetSumm concordSumm(rd1_, rd2_, &rs_);\n\t\t\t// Possibly select a random subset\n\t\t\tsize_t off;\n\t\t\tif(sortByScore) {\n\t\t\t\t// Sort by score then pick from low to high\n\t\t\t\toff = selectByScore(&rs_, nconcord, select_, rnd);\n\t\t\t} else {\n\t\t\t\t// Select subset randomly\n\t\t\t\toff = selectAlnsToReport(rs_, nconcord, select_, rnd);\n\t\t\t}\n\t\t\tassert_lt(off, rs_.size());\n\t\t\t_unused(off); // make production build happy\n\t\t\tassert(!select_.empty());\n\t\t\tg_.reportHits(\n\t\t\t\t\t\t  obuf_,\n\t\t\t\t\t\t  threadid_,\n\t\t\t\t\t\t  rd1_,\n\t\t\t\t\t\t  rd2_,\n\t\t\t\t\t\t  rdid_,\n\t\t\t\t\t\t  select_,\n\t\t\t\t\t\t  NULL,\n\t\t\t\t\t\t  &rs_,\n\t\t\t\t\t\t  NULL,\n\t\t\t\t\t\t  pairMax,\n\t\t\t\t\t\t  concordSumm,\n                          prm,\n\t\t\t\t\t\t  smet);\n\t\t\tif(pairMax) {\n\t\t\t\t// met.nconcord_rep++;\n\t\t\t} else {\n\t\t\t\tmet.nconcord_uni++;\n\t\t\t\tassert(!rs_.empty());\n\t\t\t\tif(rs_.size() == 1) {\n\t\t\t\t\t// met.nconcord_uni1++;\n\t\t\t\t} else {\n\t\t\t\t\t// met.nconcord_uni2++;\n\t\t\t\t}\n\t\t\t}\n\t\t\tinit_ = false;\n\t\t\t// write read to file\n\t\t\t//g_.outq().finishRead(obuf_, rdid_, threadid_);\n\t\t\treturn;\n\t\t}\n\t\t\n#if 0\n\t\t// Update counters given that one mate didn't align\n\t\tif(readIsPair()) {\n\t\t\tmet.nconcord_0++;\n\t\t}\n\t\tif(rd1_ != NULL) {\n\t\t\tif(nunpair1 > 0) {\n\t\t\t\t// Update counters\n\t\t\t\tif(readIsPair()) {\n\t\t\t\t\tif(unpair1Max) met.nunp_0_rep++;\n\t\t\t\t\telse {\n\t\t\t\t\t\tmet.nunp_0_uni++;\n\t\t\t\t\t\tassert(!rs1u_.empty());\n\t\t\t\t\t\tif(rs1u_.size() == 1) {\n\t\t\t\t\t\t\tmet.nunp_0_uni1++;\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tmet.nunp_0_uni2++;\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t} else {\n\t\t\t\t\tif(unpair1Max) met.nunp_rep++;\n\t\t\t\t\telse {\n\t\t\t\t\t\tmet.nunp_uni++;\n\t\t\t\t\t\tassert(!rs1u_.empty());\n\t\t\t\t\t\tif(rs1u_.size() == 1) {\n\t\t\t\t\t\t\tmet.nunp_uni1++;\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tmet.nunp_uni2++;\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t} else if(unpair1Max) {\n\t\t\t\t// Update counters\n\t\t\t\tif(readIsPair())   met.nunp_0_rep++;\n\t\t\t\telse               met.nunp_rep++;\n\t\t\t} else {\n\t\t\t\t// Update counters\n\t\t\t\tif(readIsPair())   met.nunp_0_0++;\n\t\t\t\telse               met.nunp_0++;\n\t\t\t}\n\t\t}\n\t\tif(rd2_ != NULL) {\n\t\t\tif(nunpair2 > 0) {\n\t\t\t\t// Update counters\n\t\t\t\tif(readIsPair()) {\n\t\t\t\t\tif(unpair2Max) met.nunp_0_rep++;\n\t\t\t\t\telse {\n\t\t\t\t\t\tassert(!rs2u_.empty());\n\t\t\t\t\t\tmet.nunp_0_uni++;\n\t\t\t\t\t\tif(rs2u_.size() == 1) {\n\t\t\t\t\t\t\tmet.nunp_0_uni1++;\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tmet.nunp_0_uni2++;\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t} else {\n\t\t\t\t\tif(unpair2Max) met.nunp_rep++;\n\t\t\t\t\telse {\n\t\t\t\t\t\tassert(!rs2u_.empty());\n\t\t\t\t\t\tmet.nunp_uni++;\n\t\t\t\t\t\tif(rs2u_.size() == 1) {\n\t\t\t\t\t\t\tmet.nunp_uni1++;\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tmet.nunp_uni2++;\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t} else if(unpair2Max) {\n\t\t\t\t// Update counters\n\t\t\t\tif(readIsPair())   met.nunp_0_rep++;\n\t\t\t\telse               met.nunp_rep++;\n\t\t\t} else {\n\t\t\t\t// Update counters\n\t\t\t\tif(readIsPair())   met.nunp_0_0++;\n\t\t\t\telse               met.nunp_0++;\n\t\t\t}\n\t\t}\n        \n#endif\n\t} // if(suppress alignments)\n\tinit_ = false;\n\treturn;\n}\n\n/**\n * Called by the aligner when a new unpaired or paired alignment is\n * discovered in the given stage.  This function checks whether the\n * addition of this alignment causes the reporting policy to be\n * violated (by meeting or exceeding the limits set by -k, -m, -M),\n * in which case true is returned immediately and the aligner is\n * short circuited.  Otherwise, the alignment is tallied and false\n * is returned.\n */\ntemplate <typename index_t>\nbool AlnSinkWrap<index_t>::report(int stage,\n\t\t\t\t\t\t\t\t  const AlnRes* rs)\n{\n\tassert(init_);\n\tassert(rs != NULL);\n    st_.foundConcordant();\n    rs_.push_back(*rs);\n\n\t// Tally overall alignment score\n\tTAlScore score = rs->score();\n\t// Update best score so far\n    if(score > bestPair_) {\n        best2Pair_ = bestPair_;\n        bestPair_ = score;\n    } else if(score > best2Pair_) {\n        best2Pair_ = score;\n    }\n\treturn st_.done();\n}\n\n/**\n * rs1 (possibly together with rs2 if reads are paired) are populated with\n * alignments.  Here we prioritize them according to alignment score, and\n * some randomness to break ties.  Priorities are returned in the 'select'\n * list.\n */\ntemplate <typename index_t>\nsize_t AlnSinkWrap<index_t>::selectByScore(\n\t\t\t\t\t\t\t\t\t\t   const EList<AlnRes>* rs,    // alignments to select from (mate 1)\n\t\t\t\t\t\t\t\t\t\t   uint64_t             num,    // number of alignments to select\n\t\t\t\t\t\t\t\t\t\t   EList<size_t>&       select, // prioritized list to put results in\n\t\t\t\t\t\t\t\t\t\t   RandomSource&        rnd)\nconst\n{\n\tassert(init_);\n\tassert(repOk());\n\tassert_gt(num, 0);\n\tassert(rs != NULL);\n\tsize_t sz = rs->size(); // sz = # alignments found\n\tassert_leq(num, sz);\n\tif(sz < num) {\n\t\tnum = sz;\n\t}\n\t// num = # to select\n\tif(sz < 1) {\n\t\treturn 0;\n\t}\n\tselect.resize((size_t)num);\n\t// Use 'selectBuf_' as a temporary list for sorting purposes\n\tEList<std::pair<TAlScore, size_t> >& buf =\n\tconst_cast<EList<std::pair<TAlScore, size_t> >& >(selectBuf_);\n\tbuf.resize(sz);\n\t// Sort by score.  If reads are pairs, sort by sum of mate scores.\n\tfor(size_t i = 0; i < sz; i++) {\n        buf[i].first = (*rs)[i].score();\n\t\tbuf[i].second = i; // original offset\n\t}\n\tbuf.sort(); buf.reverse(); // sort in descending order by score\n\t\n\t// Randomize streaks of alignments that are equal by score\n\tsize_t streak = 0;\n\tfor(size_t i = 1; i < buf.size(); i++) {\n\t\tif(buf[i].first == buf[i-1].first) {\n\t\t\tif(streak == 0) { streak = 1; }\n\t\t\tstreak++;\n\t\t} else {\n\t\t\tif(streak > 1) {\n\t\t\t\tassert_geq(i, streak);\n\t\t\t\tbuf.shufflePortion(i-streak, streak, rnd);\n\t\t\t}\n\t\t\tstreak = 0;\n\t\t}\n\t}\n\tif(streak > 1) {\n\t\tbuf.shufflePortion(buf.size() - streak, streak, rnd);\n\t}\n\t\n\tfor(size_t i = 0; i < num; i++) { select[i] = buf[i].second; }\n    \n    if(!secondary_) {\n        assert_geq(buf.size(), select.size());\n        for(size_t i = 0; i + 1 < select.size(); i++) {\n            if(buf[i].first != buf[i+1].first) {\n                select.resize(i+1);\n                break;\n            }\n        }\n    }\n    \n\t// Returns index of the representative alignment, but in 'select' also\n\t// returns the indexes of the next best selected alignments in order by\n\t// score.\n\treturn selectBuf_[0].second;\n}\n\n/**\n * Given that rs is already populated with alignments, consider the\n * alignment policy and make random selections where necessary.  E.g. if we\n * found 10 alignments and the policy is -k 2 -m 20, select 2 alignments at\n * random.  We \"select\" an alignment by setting the parallel entry in the\n * 'select' list to true.\n *\n * Return the \"representative\" alignment.  This is simply the first one\n * selected.  That will also be what SAM calls the \"primary\" alignment.\n */\ntemplate <typename index_t>\nsize_t AlnSinkWrap<index_t>::selectAlnsToReport(\n\t\t\t\t\t\t\t\t\t\t\t\tconst EList<AlnRes>& rs,     // alignments to select from\n\t\t\t\t\t\t\t\t\t\t\t\tuint64_t             num,    // number of alignments to select\n\t\t\t\t\t\t\t\t\t\t\t\tEList<size_t>&       select, // list to put results in\n\t\t\t\t\t\t\t\t\t\t\t\tRandomSource&        rnd)\nconst\n{\n\tassert(init_);\n\tassert(repOk());\n\tassert_gt(num, 0);\n\tsize_t sz = rs.size();\n\tif(sz < num) {\n\t\tnum = sz;\n\t}\n\tif(sz < 1) {\n\t\treturn 0;\n\t}\n\tselect.resize((size_t)num);\n\tif(sz == 1) {\n\t\tassert_eq(1, num);\n\t\tselect[0] = 0;\n\t\treturn 0;\n\t}\n\t// Select a random offset into the list of alignments\n\tuint32_t off = rnd.nextU32() % (uint32_t)sz;\n\tuint32_t offOrig = off;\n\t// Now take elements starting at that offset, wrapping around to 0 if\n\t// necessary.  Leave the rest.\n\tfor(size_t i = 0; i < num; i++) {\n\t\tselect[i] = off;\n\t\toff++;\n\t\tif(off == sz) {\n\t\t\toff = 0;\n\t\t}\n\t}\n\treturn offOrig;\n}\n\n#define NOT_SUPPRESSED !suppress_[field++]\n#define BEGIN_FIELD { \\\nif(firstfield) firstfield = false; \\\nelse o.append('\\t'); \\\n}\n#define WRITE_TAB { \\\nif(firstfield) firstfield = false; \\\nelse o.append('\\t'); \\\n}\n#define WRITE_NUM(o, x) { \\\nitoa10(x, buf); \\\no.append(buf); \\\n}\n\n#define WRITE_STRING(o, x) { \\\no.append(x.c_str()); \\\n}\n\ntemplate <typename T> inline \nvoid appendNumber(BTString& o, const T x, char* buf) {\n\titoa10<T>(x, buf);\n\to.append(buf);\n}\n\n\n/**\n * Print a seed summary to the first output stream in the outs_ list.\n */\ntemplate <typename index_t>\nvoid AlnSink<index_t>::reportSeedSummary(\n\t\t\t\t\t\t\t\t\t\t BTString&          o,\n\t\t\t\t\t\t\t\t\t\t const Read&        rd,\n\t\t\t\t\t\t\t\t\t\t TReadId            rdid,\n\t\t\t\t\t\t\t\t\t\t size_t             threadId,\n\t\t\t\t\t\t\t\t\t\t const SeedResults<index_t>& rs,\n\t\t\t\t\t\t\t\t\t\t bool               getLock)\n{\n#if 0\n\tappendSeedSummary(\n\t\t\t\t\t  o,                     // string to write to\n\t\t\t\t\t  rd,                    // read\n\t\t\t\t\t  rdid,                  // read id\n\t\t\t\t\t  rs.numOffs()*2,        // # seeds tried\n\t\t\t\t\t  rs.nonzeroOffsets(),   // # seeds with non-empty results\n\t\t\t\t\t  rs.numRanges(),        // # ranges for all seed hits\n\t\t\t\t\t  rs.numElts(),          // # elements for all seed hits\n\t\t\t\t\t  rs.numOffs(),          // # seeds tried from fw read\n\t\t\t\t\t  rs.nonzeroOffsetsFw(), // # seeds with non-empty results from fw read\n\t\t\t\t\t  rs.numRangesFw(),      // # ranges for seed hits from fw read\n\t\t\t\t\t  rs.numEltsFw(),        // # elements for seed hits from fw read\n\t\t\t\t\t  rs.numOffs(),          // # seeds tried from rc read\n\t\t\t\t\t  rs.nonzeroOffsetsRc(), // # seeds with non-empty results from fw read\n\t\t\t\t\t  rs.numRangesRc(),      // # ranges for seed hits from fw read\n\t\t\t\t\t  rs.numEltsRc());       // # elements for seed hits from fw read\n#endif\n}\n\n/**\n * Print an empty seed summary to the first output stream in the outs_ list.\n */\ntemplate <typename index_t>\nvoid AlnSink<index_t>::reportEmptySeedSummary(\n\t\t\t\t\t\t\t\t\t\t\t  BTString&          o,\n\t\t\t\t\t\t\t\t\t\t\t  const Read&        rd,\n\t\t\t\t\t\t\t\t\t\t\t  TReadId            rdid,\n\t\t\t\t\t\t\t\t\t\t\t  size_t             threadId,\n\t\t\t\t\t\t\t\t\t\t\t  bool               getLock)\n{\n\tappendSeedSummary(\n\t\t\t\t\t  o,                     // string to append to\n\t\t\t\t\t  rd,                    // read\n\t\t\t\t\t  rdid,                  // read id\n\t\t\t\t\t  0,                     // # seeds tried\n\t\t\t\t\t  0,                     // # seeds with non-empty results\n\t\t\t\t\t  0,                     // # ranges for all seed hits\n\t\t\t\t\t  0,                     // # elements for all seed hits\n\t\t\t\t\t  0,                     // # seeds tried from fw read\n\t\t\t\t\t  0,                     // # seeds with non-empty results from fw read\n\t\t\t\t\t  0,                     // # ranges for seed hits from fw read\n\t\t\t\t\t  0,                     // # elements for seed hits from fw read\n\t\t\t\t\t  0,                     // # seeds tried from rc read\n\t\t\t\t\t  0,                     // # seeds with non-empty results from fw read\n\t\t\t\t\t  0,                     // # ranges for seed hits from fw read\n\t\t\t\t\t  0);                    // # elements for seed hits from fw read\n}\n\n/**\n * Print the given string.  If ws = true, print only up to and not\n * including the first space or tab.  Useful for printing reference\n * names.\n */\ntemplate<typename T>\nstatic inline void printUptoWs(\n\t\t\t\t\t\t\t   BTString& s,\n\t\t\t\t\t\t\t   const T& str,\n\t\t\t\t\t\t\t   bool chopws)\n{\n\tsize_t len = str.length();\n\tfor(size_t i = 0; i < len; i++) {\n\t\tif(!chopws || (str[i] != ' ' && str[i] != '\\t')) {\n\t\t\ts.append(str[i]);\n\t\t} else {\n\t\t\tbreak;\n\t\t}\n\t}\n}\n\n/**\n * Append a batch of unresolved seed alignment summary results (i.e.\n * seed alignments where all we know is the reference sequence aligned\n * to and its SA range, not where it falls in the reference\n * sequence) to the given output stream in Bowtie's seed-sumamry\n * verbose-mode format.\n *\n * The seed summary format is:\n *\n *  - One line per read\n *  - A typical line consists of a set of tab-delimited fields:\n *\n *    1. Read name\n *    2. Total number of seeds extracted from the read\n *    3. Total number of seeds that aligned to the reference at\n *       least once (always <= field 2)\n *    4. Total number of distinct BW ranges found in all seed hits\n *       (always >= field 3)\n *    5. Total number of distinct BW elements found in all seed\n *       hits (always >= field 4)\n *    6-9.:   Like 2-5. but just for seeds extracted from the\n *            forward representation of the read\n *    10-13.: Like 2-5. but just for seeds extracted from the\n *            reverse-complement representation of the read\n *\n *    Note that fields 6 and 10 should add to field 2, 7 and 11\n *    should add to 3, etc.\n *\n *  - Lines for reads that are filtered out for any reason (e.g. too\n *    many Ns) have columns 2 through 13 set to 0.\n */\ntemplate <typename index_t>\nvoid AlnSink<index_t>::appendSeedSummary(\n\t\t\t\t\t\t\t\t\t\t BTString&     o,\n\t\t\t\t\t\t\t\t\t\t const Read&   rd,\n\t\t\t\t\t\t\t\t\t\t const TReadId rdid,\n\t\t\t\t\t\t\t\t\t\t size_t        seedsTried,\n\t\t\t\t\t\t\t\t\t\t size_t        nonzero,\n\t\t\t\t\t\t\t\t\t\t size_t        ranges,\n\t\t\t\t\t\t\t\t\t\t size_t        elts,\n\t\t\t\t\t\t\t\t\t\t size_t        seedsTriedFw,\n\t\t\t\t\t\t\t\t\t\t size_t        nonzeroFw,\n\t\t\t\t\t\t\t\t\t\t size_t        rangesFw,\n\t\t\t\t\t\t\t\t\t\t size_t        eltsFw,\n\t\t\t\t\t\t\t\t\t\t size_t        seedsTriedRc,\n\t\t\t\t\t\t\t\t\t\t size_t        nonzeroRc,\n\t\t\t\t\t\t\t\t\t\t size_t        rangesRc,\n\t\t\t\t\t\t\t\t\t\t size_t        eltsRc)\n{\n\tchar buf[1024];\n\tbool firstfield = true;\n\t//\n\t// Read name\n\t//\n\tBEGIN_FIELD;\n\tprintUptoWs(o, rd.name, true);\n\t\n\t//\n\t// Total number of seeds tried\n\t//\n\tBEGIN_FIELD;\n\tWRITE_NUM(o, seedsTried);\n\t\n\t//\n\t// Total number of seeds tried where at least one range was found.\n\t//\n\tBEGIN_FIELD;\n\tWRITE_NUM(o, nonzero);\n\t\n\t//\n\t// Total number of ranges found\n\t//\n\tBEGIN_FIELD;\n\tWRITE_NUM(o, ranges);\n\t\n\t//\n\t// Total number of elements found\n\t//\n\tBEGIN_FIELD;\n\tWRITE_NUM(o, elts);\n\t\n\t//\n\t// The same four numbers, but only for seeds extracted from the\n\t// forward read representation.\n\t//\n\tBEGIN_FIELD;\n\tWRITE_NUM(o, seedsTriedFw);\n\t\n\tBEGIN_FIELD;\n\tWRITE_NUM(o, nonzeroFw);\n\t\n\tBEGIN_FIELD;\n\tWRITE_NUM(o, rangesFw);\n\t\n\tBEGIN_FIELD;\n\tWRITE_NUM(o, eltsFw);\n\t\n\t//\n\t// The same four numbers, but only for seeds extracted from the\n\t// reverse complement read representation.\n\t//\n\tBEGIN_FIELD;\n\tWRITE_NUM(o, seedsTriedRc);\n\t\n\tBEGIN_FIELD;\n\tWRITE_NUM(o, nonzeroRc);\n\t\n\tBEGIN_FIELD;\n\tWRITE_NUM(o, rangesRc);\n\t\n\tBEGIN_FIELD;\n\tWRITE_NUM(o, eltsRc);\n\t\n\to.append('\\n');\n}\n\n\ninline\nvoid appendReadID(BTString& o, const BTString & rd_name) {\n    size_t namelen = rd_name.length();\n    if(namelen >= 2 &&\n       rd_name[namelen-2] == '/' &&\n       (rd_name[namelen-1] == '1' || rd_name[namelen-1] == '2' || rd_name[namelen-1] == '3'))\n    {\n        namelen -= 2;\n    }\n    for(size_t i = 0; i < namelen; i++) {\n        if(isspace(rd_name[i])) {\n            break;\n        }\n        o.append(rd_name[i]);\n    }\n}\n\ninline\nvoid appendSeqID(BTString& o, const AlnRes* rs, const std::map<uint64_t, TaxonomyNode>& tree) { \n    bool leaf = true;\n    std::map<uint64_t, TaxonomyNode>::const_iterator itr = tree.find(rs->taxID());\n    if(itr != tree.end()) {\n        const TaxonomyNode& node = itr->second;\n        leaf = node.leaf;\n    }\n\n    // unique ID\n    if(leaf) {\n        o.append(rs->uid().c_str());\n    } else {\n        o.append(get_tax_rank_string(rs->taxRank()));\n    }\n}\n\ninline \nvoid appendTaxID(BTString& o, uint64_t tid) {\n\n\tchar buf[1024];\n    uint64_t tid1 = tid & 0xffffffff;\n    uint64_t tid2 = tid >> 32;\n    itoa10<int64_t>(tid1, buf);\n    o.append(buf);\n\n    if(tid2 > 0) {\n        o.append(\".\");\n        itoa10<int64_t>(tid2, buf);\n        o.append(buf);\n    }\n}\n\n\nenum FIELD_DEF {\n\tPLACEHOLDER=0,\n\tPLACEHOLDER_STAR,\n\tPLACEHOLDER_ZERO,\n    READ_ID,\n\tSEQ_ID,\n\tTAX_ID,\n\tTAX_RANK,\n\tTAX_NAME,\n\tSCORE,\n\tSCORE2,\n\tHIT_LENGTH,\n\tQUERY_LENGTH,\n\tNUM_MATCHES,\n\tSEQ,\n\tSEQ1,\n\tSEQ2,\n\tQUAL,\n\tQUAL1,\n\tQUAL2\n};\n\n/**\n * Append a single hit to the given output stream in Bowtie's\n * verbose-mode format.\n */\ntemplate <typename index_t>\nvoid AlnSinkSam<index_t>::appendMate(\n                                     Ebwt<index_t>& ebwt,\n\t\t\t\t\t\t\t\t\t BTString&      o,           // append to this string\n\t\t\t\t\t\t\t\t\t const Read&    rd,\n\t\t\t\t\t\t\t\t\t const Read*    rdo,\n\t\t\t\t\t\t\t\t\t const TReadId  rdid,\n\t\t\t\t\t\t\t\t\t AlnRes* rs,\n\t\t\t\t\t\t\t\t\t AlnRes* rso,\n\t\t\t\t\t\t\t\t\t const AlnSetSumm& summ,\n\t\t\t\t\t\t\t\t\t const PerReadMetrics& prm,\n\t\t\t\t\t\t\t\t\t SpeciesMetrics& sm,\n\t\t\t\t\t\t\t\t\t size_t n_results)\n{\n\tif(rs == NULL) {\n\t\treturn;\n\t}\n\n\tchar buf[1024];\n\tbool firstfield = true;\n\tuint64_t taxid =  rs->taxID();\n\tconst basic_string<char> empty_string = \"\";\n\n\tfor (size_t i=0; i < this->tab_fmt_cols_.size(); ++i) {\n\t\tBEGIN_FIELD;\n\t\tswitch (this->tab_fmt_cols_[i]) {\n\t\t\tcase READ_ID:      appendReadID(o, rd.name); break;\n\t\t\tcase SEQ_ID:       appendSeqID(o, rs, ebwt.tree()); break;\n\t\t\tcase SEQ:          o.append((string(rd.patFw.toZBuf()) + \n\t\t\t\t\t\t\t\t\t\t(rdo == NULL? \"\" : \"_\" + string(rdo->patFw.toZBuf()))).c_str()); break;\n\t\t\tcase QUAL:         o.append((string(rd.qual.toZBuf()) + \n\t\t\t\t\t\t\t\t\t\t(rdo == NULL? \"\" : \"_\" + string(rdo->qual.toZBuf()))).c_str()); break;\n\n\t\t\tcase SEQ1:         o.append(rd.patFw.toZBuf()); break;\n\t\t\tcase QUAL1:        o.append(rd.qual.toZBuf()); break;\n\t\t\tcase SEQ2:         o.append(rdo == NULL? empty_string.c_str() : rdo->patFw.toZBuf()); break;\n\t\t\tcase QUAL2:        o.append(rdo == NULL? empty_string.c_str() : rdo->qual.toZBuf()); break;\n\n\t\t\tcase TAX_ID:       appendTaxID(o, taxid); break;\n\t\t\tcase TAX_RANK:     o.append(get_tax_rank_string(rs->taxRank())); break;\n\t\t\tcase TAX_NAME:     o.append(find_or_use_default(ebwt.name(), taxid, empty_string).c_str()); break;\n\n\t\t\tcase SCORE:        appendNumber<uint64_t>(o, rs->score(), buf); break;\n\t\t\tcase SCORE2:       appendNumber<uint64_t>(o,\n\t\t\t\t\t\t\t\t\t   (summ.secbest().valid()? summ.secbest().score() : 0),\n\t\t\t\t\t\t\t\t\t   buf); break;\n\t\t\tcase HIT_LENGTH:   appendNumber<uint64_t>(o, rs->summedHitLen(), buf); break;\n\t\t\tcase QUERY_LENGTH: appendNumber<uint64_t>(o, \n\t\t\t\t\t\t\t\t\t   (rd.patFw.length() + (rdo != NULL ? rdo->patFw.length() : 0)),\n\t\t\t\t\t\t\t\t\t   buf); break;\n\t\t\tcase NUM_MATCHES:  appendNumber<uint64_t>(o, n_results, buf); break;\n\t\t\tcase PLACEHOLDER:  o.append(\"\");\n\t\t\tcase PLACEHOLDER_STAR:  o.append(\"*\");\n\t\t\tcase PLACEHOLDER_ZERO:  o.append(\"0\");\n\t\t\tdefault: ;\n\t\t}\n\n\t}\n\to.append(\"\\n\");\n\n\n\t// species counting\n\tsm.addSpeciesCounts(\n                        rs->taxID(),\n                        rs->score(),\n                        rs->max_score(),\n                        rs->summedHitLen(),\n                        1.0 / n_results,\n                        (uint32_t)n_results);\n\n\t// only count k-mers if the read is unique\n    if (n_results == 1) {\n\t\tfor (size_t i = 0; i< rs->nReadPositions(); ++i) {\n\t\t\tsm.addAllKmers(rs->taxID(),\n                           rs->isFw()? rd.patFw : rd.patRc,\n                           rs->readPositions(i).first,\n                           rs->readPositions(i).second);\n\t\t}\n\t}\n\n//    (sc[rs->speciesID_])++;\n   \n}\n\n// #include <iomanip>\n\n/**\n * Initialize state machine with a new read.  The state we start in depends\n * on whether it's paired-end or unpaired.\n */\nvoid ReportingState::nextRead(bool paired) {\n    paired_ = paired;\n    state_ = CONCORDANT_PAIRS;\n    doneConcord_ = false;\n    exitConcord_ = ReportingState::EXIT_DID_NOT_EXIT;\n    done_ = false;\n    nconcord_ = 0;\n}\n\n/**\n * Caller uses this member function to indicate that one additional\n * concordant alignment has been found.\n */\nbool ReportingState::foundConcordant() {\n    assert_geq(state_, ReportingState::CONCORDANT_PAIRS);\n    assert(!doneConcord_);\n    nconcord_++;\n    if(doneConcord_) {\n        // If we're finished looking for concordant alignments, do we have to\n        // continue on to search for unpaired alignments?  Only if our exit\n        // from the concordant stage is EXIT_SHORT_CIRCUIT_M.  If it's\n        // EXIT_SHORT_CIRCUIT_k or EXIT_WITH_ALIGNMENTS, we can skip unpaired.\n        assert_neq(ReportingState::EXIT_NO_ALIGNMENTS, exitConcord_);\n    }\n    return done();\n}\n\n/**\n * Caller uses this member function to indicate that one additional unpaired\n * mate alignment has been found for the specified mate.\n */\nbool ReportingState::foundUnpaired(bool mate1) {\n    return done();\n}\n\n/**\n * Called to indicate that the aligner has finished searching for\n * alignments.  This gives us a chance to finalize our state.\n *\n * TODO: Keep track of short-circuiting information.\n */\nvoid ReportingState::finish() {\n    if(!doneConcord_) {\n        doneConcord_ = true;\n        exitConcord_ =\n        ((nconcord_ > 0) ?\n         ReportingState::EXIT_WITH_ALIGNMENTS :\n         ReportingState::EXIT_NO_ALIGNMENTS);\n    }\n    assert_gt(exitConcord_, EXIT_DID_NOT_EXIT);\n    done_ = true;\n    assert(done());\n}\n\n\n/**\n * Populate given counters with the number of various kinds of alignments\n * to report for this read.  Concordant alignments are preferable to (and\n * mutually exclusive with) discordant alignments, and paired-end\n * alignments are preferable to unpaired alignments.\n *\n * The caller also needs some additional information for the case where a\n * pair or unpaired read aligns repetitively.  If the read is paired-end\n * and the paired-end has repetitive concordant alignments, that should be\n * reported, and 'pairMax' is set to true to indicate this.  If the read is\n * paired-end, does not have any conordant alignments, but does have\n * repetitive alignments for one or both mates, then that should be\n * reported, and 'unpair1Max' and 'unpair2Max' are set accordingly.\n *\n * Note that it's possible in the case of a paired-end read for the read to\n * have repetitive concordant alignments, but for one mate to have a unique\n * unpaired alignment.\n */\nvoid ReportingState::getReport(uint64_t& nconcordAln) const // # concordant alignments to report\n{\n    nconcordAln = 0;\n    assert_gt(p_.khits, 0);\n    // Do we have 1 or more concordant alignments to report?\n    if(exitConcord_ == ReportingState::EXIT_SHORT_CIRCUIT_k) {\n        // k at random\n        assert_geq(nconcord_, (uint64_t)p_.khits);\n        nconcordAln = p_.khits;\n        return;\n    } else if(exitConcord_ == ReportingState::EXIT_WITH_ALIGNMENTS) {\n        assert_gt(nconcord_, 0);\n        // <= k at random\n        nconcordAln = min<uint64_t>(nconcord_, p_.khits);\n        return;\n    }\n}\n\n#if 0\n/**\n * Given the number of alignments in a category, check whether we\n * short-circuited out of the category.  Set the done and exit arguments to\n * indicate whether and how we short-circuited.\n */\ninline void ReportingState::areDone(\n                                    uint64_t cnt,    // # alignments in category\n                                    bool& done,      // out: whether we short-circuited out of category\n                                    int& exit) const // out: if done, how we short-circuited (-k? -m? etc)\n{\n    assert(!done);\n    // Have we exceeded the -k limit?\n    assert_gt(p_.khits, 0);\n    assert_gt(p_.mhits, 0);\n    if(cnt >= (uint64_t)p_.khits && !p_.mhitsSet()) {\n        done = true;\n        exit = ReportingState::EXIT_SHORT_CIRCUIT_k;\n    }\n    // Have we exceeded the -m or -M limit?\n    else if(p_.mhitsSet() && cnt > (uint64_t)p_.mhits) {\n        done = true;\n        assert(p_.msample);\n        exit = ReportingState::EXIT_SHORT_CIRCUIT_M;\n    }\n}\n#endif\n\n#endif /*ndef ALN_SINK_H_*/\n"
  },
  {
    "path": "alphabet.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <stdint.h>\n#include <cassert>\n#include <string>\n#include \"alphabet.h\"\n\nusing namespace std;\n\n/**\n * Mapping from ASCII characters to DNA categories:\n *\n * 0 = invalid - error\n * 1 = DNA\n * 2 = IUPAC (ambiguous DNA)\n * 3 = not an error, but unmatchable; alignments containing this\n *     character are invalid\n */\nuint8_t asc2dnacat[] = {\n\t/*   0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0,\n\t       /*                                        - */\n\t/*  48 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  64 */ 0, 1, 2, 1, 2, 0, 0, 1, 2, 0, 0, 2, 0, 2, 2, 0,\n\t       /*    A  B  C  D        G  H        K     M  N */\n\t/*  80 */ 0, 0, 2, 2, 1, 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0,\n\t       /*       R  S  T     V  W  X  Y */\n\t/*  96 */ 0, 1, 2, 1, 2, 0, 0, 1, 2, 0, 0, 2, 0, 2, 2, 0,\n\t       /*    a  b  c  d        g  h        k     m  n */\n\t/* 112 */ 0, 0, 2, 2, 1, 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0,\n\t       /*       r  s  t     v  w  x  y */\n\t/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n};\n\n// 5-bit pop count\nint mask2popcnt[] = {\n\t0, 1, 1, 2, 1, 2, 2, 3,\n\t1, 2, 2, 3, 2, 3, 3, 4,\n\t1, 2, 2, 3, 2, 3, 3, 4,\n\t2, 3, 3, 4, 3, 4, 4, 5\n};\n\n/**\n * Mapping from masks to ASCII characters for ambiguous nucleotides.\n */\nchar mask2dna[] = {\n\t'?', // 0\n\t'A', // 1\n\t'C', // 2\n\t'M', // 3\n\t'G', // 4\n\t'R', // 5\n\t'S', // 6\n\t'V', // 7\n\t'T', // 8\n\t'W', // 9\n\t'Y', // 10\n\t'H', // 11\n\t'K', // 12\n\t'D', // 13\n\t'B', // 14\n\t'N', // 15 (inclusive N)\n\t'N'  // 16 (exclusive N)\n};\n\n/**\n * Mapping from ASCII characters for ambiguous nucleotides into masks:\n */\nuint8_t asc2dnamask[] = {\n\t/*   0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  48 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  64 */ 0, 1,14, 2,13, 0, 0, 4,11, 0, 0,12, 0, 3,15, 0,\n\t       /*    A  B  C  D        G  H        K     M  N */\n\t/*  80 */ 0, 0, 5, 6, 8, 0, 7, 9, 0,10, 0, 0, 0, 0, 0, 0,\n\t       /*       R  S  T     V  W     Y */\n\t/*  96 */ 0, 1,14, 2,13, 0, 0, 4,11, 0, 0,12, 0, 3,15, 0,\n\t       /*    a  b  c  d        g  h        k     m  n */\n\t/* 112 */ 0, 0, 5, 6, 8, 0, 7, 9, 0,10, 0, 0, 0, 0, 0, 0,\n\t       /*       r  s  t     v  w     y */\n\t/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n};\n\n/**\n * Convert a pair of DNA masks to a color mask\n *\n * \n */ \nuint8_t dnamasks2colormask[16][16] = {\n\t         /* 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 */\n\t/*  0 */ {  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\n\t/*  1 */ {  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 },\n\t/*  2 */ {  0,  2,  1,  3,  8, 10,  9, 11,  4,  6,  5,  7, 12, 14, 13, 15 },\n\t/*  3 */ {  0,  3,  3,  3, 12, 15, 15, 15, 12, 15, 15, 15, 12, 15, 15, 15 },\n\t/*  4 */ {  0,  4,  8, 12,  1,  5,  9, 13,  2,  6, 10, 14,  3,  7, 11, 15 },\n\t/*  5 */ {  0,  5, 10, 15,  5,  5, 15, 15, 10, 15, 10, 15, 15, 15, 15, 15 },\n\t/*  6 */ {  0,  6,  9, 15,  9, 15,  9, 15,  6,  6, 15, 15, 15, 15, 15, 15 },\n\t/*  7 */ {  0,  7, 11, 15, 13, 15, 15, 15, 14, 15, 15, 15, 15, 15, 15, 15 },\n\t/*  8 */ {  0,  8,  4, 12,  2, 10,  6, 14,  1,  9,  5, 13,  3, 11,  7, 15 },\n\t/*  9 */ {  0,  9,  6, 15,  6, 15,  6, 15,  9,  9, 15, 15, 15, 15, 15, 15 },\n\t/* 10 */ {  0, 10,  5, 15, 10, 10, 15, 15,  5, 15,  5, 15, 15, 15, 15, 15 },\n\t/* 11 */ {  0, 11,  7, 15, 14, 15, 15, 15, 13, 15, 15, 15, 15, 15, 15, 15 },\n\t/* 12 */ {  0, 12, 12, 12,  3, 15, 15, 15,  3, 15, 15, 15,  3, 15, 15, 15 },\n\t/* 13 */ {  0, 13, 14, 15,  7, 15, 15, 15, 11, 15, 15, 15, 15, 15, 15, 15 },\n\t/* 14 */ {  0, 14, 13, 15, 11, 15, 15, 15,  7, 15, 15, 15, 15, 15, 15, 15 },\n\t/* 15 */ {  0, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15 }\n};\n\n/**\n * Mapping from ASCII characters for ambiguous nucleotides into masks:\n */\nchar asc2dnacomp[] = {\n\t/*   0 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/*  16 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/*  32 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,'-',  0,  0,\n\t/*  48 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/*  64 */ 0,'T','V','G','H',  0,  0,'C','D',  0,  0,'M',  0,'K','N',  0,\n\t       /*    A   B   C   D           G   H           K       M   N */\n\t/*  80 */ 0,  0,'Y','S','A',  0,'B','W',  0,'R',  0,  0,  0,  0,  0,  0,\n\t       /*        R   S   T       V   W       Y */\n\t/*  96 */ 0,'T','V','G','H',  0,  0,'C','D',  0,  0,'M',  0,'K','N',  0,\n\t        /*   a   b   c   d           g   h           k       m   n */\n\t/* 112 */ 0,  0,'Y','S','A',  0,'B','W',  0,'R',  0,  0,  0,  0,  0,  0,\n\t       /*        r   s   t       v   w       y */\n\t/* 128 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 144 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 160 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 176 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 192 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 208 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 224 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 240 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0\n};\n\n/**\n * Mapping from ASCII characters for ambiguous nucleotides into masks:\n */\nchar col2dna[] = {\n\t/*   0 */  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/*  16 */  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/*  32 */  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,'-','N',  0,\n\t       /*                                                     -   . */\n\t/*  48 */'A','C','G','T','N',  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t       /* 0   1   2   3   4  */\n\t/*  64 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/*  80 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/*  96 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 112 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 128 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 144 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 160 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 176 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 192 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 208 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 224 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 240 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0\n};\n\n/**\n * Mapping from ASCII characters for ambiguous nucleotides into masks:\n */\nchar dna2col[] = {\n\t/*   0 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/*  16 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/*  32 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,'-',  0,  0,\n\t/*  48 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/*  64 */ 0,'0',  0,'1',  0,  0,  0,'2',  0,  0,  0,  0,  0,  0,'.',  0,\n\t       /*    A       C               G                           N */\n\t/*  80 */ 0,  0,  0,  0,'3',  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t       /*                T */\n\t/*  92 */ 0,'0',  0,'1',  0,  0,  0,'2',  0,  0,  0,  0,  0,  0,'.',  0,\n\t       /*    a       c               g                           n */\n\t/* 112 */ 0,  0,  0,  0,'3',  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t       /*                t */\n\t/* 128 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 144 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 160 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 176 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 192 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 208 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 224 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n\t/* 240 */ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0\n};\n\n/**\n * Mapping from ASCII characters for ambiguous nucleotides into masks:\n */\nconst char* dna2colstr[] = {\n\t/*   0 */ \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",\n\t/*  16 */ \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",\n\t/*  32 */ \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"-\",  \"?\",  \"?\",\n\t/*  48 */ \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",\n\t/*  64 */ \"?\",  \"0\",\"1|2|3\",\"1\",\"0|2|3\",\"?\",  \"?\",  \"2\",\"0|1|3\",\"?\",  \"?\", \"2|3\", \"?\", \"0|1\", \".\",  \"?\",\n\t/*               A     B     C     D                 G     H                 K           M     N */\n\t/*  80 */ \"?\",  \"?\", \"0|2\",\"1|2\", \"3\",  \"?\",\"0|1|2\",\"0|3\",\"?\", \"1|3\", \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",\n\t/*                     R     S     T           V     W           Y */\n\t/*  92 */ \"?\",  \"?\",\"1|2|3\",\"1\",\"0|2|3\",\"?\",  \"?\",  \"2\",\"0|1|3\",\"?\",  \"?\", \"2|3\", \"?\", \"0|1\", \".\",  \"?\",\n\t/*               a     b     c     d                 g     h                 k           m     n */\n\t/* 112 */ \"?\",  \"0\", \"0|2\",\"1|2\", \"3\",  \"?\",\"0|1|2\",\"0|3\",\"?\", \"1|3\", \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",\n\t/*                     r     s     t           v     w           y */\n\t/* 128 */ \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",\n\t/* 144 */ \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",\n\t/* 160 */ \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",\n\t/* 176 */ \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",\n\t/* 192 */ \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",\n\t/* 208 */ \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",\n\t/* 224 */ \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",\n\t/* 240 */ \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\",  \"?\"\n};\n\n/**\n * Mapping from ASCII characters to color categories:\n *\n * 0 = invalid - error\n * 1 = valid color\n * 2 = IUPAC (ambiguous DNA) - there is no such thing for colors to my\n *     knowledge\n * 3 = not an error, but unmatchable; alignments containing this\n *     character are invalid\n */\nuint8_t asc2colcat[] = {\n\t/*   0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0,\n\t       /*                                        -  . */\n\t/*  48 */ 1, 1, 1, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t       /* 0  1  2  3  4  */\n\t/*  64 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  80 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  96 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 112 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n};\n\n/**\n * Set the category for all IUPAC codes.  By default they're in\n * category 2 (IUPAC), but sometimes we'd like to put them in category\n * 3 (unmatchable), for example.\n */\nvoid setIupacsCat(uint8_t cat) {\n\tassert(cat < 4);\n\tasc2dnacat[(int)'B'] = asc2dnacat[(int)'b'] =\n\tasc2dnacat[(int)'D'] = asc2dnacat[(int)'d'] =\n\tasc2dnacat[(int)'H'] = asc2dnacat[(int)'h'] =\n\tasc2dnacat[(int)'K'] = asc2dnacat[(int)'k'] =\n\tasc2dnacat[(int)'M'] = asc2dnacat[(int)'m'] =\n\tasc2dnacat[(int)'N'] = asc2dnacat[(int)'n'] =\n\tasc2dnacat[(int)'R'] = asc2dnacat[(int)'r'] =\n\tasc2dnacat[(int)'S'] = asc2dnacat[(int)'s'] =\n\tasc2dnacat[(int)'V'] = asc2dnacat[(int)'v'] =\n\tasc2dnacat[(int)'W'] = asc2dnacat[(int)'w'] =\n\tasc2dnacat[(int)'X'] = asc2dnacat[(int)'x'] =\n\tasc2dnacat[(int)'Y'] = asc2dnacat[(int)'y'] = cat;\n}\n\n/// For converting from ASCII to the Dna5 code where A=0, C=1, G=2,\n/// T=3, N=4\nuint8_t asc2dna[] = {\n\t/*   0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  48 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  64 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,\n\t       /*    A     C           G                    N */\n\t/*  80 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t       /*             T */\n\t/*  96 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,\n\t       /*    a     c           g                    n */\n\t/* 112 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t       /*             t */\n\t/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n};\n\n/// Convert an ascii char representing a base or a color to a 2-bit\n/// code: 0=A,0; 1=C,1; 2=G,2; 3=T,3; 4=N,.\nuint8_t asc2dnaOrCol[] = {\n\t/*   0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0,\n\t/*                                               -  . */\n\t/*  48 */ 0, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*        0  1  2  3 */\n\t/*  64 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,\n\t/*           A     C           G                    N */\n\t/*  80 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*                    T */\n\t/*  96 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,\n\t/*           a     c           g                    n */\n\t/* 112 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*                    t */\n\t/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n};\n\n/// For converting from ASCII to the Dna5 code where A=0, C=1, G=2,\n/// T=3, N=4\nuint8_t asc2col[] = {\n\t/*   0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0,\n\t       /*                                        -  . */\n\t/*  48 */ 0, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t       /* 0  1  2  3 */\n\t/*  64 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  80 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/*  96 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 112 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n\t/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n};\n\n/**\n * Convert a nucleotide and a color to the paired nucleotide.  Indexed\n * first by nucleotide then by color.  Note that this is exactly the\n * same as the dinuc2color array.\n */\nuint8_t nuccol2nuc[5][5] = {\n\t/*       B  G  O  R  . */\n\t/* A */ {0, 1, 2, 3, 4},\n\t/* C */ {1, 0, 3, 2, 4},\n\t/* G */ {2, 3, 0, 1, 4},\n\t/* T */ {3, 2, 1, 0, 4},\n\t/* N */ {4, 4, 4, 4, 4}\n};\n\n/**\n * Convert a pair of nucleotides to a color.\n */\nuint8_t dinuc2color[5][5] = {\n\t/* A */ {0, 1, 2, 3, 4},\n\t/* C */ {1, 0, 3, 2, 4},\n\t/* G */ {2, 3, 0, 1, 4},\n\t/* T */ {3, 2, 1, 0, 4},\n\t/* N */ {4, 4, 4, 4, 4}\n};\n\n/// Convert bit encoded DNA char to its complement\nint dnacomp[5] = {\n\t3, 2, 1, 0, 4\n};\n\nconst char *iupacs = \"!ACMGRSVTWYHKDBN!acmgrsvtwyhkdbn\";\n\nchar mask2iupac[16] = {\n\t-1,\n\t'A', // 0001\n\t'C', // 0010\n\t'M', // 0011\n\t'G', // 0100\n\t'R', // 0101\n\t'S', // 0110\n\t'V', // 0111\n\t'T', // 1000\n\t'W', // 1001\n\t'Y', // 1010\n\t'H', // 1011\n\t'K', // 1100\n\t'D', // 1101\n\t'B', // 1110\n\t'N', // 1111\n};\n\nint maskcomp[16] = {\n\t0,  // 0000 (!) -> 0000 (!)\n\t8,  // 0001 (A) -> 1000 (T)\n\t4,  // 0010 (C) -> 0100 (G)\n\t12, // 0011 (M) -> 1100 (K)\n\t2,  // 0100 (G) -> 0010 (C)\n\t10, // 0101 (R) -> 1010 (Y)\n\t6,  // 0110 (S) -> 0110 (S)\n\t14, // 0111 (V) -> 1110 (B)\n\t1,  // 1000 (T) -> 0001 (A)\n\t9,  // 1001 (W) -> 1001 (W)\n\t5,  // 1010 (Y) -> 0101 (R)\n\t13, // 1011 (H) -> 1101 (D)\n\t3,  // 1100 (K) -> 0011 (M)\n\t11, // 1101 (D) -> 1011 (H)\n\t7,  // 1110 (B) -> 0111 (V)\n\t15, // 1111 (N) -> 1111 (N)\n};\n\n"
  },
  {
    "path": "alphabet.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef ALPHABETS_H_\n#define ALPHABETS_H_\n\n#include <stdexcept>\n#include <string>\n#include <sstream>\n#include <stdint.h>\n#include \"assert_helpers.h\"\n\nusing namespace std;\n\n/// Convert an ascii char to a DNA category.  Categories are:\n/// 0 -> invalid\n/// 1 -> unambiguous a, c, g or t\n/// 2 -> ambiguous\n/// 3 -> unmatchable\nextern uint8_t asc2dnacat[];\n/// Convert masks to ambiguous nucleotides\nextern char mask2dna[];\n/// Convert ambiguous ASCII nuceleotide to mask\nextern uint8_t asc2dnamask[];\n/// Convert mask to # of alternative in the mask\nextern int mask2popcnt[];\n/// Convert an ascii char to a 2-bit base: 0=A, 1=C, 2=G, 3=T, 4=N\nextern uint8_t asc2dna[];\n/// Convert an ascii char representing a base or a color to a 2-bit\n/// code: 0=A,0; 1=C,1; 2=G,2; 3=T,3; 4=N,.\nextern uint8_t asc2dnaOrCol[];\n/// Convert a pair of DNA masks to a color mask\nextern uint8_t dnamasks2colormask[16][16];\n\n/// Convert an ascii char to a color category.  Categories are:\n/// 0 -> invalid\n/// 1 -> unambiguous 0, 1, 2 or 3\n/// 2 -> ambiguous (not applicable for colors)\n/// 3 -> unmatchable\nextern uint8_t asc2colcat[];\n/// Convert an ascii char to a 2-bit base: 0=A, 1=C, 2=G, 3=T, 4=N\nextern uint8_t asc2col[];\n/// Convert an ascii char to its DNA complement, including IUPACs\nextern char asc2dnacomp[];\n\n/// Convert a pair of 2-bit (and 4=N) encoded DNA bases to a color\nextern uint8_t dinuc2color[5][5];\n/// Convert a 2-bit nucleotide (and 4=N) and a color to the\n/// corresponding 2-bit nucleotide\nextern uint8_t nuccol2nuc[5][5];\n/// Convert a 4-bit mask into an IUPAC code\nextern char mask2iupac[16];\n\n/// Convert an ascii color to an ascii dna char\nextern char col2dna[];\n/// Convert an ascii dna to a color char\nextern char dna2col[];\n/// Convert an ascii dna to a color char\nextern const char* dna2colstr[];\n\n/// Convert bit encoded DNA char to its complement\nextern int dnacomp[5];\n\n/// String of all DNA and IUPAC characters\nextern const char *iupacs;\n\n/// Map from masks to their reverse-complement masks\nextern int maskcomp[16];\n\n/**\n * Return true iff c is a Dna character.\n */\nstatic inline bool isDna(char c) {\n\treturn asc2dnacat[(int)c] > 0;\n}\n\n/**\n * Return true iff c is a color character.\n */\nstatic inline bool isColor(char c) {\n\treturn asc2colcat[(int)c] > 0;\n}\n\n/**\n * Return true iff c is an ambiguous Dna character.\n */\nstatic inline bool isAmbigNuc(char c) {\n\treturn asc2dnacat[(int)c] == 2;\n}\n\n/**\n * Return true iff c is an ambiguous color character.\n */\nstatic inline bool isAmbigColor(char c) {\n\treturn asc2colcat[(int)c] == 2;\n}\n\n/**\n * Return true iff c is an ambiguous character.\n */\nstatic inline bool isAmbig(char c, bool color) {\n\treturn (color ? asc2colcat[(int)c] : asc2dnacat[(int)c]) == 2;\n}\n\n/**\n * Return true iff c is an unambiguous DNA character.\n */\nstatic inline bool isUnambigNuc(char c) {\n\treturn asc2dnacat[(int)c] == 1;\n}\n\n/**\n * Return the DNA complement of the given ASCII char.\n */\nstatic inline char comp(char c) {\n\tswitch(c) {\n\tcase 'a': return 't';\n\tcase 'A': return 'T';\n\tcase 'c': return 'g';\n\tcase 'C': return 'G';\n\tcase 'g': return 'c';\n\tcase 'G': return 'C';\n\tcase 't': return 'a';\n\tcase 'T': return 'A';\n\tdefault: return c;\n\t}\n}\n\n/**\n * Return the reverse complement of a bit-encoded nucleotide.\n */\nstatic inline int compDna(int c) {\n\tassert_leq(c, 4);\n\treturn dnacomp[c];\n}\n\n/**\n * Return true iff c is an unambiguous Dna character.\n */\nstatic inline bool isUnambigDna(char c) {\n\treturn asc2dnacat[(int)c] == 1;\n}\n\n/**\n * Return true iff c is an unambiguous color character (0,1,2,3).\n */\nstatic inline bool isUnambigColor(char c) {\n\treturn asc2colcat[(int)c] == 1;\n}\n\n/// Convert a pair of 2-bit (and 4=N) encoded DNA bases to a color\nextern uint8_t dinuc2color[5][5];\n\n/**\n * Decode a not-necessarily-ambiguous nucleotide.\n */\nstatic inline void decodeNuc(char c , int& num, int *alts) {\n\tswitch(c) {\n\tcase 'A': alts[0] = 0; num = 1; break;\n\tcase 'C': alts[0] = 1; num = 1; break;\n\tcase 'G': alts[0] = 2; num = 1; break;\n\tcase 'T': alts[0] = 3; num = 1; break;\n\tcase 'M': alts[0] = 0; alts[1] = 1; num = 2; break;\n\tcase 'R': alts[0] = 0; alts[1] = 2; num = 2; break;\n\tcase 'W': alts[0] = 0; alts[1] = 3; num = 2; break;\n\tcase 'S': alts[0] = 1; alts[1] = 2; num = 2; break;\n\tcase 'Y': alts[0] = 1; alts[1] = 3; num = 2; break;\n\tcase 'K': alts[0] = 2; alts[1] = 3; num = 2; break;\n\tcase 'V': alts[0] = 0; alts[1] = 1; alts[2] = 2; num = 3; break;\n\tcase 'H': alts[0] = 0; alts[1] = 1; alts[2] = 3; num = 3; break;\n\tcase 'D': alts[0] = 0; alts[1] = 2; alts[2] = 3; num = 3; break;\n\tcase 'B': alts[0] = 1; alts[1] = 2; alts[2] = 3; num = 3; break;\n\tcase 'N': alts[0] = 0; alts[1] = 1; alts[2] = 2; alts[3] = 3; num = 4; break;\n\tdefault: {\n\t\tstd::cerr << \"Bad IUPAC code: \" << c << \", (int: \" << (int)c << \")\" << std::endl;\n\t\tthrow std::runtime_error(\"\");\n\t}\n\t}\n}\n\nextern void setIupacsCat(uint8_t cat);\n\n#endif /*ALPHABETS_H_*/\n"
  },
  {
    "path": "assert_helpers.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef ASSERT_HELPERS_H_\n#define ASSERT_HELPERS_H_\n\n#include <stdexcept>\n#include <string>\n#include <cassert>\n#include <iostream>\n\n/**\n * Assertion for release-enabled assertions\n */\nclass ReleaseAssertException : public std::runtime_error {\npublic:\n\tReleaseAssertException(const std::string& msg = \"\") : std::runtime_error(msg) {}\n};\n\n/**\n * Macros for release-enabled assertions, and helper macros to make\n * all assertion error messages more helpful.\n */\n#ifndef NDEBUG\n#define ASSERT_ONLY(...) __VA_ARGS__\n#else\n#define ASSERT_ONLY(...)\n#endif\n\n#define rt_assert(b)  \\\n\tif(!(b)) { \\\n\t\tstd::cerr << \"rt_assert at \" << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tthrow ReleaseAssertException(); \\\n\t}\n#define rt_assert_msg(b,msg)  \\\n\tif(!(b)) { \\\n\t\tstd::cerr << msg <<  \" at \" << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tthrow ReleaseAssertException(msg); \\\n\t}\n\n#define rt_assert_eq(ex,ac)  \\\n\tif(!((ex) == (ac))) { \\\n\t\tstd::cerr << \"rt_assert_eq: expected (\" << (ex) << \", 0x\" << std::hex << (ex) << std::dec << \") got (\" << (ac) << \", 0x\" << std::hex << (ac) << std::dec << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tthrow ReleaseAssertException(); \\\n\t}\n#define rt_assert_eq_msg(ex,ac,msg)  \\\n\tif(!((ex) == (ac))) { \\\n\t\tstd::cerr << \"rt_assert_eq: \" << msg <<  \": (\" << (ex) << \", 0x\" << std::hex << (ex) << std::dec << \") got (\" << (ac) << \", 0x\" << std::hex << (ac) << std::dec << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tthrow ReleaseAssertException(msg); \\\n\t}\n\n#ifndef NDEBUG\n#define assert_eq(ex,ac)  \\\n\tif(!((ex) == (ac))) { \\\n\t\tstd::cerr << \"assert_eq: expected (\" << (ex) << \", 0x\" << std::hex << (ex) << std::dec << \") got (\" << (ac) << \", 0x\" << std::hex << (ac) << std::dec << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tassert(0); \\\n\t}\n#define assert_eq_msg(ex,ac,msg)  \\\n\tif(!((ex) == (ac))) { \\\n\t\tstd::cerr << \"assert_eq: \" << msg <<  \": (\" << (ex) << \", 0x\" << std::hex << (ex) << std::dec << \") got (\" << (ac) << \", 0x\" << std::hex << (ac) << std::dec << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tassert(0); \\\n\t}\n#else\n#define assert_eq(ex,ac)\n#define assert_eq_msg(ex,ac,msg)\n#endif\n\n#define rt_assert_neq(ex,ac)  \\\n\tif(!((ex) != (ac))) { \\\n\t\tstd::cerr << \"rt_assert_neq: expected not (\" << (ex) << \", 0x\" << std::hex << (ex) << std::dec << \") got (\" << (ac) << \", 0x\" << std::hex << (ac) << std::dec << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tthrow ReleaseAssertException(); \\\n\t}\n#define rt_assert_neq_msg(ex,ac,msg)  \\\n\tif(!((ex) != (ac))) { \\\n\t\tstd::cerr << \"rt_assert_neq: \" << msg << \": (\" << (ex) << \", 0x\" << std::hex << (ex) << std::dec << \") got (\" << (ac) << \", 0x\" << std::hex << (ac) << std::dec << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tthrow ReleaseAssertException(msg); \\\n\t}\n\n#ifndef NDEBUG\n#define assert_neq(ex,ac)  \\\n\tif(!((ex) != (ac))) { \\\n\t\tstd::cerr << \"assert_neq: expected not (\" << (ex) << \", 0x\" << std::hex << (ex) << std::dec << \") got (\" << (ac) << \", 0x\" << std::hex << (ac) << std::dec << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tassert(0); \\\n\t}\n#define assert_neq_msg(ex,ac,msg)  \\\n\tif(!((ex) != (ac))) { \\\n\t\tstd::cerr << \"assert_neq: \" << msg << \": (\" << (ex) << \", 0x\" << std::hex << (ex) << std::dec << \") got (\" << (ac) << \", 0x\" << std::hex << (ac) << std::dec << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tassert(0); \\\n\t}\n#else\n#define assert_neq(ex,ac)\n#define assert_neq_msg(ex,ac,msg)\n#endif\n\n#define rt_assert_gt(a,b) \\\n\tif(!((a) > (b))) { \\\n\t\tstd::cerr << \"rt_assert_gt: expected (\" << (a) << \") > (\" << (b) << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tthrow ReleaseAssertException(); \\\n\t}\n#define rt_assert_gt_msg(a,b,msg) \\\n\tif(!((a) > (b))) { \\\n\t\tstd::cerr << \"rt_assert_gt: \" << msg << \": (\" << (a) << \") > (\" << (b) << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tthrow ReleaseAssertException(msg); \\\n\t}\n\n#ifndef NDEBUG\n#define assert_gt(a,b) \\\n\tif(!((a) > (b))) { \\\n\t\tstd::cerr << \"assert_gt: expected (\" << (a) << \") > (\" << (b) << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tassert(0); \\\n\t}\n#define assert_gt_msg(a,b,msg) \\\n\tif(!((a) > (b))) { \\\n\t\tstd::cerr << \"assert_gt: \" << msg << \": (\" << (a) << \") > (\" << (b) << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tassert(0); \\\n\t}\n#else\n#define assert_gt(a,b)\n#define assert_gt_msg(a,b,msg)\n#endif\n\n#define rt_assert_geq(a,b) \\\n\tif(!((a) >= (b))) { \\\n\t\tstd::cerr << \"rt_assert_geq: expected (\" << (a) << \") >= (\" << (b) << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tthrow ReleaseAssertException(); \\\n\t}\n#define rt_assert_geq_msg(a,b,msg) \\\n\tif(!((a) >= (b))) { \\\n\t\tstd::cerr << \"rt_assert_geq: \" << msg << \": (\" << (a) << \") >= (\" << (b) << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tthrow ReleaseAssertException(msg); \\\n\t}\n\n#ifndef NDEBUG\n#define assert_geq(a,b) \\\n\tif(!((a) >= (b))) { \\\n\t\tstd::cerr << \"assert_geq: expected (\" << (a) << \") >= (\" << (b) << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tassert(0); \\\n\t}\n#define assert_geq_msg(a,b,msg) \\\n\tif(!((a) >= (b))) { \\\n\t\tstd::cerr << \"assert_geq: \" << msg << \": (\" << (a) << \") >= (\" << (b) << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tassert(0); \\\n\t}\n#else\n#define assert_geq(a,b)\n#define assert_geq_msg(a,b,msg)\n#endif\n\n#define rt_assert_lt(a,b) \\\n\tif(!(a < b)) { \\\n\t\tstd::cerr << \"rt_assert_lt: expected (\" << a << \") < (\" << b << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tthrow ReleaseAssertException(); \\\n\t}\n#define rt_assert_lt_msg(a,b,msg) \\\n\tif(!(a < b)) { \\\n\t\tstd::cerr << \"rt_assert_lt: \" << msg << \": (\" << a << \") < (\" << b << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tthrow ReleaseAssertException(msg); \\\n\t}\n\n#ifndef NDEBUG\n#define assert_lt(a,b) \\\n\tif(!(a < b)) { \\\n\t\tstd::cerr << \"assert_lt: expected (\" << a << \") < (\" << b << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tassert(0); \\\n\t}\n#define assert_lt_msg(a,b,msg) \\\n\tif(!(a < b)) { \\\n\t\tstd::cerr << \"assert_lt: \" << msg << \": (\" << a << \") < (\" << b << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tassert(0); \\\n\t}\n#else\n#define assert_lt(a,b)\n#define assert_lt_msg(a,b,msg)\n#endif\n\n#define rt_assert_leq(a,b) \\\n\tif(!((a) <= (b))) { \\\n\t\tstd::cerr << \"rt_assert_leq: expected (\" << (a) << \") <= (\" << (b) << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tthrow ReleaseAssertException(); \\\n\t}\n#define rt_assert_leq_msg(a,b,msg) \\\n\tif(!((a) <= (b))) { \\\n\t\tstd::cerr << \"rt_assert_leq: \" << msg << \": (\" << (a) << \") <= (\" << (b) << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tthrow ReleaseAssertException(msg); \\\n\t}\n\n#ifndef NDEBUG\n#define assert_leq(a,b) \\\n\tif(!((a) <= (b))) { \\\n\t\tstd::cerr << \"assert_leq: expected (\" << (a) << \") <= (\" << (b) << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tassert(0); \\\n\t}\n#define assert_leq_msg(a,b,msg) \\\n\tif(!((a) <= (b))) { \\\n\t\tstd::cerr << \"assert_leq: \" << msg << \": (\" << (a) << \") <= (\" << (b) << \")\" << std::endl; \\\n\t\tstd::cerr << __FILE__ << \":\" << __LINE__ << std::endl; \\\n\t\tassert(0); \\\n\t}\n#else\n#define assert_leq(a,b)\n#define assert_leq_msg(a,b,msg)\n#endif\n\n#ifndef NDEBUG\n#define assert_in(c, s) assert_in2(c, s, __FILE__, __LINE__)\nstatic inline void assert_in2(char c, const char *str, const char *file, int line) {\n\tconst char *s = str;\n\twhile(*s != '\\0') {\n\t\tif(c == *s) return;\n\t\ts++;\n\t}\n\tstd::cerr << \"assert_in: (\" << c << \") not in  (\" << str << \")\" << std::endl;\n\tstd::cerr << file << \":\" << line << std::endl;\n\tassert(0);\n}\n#else\n#define assert_in(c, s)\n#endif\n\n#ifndef NDEBUG\n#define assert_range(b, e, v) assert_range_helper(b, e, v, __FILE__, __LINE__)\ntemplate<typename T>\ninline static void assert_range_helper(const T& begin,\n                                       const T& end,\n                                       const T& val,\n                                       const char *file,\n                                       int line)\n{\n\tif(val < begin || val > end) {\n\t\tstd::cerr << \"assert_range: (\" << val << \") not in  [\"\n\t\t          << begin << \", \" << end << \"]\" << std::endl;\n\t\tstd::cerr << file << \":\" << line << std::endl;\n\t\tassert(0);\n\t}\n}\n#else\n#define assert_range(b, e, v)\n#endif\n\n// define a macro to indicate variables that are only required for asserts\n// used to make production build happy, i.e. disable \"warning: variable ‘x’ set but not used [-Wunused-but-set-variable]\"\n#define _unused(x) ((void)x)\n\n#endif /*ASSERT_HELPERS_H_*/\n"
  },
  {
    "path": "binary_sa_search.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef BINARY_SA_SEARCH_H_\n#define BINARY_SA_SEARCH_H_\n\n#include <stdint.h>\n#include <iostream>\n#include <limits>\n#include \"alphabet.h\"\n#include \"assert_helpers.h\"\n#include \"ds.h\"\n#include \"btypes.h\"\n\n/**\n * Do a binary search using the suffix of 'host' beginning at offset\n * 'qry' as the query and 'sa' as an already-lexicographically-sorted\n * list of suffixes of host.  'sa' may be all suffixes of host or just\n * a subset.  Returns the index in sa of the smallest suffix of host\n * that is larger than qry, or length(sa) if all suffixes of host are\n * less than qry.\n *\n * We use the Manber and Myers optimization of maintaining a pair of\n * counters for the longest lcp observed so far on the left- and right-\n * hand sides and using the min of the two as a way of skipping over\n * characters at the beginning of a new round.\n *\n * Returns maximum value if the query suffix matches an element of sa.\n */\ntemplate<typename TStr, typename TSufElt> inline\nTIndexOffU binarySASearch(\n\tconst TStr& host,\n\tTIndexOffU qry,\n\tconst EList<TSufElt>& sa)\n{\n\tTIndexOffU lLcp = 0, rLcp = 0; // greatest observed LCPs on left and right\n\tTIndexOffU l = 0, r = (TIndexOffU)sa.size()+1; // binary-search window\n\tTIndexOffU hostLen = (TIndexOffU)host.length();\n\twhile(true) {\n\t\tassert_gt(r, l);\n\t\tTIndexOffU m = (l+r) >> 1;\n\t\tif(m == l) {\n\t\t\t// Binary-search window has closed: we have an answer\n\t\t\tif(m > 0 && sa[m-1] == qry) {\n\t\t\t\treturn std::numeric_limits<TIndexOffU>::max(); // qry matches\n\t\t\t}\n\t\t\tassert_leq(m, sa.size());\n\t\t\treturn m; // Return index of right-hand suffix\n\t\t}\n\t\tassert_gt(m, 0);\n\t\tTIndexOffU suf = sa[m-1];\n\t\tif(suf == qry) {\n\t\t\treturn std::numeric_limits<TIndexOffU>::max(); // query matches an elt of sa\n\t\t}\n\t\tTIndexOffU lcp = min(lLcp, rLcp);\n#ifndef NDEBUG\n\t\tif(sstr_suf_upto_neq(host, qry, host, suf, lcp)) {\n\t\t\tassert(0);\n\t\t}\n#endif\n\t\t// Keep advancing lcp, but stop when query mismatches host or\n\t\t// when the counter falls off either the query or the suffix\n\t\twhile(suf+lcp < hostLen && qry+lcp < hostLen && host[suf+lcp] == host[qry+lcp]) {\n\t\t\tlcp++;\n\t\t}\n\t\t// Fell off the end of either the query or the sa elt?\n\t\tbool fell = (suf+lcp == hostLen || qry+lcp == hostLen);\n\t\tif((fell && qry+lcp == hostLen) || (!fell && host[suf+lcp] < host[qry+lcp])) {\n\t\t\t// Query is greater than sa elt\n\t\t\tl = m;                 // update left bound\n\t\t\tlLcp = max(lLcp, lcp); // update left lcp\n\t\t}\n\t\telse if((fell && suf+lcp == hostLen) || (!fell && host[suf+lcp] > host[qry+lcp])) {\n\t\t\t// Query is less than sa elt\n\t\t\tr = m;                 // update right bound\n\t\t\trLcp = max(rLcp, lcp); // update right lcp\n\t\t} else {\n\t\t\tassert(false); // Must be one or the other!\n\t\t}\n\t}\n\t// Shouldn't get here\n\tassert(false);\n\treturn std::numeric_limits<TIndexOffU>::max();\n}\n\n#endif /*BINARY_SA_SEARCH_H_*/\n"
  },
  {
    "path": "bitpack.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef BITPACK_H_\n#define BITPACK_H_\n\n#include <stdint.h>\n#include \"assert_helpers.h\"\n\n/**\n * Routines for marshalling 2-bit values into and out of 8-bit or\n * 32-bit hosts\n */\n\nstatic inline void pack_2b_in_8b(const int two, uint8_t& eight, const int off) {\n\tassert_lt(two, 4);\n\tassert_lt(off, 4);\n\teight |= (two << (off*2));\n}\n\nstatic inline int unpack_2b_from_8b(const uint8_t eight, const int off) {\n\tassert_lt(off, 4);\n\treturn ((eight >> (off*2)) & 0x3);\n}\n\nstatic inline void pack_2b_in_32b(const int two, uint32_t& thirty2, const int off) {\n\tassert_lt(two, 4);\n\tassert_lt(off, 16);\n\tthirty2 |= (two << (off*2));\n}\n\nstatic inline int unpack_2b_from_32b(const uint32_t thirty2, const int off) {\n\tassert_lt(off, 16);\n\treturn ((thirty2 >> (off*2)) & 0x3);\n}\n\n#endif /*BITPACK_H_*/\n"
  },
  {
    "path": "blockwise_sa.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef BLOCKWISE_SA_H_\n#define BLOCKWISE_SA_H_\n\n#include <stdint.h>\n#include <stdlib.h>\n#include <iostream>\n#include <sstream>\n#include <stdexcept>\n#include \"assert_helpers.h\"\n#include \"diff_sample.h\"\n#include \"multikey_qsort.h\"\n#include \"random_source.h\"\n#include \"binary_sa_search.h\"\n#include \"zbox.h\"\n#include \"alphabet.h\"\n#include \"timer.h\"\n#include \"ds.h\"\n#include \"mem_ids.h\"\n#include \"word_io.h\"\n\nusing namespace std;\n\n// Helpers for printing verbose messages\n\n#ifndef VMSG_NL\n#define VMSG_NL(...) \\\nif(this->verbose()) { \\\n\tstringstream tmp; \\\n\ttmp << __VA_ARGS__ << endl; \\\n\tthis->verbose(tmp.str()); \\\n}\n#endif\n\n#ifndef VMSG\n#define VMSG(...) \\\nif(this->verbose()) { \\\n\tstringstream tmp; \\\n\ttmp << __VA_ARGS__; \\\n\tthis->verbose(tmp.str()); \\\n}\n#endif\n\n/**\n * Abstract parent class for blockwise suffix-array building schemes.\n */\ntemplate<typename TStr>\nclass BlockwiseSA {\npublic:\n\tBlockwiseSA(const TStr& __text,\n\t            TIndexOffU __bucketSz,\n                int  __nthreads = 1,\n\t            bool __sanityCheck = false,\n\t            bool __passMemExc = false,\n\t            bool __verbose = false,\n\t            ostream& __logger = cout) :\n\t_text(__text),\n\t_bucketSz(max<TIndexOffU>(__bucketSz, 2u)),\n    _nthreads(__nthreads),\n\t_sanityCheck(__sanityCheck),\n\t_passMemExc(__passMemExc),\n\t_verbose(__verbose),\n\t_itrBucket(EBWTB_CAT),\n    _itrBucketIdx(0),\n\t_itrBucketPos(OFF_MASK),\n\t_itrPushedBackSuffix(OFF_MASK),\n\t_logger(__logger)\n\t{\n    }\n\n\tvirtual ~BlockwiseSA() { }\n\n\t/**\n\t * Get the next suffix; compute the next bucket if necessary.\n\t */\n    virtual TIndexOffU nextSuffix() = 0;\n\n\t/**\n\t * Return true iff the next call to nextSuffix will succeed.\n\t */\n\tbool hasMoreSuffixes() {\n\t\tif(_itrPushedBackSuffix != OFF_MASK) return true;\n\t\ttry {\n\t\t\t_itrPushedBackSuffix = nextSuffix();\n\t\t} catch(out_of_range& e) {\n\t\t\tassert_eq(OFF_MASK, _itrPushedBackSuffix);\n\t\t\treturn false;\n\t\t}\n\t\treturn true;\n\t}\n\n\t/**\n\t * Reset the suffix iterator so that the next call to nextSuffix()\n\t * returns the lexicographically-first suffix.\n\t */\n\tvoid resetSuffixItr() {\n\t\t_itrBucket.clear();\n        _itrBucketIdx = 0;\n\t\t_itrBucketPos = OFF_MASK;\n\t\t_itrPushedBackSuffix = OFF_MASK;\n\t\treset();\n\t\tassert(suffixItrIsReset());\n\t}\n\n\t/**\n\t * Returns true iff the next call to nextSuffix() returns the\n\t * lexicographically-first suffix.\n\t */\n\tbool suffixItrIsReset() {\n\t\treturn _itrBucketIdx                       == 0 &&\n               _itrBucket.size()                   == 0 &&\n\t\t       _itrBucketPos                       == OFF_MASK &&\n\t\t       _itrPushedBackSuffix                == OFF_MASK &&\n\t\t       isReset();\n\t}\n\n\tconst TStr& text()  const { return _text; }\n\tTIndexOffU bucketSz() const { return _bucketSz; }\n\tbool sanityCheck()  const { return _sanityCheck; }\n\tbool verbose()      const { return _verbose; }\n\tostream& log()      const { return _logger; }\n\tsize_t size()       const { return _text.length()+1; }\n\nprotected:\n\t/// Reset back to the first block\n\tvirtual void reset() = 0;\n\t/// Return true iff reset to the first block\n\tvirtual bool isReset() = 0;\n\n\t/**\n\t * Grab the next block of sorted suffixes.  The block is guaranteed\n\t * to have at most _bucketSz elements.\n\t */\n\tvirtual void nextBlock(int cur_block, int tid = 0) = 0;\n\t/// Return true iff more blocks are available\n\tvirtual bool hasMoreBlocks() const = 0;\n\t/// Optionally output a verbose message\n\tvoid verbose(const string& s) const {\n\t\tif(this->verbose()) {\n\t\t\tthis->log() << s.c_str();\n\t\t\tthis->log().flush();\n\t\t}\n\t}\n\n\tconst TStr&        _text;        /// original string\n\tconst TIndexOffU   _bucketSz;    /// target maximum bucket size\n    const int          _nthreads;    /// number of threads\n\tconst bool         _sanityCheck; /// whether to perform sanity checks\n\tconst bool         _passMemExc;  /// true -> pass on memory exceptions\n\tconst bool         _verbose;     /// be talkative\n\tEList<TIndexOffU>  _itrBucket;   /// current bucket\n    TIndexOffU         _itrBucketIdx;\n\tTIndexOffU         _itrBucketPos;/// offset into current bucket\n\tTIndexOffU         _itrPushedBackSuffix; /// temporary slot for lookahead\n\tostream&           _logger;      /// write log messages here\n};\n\n/**\n * Abstract parent class for a blockwise suffix array builder that\n * always doles out blocks in lexicographical order.\n */\ntemplate<typename TStr>\nclass InorderBlockwiseSA : public BlockwiseSA<TStr> {\npublic:\n\tInorderBlockwiseSA(const TStr& __text,\n\t                   TIndexOffU __bucketSz,\n                       int  __nthreads = 1,\n\t                   bool __sanityCheck = false,\n\t   \t               bool __passMemExc = false,\n\t                   bool __verbose = false,\n\t                   ostream& __logger = cout) :\n\tBlockwiseSA<TStr>(__text, __bucketSz, __nthreads, __sanityCheck, __passMemExc, __verbose, __logger)\n\t{}\n};\n\n/**\n * Build the SA a block at a time according to the scheme outlined in\n * Karkkainen's \"Fast BWT\" paper.\n */\ntemplate<typename TStr>\nclass KarkkainenBlockwiseSA : public InorderBlockwiseSA<TStr> {\npublic:\n\ttypedef DifferenceCoverSample<TStr> TDC;\n\n\tKarkkainenBlockwiseSA(const TStr& __text,\n\t                      TIndexOffU __bucketSz,\n                          int      __nthreads,\n\t                      uint32_t __dcV,\n\t                      uint32_t __seed = 0,\n\t      \t              bool __sanityCheck = false,\n\t   \t                  bool __passMemExc = false,\n\t      \t              bool __verbose = false,\n                          string base_fname = \"\",\n\t      \t              ostream& __logger = cout) :\n\tInorderBlockwiseSA<TStr>(__text, __bucketSz, __nthreads, __sanityCheck, __passMemExc, __verbose, __logger),\n\t_sampleSuffs(EBWTB_CAT), _cur(0), _dcV(__dcV), _dc(EBWTB_CAT), _built(false), _base_fname(base_fname), _bigEndian(currentlyBigEndian())\n\t{ _randomSrc.init(__seed); reset(); }\n\n\t~KarkkainenBlockwiseSA()\n    {\n        if(_threads.size() > 0) {\n            for (size_t tid = 0; tid < _threads.size(); tid++) {\n                _threads[tid]->join();\n                delete _threads[tid];\n            }\n        }\n    }\n\n\t/**\n\t * Allocate an amount of memory that simulates the peak memory\n\t * usage of the DifferenceCoverSample with the given text and v.\n\t * Throws bad_alloc if it's not going to fit in memory.  Returns\n\t * the approximate number of bytes the Cover takes at all times.\n\t */\n\tstatic size_t simulateAllocs(const TStr& text, TIndexOffU bucketSz) {\n\t\tsize_t len = text.length();\n\t\t// _sampleSuffs and _itrBucket are in memory at the peak\n\t\tsize_t bsz = bucketSz;\n\t\tsize_t sssz = len / max<TIndexOffU>(bucketSz-1, 1);\n\t\tAutoArray<TIndexOffU> tmp(bsz + sssz + (1024 * 1024 /*out of caution*/), EBWT_CAT);\n\t\treturn bsz;\n\t}\n    \n    static void nextBlock_Worker(void *vp) {\n        pair<KarkkainenBlockwiseSA*, int> param = *(pair<KarkkainenBlockwiseSA*, int>*)vp;\n        KarkkainenBlockwiseSA* sa = param.first;\n        int tid = param.second;\n        while(true) {\n            size_t cur = 0;\n            {\n                ThreadSafe ts(&sa->_mutex, sa->_nthreads > 1);\n                cur = sa->_cur;\n                if(cur > sa->_sampleSuffs.size()) break;\n                sa->_cur++;\n            }\n            sa->nextBlock((int)cur, tid);\n            // Write suffixes into a file\n            std::ostringstream number; number << cur;\n            const string fname = sa->_base_fname + \".\" + number.str() + \".sa\";\n            ofstream sa_file(fname.c_str(), ios::binary);\n            if(!sa_file.good()) {\n                cerr << \"Could not open file for writing a reference graph: \\\"\" << fname << \"\\\"\" << endl;\n                throw 1;\n            }\n            const EList<TIndexOffU>& bucket = sa->_itrBuckets[tid];\n            writeIndex<TIndexOffU>(sa_file, bucket.size(), sa->_bigEndian);\n            for(size_t i = 0; i < bucket.size(); i++) {\n                writeIndex<TIndexOffU>(sa_file, bucket[i], sa->_bigEndian);\n            }\n            sa_file.close();\n            sa->_itrBuckets[tid].clear();\n            sa->_done[cur] = true;\n        }\n    }\n    \n    /**\n     * Get the next suffix; compute the next bucket if necessary.\n     */\n    virtual TIndexOffU nextSuffix() {\n        // Launch threads if not\n        if(this->_nthreads > 1) {\n            if(_threads.size() == 0) {\n                _done.resize(_sampleSuffs.size() + 1);\n                _done.fill(false);\n                _itrBuckets.resize(this->_nthreads);\n                for(int tid = 0; tid < this->_nthreads; tid++) {\n                    _tparams.expand();\n                    _tparams.back().first = this;\n                    _tparams.back().second = tid;\n                    _threads.push_back(new tthread::thread(nextBlock_Worker, (void*)&_tparams.back()));\n                }\n                assert_eq(_threads.size(), (size_t)this->_nthreads);\n            }\n        }\n        if(this->_itrPushedBackSuffix != OFF_MASK) {\n            TIndexOffU tmp = this->_itrPushedBackSuffix;\n            this->_itrPushedBackSuffix = OFF_MASK;\n            return tmp;\n        }\n        while(this->_itrBucketPos >= this->_itrBucket.size() ||\n              this->_itrBucket.size() == 0)\n        {\n            if(!hasMoreBlocks()) {\n                throw out_of_range(\"No more suffixes\");\n            }\n            if(this->_nthreads == 1) {\n                nextBlock((int)_cur);\n                _cur++;\n            } else {\n                while(!_done[this->_itrBucketIdx]) {\n#if defined(_TTHREAD_WIN32_)\n                    Sleep(1);\n#elif defined(_TTHREAD_POSIX_)\n                    const static timespec ts = {0, 1000000};  // 1 millisecond\n                    nanosleep(&ts, NULL);\n#endif\n                }\n                // Read suffixes from a file\n                std::ostringstream number; number << this->_itrBucketIdx;\n                const string fname = _base_fname + \".\" + number.str() + \".sa\";\n                ifstream sa_file(fname.c_str(), ios::binary);\n                if(!sa_file.good()) {\n                    cerr << \"Could not open file for reading a reference graph: \\\"\" << fname << \"\\\"\" << endl;\n                    throw 1;\n                }\n                size_t numSAs = readIndex<TIndexOffU>(sa_file, _bigEndian);\n                this->_itrBucket.resizeExact(numSAs);\n                for(size_t i = 0; i < numSAs; i++) {\n                    this->_itrBucket[i] = readIndex<TIndexOffU>(sa_file, _bigEndian);\n                }\n                sa_file.close();\n                std::remove(fname.c_str());\n            }\n            this->_itrBucketIdx++;\n            this->_itrBucketPos = 0;\n        }\n        return this->_itrBucket[this->_itrBucketPos++];\n    }\n\n\t/// Defined in blockwise_sa.cpp\n\tvirtual void nextBlock(int cur_block, int tid = 0);\n\n\t/// Defined in blockwise_sa.cpp\n\tvirtual void qsort(EList<TIndexOffU>& bucket);\n\n\t/// Return true iff more blocks are available\n\tvirtual bool hasMoreBlocks() const {\n        return this->_itrBucketIdx <= _sampleSuffs.size();\n\t}\n\n\t/// Return the difference-cover period\n\tuint32_t dcV() const { return _dcV; }\n\nprotected:\n\n\t/**\n\t * Initialize the state of the blockwise suffix sort.  If the\n\t * difference cover sample and the sample set have not yet been\n\t * built, build them.  Then reset the block cursor to point to\n\t * the first block.\n\t */\n\tvirtual void reset() {\n\t\tif(!_built) {\n\t\t\tbuild();\n\t\t}\n\t\tassert(_built);\n\t\t_cur = 0;\n\t}\n\n\t/// Return true iff we're about to dole out the first bucket\n\tvirtual bool isReset() {\n\t\treturn _cur == 0;\n\t}\n\nprivate:\n\n\t/**\n\t * Calculate the difference-cover sample and sample suffixes.\n\t */\n\tvoid build() {\n\t\t// Calculate difference-cover sample\n\t\tassert(_dc.get() == NULL);\n\t\tif(_dcV != 0) {\n\t\t\t_dc.init(new TDC(this->text(), _dcV, this->verbose(), this->sanityCheck()));\n\t\t\t_dc.get()->build(this->_nthreads);\n\t\t}\n\t\t// Calculate sample suffixes\n\t\tif(this->bucketSz() <= this->text().length()) {\n\t\t\tVMSG_NL(\"Building samples\");\n\t\t\tbuildSamples();\n\t\t} else {\n\t\t\tVMSG_NL(\"Skipping building samples since text length \" <<\n\t\t\t        this->text().length() << \" is less than bucket size: \" <<\n\t\t\t        this->bucketSz());\n\t\t}\n\t\t_built = true;\n\t}\n\n\t/**\n\t * Calculate the lcp between two suffixes using the difference\n\t * cover as a tie-breaker.  If the tie-breaker is employed, then\n\t * the calculated lcp may be an underestimate.\n\t *\n\t * Defined in blockwise_sa.cpp\n\t */\n\tinline bool tieBreakingLcp(TIndexOffU aOff,\n\t                           TIndexOffU bOff,\n\t                           TIndexOffU& lcp,\n\t                           bool& lcpIsSoft);\n\n\t/**_randomSrc\n\t * Compare two suffixes using the difference-cover sample.\n\t */\n\tinline bool suffixCmp(TIndexOffU cmp,\n\t                      TIndexOffU i,\n\t                      int64_t& j,\n\t                      int64_t& k,\n\t                      bool& kSoft,\n\t                      const EList<TIndexOffU>& z);\n\n\tvoid buildSamples();\n\n\tEList<TIndexOffU>  _sampleSuffs; /// sample suffixes\n\tTIndexOffU         _cur;         /// offset to 1st elt of next block\n\tconst uint32_t     _dcV;         /// difference-cover periodicity\n\tPtrWrap<TDC>       _dc;          /// queryable difference-cover data\n\tbool               _built;       /// whether samples/DC have been built\n\tRandomSource       _randomSrc;   /// source of pseudo-randoms\n    \n\n    MUTEX_T                 _mutex;       /// synchronization of output message\n    string                  _base_fname;  /// base file name for storing SA blocks\n    bool                    _bigEndian;   /// bigEndian?\n    EList<tthread::thread*> _threads;     /// thread list\n    EList<pair<KarkkainenBlockwiseSA*, int> > _tparams;\n    ELList<TIndexOffU>      _itrBuckets;  /// buckets\n    EList<bool>             _done;        /// is a block processed?\n};\n\n/**\n * Qsort the set of suffixes whose offsets are in 'bucket'.\n */\ntemplate<typename TStr>\ninline void KarkkainenBlockwiseSA<TStr>::qsort(EList<TIndexOffU>& bucket) {\n\tconst TStr& t = this->text();\n\tTIndexOffU *s = bucket.ptr();\n\tsize_t slen = bucket.size();\n\tTIndexOffU len = (TIndexOffU)t.length();\n\tif(_dc.get() != NULL) {\n\t\t// Use the difference cover as a tie-breaker if we have it\n\t\tVMSG_NL(\"  (Using difference cover)\");\n\t\t// Extract the 'host' array because it's faster to work\n\t\t// with than the EList<> container\n\t\tconst uint8_t *host = (const uint8_t *)t.buf();\n\t\tassert(_dc.get() != NULL);\n\t\tmkeyQSortSufDcU8(t, host, len, s, slen, *_dc.get(), 4,\n\t\t                 this->verbose(), this->sanityCheck());\n\t} else {\n\t\tVMSG_NL(\"  (Not using difference cover)\");\n\t\t// We don't have a difference cover - just do a normal\n\t\t// suffix sort\n\t\tmkeyQSortSuf(t, s, slen, 4,\n\t\t             this->verbose(), this->sanityCheck());\n\t}\n}\n\n/**\n * Qsort the set of suffixes whose offsets are in 'bucket'.  This\n * specialization for packed strings does not attempt to extract and\n * operate directly on the host string; the fact that the string is\n * packed means that the array cannot be sorted directly.\n */\ntemplate<>\ninline void KarkkainenBlockwiseSA<S2bDnaString>::qsort(\n\tEList<TIndexOffU>& bucket)\n{\n\tconst S2bDnaString& t = this->text();\n\tTIndexOffU *s = bucket.ptr();\n\tsize_t slen = bucket.size();\n\tsize_t len = t.length();\n\tif(_dc.get() != NULL) {\n\t\t// Use the difference cover as a tie-breaker if we have it\n\t\tVMSG_NL(\"  (Using difference cover)\");\n\t\t// Can't use the text's 'host' array because the backing\n\t\t// store for the packed string is not one-char-per-elt.\n\t\tmkeyQSortSufDcU8(t, t, len, s, slen, *_dc.get(), 4,\n\t\t                 this->verbose(), this->sanityCheck());\n\t} else {\n\t\tVMSG_NL(\"  (Not using difference cover)\");\n\t\t// We don't have a difference cover - just do a normal\n\t\t// suffix sort\n\t\tmkeyQSortSuf(t, s, slen, 4,\n\t\t             this->verbose(), this->sanityCheck());\n\t}\n}\n\ntemplate<typename TStr>\nstruct BinarySortingParam {\n    const TStr*              t;\n    const EList<TIndexOffU>* sampleSuffs;\n    EList<TIndexOffU>        bucketSzs;\n    EList<TIndexOffU>        bucketReps;\n    size_t                   begin;\n    size_t                   end;\n};\n\ntemplate<typename TStr>\nstatic void BinarySorting_worker(void *vp)\n{\n    BinarySortingParam<TStr>* param = (BinarySortingParam<TStr>*)vp;\n    const TStr& t = *(param->t);\n    size_t len = t.length();\n    const EList<TIndexOffU>& sampleSuffs = *(param->sampleSuffs);\n    EList<TIndexOffU>& bucketSzs = param->bucketSzs;\n    EList<TIndexOffU>& bucketReps = param->bucketReps;\n    ASSERT_ONLY(size_t numBuckets = bucketSzs.size());\n    size_t begin = param->begin;\n    size_t end = param->end;\n    // Iterate through every suffix in the text, determine which\n    // bucket it falls into by doing a binary search across the\n    // sorted list of samples, and increment a counter associated\n    // with that bucket.  Also, keep one representative for each\n    // bucket so that we can split it later.  We loop in ten\n    // stretches so that we can print out a helpful progress\n    // message.  (This step can take a long time.)\n    for(TIndexOffU i = begin; i < end && i < len; i++) {\n        TIndexOffU r = binarySASearch(t, i, sampleSuffs);\n        if(r == std::numeric_limits<TIndexOffU>::max()) continue; // r was one of the samples\n        assert_lt(r, numBuckets);\n        bucketSzs[r]++;\n        assert_lt(bucketSzs[r], len);\n        if(bucketReps[r] == OFF_MASK || (i & 100) == 0) {\n            bucketReps[r] = i; // clobbers previous one, but that's OK\n        }\n    }\n}\n\n/**\n * Select a set of bucket-delineating sample suffixes such that no\n * bucket is greater than the requested upper limit.  Some care is\n * taken to make each bucket's size close to the limit without\n * going over.\n */\ntemplate<typename TStr>\nvoid KarkkainenBlockwiseSA<TStr>::buildSamples() {\n\tconst TStr& t = this->text();\n    TIndexOffU bsz = this->bucketSz()-1; // subtract 1 to leave room for sample\n\tsize_t len = this->text().length();\n\t// Prepare _sampleSuffs array\n\t_sampleSuffs.clear();\n\tTIndexOffU numSamples = (TIndexOffU)((len/bsz)+1)<<1; // ~len/bsz x 2\n\tassert_gt(numSamples, 0);\n\tVMSG_NL(\"Reserving space for \" << numSamples << \" sample suffixes\");\n\tif(this->_passMemExc) {\n\t\t_sampleSuffs.resizeExact(numSamples);\n\t\t// Randomly generate samples.  Allow duplicates for now.\n\t\tVMSG_NL(\"Generating random suffixes\");\n\t\tfor(size_t i = 0; i < numSamples; i++) {\n#ifdef BOWTIE_64BIT_INDEX         \n\t\t\t_sampleSuffs[i] = (TIndexOffU)(_randomSrc.nextU64() % len); \n#else\n\t\t\t_sampleSuffs[i] = (TIndexOffU)(_randomSrc.nextU32() % len); \n#endif\n\t\t}\n\t} else {\n\t\ttry {\n\t\t\t_sampleSuffs.resizeExact(numSamples);\n\t\t\t// Randomly generate samples.  Allow duplicates for now.\n\t\t\tVMSG_NL(\"Generating random suffixes\");\n\t\t\tfor(size_t i = 0; i < numSamples; i++) {\n#ifdef BOWTIE_64BIT_INDEX\n\t\t\t\t_sampleSuffs[i] = (TIndexOffU)(_randomSrc.nextU64() % len); \n#else\n\t\t\t\t_sampleSuffs[i] = (TIndexOffU)(_randomSrc.nextU32() % len); \n#endif                \n\t\t\t}\n\t\t} catch(bad_alloc &e) {\n\t\t\tif(this->_passMemExc) {\n\t\t\t\tthrow e; // rethrow immediately\n\t\t\t} else {\n\t\t\t\tcerr << \"Could not allocate sample suffix container of \" << (numSamples * OFF_SIZE) << \" bytes.\" << endl\n\t\t\t\t     << \"Please try using a smaller number of blocks by specifying a larger --bmax or\" << endl\n\t\t\t\t     << \"a smaller --bmaxdivn\" << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t}\n\t}\n\t// Remove duplicates; very important to do this before the call to\n\t// mkeyQSortSuf so that it doesn't try to calculate lexicographical\n\t// relationships between very long, identical strings, which takes\n\t// an extremely long time in general, and causes the stack to grow\n\t// linearly with the size of the input\n\t{\n\t\tTimer timer(cout, \"QSorting sample offsets, eliminating duplicates time: \", this->verbose());\n\t\tVMSG_NL(\"QSorting \" << _sampleSuffs.size() << \" sample offsets, eliminating duplicates\");\n\t\t_sampleSuffs.sort();\n\t\tsize_t sslen = _sampleSuffs.size();\n\t\tfor(size_t i = 0; i < sslen-1; i++) {\n\t\t\tif(_sampleSuffs[i] == _sampleSuffs[i+1]) {\n\t\t\t\t_sampleSuffs.erase(i--);\n\t\t\t\tsslen--;\n\t\t\t}\n\t\t}\n\t}\n\t// Multikey quicksort the samples\n\t{\n\t\tTimer timer(cout, \"  Multikey QSorting samples time: \", this->verbose());\n\t\tVMSG_NL(\"Multikey QSorting \" << _sampleSuffs.size() << \" samples\");\n\t\tthis->qsort(_sampleSuffs);\n\t}\n\t// Calculate bucket sizes\n\tVMSG_NL(\"Calculating bucket sizes\");\n\tint limit = 5;\n\t// Iterate until all buckets are less than\n\twhile(--limit >= 0) {\n        TIndexOffU numBuckets = (TIndexOffU)_sampleSuffs.size()+1;\n        AutoArray<tthread::thread*> threads(this->_nthreads);\n        EList<BinarySortingParam<TStr> > tparams;\n        for(int tid = 0; tid < this->_nthreads; tid++) {\n            // Calculate bucket sizes by doing a binary search for each\n            // suffix and noting where it lands\n            tparams.expand();\n            try {\n                // Allocate and initialize containers for holding bucket\n                // sizes and representatives.\n                tparams.back().bucketSzs.resizeExact(numBuckets);\n                tparams.back().bucketReps.resizeExact(numBuckets);\n                tparams.back().bucketSzs.fillZero();\n                tparams.back().bucketReps.fill(OFF_MASK);\n            } catch(bad_alloc &e) {\n                if(this->_passMemExc) {\n                    throw e; // rethrow immediately\n                } else {\n                    cerr << \"Could not allocate sizes, representatives (\" << ((numBuckets*8)>>10) << \" KB) for blocks.\" << endl\n                    << \"Please try using a smaller number of blocks by specifying a larger --bmax or a\" << endl\n                    << \"smaller --bmaxdivn.\" << endl;\n                    throw 1;\n                }\n            }\n            tparams.back().t = &t;\n            tparams.back().sampleSuffs = &_sampleSuffs;\n            tparams.back().begin = (tid == 0 ? 0 : len / this->_nthreads * tid);\n            tparams.back().end = (tid + 1 == this->_nthreads ? len : len / this->_nthreads * (tid + 1));\n            if(this->_nthreads == 1) {\n                BinarySorting_worker<TStr>((void*)&tparams.back());\n            } else {\n                threads[tid] = new tthread::thread(BinarySorting_worker<TStr>, (void*)&tparams.back());\n            }\n        }\n        \n        if(this->_nthreads > 1) {\n            for (int tid = 0; tid < this->_nthreads; tid++) {\n                threads[tid]->join();\n            }\n        }\n        \n        EList<TIndexOffU>& bucketSzs = tparams[0].bucketSzs;\n        EList<TIndexOffU>& bucketReps = tparams[0].bucketReps;\n        for(int tid = 1; tid < this->_nthreads; tid++) {\n            for(size_t j = 0; j < numBuckets; j++) {\n                bucketSzs[j] += tparams[tid].bucketSzs[j];\n                if(bucketReps[j] == OFF_MASK) {\n                    bucketReps[j] = tparams[tid].bucketReps[j];\n                }\n            }\n        }\n\t\t// Check for large buckets and mergeable pairs of small buckets\n\t\t// and split/merge as necessary\n\t\tTIndexOff added = 0;\n\t\tTIndexOff merged = 0;\n\t\tassert_eq(bucketSzs.size(), numBuckets);\n\t\tassert_eq(bucketReps.size(), numBuckets);\n\t\t{\n\t\t\tTimer timer(cout, \"  Splitting and merging time: \", this->verbose());\n\t\t\tVMSG_NL(\"Splitting and merging\");\n\t\t\tfor(TIndexOffU i = 0; i < numBuckets; i++) {\n\t\t\t\tTIndexOffU mergedSz = bsz + 1;\n\t\t\t\tassert(bucketSzs[(size_t)i] == 0 || bucketReps[(size_t)i] != OFF_MASK);\n\t\t\t\tif(i < numBuckets-1) {\n\t\t\t\t\tmergedSz = bucketSzs[(size_t)i] + bucketSzs[(size_t)i+1] + 1;\n\t\t\t\t}\n\t\t\t\t// Merge?\n\t\t\t\tif(mergedSz <= bsz) {\n\t\t\t\t\tbucketSzs[(size_t)i+1] += (bucketSzs[(size_t)i]+1);\n\t\t\t\t\t// The following may look strange, but it's necessary\n\t\t\t\t\t// to ensure that the merged bucket has a representative\n\t\t\t\t\tbucketReps[(size_t)i+1] = _sampleSuffs[(size_t)i+added];\n\t\t\t\t\t_sampleSuffs.erase((size_t)i+added);\n\t\t\t\t\tbucketSzs.erase((size_t)i);\n\t\t\t\t\tbucketReps.erase((size_t)i);\n\t\t\t\t\ti--; // might go to -1 but ++ will overflow back to 0\n\t\t\t\t\tnumBuckets--;\n\t\t\t\t\tmerged++;\n\t\t\t\t\tassert_eq(numBuckets, _sampleSuffs.size()+1-added);\n\t\t\t\t\tassert_eq(numBuckets, bucketSzs.size());\n\t\t\t\t}\n\t\t\t\t// Split?\n\t\t\t\telse if(bucketSzs[(size_t)i] > bsz) {\n\t\t\t\t\t// Add an additional sample from the bucketReps[]\n\t\t\t\t\t// set accumulated in the binarySASearch loop; this\n\t\t\t\t\t// effectively splits the bucket\n\t\t\t\t\t_sampleSuffs.insert(bucketReps[(size_t)i], (TIndexOffU)(i + (added++)));\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tif(added == 0) {\n\t\t\t//if(this->verbose()) {\n\t\t\t//\tcout << \"Final bucket sizes:\" << endl;\n\t\t\t//\tcout << \"  (begin): \" << bucketSzs[0] << \" (\" << (int)(bsz - bucketSzs[0]) << \")\" << endl;\n\t\t\t//\tfor(uint32_t i = 1; i < numBuckets; i++) {\n\t\t\t//\t\tcout << \"  \" << bucketSzs[i] << \" (\" << (int)(bsz - bucketSzs[i]) << \")\" << endl;\n\t\t\t//\t}\n\t\t\t//}\n\t\t\tbreak;\n\t\t}\n\t\t// Otherwise, continue until no more buckets need to be\n\t\t// split\n\t\tVMSG_NL(\"Split \" << added << \", merged \" << merged << \"; iterating...\");\n\t}\n\t// Do *not* force a do-over\n//\tif(limit == 0) {\n//\t\tVMSG_NL(\"Iterated too many times; trying again...\");\n//\t\tbuildSamples();\n//\t}\n\tVMSG_NL(\"Avg bucket size: \" << ((double)(len-_sampleSuffs.size()) / (_sampleSuffs.size()+1)) << \" (target: \" << bsz << \")\");\n}\n\n/**\n * Do a simple LCP calculation on two strings.\n */\ntemplate<typename T> inline\nstatic TIndexOffU suffixLcp(const T& t, TIndexOffU aOff, TIndexOffU bOff) {\n\tTIndexOffU c = 0;\n\tsize_t len = t.length();\n\tassert_leq(aOff, len);\n\tassert_leq(bOff, len);\n\twhile(aOff + c < len && bOff + c < len && t[aOff + c] == t[bOff + c]) c++;\n\treturn c;\n}\n\n/**\n * Calculate the lcp between two suffixes using the difference\n * cover as a tie-breaker.  If the tie-breaker is employed, then\n * the calculated lcp may be an underestimate.  If the tie-breaker is\n * employed, lcpIsSoft will be set to true (otherwise, false).\n */\ntemplate<typename TStr> inline\nbool KarkkainenBlockwiseSA<TStr>::tieBreakingLcp(TIndexOffU aOff,\n                                                 TIndexOffU bOff,\n                                                 TIndexOffU& lcp,\n                                                 bool& lcpIsSoft)\n{\n\tconst TStr& t = this->text();\n\tTIndexOffU c = 0;\n\tTIndexOffU tlen = (TIndexOffU)t.length();\n\tassert_leq(aOff, tlen);\n\tassert_leq(bOff, tlen);\n\tassert(_dc.get() != NULL);\n\tuint32_t dcDist = _dc.get()->tieBreakOff(aOff, bOff);\n\tlcpIsSoft = false; // hard until proven soft\n\twhile(c < dcDist &&    // we haven't hit the tie breaker\n\t      c < tlen-aOff && // we haven't fallen off of LHS suffix\n\t      c < tlen-bOff && // we haven't fallen off of RHS suffix\n\t      t[aOff+c] == t[bOff+c]) // we haven't hit a mismatch\n\t\tc++;\n\tlcp = c;\n\tif(c == tlen-aOff) {\n\t\t// Fell off LHS (a), a is greater\n\t\treturn false;\n\t} else if(c == tlen-bOff) {\n\t\t// Fell off RHS (b), b is greater\n\t\treturn true;\n\t} else if(c == dcDist) {\n\t\t// Hit a tie-breaker element\n\t\tlcpIsSoft = true;\n\t\tassert_neq(dcDist, 0xffffffff);\n\t\treturn _dc.get()->breakTie(aOff+c, bOff+c) < 0;\n\t} else {\n\t\tassert_neq(t[aOff+c], t[bOff+c]);\n\t\treturn t[aOff+c] < t[bOff+c];\n\t}\n}\n\n/**\n * Lookup a suffix LCP in the given z array; if the element is not\n * filled in then calculate it from scratch.\n */\ntemplate<typename T>\nstatic TIndexOffU lookupSuffixZ(\n\tconst T& t,\n\tTIndexOffU zOff,\n\tTIndexOffU off,\n\tconst EList<TIndexOffU>& z)\n{\n\tif(zOff < z.size()) {\n\t\tTIndexOffU ret = z[zOff];\n\t\tassert_eq(ret, suffixLcp(t, off + zOff, off));\n\t\treturn ret;\n\t}\n\tassert_leq(off + zOff, t.length());\n\treturn suffixLcp(t, off + zOff, off);\n}\n\n/**\n * true -> i < cmp\n * false -> i > cmp\n */\ntemplate<typename TStr> inline\nbool KarkkainenBlockwiseSA<TStr>::suffixCmp(\n\tTIndexOffU cmp,\n\tTIndexOffU i,\n\tint64_t& j,\n\tint64_t& k,\n\tbool& kSoft,\n\tconst EList<TIndexOffU>& z)\n{\n\tconst TStr& t = this->text();\n\tTIndexOffU len = (TIndexOffU)t.length();\n\t// i is not covered by any previous match\n\tTIndexOffU l;\n\tif((int64_t)i > k) {\n\t\tk = i; // so that i + lHi == kHi\n\t\tl = 0; // erase any previous l\n\t\tkSoft = false;\n\t\t// To be extended\n\t}\n\t// i is covered by a previous match\n\telse /* i <= k */ {\n\t\tassert_gt((int64_t)i, j);\n\t\tTIndexOffU zIdx = (TIndexOffU)(i-j);\n\t\tassert_leq(zIdx, len-cmp);\n\t\tif(zIdx < _dcV || _dc.get() == NULL) {\n\t\t\t// Go as far as the Z-box says\n\t\t\tl = lookupSuffixZ(t, zIdx, cmp, z);\n\t\t\tif(i + l > len) {\n\t\t\t\tl = len-i;\n\t\t\t}\n\t\t\tassert_leq(i + l, len);\n\t\t\t// Possibly to be extended\n\t\t} else {\n\t\t\t// But we're past the point of no-more-Z-boxes\n\t\t\tbool ret = tieBreakingLcp(i, cmp, l, kSoft);\n\t\t\t// Sanity-check tie-breaker\n\t\t\tif(this->sanityCheck()) {\n\t\t\t\tif(ret) assert(sstr_suf_lt(t, i, t, cmp, false));\n\t\t\t\telse    assert(sstr_suf_gt(t, i, t, cmp, false));\n\t\t\t}\n\t\t\tj = i;\n\t\t\tk = i + l;\n\t\t\tif(this->sanityCheck()) {\n\t\t\t\tif(kSoft) { assert_leq(l, suffixLcp(t, i, cmp)); }\n\t\t\t\telse      { assert_eq (l, suffixLcp(t, i, cmp)); }\n\t\t\t}\n\t\t\treturn ret;\n\t\t}\n\t}\n\n\t// Z box extends exactly as far as previous match (or there\n\t// is neither a Z box nor a previous match)\n\tif((int64_t)(i + l) == k) {\n\t\t// Extend\n\t\twhile(l < len-cmp && k < (int64_t)len && t[(size_t)(cmp+l)] == t[(size_t)k]) {\n\t\t\tk++; l++;\n\t\t}\n\t\tj = i; // update furthest-extending LHS\n\t\tkSoft = false;\n\t\tassert_eq(l, suffixLcp(t, i, cmp));\n\t}\n\t// Z box extends further than previous match\n\telse if((int64_t)(i + l) > k) {\n\t\tl = (TIndexOffU)(k - i); // point to just after previous match\n\t\tj = i; // update furthest-extending LHS\n\t\tif(kSoft) {\n\t\t\twhile(l < len-cmp && k < (int64_t)len && t[(size_t)(cmp+l)] == t[(size_t)k]) {\n\t\t\t\tk++; l++;\n\t\t\t}\n\t\t\tkSoft = false;\n\t\t\tassert_eq(l, suffixLcp(t, i, cmp));\n\t\t} else assert_eq(l, suffixLcp(t, i, cmp));\n\t}\n\n\t// Check that calculated lcp matches actual lcp\n\tif(this->sanityCheck()) {\n\t\tif(!kSoft) {\n\t\t\t// l should exactly match lcp\n\t\t\tassert_eq(l, suffixLcp(t, i, cmp));\n\t\t} else {\n\t\t\t// l is an underestimate of LCP\n\t\t\tassert_leq(l, suffixLcp(t, i, cmp));\n\t\t}\n\t}\n\tassert_leq(l+i, len);\n\tassert_leq(l, len-cmp);\n\n\t// i and cmp should not be the same suffix\n\tassert(l != len-cmp || i+l != len);\n\n\t// Now we're ready to do a comparison on the next char\n\tif(l+i != len && (\n\t   l == len-cmp || // departure from paper algorithm:\n\t                   // falling off pattern implies\n\t                   // pattern is *greater* in our case\n\t   t[i + l] < t[cmp + l]))\n\t{\n\t\t// Case 2: Text suffix is less than upper sample suffix\n#ifndef NDEBUG\n\t\tif(this->sanityCheck()) {\n\t\t\tassert(sstr_suf_lt(t, i, t, cmp, false));\n\t\t}\n#endif\n\t\treturn true; // suffix at i is less than suffix at cmp\n\t}\n\telse {\n\t\t// Case 3: Text suffix is greater than upper sample suffix\n#ifndef NDEBUG\n\t\tif(this->sanityCheck()) {\n\t\t\tassert(sstr_suf_gt(t, i, t, cmp, false));\n\t\t}\n#endif\n\t\treturn false; // suffix at i is less than suffix at cmp\n\t}\n}\n\n/**\n * Retrieve the next block.  This is the most performance-critical part\n * of the blockwise suffix sorting process.\n */\ntemplate<typename TStr>\nvoid KarkkainenBlockwiseSA<TStr>::nextBlock(int cur_block, int tid) {\n#ifndef NDEBUG\n    if(this->_nthreads > 1) {\n        assert_lt(tid, this->_itrBuckets.size());\n    }\n#endif\n    EList<TIndexOffU>& bucket = (this->_nthreads > 1 ? this->_itrBuckets[tid] : this->_itrBucket);\n    {\n        ThreadSafe ts(&_mutex, this->_nthreads > 1);\n        VMSG_NL(\"Getting block \" << (cur_block+1) << \" of \" << _sampleSuffs.size()+1);\n    }\n\tassert(_built);\n\tassert_gt(_dcV, 3);\n\tassert_leq(cur_block, _sampleSuffs.size());\n\tconst TStr& t = this->text();\n\tTIndexOffU len = (TIndexOffU)t.length();\n\t// Set up the bucket\n\tbucket.clear();\n\tTIndexOffU lo = OFF_MASK, hi = OFF_MASK;\n\tif(_sampleSuffs.size() == 0) {\n\t\t// Special case: if _sampleSuffs is 0, then multikey-quicksort\n\t\t// everything\n        {\n            ThreadSafe ts(&_mutex, this->_nthreads > 1);\n            VMSG_NL(\"  No samples; assembling all-inclusive block\");\n        }\n\t\tassert_eq(0, cur_block);\n\t\ttry {\n\t\t\tif(bucket.capacity() < this->bucketSz()) {\n\t\t\t\tbucket.reserveExact(len+1);\n\t\t\t}\n\t\t\tbucket.resize(len);\n\t\t\tfor(TIndexOffU i = 0; i < len; i++) {\n\t\t\t\tbucket[i] = i;\n\t\t\t}\n\t\t} catch(bad_alloc &e) {\n\t\t\tif(this->_passMemExc) {\n\t\t\t\tthrow e; // rethrow immediately\n\t\t\t} else {\n\t\t\t\tcerr << \"Could not allocate a master suffix-array block of \" << ((len+1) * 4) << \" bytes\" << endl\n\t\t\t\t     << \"Please try using a larger number of blocks by specifying a smaller --bmax or\" << endl\n\t\t\t\t     << \"a larger --bmaxdivn\" << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t}\n\t} else {\n\t\ttry {\n            {\n                ThreadSafe ts(&_mutex, this->_nthreads > 1);\n                VMSG_NL(\"  Reserving size (\" << this->bucketSz() << \") for bucket \" << (cur_block+1));\n            }\n\t\t\t// BTL: Add a +100 fudge factor; there seem to be instances\n\t\t\t// where a bucket ends up having one more elt than bucketSz()\n\t\t\tif(bucket.size() < this->bucketSz()+100) {\n\t\t\t\tbucket.reserveExact(this->bucketSz()+100);\n\t\t\t}\n\t\t} catch(bad_alloc &e) {\n\t\t\tif(this->_passMemExc) {\n\t\t\t\tthrow e; // rethrow immediately\n\t\t\t} else {\n\t\t\t\tcerr << \"Could not allocate a suffix-array block of \" << ((this->bucketSz()+1) * 4) << \" bytes\" << endl;\n\t\t\t\tcerr << \"Please try using a larger number of blocks by specifying a smaller --bmax or\" << endl\n\t\t\t\t     << \"a larger --bmaxdivn\" << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t}\n\t\t// Select upper and lower bounds from _sampleSuffs[] and\n\t\t// calculate the Z array up to the difference-cover periodicity\n\t\t// for both.  Be careful about first/last buckets.\n\t\tEList<TIndexOffU> zLo(EBWTB_CAT), zHi(EBWTB_CAT);\n\t\tassert_geq(cur_block, 0);\n\t\tassert_leq((size_t)cur_block, _sampleSuffs.size());\n\t\tbool first = (cur_block == 0);\n\t\tbool last  = ((size_t)cur_block == _sampleSuffs.size());\n\t\ttry {\n\t\t\t// Timer timer(cout, \"  Calculating Z arrays time: \", this->verbose());\n            {\n                ThreadSafe ts(&_mutex, this->_nthreads > 1);\n                VMSG_NL(\"  Calculating Z arrays for bucket \" << (cur_block+1));\n            }\n\t\t\tif(!last) {\n\t\t\t\t// Not the last bucket\n\t\t\t\tassert_lt(cur_block, _sampleSuffs.size());\n\t\t\t\thi = _sampleSuffs[cur_block];\n\t\t\t\tzHi.resizeExact(_dcV);\n\t\t\t\tzHi.fillZero();\n\t\t\t\tassert_eq(zHi[0], 0);\n\t\t\t\tcalcZ(t, hi, zHi, this->verbose(), this->sanityCheck());\n\t\t\t}\n\t\t\tif(!first) {\n\t\t\t\t// Not the first bucket\n\t\t\t\tassert_gt(cur_block, 0);\n\t\t\t\tassert_leq(cur_block, _sampleSuffs.size());\n\t\t\t\tlo = _sampleSuffs[cur_block-1];\n\t\t\t\tzLo.resizeExact(_dcV);\n\t\t\t\tzLo.fillZero();\n\t\t\t\tassert_gt(_dcV, 3);\n\t\t\t\tassert_eq(zLo[0], 0);\n\t\t\t\tcalcZ(t, lo, zLo, this->verbose(), this->sanityCheck());\n\t\t\t}\n\t\t} catch(bad_alloc &e) {\n\t\t\tif(this->_passMemExc) {\n\t\t\t\tthrow e; // rethrow immediately\n\t\t\t} else {\n\t\t\t\tcerr << \"Could not allocate a z-array of \" << (_dcV * 4) << \" bytes\" << endl;\n\t\t\t\tcerr << \"Please try using a larger number of blocks by specifying a smaller --bmax or\" << endl\n\t\t\t\t     << \"a larger --bmaxdivn\" << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t}\n\n\t\t// This is the most critical loop in the algorithm; this is where\n\t\t// we iterate over all suffixes in the text and pick out those that\n\t\t// fall into the current bucket.\n\t\t//\n\t\t// This loop is based on the SMALLERSUFFIXES function outlined on\n\t\t// p7 of the \"Fast BWT\" paper\n\t\t//\n\t\tint64_t kHi = -1, kLo = -1;\n\t\tint64_t jHi = -1, jLo = -1;\n\t\tbool kHiSoft = false, kLoSoft = false;\n\t\tassert_eq(0, bucket.size());\n\t\t{\n\t\t\t// Timer timer(cout, \"  Block accumulator loop time: \", this->verbose());\n            {\n                ThreadSafe ts(&_mutex, this->_nthreads > 1);\n                VMSG_NL(\"  Entering block accumulator loop for bucket \" << (cur_block+1) << \":\");\n            }\n\t\t\tTIndexOffU lenDiv10 = (len + 9) / 10;\n\t\t\tfor(TIndexOffU iten = 0, ten = 0; iten < len; iten += lenDiv10, ten++) {\n                TIndexOffU itenNext = iten + lenDiv10;\n                {\n                    ThreadSafe ts(&_mutex, this->_nthreads > 1);\n                    if(ten > 0) VMSG_NL(\"  bucket \" << (cur_block+1) << \": \" << (ten * 10) << \"%\");\n                }\n                for(TIndexOffU i = iten; i < itenNext && i < len; i++) {\n                    assert_lt(jLo, (TIndexOff)i); assert_lt(jHi, (TIndexOff)i);\n                    // Advance the upper-bound comparison by one character\n                    if(i == hi || i == lo) continue; // equal to one of the bookends\n                    if(hi != OFF_MASK && !suffixCmp(hi, i, jHi, kHi, kHiSoft, zHi)) {\n                        continue; // not in the bucket\n                    }\n                    if(lo != OFF_MASK && suffixCmp(lo, i, jLo, kLo, kLoSoft, zLo)) {\n                        continue; // not in the bucket\n                    }\n                    // In the bucket! - add it\n                    assert_lt(i, len);\n                    try {\n                        bucket.push_back(i);\n                    } catch(bad_alloc &e) {\n                        cerr << \"Could not append element to block of \" << ((bucket.size()) * OFF_SIZE) << \" bytes\" << endl;\n                        if(this->_passMemExc) {\n                            throw e; // rethrow immediately\n                        } else {\n                            cerr << \"Please try using a larger number of blocks by specifying a smaller --bmax or\" << endl\n                            << \"a larger --bmaxdivn\" << endl;\n                            throw 1;\n                        }\n                    }\n                    // Not necessarily true; we allow overflowing buckets\n                    // since we can't guarantee that a good set of sample\n                    // suffixes can be found in a reasonable amount of time\n                    //assert_lt(bucket.size(), this->bucketSz());\n                }\n            } // end loop over all suffixes of t\n            {\n                ThreadSafe ts(&_mutex, this->_nthreads > 1);\n                VMSG_NL(\"  bucket \" << (cur_block+1) << \": 100%\");\n            }\n\t\t}\n\t} // end else clause of if(_sampleSuffs.size() == 0)\n\t// Sort the bucket\n\tif(bucket.size() > 0) {\n\t\tTimer timer(cout, \"  Sorting block time: \", this->verbose());\n        {\n            ThreadSafe ts(&_mutex, this->_nthreads > 1);\n            VMSG_NL(\"  Sorting block of length \" << bucket.size() << \" for bucket \" << (cur_block+1));\n        }\n\t\tthis->qsort(bucket);\n\t}\n\tif(hi != OFF_MASK) {\n\t\t// Not the final bucket; throw in the sample on the RHS\n\t\tbucket.push_back(hi);\n\t} else {\n\t\t// Final bucket; throw in $ suffix\n\t\tbucket.push_back(len);\n\t}\n    {\n        ThreadSafe ts(&_mutex, this->_nthreads > 1);\n        VMSG_NL(\"Returning block of \" << bucket.size() << \" for bucket \" << (cur_block+1));\n    }\n}\n\n#endif /*BLOCKWISE_SA_H_*/\n"
  },
  {
    "path": "bt2_idx.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <string>\n#include <stdexcept>\n#include <iostream>\n#include <fstream>\n#include <stdlib.h>\n#include \"bt2_idx.h\"\n\nusing namespace std;\n\nconst std::string gEbwt_ext(\"cf\");\n\n/**\n * Try to find the Bowtie index specified by the user.  First try the\n * exact path given by the user.  Then try the user-provided string\n * appended onto the path of the \"indexes\" subdirectory below this\n * executable, then try the provided string appended onto\n * \"$BOWTIE2_INDEXES/\".\n */\nstring adjustEbwtBase(const string& cmdline,\n\t\t\t\t\t  const string& ebwtFileBase,\n\t\t\t\t\t  bool verbose = false)\n{\n\tstring str = ebwtFileBase;\n\tifstream in;\n\tif(verbose) cout << \"Trying \" << str.c_str() << endl;\n\tin.open((str + \".1.\" + gEbwt_ext).c_str(), ios_base::in | ios::binary);\n\tif(!in.is_open()) {\n\t\tif(verbose) cout << \"  didn't work\" << endl;\n\t\tin.close();\n\t\tif(getenv(\"CENTRIFUGE_INDEXES\") != NULL) {\n\t\t\tstr = string(getenv(\"CENTRIFUGE_INDEXES\")) + \"/\" + ebwtFileBase;\n\t\t\tif(verbose) cout << \"Trying \" << str.c_str() << endl;\n\t\t\tin.open((str + \".1.\" + gEbwt_ext).c_str(), ios_base::in | ios::binary);\n\t\t\tif(!in.is_open()) {\n\t\t\t\tif(verbose) cout << \"  didn't work\" << endl;\n\t\t\t\tin.close();\n\t\t\t} else {\n\t\t\t\tif(verbose) cout << \"  worked\" << endl;\n\t\t\t}\n\t\t}\n\t}\n\tif(!in.is_open()) {\n\t\tcerr << \"Could not locate a Centrifuge index corresponding to basename \\\"\" << ebwtFileBase.c_str() << \"\\\"\" << endl;\n\t\tthrow 1;\n\t}\n\treturn str;\n}\n\nstring gLastIOErrMsg;\n\nuint8_t tax_rank_num[RANK_MAX];\n"
  },
  {
    "path": "bt2_idx.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef EBWT_H_\n#define EBWT_H_\n\n#include <stdint.h>\n#include <string.h>\n#include <iostream>\n#include <fstream>\n#include <sstream>\n#include <memory>\n#include <fcntl.h>\n#include <math.h>\n#include <errno.h>\n#include <stdexcept>\n#include <sys/stat.h>\n#include <map>\n#include <set>\n#ifdef BOWTIE_MM\n#include <sys/mman.h>\n#include <sys/shm.h>\n#endif\n#include \"shmem.h\"\n#include \"alphabet.h\"\n#include \"assert_helpers.h\"\n#include \"bitpack.h\"\n#include \"blockwise_sa.h\"\n#include \"endian_swap.h\"\n#include \"word_io.h\"\n#include \"random_source.h\"\n#include \"ref_read.h\"\n#include \"threading.h\"\n#include \"str_util.h\"\n#include \"mm.h\"\n#include \"timer.h\"\n#include \"reference.h\"\n#include \"search_globals.h\"\n#include \"ds.h\"\n#include \"random_source.h\"\n#include \"mem_ids.h\"\n#include \"btypes.h\"\n#include \"taxonomy.h\"\n\n#ifdef POPCNT_CAPABILITY\n#include \"processor_support.h\"\n#endif\n\nusing namespace std;\n\n// From ccnt_lut.cpp, automatically generated by gen_lookup_tables.pl\nextern uint8_t cCntLUT_4[4][4][256];\nextern uint8_t cCntLUT_4_rev[4][4][256];\n\nstatic const uint64_t c_table[4] = {\n    0xffffffffffffffff,\n    0xaaaaaaaaaaaaaaaa,\n    0x5555555555555555,\n    0x0000000000000000\n};\n\n#ifndef VMSG_NL\n#define VMSG_NL(...) \\\nif(this->verbose()) { \\\n\tstringstream tmp; \\\n\ttmp << __VA_ARGS__ << endl; \\\n\tthis->verbose(tmp.str()); \\\n}\n#endif\n\n#ifndef VMSG\n#define VMSG(...) \\\nif(this->verbose()) { \\\n\tstringstream tmp; \\\n\ttmp << __VA_ARGS__; \\\n\tthis->verbose(tmp.str()); \\\n}\n#endif\n\n/**\n * Flags describing type of Ebwt.\n */\nenum EBWT_FLAGS {\n\tEBWT_COLOR = 2,     // true -> Ebwt is colorspace\n\tEBWT_ENTIRE_REV = 4 // true -> reverse Ebwt is the whole\n\t                    // concatenated string reversed, rather than\n\t\t\t\t\t\t// each stretch reversed\n};\n\n/**\n * Extended Burrows-Wheeler transform header.  This together with the\n * actual data arrays and other text-specific parameters defined in\n * class Ebwt constitute the entire Ebwt.\n */\ntemplate <typename index_t = uint32_t>\nclass EbwtParams {\n\npublic:\n\tEbwtParams() { }\n\n\tEbwtParams(\n\t\tindex_t len,\n\t\tint32_t lineRate,\n\t\tint32_t offRate,\n\t\tint32_t ftabChars,\n\t\tbool color,\n\t\tbool entireReverse)\n\t{\n\t\tinit(len, lineRate, offRate, ftabChars, color, entireReverse);\n\t}\n\n\tEbwtParams(const EbwtParams& eh) {\n\t\tinit(eh._len, eh._lineRate, eh._offRate,\n\t\t     eh._ftabChars, eh._color, eh._entireReverse);\n\t}\n\n\tvoid init(\n\t\tindex_t len,\n\t\tint32_t lineRate,\n\t\tint32_t offRate,\n\t\tint32_t ftabChars,\n\t\tbool color,\n\t\tbool entireReverse)\n\t{\n\t\t_color = color;\n\t\t_entireReverse = entireReverse;\n\t\t_len = len;\n\t\t_bwtLen = _len + 1;\n\t\t_sz = (len+3)/4;\n\t\t_bwtSz = (len/4 + 1);\n\t\t_lineRate = lineRate;\n\t\t_origOffRate = offRate;\n\t\t_offRate = offRate;\n\t\t_offMask = std::numeric_limits<index_t>::max() << _offRate;\n\t\t_ftabChars = ftabChars;\n\t\t_eftabLen = _ftabChars*2;\n\t\t_eftabSz = _eftabLen*sizeof(index_t);\n\t\t_ftabLen = (1 << (_ftabChars*2))+1;\n\t\t_ftabSz = _ftabLen*sizeof(index_t);\n\t\t_offsLen = (_bwtLen + (1 << _offRate) - 1) >> _offRate;\n\t\t_offsSz = _offsLen*sizeof(index_t);\n\t\t_lineSz = 1 << _lineRate;\n\t\t_sideSz = _lineSz * 1 /* lines per side */;\n\t\t_sideBwtSz = _sideSz - (sizeof(index_t) * 4);\n\t\t_sideBwtLen = _sideBwtSz*4;\n\t\t_numSides = (_bwtSz+(_sideBwtSz)-1)/(_sideBwtSz);\n\t\t_numLines = _numSides * 1 /* lines per side */;\n\t\t_ebwtTotLen = _numSides * _sideSz;\n\t\t_ebwtTotSz = _ebwtTotLen;\n\t\tassert(repOk());\n\t}\n\n\tindex_t len() const           { return _len; }\n\tindex_t lenNucs() const       { return _len + (_color ? 1 : 0); }\n\tindex_t bwtLen() const        { return _bwtLen; }\n\tindex_t sz() const            { return _sz; }\n\tindex_t bwtSz() const         { return _bwtSz; }\n\tint32_t lineRate() const      { return _lineRate; }\n\tint32_t origOffRate() const   { return _origOffRate; }\n\tint32_t offRate() const       { return _offRate; }\n\tindex_t offMask() const       { return _offMask; }\n\tint32_t ftabChars() const     { return _ftabChars; }\n\tindex_t eftabLen() const      { return _eftabLen; }\n\tindex_t eftabSz() const       { return _eftabSz; }\n\tindex_t ftabLen() const       { return _ftabLen; }\n\tindex_t ftabSz() const        { return _ftabSz; }\n\tindex_t offsLen() const       { return _offsLen; }\n\tindex_t offsSz() const        { return _offsSz; }\n\tindex_t lineSz() const        { return _lineSz; }\n\tindex_t sideSz() const        { return _sideSz; }\n\tindex_t sideBwtSz() const     { return _sideBwtSz; }\n\tindex_t sideBwtLen() const    { return _sideBwtLen; }\n\tindex_t numSides() const      { return _numSides; }\n\tindex_t numLines() const      { return _numLines; }\n\tindex_t ebwtTotLen() const    { return _ebwtTotLen; }\n\tindex_t ebwtTotSz() const     { return _ebwtTotSz; }\n\tbool color() const            { return _color; }\n\tbool entireReverse() const    { return _entireReverse; }\n\n\t/**\n\t * Set a new suffix-array sampling rate, which involves updating\n\t * rate, mask, sample length, and sample size.\n\t */\n\tvoid setOffRate(int __offRate) {\n\t\t_offRate = __offRate;\n\t\t_offMask = std::numeric_limits<index_t>::max() << _offRate;\n\t\t_offsLen = (_bwtLen + (1 << _offRate) - 1) >> _offRate;\n\t\t_offsSz = _offsLen*sizeof(index_t);\n\t}\n\n#ifndef NDEBUG\n\t/// Check that this EbwtParams is internally consistent\n\tbool repOk() const {\n\t\t// assert_gt(_len, 0);\n\t\tassert_gt(_lineRate, 3);\n\t\tassert_geq(_offRate, 0);\n\t\tassert_leq(_ftabChars, 16);\n\t\tassert_geq(_ftabChars, 1);\n        assert_lt(_lineRate, 32);\n\t\tassert_lt(_ftabChars, 32);\n\t\tassert_eq(0, _ebwtTotSz % _lineSz);\n\t\treturn true;\n\t}\n#endif\n\n\t/**\n\t * Pretty-print the header contents to the given output stream.\n\t */\n\tvoid print(ostream& out) const {\n\t\tout << \"Headers:\" << endl\n\t\t    << \"    len: \"          << _len << endl\n\t\t    << \"    bwtLen: \"       << _bwtLen << endl\n\t\t    << \"    sz: \"           << _sz << endl\n\t\t    << \"    bwtSz: \"        << _bwtSz << endl\n\t\t    << \"    lineRate: \"     << _lineRate << endl\n\t\t    << \"    offRate: \"      << _offRate << endl\n\t\t    << \"    offMask: 0x\"    << hex << _offMask << dec << endl\n\t\t    << \"    ftabChars: \"    << _ftabChars << endl\n\t\t    << \"    eftabLen: \"     << _eftabLen << endl\n\t\t    << \"    eftabSz: \"      << _eftabSz << endl\n\t\t    << \"    ftabLen: \"      << _ftabLen << endl\n\t\t    << \"    ftabSz: \"       << _ftabSz << endl\n\t\t    << \"    offsLen: \"      << _offsLen << endl\n\t\t    << \"    offsSz: \"       << _offsSz << endl\n\t\t    << \"    lineSz: \"       << _lineSz << endl\n\t\t    << \"    sideSz: \"       << _sideSz << endl\n\t\t    << \"    sideBwtSz: \"    << _sideBwtSz << endl\n\t\t    << \"    sideBwtLen: \"   << _sideBwtLen << endl\n\t\t    << \"    numSides: \"     << _numSides << endl\n\t\t    << \"    numLines: \"     << _numLines << endl\n\t\t    << \"    ebwtTotLen: \"   << _ebwtTotLen << endl\n\t\t    << \"    ebwtTotSz: \"    << _ebwtTotSz << endl\n\t\t    << \"    color: \"        << _color << endl\n\t\t    << \"    reverse: \"      << _entireReverse << endl;\n\t}\n\n\tindex_t _len;\n\tindex_t _bwtLen;\n\tindex_t _sz;\n\tindex_t _bwtSz;\n\tint32_t _lineRate;\n\tint32_t _origOffRate;\n\tint32_t _offRate;\n\tindex_t _offMask;\n\tint32_t _ftabChars;\n\tindex_t _eftabLen;\n\tindex_t _eftabSz;\n\tindex_t _ftabLen;\n\tindex_t _ftabSz;\n\tindex_t _offsLen;\n\tindex_t _offsSz;\n\tindex_t _lineSz;\n\tindex_t _sideSz;\n\tindex_t _sideBwtSz;\n\tindex_t _sideBwtLen;\n\tindex_t _numSides;\n\tindex_t _numLines;\n\tindex_t _ebwtTotLen;\n\tindex_t _ebwtTotSz;\n\tbool     _color;\n\tbool     _entireReverse;\n};\n\n/**\n * Exception to throw when a file-realted error occurs.\n */\nclass EbwtFileOpenException : public std::runtime_error {\npublic:\n\tEbwtFileOpenException(const std::string& msg = \"\") :\n\t\tstd::runtime_error(msg) { }\n};\n\n/**\n * Calculate size of file with given name.\n */\nstatic inline int64_t fileSize(const char* name) {\n\tstd::ifstream f;\n\tf.open(name, std::ios_base::binary | std::ios_base::in);\n\tif (!f.good() || f.eof() || !f.is_open()) { return 0; }\n\tf.seekg(0, std::ios_base::beg);\n\tstd::ifstream::pos_type begin_pos = f.tellg();\n\tf.seekg(0, std::ios_base::end);\n\treturn static_cast<int64_t>(f.tellg() - begin_pos);\n}\n\n/**\n * Encapsulates a location in the bwt text in terms of the side it\n * occurs in and its offset within the side.\n */\ntemplate <typename index_t = uint32_t>\nstruct SideLocus {\n\tSideLocus() :\n\t_sideByteOff(0),\n\t_sideNum(0),\n\t_charOff(0),\n\t_by(-1),\n\t_bp(-1) { }\n\n\t/**\n\t * Construct from row and other relevant information about the Ebwt.\n\t */\n\tSideLocus(index_t row, const EbwtParams<index_t>& ep, const uint8_t* ebwt) {\n\t\tinitFromRow(row, ep, ebwt);\n\t}\n\n\t/**\n\t * Init two SideLocus objects from a top/bot pair, using the result\n\t * from one call to initFromRow to possibly avoid a second call.\n\t */\n\tstatic void initFromTopBot(\n\t\tindex_t top,\n\t\tindex_t bot,\n\t\tconst EbwtParams<index_t>& ep,\n\t\tconst uint8_t* ebwt,\n\t\tSideLocus& ltop,\n\t\tSideLocus& lbot)\n\t{\n\t\tconst index_t sideBwtLen = ep._sideBwtLen;\n\t\tassert_gt(bot, top);\n\t\tltop.initFromRow(top, ep, ebwt);\n\t\tindex_t spread = bot - top;\n\t\t// Many cache misses on the following lines\n\t\tif(ltop._charOff + spread < sideBwtLen) {\n\t\t\tlbot._charOff = ltop._charOff + spread;\n\t\t\tlbot._sideNum = ltop._sideNum;\n\t\t\tlbot._sideByteOff = ltop._sideByteOff;\n\t\t\tlbot._by = (int)(lbot._charOff >> 2);\n\t\t\tassert_lt(lbot._by, (int)ep._sideBwtSz);\n\t\t\tlbot._bp = lbot._charOff & 3;\n\t\t} else {\n\t\t\tlbot.initFromRow(bot, ep, ebwt);\n\t\t}\n\t}\n\n\t/**\n\t * Calculate SideLocus based on a row and other relevant\n\t * information about the shape of the Ebwt.\n\t */\n\tvoid initFromRow(index_t row, const EbwtParams<index_t>& ep, const uint8_t* ebwt) {\n\t\tconst index_t sideSz      = ep._sideSz;\n\t\t// Side length is hard-coded for now; this allows the compiler\n\t\t// to do clever things to accelerate / and %.\n\t\t_sideNum                  = row / ep._sideBwtLen;\n\t\tassert_lt(_sideNum, ep._numSides);\n\t\t_charOff                  = row % ep._sideBwtLen;\n\t\t_sideByteOff              = _sideNum * sideSz;\n\t\tassert_leq(row, ep._len);\n\t\tassert_leq(_sideByteOff + sideSz, ep._ebwtTotSz);\n\t\t// Tons of cache misses on the next line\n\t\t_by = (int)(_charOff >> 2); // byte within side\n\t\tassert_lt(_by, (int)ep._sideBwtSz);\n\t\t_bp = _charOff & 3;  // bit-pair within byte\n\t}\n\t\n\t/**\n\t * Transform this SideLocus to refer to the next side (i.e. the one\n\t * corresponding to the next side downstream).  Set all cursors to\n\t * point to the beginning of the side.\n\t */\n\tvoid nextSide(const EbwtParams<index_t>& ep) {\n\t\tassert(valid());\n\t\t_sideByteOff += ep.sideSz();\n\t\t_sideNum++;\n\t\t_by = _bp = _charOff = 0;\n\t\tassert(valid());\n\t}\n\n\t/**\n\t * Return true iff this is an initialized SideLocus\n\t */\n\tbool valid() const {\n\t\tif(_bp != -1) {\n\t\t\treturn true;\n\t\t}\n\t\treturn false;\n\t}\n\t\n\t/**\n\t * Convert locus to BW row it corresponds to.\n\t */\n    index_t toBWRow() const;\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that SideLocus is internally consistent and consistent\n\t * with the (provided) EbwtParams.\n\t */\n\tbool repOk(const EbwtParams<index_t>& ep) const {\n\t\tASSERT_ONLY(index_t row = toBWRow());\n\t\tassert_leq(row, ep._len);\n\t\tassert_range(-1, 3, _bp);\n\t\tassert_range(0, (int)ep._sideBwtSz, _by);\n\t\treturn true;\n\t}\n#endif\n\n\t/// Make this look like an invalid SideLocus\n\tvoid invalidate() {\n\t\t_bp = -1;\n\t}\n\n\t/**\n\t * Return a read-only pointer to the beginning of the top side.\n\t */\n\tconst uint8_t *side(const uint8_t* ebwt) const {\n\t\treturn ebwt + _sideByteOff;\n\t}\n    \n    /**\n\t * Return a read-only pointer to the beginning of the top side.\n\t */\n\tconst uint8_t *next_side(const EbwtParams<index_t>& ep, const uint8_t* ebwt) const {\n        if(_sideByteOff + ep._sideSz < ep._ebwtTotSz) {\n            return ebwt + _sideByteOff + ep._sideSz;\n        } else {\n            return NULL;\n        }\n\t}\n    \n\tindex_t _sideByteOff; // offset of top side within ebwt[]\n\tindex_t _sideNum;     // index of side\n\tindex_t _charOff;     // character offset within side\n\tint32_t _by;          // byte within side (not adjusted for bw sides)\n\tint32_t _bp;          // bitpair within byte (not adjusted for bw sides)\n};\n\n/**\n * Convert locus to BW row it corresponds to.\n */\ntemplate <typename index_t>\ninline index_t SideLocus<index_t>::toBWRow() const {\n    if(sizeof(index_t) == 8) {\n        return _sideNum * (512 - 16 * sizeof(index_t)) + _charOff;\n    } else {\n        return _sideNum * (256 - 16 * sizeof(index_t)) + _charOff;\n    }\n}\n\ntemplate <>\ninline uint64_t SideLocus<uint64_t>::toBWRow() const {\n    return _sideNum * (512 - 16 * sizeof(uint64_t)) + _charOff;\n}\n\ntemplate <>\ninline uint32_t SideLocus<uint32_t>::toBWRow() const {\n    return _sideNum * (256 - 16 * sizeof(uint32_t)) + _charOff;\n}\n\ntemplate <>\ninline uint16_t SideLocus<uint16_t>::toBWRow() const {\n    return _sideNum * (256 - 16 * sizeof(uint16_t)) + _charOff;\n}\n\n#ifdef POPCNT_CAPABILITY   // wrapping of \"struct\"\nstruct USE_POPCNT_GENERIC {\n#endif\n    // Use this standard bit-bashing population count\n    inline static int pop64(uint64_t x) {\n        // Lots of cache misses on following lines (>10K)\n        x = x - ((x >> 1) & 0x5555555555555555llu);\n        x = (x & 0x3333333333333333llu) + ((x >> 2) & 0x3333333333333333llu);\n        x = (x + (x >> 4)) & 0x0F0F0F0F0F0F0F0Fllu;\n        x = x + (x >> 8);\n        x = x + (x >> 16);\n        x = x + (x >> 32);\n        return (int)(x & 0x3Fllu);\n    }\n#ifdef POPCNT_CAPABILITY  // wrapping a \"struct\"\n};\n#endif\n\n#ifdef POPCNT_CAPABILITY\nstruct USE_POPCNT_INSTRUCTION {\n    inline static int pop64(uint64_t x) {\n        int64_t count;\n        asm (\"popcntq %[x],%[count]\\n\": [count] \"=&r\" (count): [x] \"r\" (x));\n        return (int)count;\n    }\n};\n#endif\n\n/**\n * Tricky-bit-bashing bitpair counting for given two-bit value (0-3)\n * within a 64-bit argument.\n */\n#ifdef POPCNT_CAPABILITY\ntemplate<typename Operation>\n#endif\ninline static int countInU64(int c, uint64_t dw) {\n    uint64_t c0 = c_table[c];\n\tuint64_t x0 = dw ^ c0;\n    uint64_t x1 = (x0 >> 1);\n    uint64_t x2 = x1 & (0x5555555555555555);\n    uint64_t x3 = x0 & x2;\n#ifdef POPCNT_CAPABILITY\n    uint64_t tmp = Operation().pop64(x3);\n#else\n    uint64_t tmp = pop64(x3);\n#endif\n    return (int) tmp;\n}\n\n// Forward declarations for Ebwt class\nclass EbwtSearchParams;\n\n/**\n * Extended Burrows-Wheeler transform data.\n *\n * An Ebwt may be transferred to and from RAM with calls to\n * evictFromMemory() and loadIntoMemory().  By default, a newly-created\n * Ebwt is not loaded into memory; if the user would like to use a\n * newly-created Ebwt to answer queries, they must first call\n * loadIntoMemory().\n */\ntemplate <class index_t = uint32_t>\nclass Ebwt {\npublic:\n\t#define Ebwt_INITS \\\n\t    _toBigEndian(currentlyBigEndian()), \\\n\t    _overrideOffRate(overrideOffRate), \\\n\t    _verbose(verbose), \\\n\t    _passMemExc(passMemExc), \\\n\t    _sanity(sanityCheck), \\\n\t    fw_(fw), \\\n\t    _in1(NULL), \\\n\t    _in2(NULL), \\\n\t    _zOff(std::numeric_limits<index_t>::max()), \\\n\t    _zEbwtByteOff(std::numeric_limits<index_t>::max()), \\\n\t    _zEbwtBpOff(-1), \\\n\t    _nPat(0), \\\n\t    _nFrag(0), \\\n\t    _plen(EBWT_CAT), \\\n\t    _rstarts(EBWT_CAT), \\\n\t    _fchr(EBWT_CAT), \\\n\t    _ftab(EBWT_CAT), \\\n\t    _eftab(EBWT_CAT), \\\n        _offw(false), \\\n\t    _offs(EBWT_CAT), \\\n        _offsw(EBWT_CAT), \\\n\t    _ebwt(EBWT_CAT), \\\n\t    _useMm(false), \\\n\t    useShmem_(false), \\\n\t    _refnames(EBWT_CAT), \\\n\t    mmFile1_(NULL), \\\n\t    mmFile2_(NULL), \\\n        _compressed(false), \\\n\t   _boundaryCheck( 1 ) \n\n\t/// Construct an Ebwt from the given input file\n\tEbwt(const string& in,\n\t     int color,\n\t\t int needEntireReverse,\n\t     bool fw,\n\t     int32_t overrideOffRate, // = -1,\n\t     int32_t offRatePlus, // = -1,\n\t     bool useMm, // = false,\n\t     bool useShmem, // = false,\n\t     bool mmSweep, // = false,\n\t     bool loadNames, // = false,\n\t\t bool loadSASamp, // = true,\n\t\t bool loadFtab, // = true,\n\t\t bool loadRstarts, // = true,\n\t     bool verbose, // = false,\n\t     bool startVerbose, // = false,\n\t     bool passMemExc, // = false,\n\t     bool sanityCheck, // = false)\n\t\t bool skipLoading = false) : \n\t     Ebwt_INITS\n\t{\n\t\tassert(!useMm || !useShmem);\n\n#ifdef POPCNT_CAPABILITY\n\t\tProcessorSupport ps;\n\t\t_usePOPCNTinstruction = ps.POPCNTenabled();\n#endif\n\n\t\tpacked_ = false;\n\t\t_useMm = useMm;\n\t\tuseShmem_ = useShmem;\n\t\t_in1Str = in + \".1.\" + gEbwt_ext;\n\t\t_in2Str = in + \".2.\" + gEbwt_ext;\n\n\t\tif(!skipLoading) {\n\t\t\treadIntoMemory(\n\t\t\t\t\tcolor,       // expect index to be colorspace?\n\t\t\t\t\tfw ? -1 : needEntireReverse, // need REF_READ_REVERSE\n\t\t\t\t\tloadSASamp,  // load the SA sample portion?\n\t\t\t\t\tloadFtab,    // load the ftab & eftab?\n\t\t\t\t\tloadRstarts, // load the rstarts array?\n\t\t\t\t\ttrue,        // stop after loading the header portion?\n\t\t\t\t\t&_eh,        // params\n\t\t\t\t\tmmSweep,     // mmSweep\n\t\t\t\t\tloadNames,   // loadNames\n\t\t\t\t\tstartVerbose); // startVerbose\n\t\t\t// If the offRate has been overridden, reflect that in the\n\t\t\t// _eh._offRate field\n\t\t\tif(offRatePlus > 0 && _overrideOffRate == -1) {\n\t\t\t\t_overrideOffRate = _eh._offRate + offRatePlus;\n\t\t\t}\n\t\t\tif(_overrideOffRate > _eh._offRate) {\n\t\t\t\t_eh.setOffRate(_overrideOffRate);\n\t\t\t\tassert_eq(_overrideOffRate, _eh._offRate);\n\t\t\t}\n\t\t\tassert(repOk());\n\t\t}\n\n\t\t// Read conversion table, genome size table, and taxonomy tree\n\t\tstring in3Str = in + \".3.\" + gEbwt_ext;\n\t\tif(verbose || startVerbose) cerr << \"Opening \\\"\" << in3Str.c_str() << \"\\\"\" << endl;\n\t\tifstream in3(in3Str.c_str(), ios::binary);\n\t\tif(!in3.good()) {\n\t\t\tcerr << \"Could not open index file \" << in3Str.c_str() << endl;\n\t\t}\n\n\t\tinitial_tax_rank_num();\n\n\t\tset<uint64_t> leaves;\n\t\tsize_t num_cids = 0; // number of compressed sequences\n\t\t_uid_to_tid.clear();\n\t\treadU32(in3, this->toBe());\n\t\tuint64_t nref = readIndex<uint64_t>(in3, this->toBe());\n\t\tif(nref > 0) {\n\t\t\twhile(!in3.eof()) {\n\t\t\t\tstring uid;\n\t\t\t\tuint64_t tid;\n\t\t\t\twhile(true) {\n\t\t\t\t\tchar c = '\\0';\n\t\t\t\t\tin3 >> c;\n\t\t\t\t\tif(c == '\\0' || c == '\\n') break;\n\t\t\t\t\tuid.push_back(c);\n\t\t\t\t}\n\t\t\t\tif(uid.find(\"cid\") == 0) {\n\t\t\t\t\tnum_cids++;\n\t\t\t\t}\n\t\t\t\ttid = readIndex<uint64_t>(in3, this->toBe());\n\t\t\t\t_uid_to_tid.expand();\n\t\t\t\t_uid_to_tid.back().first = uid;\n\t\t\t\t_uid_to_tid.back().second = tid;\n\t\t\t\tleaves.insert(tid);\n\t\t\t\tif(nref == _uid_to_tid.size()) break;\n\t\t\t}\n\t\t\tassert_eq(nref, _uid_to_tid.size());\n\t\t}\n\n\t\tif(num_cids >= 10) {\n\t\t\tthis->_compressed = true;\n\t\t}\n\n\t\t_tree.clear();\n\t\tuint64_t ntid = readIndex<uint64_t>(in3, this->toBe());\n\t\tif(ntid > 0) {\n\t\t\twhile(!in3.eof()) {\n\t\t\t\tTaxonomyNode node;\n\t\t\t\tuint64_t tid = readIndex<uint64_t>(in3, this->toBe());\n\t\t\t\tnode.parent_tid = readIndex<uint64_t>(in3, this->toBe());\n\t\t\t\tnode.rank = readIndex<uint16_t>(in3, this->toBe());\n\t\t\t\tnode.leaf = (leaves.find(tid) != leaves.end());\n\t\t\t\t_tree[tid] = node;\n\t\t\t\tif(ntid == _tree.size()) break;\n\t\t\t}\n\t\t\tassert_eq(ntid, _tree.size());\n\t\t}\n\n\t\t_name.clear();\n\t\tuint64_t nname = readIndex<uint64_t>(in3, this->toBe());\n\t\tif(nname > 0) {\n\t\t\tstring name;\n\t\t\twhile(!in3.eof()) {\n\t\t\t\tuint64_t tid = readIndex<uint64_t>(in3, this->toBe());\n\t\t\t\tin3 >> name;\n\t\t\t\tin3.seekg(1, ios_base::cur);\n\t\t\t\tassert(_name.find(tid) == _name.end());\n\t\t\t\tstd::replace(name.begin(), name.end(), '@', ' ');\n\t\t\t\t_name[tid] = name;\n\t\t\t\tif(_name.size() == nname)\n\t\t\t\t\tbreak;\n\t\t\t}\n\t\t}\n\n\t\t_size.clear();\n\t\tuint64_t nsize = readIndex<uint64_t>(in3, this->toBe());\n\t\tif(nsize > 0) {\n\t\t\twhile(!in3.eof()) {\n\t\t\t\tuint64_t tid = readIndex<uint64_t>(in3, this->toBe());\n\t\t\t\tuint64_t size = readIndex<uint64_t>(in3, this->toBe());\n\t\t\t\tassert(_size.find(tid) == _size.end());\n\t\t\t\t_size[tid] = size;\n\t\t\t\tif(_size.size() == nsize)\n\t\t\t\t\tbreak;\n\t\t\t}\n\t\t}\n\n\t\t// Calculate average genome size\n\t\tif(!this->_offw) { // Skip if there are many sequences (e.g. >64K)\n\t\t\tfor(map<uint64_t, TaxonomyNode>::const_iterator tree_itr = _tree.begin(); tree_itr != _tree.end(); tree_itr++) {\n\t\t\t\tuint64_t tid = tree_itr->first;\n\t\t\t\tconst TaxonomyNode& node = tree_itr->second;\n\t\t\t\tif(node.rank == RANK_SPECIES || node.rank == RANK_GENUS || node.rank == RANK_FAMILY ||\n\t\t\t\t\t\tnode.rank == RANK_ORDER || node.rank == RANK_CLASS || node.rank == RANK_PHYLUM) {\n\t\t\t\t\tsize_t sum = 0, count = 0;\n\t\t\t\t\tfor(map<uint64_t, uint64_t>::const_iterator size_itr = _size.begin(); size_itr != _size.end(); size_itr++) {\n\t\t\t\t\t\tuint64_t c_tid = size_itr->first;\n\t\t\t\t\t\tmap<uint64_t, TaxonomyNode>::const_iterator tree_itr2 = _tree.find(c_tid);\n\t\t\t\t\t\tif(tree_itr2 == _tree.end())\n\t\t\t\t\t\t\tcontinue;\n\n\t\t\t\t\t\tassert(tree_itr2 != _tree.end());\n\t\t\t\t\t\tconst TaxonomyNode& c_node = tree_itr2->second;\n\t\t\t\t\t\tif((c_node.rank == RANK_UNKNOWN && c_node.leaf) ||\n\t\t\t\t\t\t\t\ttax_rank_num[c_node.rank] < tax_rank_num[RANK_SPECIES]) {\n\t\t\t\t\t\t\tc_tid = c_node.parent_tid;\n\t\t\t\t\t\t\twhile(true) {\n\t\t\t\t\t\t\t\tif(c_tid == tid) {\n\t\t\t\t\t\t\t\t\tsum += size_itr->second;\n\t\t\t\t\t\t\t\t\tcount += 1;\n\t\t\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\ttree_itr2 = _tree.find(c_tid);\n\t\t\t\t\t\t\t\tif(tree_itr2 == _tree.end())\n\t\t\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t\t\tif(c_tid == tree_itr2->second.parent_tid)\n\t\t\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t\t\tc_tid = tree_itr2->second.parent_tid;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\tif(count > 0) {\n\t\t\t\t\t\t_size[tid] = sum / count;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\t_paths.buildPaths(_uid_to_tid, _tree);\n\n\t\tin3.close();\n\n\t\t// Read in the information provided by Li. The SA coordinate that corresponds to the start of genome.\n\t\tstring in4Str = in + \".4.\" + gEbwt_ext;\n\t\tif(verbose || startVerbose) cerr << \"Opening \\\"\" << in4Str.c_str() << \"\\\"\" << endl;\n\t\tifstream in4(in4Str.c_str(), ios::binary);\n\t\tif(!in4.good()) {\n\t\t\tif(verbose || startVerbose) cerr << \"Could not open index file \" << in4Str.c_str() << endl;\n\t\t}\n\t\telse\n\t\t{\n\t\t\treadU32(in4, this->toBe());\n\n\t\t\t_saGenomeBoundary.clear() ;\n\t\t\tnsize = readIndex<uint64_t>( in4, this->toBe() ) ;\n\t\t\t//cout<<nsize<<\" \"<<_uid_to_tid.size()<<endl ;\n\t\t\t\n\t\t\t_lastGenomeBoundary = 0 ;\n\t\t\tif ( nsize > 0 )\n\t\t\t{\n\t\t\t\tuint64_t t ;\n\t\t\t\tfor ( t = 0 ; t < nsize ; ++t )\n\t\t\t\t{\n\t\t\t\t\tuint64_t saCoord = readIndex<uint64_t>( in4, this->toBe() ) ;\n\t\t\t\t\tuint32_t refIdx = readIndex<uint32_t>( in4, this->toBe() ) ;\n\t\t\t\t\t/*string uid;\n\t\t\t\t\t  while(true) {\n\t\t\t\t\t  char c = '\\0';\n\t\t\t\t\t  in4 >> c;\n\t\t\t\t\t  if(c == '\\0' || c == '\\n') break;\n\t\t\t\t\t  uid.push_back(c);\n\t\t\t\t\t  }*/\n\t\t\t\t\t//cout<<saCoord<<\" \"<<uid<<\" \"<< uidStrToIdx[ uid ] <<endl ;\n\t\t\t\t\t_saGenomeBoundary[ saCoord ] = refIdx ;\n\n\t\t\t\t\tif ( saCoord > _lastGenomeBoundary )\n\t\t\t\t\t\t_lastGenomeBoundary = saCoord ;\n\t\t\t\t}\n\n\t\t\t\t_boundaryCheckShift = 8 ;\n\t\t\t\twhile ( 1 )\t\n\t\t\t\t{\t\n\t\t\t\t\tuint64_t blockSize = ((uint64_t)1)<<_boundaryCheckShift ;\n\t\t\t\t\tif ( blockSize > _lastGenomeBoundary + 1 )\n\t\t\t\t\t\tbreak ;\n\t\t\t\t\tuint64_t blockCnt =  ( _lastGenomeBoundary + 1 ) / blockSize + 1 ;\n\t\t\t\t\tif ( blockCnt < 100 * nsize )\n\t\t\t\t\t{\n\t\t\t\t\t\tif ( _boundaryCheckShift > 8 )\n\t\t\t\t\t\t\t--_boundaryCheckShift ;\n\t\t\t\t\t\tbreak ;\n\t\t\t\t\t}\n\n\t\t\t\t\t++_boundaryCheckShift ;\n\t\t\t\t}\n\n\t\t\t\t//cout<<nsize<<\" \"<<_lastGenomeBoundary<<\" \"<<_boundaryCheckShift<<endl ;\n\n\t\t\t\t_boundaryCheck.resize( ( ( _lastGenomeBoundary + 1 ) >> _boundaryCheckShift ) + 1 ) ;\n\t\t\t\t_boundaryCheck.reset() ;\n\t\t\t\tfor ( std::map<uint64_t, uint32_t>::iterator it = _saGenomeBoundary.begin() ; it != _saGenomeBoundary.end() ; ++it )\n\t\t\t\t{\n\t\t\t\t\t_boundaryCheck.set( (it->first) >> _boundaryCheckShift ) ;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tin4.close() ;\n\t}\n\t\n\t/// Construct an Ebwt from the given header parameters and string\n\t/// vector, optionally using a blockwise suffix sorter with the\n\t/// given 'bmax' and 'dcv' parameters.  The string vector is\n\t/// ultimately joined and the joined string is passed to buildToDisk().\n\tEbwt(\n\t\t bool packed,\n\t\t int color,\n\t\t int needEntireReverse,\n\t\t int32_t lineRate,\n\t\t int32_t offRate,\n\t\t int32_t ftabChars,\n\t\t const string& file,   // base filename for EBWT files\n\t\t bool fw,\n\t\t int dcv,\n\t\t EList<RefRecord>& szs,\n\t\t index_t sztot,\n\t\t const RefReadInParams& refparams,\n\t\t uint32_t seed,\n\t\t int32_t overrideOffRate = -1,\n\t\t bool verbose = false,\n\t\t bool passMemExc = false,\n\t\t bool sanityCheck = false) :\n\tEbwt_INITS,\n\t_eh(\n\t\tjoinedLen(szs),\n\t\tlineRate,\n\t\toffRate,\n\t\tftabChars,\n\t\tcolor,\n\t\trefparams.reverse == REF_READ_REVERSE)\n\t{\n#ifdef POPCNT_CAPABILITY\n        ProcessorSupport ps;\n        _usePOPCNTinstruction = ps.POPCNTenabled();\n#endif\n\t\tpacked_ = packed;\n\t}\n\n\t/// Construct an Ebwt from the given header parameters and string\n\t/// vector, optionally using a blockwise suffix sorter with the\n\t/// given 'bmax' and 'dcv' parameters.  The string vector is\n\t/// ultimately joined and the joined string is passed to buildToDisk().\n\ttemplate<typename TStr>\n\tEbwt(\n         TStr& s,\n         bool packed,\n         int color,\n         int needEntireReverse,\n         int32_t lineRate,\n         int32_t offRate,\n         int32_t ftabChars,\n         const string& file,   // base filename for EBWT files\n         bool fw,\n         bool useBlockwise,\n         index_t bmax,\n         index_t bmaxSqrtMult,\n         index_t bmaxDivN,\n         int dcv,\n         int nthreads,\n         EList<FileBuf*>& is,\n         EList<RefRecord>& szs,\n         index_t sztot,\n         const string& conversion_table_fname,\n         const string& taxonomy_fname,\n         const string& name_table_fname,\n         const string& size_table_fname,\n         const RefReadInParams& refparams,\n         uint32_t seed,\n         int32_t overrideOffRate = -1,\n         bool doSaFile = false,\n         bool doBwtFile = false,\n         int kmer_size = 0,\n         bool verbose = false,\n         bool passMemExc = false,\n         bool sanityCheck = false) :\n    Ebwt_INITS,\n    _eh(\n        joinedLen(szs),\n        lineRate,\n        offRate,\n        ftabChars,\n        color,\n        refparams.reverse == REF_READ_REVERSE)\n\t{\n#ifdef POPCNT_CAPABILITY\n        ProcessorSupport ps;\n        _usePOPCNTinstruction = ps.POPCNTenabled();\n#endif\n\t\t_in1Str = file + \".1.\" + gEbwt_ext;\n\t\t_in2Str = file + \".2.\" + gEbwt_ext;\n\t\tpacked_ = packed;\n\t\t// Open output files\n\t\tofstream fout1(_in1Str.c_str(), ios::binary);\n\t\tif(!fout1.good()) {\n\t\t\tcerr << \"Could not open index file for writing: \\\"\" << _in1Str.c_str() << \"\\\"\" << endl\n\t\t\t     << \"Please make sure the directory exists and that permissions allow writing by\" << endl\n\t\t\t     << \"Bowtie.\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tofstream fout2(_in2Str.c_str(), ios::binary);\n\t\tif(!fout2.good()) {\n\t\t\tcerr << \"Could not open index file for writing: \\\"\" << _in2Str.c_str() << \"\\\"\" << endl\n\t\t\t     << \"Please make sure the directory exists and that permissions allow writing by\" << endl\n\t\t\t     << \"Bowtie.\" << endl;\n\t\t\tthrow 1;\n\t\t}\n        _inSaStr = file + \".sa\";\n        _inBwtStr = file + \".bwt\";\n        ofstream *saOut = NULL, *bwtOut = NULL;\n        if(doSaFile) {\n            saOut = new ofstream(_inSaStr.c_str(), ios::binary);\n            if(!saOut->good()) {\n                cerr << \"Could not open suffix-array file for writing: \\\"\" << _inSaStr.c_str() << \"\\\"\" << endl\n                << \"Please make sure the directory exists and that permissions allow writing by\" << endl\n                << \"Bowtie.\" << endl;\n                throw 1;\n            }\n        }\n        if(doBwtFile) {\n            bwtOut = new ofstream(_inBwtStr.c_str(), ios::binary);\n            if(!bwtOut->good()) {\n                cerr << \"Could not open suffix-array file for writing: \\\"\" << _inBwtStr.c_str() << \"\\\"\" << endl\n                << \"Please make sure the directory exists and that permissions allow writing by\" << endl\n                << \"Bowtie.\" << endl;\n                throw 1;\n            }\n        }\n\t\t// Build\n\t\tinitFromVector<TStr>(\n\t\t\t\t\t\t\t s,\n\t\t\t\t\t\t\t is,\n\t\t\t\t\t\t\t szs,\n\t\t\t\t\t\t\t sztot,\n\t\t\t\t\t\t\t refparams,\n\t\t\t\t\t\t\t fout1,\n\t\t\t\t\t\t\t fout2,\n                             saOut,\n                             bwtOut,\n                             kmer_size,\n                             file,\n                             conversion_table_fname,\n                             taxonomy_fname,\n                             name_table_fname,\n                             size_table_fname,\n\t\t\t\t\t\t\t useBlockwise,\n\t\t\t\t\t\t\t bmax,\n\t\t\t\t\t\t\t bmaxSqrtMult,\n\t\t\t\t\t\t\t bmaxDivN,\n\t\t\t\t\t\t\t dcv,\n                             nthreads,\n\t\t\t\t\t\t\t seed,\n\t\t\t\t\t\t\t verbose);\n\t\t// Close output files\n\t\tfout1.flush();\n\t\tint64_t tellpSz1 = (int64_t)fout1.tellp();\n\t\tVMSG_NL(\"Wrote \" << fout1.tellp() << \" bytes to primary EBWT file: \" << _in1Str.c_str());\n\t\tfout1.close();\n\t\tbool err = false;\n\t\tif(tellpSz1 > fileSize(_in1Str.c_str())) {\n\t\t\terr = true;\n\t\t\tcerr << \"Index is corrupt: File size for \" << _in1Str.c_str() << \" should have been \" << tellpSz1\n\t\t\t     << \" but is actually \" << fileSize(_in1Str.c_str()) << \".\" << endl;\n\t\t}\n\t\tfout2.flush();\n\t\tint64_t tellpSz2 = (int64_t)fout2.tellp();\n\t\tVMSG_NL(\"Wrote \" << fout2.tellp() << \" bytes to secondary EBWT file: \" << _in2Str.c_str());\n\t\tfout2.close();\n\t\tif(tellpSz2 > fileSize(_in2Str.c_str())) {\n\t\t\terr = true;\n\t\t\tcerr << \"Index is corrupt: File size for \" << _in2Str.c_str() << \" should have been \" << tellpSz2\n\t\t\t     << \" but is actually \" << fileSize(_in2Str.c_str()) << \".\" << endl;\n\t\t}\n        if(saOut != NULL) {\n            // Check on suffix array output file size\n            int64_t tellpSzSa = (int64_t)saOut->tellp();\n            VMSG_NL(\"Wrote \" << tellpSzSa << \" bytes to suffix-array file: \" << _inSaStr.c_str());\n            saOut->close();\n            if(tellpSzSa > fileSize(_inSaStr.c_str())) {\n                err = true;\n                cerr << \"Index is corrupt: File size for \" << _inSaStr.c_str() << \" should have been \" << tellpSzSa\n                << \" but is actually \" << fileSize(_inSaStr.c_str()) << \".\" << endl;\n            }\n        }\n        if(bwtOut != NULL) {\n            // Check on suffix array output file size\n            int64_t tellpSzBwt = (int64_t)bwtOut->tellp();\n            VMSG_NL(\"Wrote \" << tellpSzBwt << \" bytes to BWT file: \" << _inBwtStr.c_str());\n            bwtOut->close();\n            if(tellpSzBwt > fileSize(_inBwtStr.c_str())) {\n                err = true;\n                cerr << \"Index is corrupt: File size for \" << _inBwtStr.c_str() << \" should have been \" << tellpSzBwt\n                << \" but is actually \" << fileSize(_inBwtStr.c_str()) << \".\" << endl;\n            }\n        }\n\t\tif(err) {\n\t\t\tcerr << \"Please check if there is a problem with the disk or if disk is full.\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\t// Reopen as input streams\n\t\tVMSG_NL(\"Re-opening _in1 and _in2 as input streams\");\n\t\tif(_sanity) {\n\t\t\tVMSG_NL(\"Sanity-checking Bt2\");\n\t\t\tassert(!isInMemory());\n\t\t\treadIntoMemory(\n\t\t\t\tcolor,                       // colorspace?\n\t\t\t\tfw ? -1 : needEntireReverse, // 1 -> need the reverse to be reverse-of-concat\n\t\t\t\ttrue,                        // load SA sample (_offs[])?\n\t\t\t\ttrue,                        // load ftab (_ftab[] & _eftab[])?\n\t\t\t\ttrue,                        // load r-starts (_rstarts[])?\n\t\t\t\tfalse,                       // just load header?\n\t\t\t\tNULL,                        // Params object to fill\n\t\t\t\tfalse,                       // mm sweep?\n\t\t\t\ttrue,                        // load names?\n\t\t\t\tfalse);                      // verbose startup?\n\t\t\t// sanityCheckAll(refparams.reverse);\n\t\t\tevictFromMemory();\n\t\t\tassert(!isInMemory());\n\t\t}\n\t\tVMSG_NL(\"Returning from Ebwt constructor\");\n\t}\n\t\n\t/**\n\t * Static constructor for a pair of forward/reverse indexes for the\n\t * given reference string.\n\t */\n\ttemplate<typename TStr>\n\tstatic pair<Ebwt*, Ebwt*>\n\tfromString(\n\t\tconst char* str,\n\t\tbool packed,\n\t\tint color,\n\t\tint reverse,\n\t\tbool bigEndian,\n\t\tint32_t lineRate,\n\t\tint32_t offRate,\n\t\tint32_t ftabChars,\n\t\tconst string& file,\n\t\tbool useBlockwise,\n\t\tindex_t bmax,\n\t\tindex_t bmaxSqrtMult,\n\t\tindex_t bmaxDivN,\n\t\tint dcv,\n\t\tuint32_t seed,\n\t\tbool verbose,\n\t\tbool autoMem,\n\t\tbool sanity)\n\t{\n\t\tEList<std::string> strs(EBWT_CAT);\n\t\tstrs.push_back(std::string(str));\n\t\treturn fromStrings<TStr>(\n\t\t\tstrs,\n\t\t\tpacked,\n\t\t\tcolor,\n\t\t\treverse,\n\t\t\tbigEndian,\n\t\t\tlineRate,\n\t\t\toffRate,\n\t\t\tftabChars,\n\t\t\tfile,\n\t\t\tuseBlockwise,\n\t\t\tbmax,\n\t\t\tbmaxSqrtMult,\n\t\t\tbmaxDivN,\n\t\t\tdcv,\n\t\t\tseed,\n\t\t\tverbose,\n\t\t\tautoMem,\n\t\t\tsanity);\n\t}\n\t\n\t/**\n\t * Static constructor for a pair of forward/reverse indexes for the\n\t * given list of reference strings.\n\t */\n\ttemplate<typename TStr>\n\tstatic pair<Ebwt*, Ebwt*>\n\tfromStrings(\n\t\tconst EList<std::string>& strs,\n\t\tbool packed,\n\t\tint color,\n\t\tint reverse,\n\t\tbool bigEndian,\n\t\tint32_t lineRate,\n\t\tint32_t offRate,\n\t\tint32_t ftabChars,\n\t\tconst string& file,\n\t\tbool useBlockwise,\n\t\tindex_t bmax,\n\t\tindex_t bmaxSqrtMult,\n\t\tindex_t bmaxDivN,\n\t\tint dcv,\n\t\tuint32_t seed,\n\t\tbool verbose,\n\t\tbool autoMem,\n\t\tbool sanity)\n\t{\n        assert(!strs.empty());\n\t\tEList<FileBuf*> is(EBWT_CAT);\n\t\tRefReadInParams refparams(color, REF_READ_FORWARD, false, false);\n\t\t// Adapt sequence strings to stringstreams open for input\n\t\tunique_ptr<stringstream> ss(new stringstream());\n\t\tfor(index_t i = 0; i < strs.size(); i++) {\n\t\t\t(*ss) << \">\" << i << endl << strs[i] << endl;\n\t\t}\n\t\tunique_ptr<FileBuf> fb(new FileBuf(ss.get()));\n\t\tassert(!fb->eof());\n\t\tassert(fb->get() == '>');\n\t\tASSERT_ONLY(fb->reset());\n\t\tassert(!fb->eof());\n\t\tis.push_back(fb.get());\n\t\t// Vector for the ordered list of \"records\" comprising the input\n\t\t// sequences.  A record represents a stretch of unambiguous\n\t\t// characters in one of the input sequences.\n\t\tEList<RefRecord> szs(EBWT_CAT);\n\t\tstd::pair<index_t, index_t> sztot;\n\t\tsztot = BitPairReference::szsFromFasta(is, file, bigEndian, refparams, szs, sanity);\n\t\t// Construct Ebwt from input strings and parameters\n\t\tEbwt<index_t> *ebwtFw = new Ebwt<index_t>(\n\t\t\t\t\t\t\t\t\t\t\t\t  TStr(),\n\t\t\t\t\t\t\t\t\t\t\t\t  packed,\n\t\t\t\t\t\t\t\t\t\t\t\t  refparams.color ? 1 : 0,\n\t\t\t\t\t\t\t\t\t\t\t\t  -1,           // fw\n\t\t\t\t\t\t\t\t\t\t\t\t  lineRate,\n\t\t\t\t\t\t\t\t\t\t\t\t  offRate,      // suffix-array sampling rate\n\t\t\t\t\t\t\t\t\t\t\t\t  ftabChars,    // number of chars in initial arrow-pair calc\n\t\t\t\t\t\t\t\t\t\t\t\t  file,         // basename for .?.ebwt files\n\t\t\t\t\t\t\t\t\t\t\t\t  true,         // fw?\n\t\t\t\t\t\t\t\t\t\t\t\t  useBlockwise, // useBlockwise\n\t\t\t\t\t\t\t\t\t\t\t\t  bmax,         // block size for blockwise SA builder\n\t\t\t\t\t\t\t\t\t\t\t\t  bmaxSqrtMult, // block size as multiplier of sqrt(len)\n\t\t\t\t\t\t\t\t\t\t\t\t  bmaxDivN,     // block size as divisor of len\n\t\t\t\t\t\t\t\t\t\t\t\t  dcv,          // difference-cover period\n\t\t\t\t\t\t\t\t\t\t\t\t  is,           // list of input streams\n\t\t\t\t\t\t\t\t\t\t\t\t  szs,          // list of reference sizes\n\t\t\t\t\t\t\t\t\t\t\t\t  sztot.first,  // total size of all unambiguous ref chars\n\t\t\t\t\t\t\t\t\t\t\t\t  refparams,    // reference read-in parameters\n\t\t\t\t\t\t\t\t\t\t\t\t  seed,         // pseudo-random number generator seed\n\t\t\t\t\t\t\t\t\t\t\t\t  -1,           // override offRate\n\t\t\t\t\t\t\t\t\t\t\t\t  verbose,      // be talkative\n\t\t\t\t\t\t\t\t\t\t\t\t  autoMem,      // pass exceptions up to the toplevel so that we can adjust memory settings automatically\n\t\t\t\t\t\t\t\t\t\t\t\t  sanity);      // verify results and internal consistency\n\t\trefparams.reverse = reverse;\n\t\tszs.clear();\n\t\tsztot = BitPairReference::szsFromFasta(is, file, bigEndian, refparams, szs, sanity);\n\t\t// Construct Ebwt from input strings and parameters\n\t\tEbwt<index_t> *ebwtBw = new Ebwt<index_t>(\n\t\t\t\t\t\t\t\t\t\t\t\t  TStr(),\n\t\t\t\t\t\t\t\t\t\t\t\t  packed,\n\t\t\t\t\t\t\t\t\t\t\t\t  refparams.color ? 1 : 0,\n\t\t\t\t\t\t\t\t\t\t\t\t  reverse == REF_READ_REVERSE,\n\t\t\t\t\t\t\t\t\t\t\t\t  lineRate,\n\t\t\t\t\t\t\t\t\t\t\t\t  offRate,      // suffix-array sampling rate\n\t\t\t\t\t\t\t\t\t\t\t\t  ftabChars,    // number of chars in initial arrow-pair calc\n\t\t\t\t\t\t\t\t\t\t\t\t  file + \".rev\",// basename for .?.ebwt files\n\t\t\t\t\t\t\t\t\t\t\t\t  false,        // fw?\n\t\t\t\t\t\t\t\t\t\t\t\t  useBlockwise, // useBlockwise\n\t\t\t\t\t\t\t\t\t\t\t\t  bmax,         // block size for blockwise SA builder\n\t\t\t\t\t\t\t\t\t\t\t\t  bmaxSqrtMult, // block size as multiplier of sqrt(len)\n\t\t\t\t\t\t\t\t\t\t\t\t  bmaxDivN,     // block size as divisor of len\n\t\t\t\t\t\t\t\t\t\t\t\t  dcv,          // difference-cover period\n\t\t\t\t\t\t\t\t\t\t\t\t  is,           // list of input streams\n\t\t\t\t\t\t\t\t\t\t\t\t  szs,          // list of reference sizes\n\t\t\t\t\t\t\t\t\t\t\t\t  sztot.first,  // total size of all unambiguous ref chars\n\t\t\t\t\t\t\t\t\t\t\t\t  refparams,    // reference read-in parameters\n\t\t\t\t\t\t\t\t\t\t\t\t  seed,         // pseudo-random number generator seed\n\t\t\t\t\t\t\t\t\t\t\t\t  -1,           // override offRate\n\t\t\t\t\t\t\t\t\t\t\t\t  verbose,      // be talkative\n\t\t\t\t\t\t\t\t\t\t\t\t  autoMem,      // pass exceptions up to the toplevel so that we can adjust memory settings automatically\n\t\t\t\t\t\t\t\t\t\t\t\t  sanity);      // verify results and internal consistency\n\t\treturn make_pair(ebwtFw, ebwtBw);\n\t}\n\t\n\t/// Return true iff the Ebwt is packed\n\tbool isPacked() { return packed_; }\n\n\t/**\n\t * Write the rstarts array given the szs array for the reference.\n\t */\n\tvoid szsToDisk(const EList<RefRecord>& szs, ostream& os, int reverse);\n\t\n\t/**\n\t * Helper for the constructors above.  Takes a vector of text\n\t * strings and joins them into a single string with a call to\n\t * joinToDisk, which does a join (with padding) and writes some of\n\t * the resulting data directly to disk rather than keep it in\n\t * memory.  It then constructs a suffix-array producer (what kind\n\t * depends on 'useBlockwise') for the resulting sequence.  The\n\t * suffix-array producer can then be used to obtain chunks of the\n\t * joined string's suffix array.\n\t */\n\ttemplate <typename TStr>\n\tvoid initFromVector(TStr& s,\n\t\t\t\t\t\tEList<FileBuf*>& is,\n\t                    EList<RefRecord>& szs,\n\t                    index_t sztot,\n\t                    const RefReadInParams& refparams,\n\t                    ofstream& out1,\n\t                    ofstream& out2,\n                        ofstream* saOut,\n                        ofstream* bwtOut,\n                        int kmer_size,\n                        const string& base_fname,\n                        const string& conversion_table_fname,\n                        const string& taxonomy_fname,\n                        const string& size_table_fname,\n                        const string& name_table_fname,\n\t                    bool useBlockwise,\n\t                    index_t bmax,\n\t                    index_t bmaxSqrtMult,\n\t                    index_t bmaxDivN,\n\t                    int dcv,\n                        int nthreads,\n\t                    uint32_t seed,\n\t\t\t\t\t\tbool verbose)\n\t{\n\t\t// Compose text strings into single string\n\t\tVMSG_NL(\"Calculating joined length\");\n\t\tindex_t jlen;\n\t\tjlen = joinedLen(szs);\n\t\tassert_geq(jlen, sztot);\n\t\tVMSG_NL(\"Writing header\");\n\t\twriteFromMemory(true, out1, out2);\n\t\ttry {\n\t\t\tVMSG_NL(\"Reserving space for joined string\");\n\t\t\ts.resize(jlen);\n\t\t\tVMSG_NL(\"Joining reference sequences\");\n\t\t\tif(refparams.reverse == REF_READ_REVERSE) {\n\t\t\t\t{\n\t\t\t\t\tTimer timer(cout, \"  Time to join reference sequences: \", _verbose);\n\t\t\t\t\tjoinToDisk(is, szs, sztot, refparams, s, out1, out2);\n\t\t\t\t} {\n\t\t\t\t\tTimer timer(cout, \"  Time to reverse reference sequence: \", _verbose);\n\t\t\t\t\tEList<RefRecord> tmp(EBWT_CAT);\n\t\t\t\t\ts.reverse();\n\t\t\t\t\treverseRefRecords(szs, tmp, false, verbose);\n\t\t\t\t\tszsToDisk(tmp, out1, refparams.reverse);\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\tTimer timer(cout, \"  Time to join reference sequences: \", _verbose);\n\t\t\t\tjoinToDisk(is, szs, sztot, refparams, s, out1, out2);\n\t\t\t\tszsToDisk(szs, out1, refparams.reverse);\n\t\t\t}\n\t\t\t// Joined reference sequence now in 's'\n\t\t} catch(bad_alloc& e) {\n\t\t\t// If we throw an allocation exception in the try block,\n\t\t\t// that means that the joined version of the reference\n\t\t\t// string itself is too larger to fit in memory.  The only\n\t\t\t// alternatives are to tell the user to give us more memory\n\t\t\t// or to try again with a packed representation of the\n\t\t\t// reference (if we haven't tried that already).\n\t\t\tcerr << \"Could not allocate space for a joined string of \" << jlen << \" elements.\" << endl;\n\t\t\tif(!isPacked() && _passMemExc) {\n\t\t\t\t// Pass the exception up so that we can retry using a\n\t\t\t\t// packed string representation\n\t\t\t\tthrow e;\n\t\t\t}\n\t\t\t// There's no point passing this exception on.  The fact\n\t\t\t// that we couldn't allocate the joined string means that\n\t\t\t// --bmax is irrelevant - the user should re-run with\n\t\t\t// ebwt-build-packed\n\t\t\tif(isPacked()) {\n\t\t\t\tcerr << \"Please try running centrifuge-build on a computer with more memory.\" << endl;\n\t\t\t} else {\n\t\t\t\tcerr << \"Please try running centrifuge-build in packed mode (-p/--packed) or in automatic\" << endl\n\t\t\t\t     << \"mode (-a/--auto), or try again on a computer with more memory.\" << endl;\n\t\t\t}\n\t\t\tif(sizeof(void*) == 4) {\n\t\t\t\tcerr << \"If this computer has more than 4 GB of memory, try using a 64-bit executable;\" << endl\n\t\t\t\t     << \"this executable is 32-bit.\" << endl;\n\t\t\t}\n\t\t\tthrow 1;\n\t\t}\n        \n        this->_offw = this->_nPat > std::numeric_limits<uint16_t>::max();\n        \n        std::set<string> uids;\n        for(size_t i = 0; i < _refnames.size(); i++) {\n            const string& refname = _refnames[i];\n            string uid = get_uid(refname);\n            uids.insert(uid);\n        }\n\t\n\t\n        std::map<string, uint64_t> uid_to_tid; // map from unique id to taxonomy id\n        {\n            ifstream table_file(conversion_table_fname.c_str(), ios::in);\n            if(table_file.is_open()) {\n                while(!table_file.eof()) {\n                    string uid;\n                    table_file >> uid;\n                    if(uid.length() == 0 || uid[0] == '#') continue;\n                    string stid;\n                    table_file >> stid;\n                    uint64_t tid = get_tid(stid);\n                    if(uids.find(uid) == uids.end()) continue;\n                    if(uid_to_tid.find(uid) != uid_to_tid.end()) {\n\t\t\t\t\t\tif(uid_to_tid[uid] != tid) {\n\t\t\t\t\t\t\tcerr << \"Warning: Diverging taxonomy IDs for \" << uid << \" in \" << conversion_table_fname << \": \"\n                                 << uid_to_tid[uid] << \" and \" << tid << \". Taking first. \" << endl;\n\t\t\t\t\t\t}\n                        continue;\n                    }\n                    uid_to_tid[uid] = tid;\n                }\n                table_file.close();\n            } else {\n                cerr << \"Error: \" << conversion_table_fname << \" doesn't exist!\" << endl;\n                throw 1;\n            }\n        }\n        // Open output stream for the '.3.cf' file which will hold conversion table and taxonomy tree\n        string fname3 = base_fname + \".3.\" + gEbwt_ext;\n        ofstream fout3(fname3.c_str(), ios::binary);\n        if(!fout3.good()) {\n            cerr << \"Could not open index file for writing: \\\"\" << fname3 << \"\\\"\" << endl\n            << \"Please make sure the directory exists and that permissions allow writing by Centrifuge\" << endl;\n            throw 1;\n        }\n        std::set<uint64_t> tids;\n        writeIndex<int32_t>(fout3, 1, this->toBe()); // endianness sentinel\n        writeIndex<uint64_t>(fout3, _refnames.size(), this->toBe());\n        for(size_t i = 0; i < _refnames.size(); i++) {\n            const string& refname = _refnames[i];\n            string uid = get_uid(refname);\n            for(size_t c = 0; c < uid.length(); c++) {\n                fout3 << uid[c];\n            }\n            fout3 << '\\0';\n            if(uid_to_tid.find(uid) != uid_to_tid.end()) {\n                uint64_t tid = uid_to_tid[uid];\n                writeIndex<uint64_t>(fout3, tid, this->toBe());\n                tids.insert(tid);\n            } else {\n                cerr << \"Warning: taxomony id doesn't exists for \" << uid << \"!\" << endl;\n                writeIndex<uint64_t>(fout3, 0, this->toBe());\n            }\n        }\n\n        // Read taxonomy\n        {\n            TaxonomyTree tree = read_taxonomy_tree(taxonomy_fname);\n            std::set<uint64_t> tree_color;\n\n            for(std::set<uint64_t>::iterator itr = tids.begin(); itr != tids.end(); itr++) {\n                uint64_t tid = *itr;\n                if(tree.find(tid) == tree.end()) {\n                    cerr << \"Warning: Taxonomy ID \" << tid << \" is not in the provided taxonomy tree (\" << taxonomy_fname << \")!\" << endl;\n\n                }\n                while(tree.find(tid) != tree.end()) {\n                    uint64_t parent_tid = tree[tid].parent_tid;\n                    tree_color.insert(tid);\n                    if(parent_tid == tid) break;\n                    tid = parent_tid;\n                }\n            }\n            writeIndex<uint64_t>(fout3, tree_color.size(), this->toBe());\n            for(std::set<uint64_t>::iterator itr = tree_color.begin(); itr != tree_color.end(); itr++) {\n                uint64_t tid = *itr;\n                writeIndex<uint64_t>(fout3, tid, this->toBe());\n                assert(tree.find(tid) != tree.end());\n                const TaxonomyNode& node = tree[tid];\n                writeIndex<uint64_t>(fout3, node.parent_tid, this->toBe());\n                writeIndex<uint16_t>(fout3, node.rank, this->toBe());\n            }\n        \n            // Read name table\n            _name.clear();\n            if(name_table_fname != \"\") {\n                ifstream table_file(name_table_fname.c_str(), ios::in);\n                if(table_file.is_open()) {\n                    char line[1024];\n                    while(!table_file.eof()) {\n                        line[0] = 0;\n                        table_file.getline(line, sizeof(line));\n                        if(line[0] == 0 || line[0] == '#') continue;\n                        if(!strstr(line, \"scientific name\")) continue;\n                        istringstream cline(line);\n                        uint64_t tid;\n                        char dummy;\n                        string scientific_name;\n                        cline >> tid >> dummy >> scientific_name;\n                        if(tree_color.find(tid) == tree_color.end()) continue;\n                        string temp;\n                        while(true) {\n                            cline >> temp;\n                            if(temp == \"|\") break;\n                            scientific_name.push_back('@');\n                            scientific_name += temp;\n                        }\n                        _name[tid] = scientific_name;\n                    }\n                    table_file.close();\n                } else {\n                    cerr << \"Error: \" << name_table_fname << \" doesn't exist!\" << endl;\n                    throw 1;\n                }\n            }\n            \n            writeIndex<uint64_t>(fout3, _name.size(), this->toBe());\n            for(std::map<uint64_t, string>::const_iterator itr = _name.begin(); itr != _name.end(); itr++) {\n                writeIndex<uint64_t>(fout3, itr->first, this->toBe());\n                fout3 << itr->second << endl;\n            }\n        }\n        \n        // Read size table\n        {\n            _size.clear();\n            \n            // Calculate contig (or genome) sizes corresponding to each taxonomic ID\n            for(size_t i = 0; i < _refnames.size(); i++) {\n                string uid = get_uid(_refnames[i]);\n                if(uid_to_tid.find(uid) == uid_to_tid.end())\n                    continue;\n                uint64_t tid = uid_to_tid[uid];\n                uint64_t contig_size = plen()[i];\n                if(_size.find(tid) == _size.end()) {\n                    _size[tid] = contig_size;\n                } else {\n                    _size[tid] += contig_size;\n                }\n            }\n            \n            if(size_table_fname != \"\") {\n                ifstream table_file(size_table_fname.c_str(), ios::in);\n                if(table_file.is_open()) {\n                    while(!table_file.eof()) {\n                        string stid;\n                        table_file >> stid;\n                        if(stid.length() == 0 || stid[0] == '#') continue;\n                        uint64_t tid = get_tid(stid);\n                        uint64_t size;\n                        table_file >> size;\n                        _size[tid] = size;\n                    }\n                    table_file.close();\n                } else {\n                    cerr << \"Error: \" << size_table_fname << \" doesn't exist!\" << endl;\n                    throw 1;\n                }\n            }\n            \n            writeIndex<uint64_t>(fout3, _size.size(), this->toBe());\n            for(std::map<uint64_t, uint64_t>::const_iterator itr = _size.begin(); itr != _size.end(); itr++) {\n                writeIndex<uint64_t>(fout3, itr->first, this->toBe());\n                writeIndex<uint64_t>(fout3, itr->second, this->toBe());\n            }\n        }\n        \n        fout3.close();\n    \n\t\t// Succesfully obtained joined reference string\n\t\tassert_geq(s.length(), jlen);\n\t\tif(bmax != (index_t)OFF_MASK) {\n\t\t\tVMSG_NL(\"bmax according to bmax setting: \" << bmax);\n\t\t}\n\t\telse if(bmaxSqrtMult != (index_t)OFF_MASK) {\n\t\t\tbmax *= bmaxSqrtMult;\n\t\t\tVMSG_NL(\"bmax according to bmaxSqrtMult setting: \" << bmax);\n\t\t}\n\t\telse if(bmaxDivN != (index_t)OFF_MASK) {\n\t\t\tbmax = max<uint32_t>((uint32_t)(jlen / bmaxDivN), 1);\n\t\t\tVMSG_NL(\"bmax according to bmaxDivN setting: \" << bmax);\n\t\t}\n\t\telse {\n\t\t\tbmax = (uint32_t)sqrt(s.length());\n\t\t\tVMSG_NL(\"bmax defaulted to: \" << bmax);\n\t\t}\n\t\tint iter = 0;\n\t\tbool first = true;\n\t\tstreampos out1pos = out1.tellp();\n\t\tstreampos out2pos = out2.tellp();\n\t\t// Look for bmax/dcv parameters that work.\n\t\twhile(true) {\n\t\t\tif(!first && bmax < 40 && _passMemExc) {\n\t\t\t\tcerr << \"Could not find approrpiate bmax/dcv settings for building this index.\" << endl;\n\t\t\t\tif(!isPacked()) {\n\t\t\t\t\t// Throw an exception exception so that we can\n\t\t\t\t\t// retry using a packed string representation\n\t\t\t\t\tthrow bad_alloc();\n\t\t\t\t} else {\n\t\t\t\t\tcerr << \"Already tried a packed string representation.\" << endl;\n\t\t\t\t}\n\t\t\t\tcerr << \"Please try indexing this reference on a computer with more memory.\" << endl;\n\t\t\t\tif(sizeof(void*) == 4) {\n\t\t\t\t\tcerr << \"If this computer has more than 4 GB of memory, try using a 64-bit executable;\" << endl\n\t\t\t\t\t\t << \"this executable is 32-bit.\" << endl;\n\t\t\t\t}\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\tif(!first) {\n\t\t\t\tout1.seekp(out1pos);\n\t\t\t\tout2.seekp(out2pos);\n\t\t\t}\n\t\t\tif(dcv > 4096) dcv = 4096;\n\t\t\tif((iter % 6) == 5 && dcv < 4096 && dcv != 0) {\n\t\t\t\tdcv <<= 1; // double difference-cover period\n\t\t\t} else {\n\t\t\t\tbmax -= (bmax >> 2); // reduce by 25%\n\t\t\t}\n\t\t\tVMSG(\"Using parameters --bmax \" << bmax);\n\t\t\tif(dcv == 0) {\n\t\t\t\tVMSG_NL(\" and *no difference cover*\");\n\t\t\t} else {\n\t\t\t\tVMSG_NL(\" --dcv \" << dcv);\n\t\t\t}\n\t\t\titer++;\n\t\t\ttry {\n\t\t\t\t{\n\t\t\t\t\tVMSG_NL(\"  Doing ahead-of-time memory usage test\");\n\t\t\t\t\t// Make a quick-and-dirty attempt to force a bad_alloc iff\n\t\t\t\t\t// we would have thrown one eventually as part of\n\t\t\t\t\t// constructing the DifferenceCoverSample\n\t\t\t\t\tdcv <<= 1;\n\t\t\t\t\tindex_t sz = (index_t)DifferenceCoverSample<TStr>::simulateAllocs(s, dcv >> 1);\n\t\t\t\t\tAutoArray<uint8_t> tmp(sz, EBWT_CAT);\n\t\t\t\t\tdcv >>= 1;\n\t\t\t\t\t// Likewise with the KarkkainenBlockwiseSA\n\t\t\t\t\tsz = (index_t)KarkkainenBlockwiseSA<TStr>::simulateAllocs(s, bmax);\n                    if(nthreads > 1) sz *= (nthreads + 1);\n\t\t\t\t\tAutoArray<uint8_t> tmp2(sz, EBWT_CAT);\n\t\t\t\t\t// Now throw in the 'ftab' and 'isaSample' structures\n\t\t\t\t\t// that we'll eventually allocate in buildToDisk\n\t\t\t\t\tAutoArray<index_t> ftab(_eh._ftabLen * 2, EBWT_CAT);\n\t\t\t\t\tAutoArray<uint8_t> side(_eh._sideSz, EBWT_CAT);\n\t\t\t\t\t// Grab another 20 MB out of caution\n\t\t\t\t\tAutoArray<uint32_t> extra(20*1024*1024, EBWT_CAT);\n\t\t\t\t\t// If we made it here without throwing bad_alloc, then we\n\t\t\t\t\t// passed the memory-usage stress test\n\t\t\t\t\tVMSG(\"  Passed!  Constructing with these parameters: --bmax \" << bmax << \" --dcv \" << dcv);\n\t\t\t\t\tif(isPacked()) {\n\t\t\t\t\t\tVMSG(\" --packed\");\n\t\t\t\t\t}\n\t\t\t\t\tVMSG_NL(\"\");\n\t\t\t\t}\n\t\t\t\tVMSG_NL(\"Constructing suffix-array element generator\");\n\t\t\t\tKarkkainenBlockwiseSA<TStr> bsa(s, bmax, nthreads, dcv, seed, _sanity, _passMemExc, _verbose, base_fname);\n\t\t\t\tassert(bsa.suffixItrIsReset());\n\t\t\t\tassert_eq(bsa.size(), s.length()+1);\n\t\t\t\tVMSG_NL(\"Converting suffix-array elements to index image\");\n\n\t\t\t\tstring fname4 = base_fname + \".4.\" + gEbwt_ext;\n\t\t\t\tofstream fout4(fname4.c_str(), ios::binary);\n\t\t\t\tif(!fout4.good()) {\n\t\t\t\t\tcerr << \"Could not open index file for writing: \\\"\" << fname4 << \"\\\"\" << endl\n\t\t\t\t\t\t<< \"Please make sure the directory exists and that permissions allow writing by Centrifuge\" << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t\tbuildToDisk(bsa, s, out1, out2, saOut, bwtOut, fout4, szs, kmer_size);\n\t\t\t\tfout4.close() ;\n\t\t\t\tout1.flush(); out2.flush();\n                bool failed = out1.fail() || out2.fail();\n                if(saOut != NULL) {\n                    saOut->flush();\n                    failed = failed || saOut->fail();\n                }\n                if(bwtOut != NULL) {\n                    bwtOut->flush();\n                    failed = failed || bwtOut->fail();\n                }\n\t\t\t\tbreak;\n\t\t\t} catch(bad_alloc& e) {\n\t\t\t\tif(_passMemExc) {\n\t\t\t\t\tVMSG_NL(\"  Ran out of memory; automatically trying more memory-economical parameters.\");\n\t\t\t\t} else {\n\t\t\t\t\tcerr << \"Out of memory while constructing suffix array.  Please try using a smaller\" << endl\n\t\t\t\t\t\t << \"number of blocks by specifying a smaller --bmax or a larger --bmaxdivn\" << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t}\n\t\t\tfirst = false;\n\t\t}\n\t\tassert(repOk());\n\t\t// Now write reference sequence names on the end\n\t\tassert_eq(this->_refnames.size(), this->_nPat);\n\t\tfor(index_t i = 0; i < this->_refnames.size(); i++) {\n\t\t\tout1 << this->_refnames[i].c_str() << endl;\n\t\t}\n\t\tout1 << '\\0';\n\t\tout1.flush(); out2.flush();\n\t\tif(out1.fail() || out2.fail()) {\n\t\t\tcerr << \"An error occurred writing the index to disk.  Please check if the disk is full.\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tVMSG_NL(\"Returning from initFromVector\");\n\t}\n\t\n\t/**\n\t * Return the length that the joined string of the given string\n\t * list will have.  Note that this is indifferent to how the text\n\t * fragments correspond to input sequences - it just cares about\n\t * the lengths of the fragments.\n\t */\n\tindex_t joinedLen(EList<RefRecord>& szs) {\n\t\tindex_t ret = 0;\n\t\tfor(unsigned int i = 0; i < szs.size(); i++) {\n\t\t\tret += (index_t)szs[i].len;\n\t\t}\n\t\treturn ret;\n\t}\n\n\t/// Destruct an Ebwt\n\t~Ebwt() {\n\t\t_fchr.reset();\n\t\t_ftab.reset();\n\t\t_eftab.reset();\n\t\t_plen.reset();\n\t\t_rstarts.reset();\n\t\t_offs.reset();\n        _offsw.reset();\n\t\t_ebwt.reset();\n\t\tif(offs() != NULL && useShmem_) {\n\t\t\tFREE_SHARED(offs());\n\t\t}\n        if(offsw() != NULL && useShmem_) {\n            FREE_SHARED(offsw());\n        }\n\t\tif(ebwt() != NULL && useShmem_) {\n\t\t\tFREE_SHARED(ebwt());\n\t\t}\n\t\tif (_in1 != NULL) fclose(_in1);\n\t\tif (_in2 != NULL) fclose(_in2);\n\t}\n\n\t/// Accessors\n\tinline const EbwtParams<index_t>& eh() const     { return _eh; }\n\tindex_t    zOff() const         { return _zOff; }\n\tindex_t    zEbwtByteOff() const { return _zEbwtByteOff; }\n\tint        zEbwtBpOff() const   { return _zEbwtBpOff; }\n\tindex_t    nPat() const        { return _nPat; }\n\tindex_t    nFrag() const       { return _nFrag; }\n\tinline index_t*   fchr()              { return _fchr.get(); }\n\tinline index_t*   ftab()              { return _ftab.get(); }\n\tinline index_t*   eftab()             { return _eftab.get(); }\n\tinline uint16_t*   offs()              { return _offs.get(); }\n    inline uint32_t*   offsw()             { return _offsw.get(); }\n\tinline index_t*   plen()              { return _plen.get(); }\n\tinline index_t*   rstarts()           { return _rstarts.get(); }\n\tinline uint8_t*    ebwt()              { return _ebwt.get(); }\n\tinline const index_t* fchr() const    { return _fchr.get(); }\n\tinline const index_t* ftab() const    { return _ftab.get(); }\n\tinline const index_t* eftab() const   { return _eftab.get(); }\n    inline const uint16_t* offs() const    { return _offs.get(); }\n    inline const uint32_t* offsw() const    { return _offsw.get(); }\n\tinline const index_t* plen() const    { return _plen.get(); }\n\tinline const index_t* rstarts() const { return _rstarts.get(); }\n\tinline const uint8_t*  ebwt() const    { return _ebwt.get(); }\n\tbool        toBe() const         { return _toBigEndian; }\n\tbool        verbose() const      { return _verbose; }\n\tbool        sanityCheck() const  { return _sanity; }\n\tEList<string>& refnames()        { return _refnames; }\n\tbool        fw() const           { return fw_; }\n    \n    const EList<pair<string, uint64_t> >&   uid_to_tid() const { return _uid_to_tid; }\n    const TaxonomyTree& tree() const { return _tree; }\n    const TaxonomyPathTable&                paths() const { return _paths; }\n    const std::map<uint64_t, string>&       name() const { return _name; }\n    const std::map<uint64_t, uint64_t>&     size() const { return _size; }\n    inline const bool \t    saGenomeBoundaryHas( uint64_t key ) const { return _saGenomeBoundary.find( key ) != _saGenomeBoundary.end() ; }\n    inline const uint32_t saGenomeBoundaryVal( uint64_t key ) const { return _saGenomeBoundary.at(key) ; }\n    bool                                    compressed() const { return _compressed; }\n    \n    \n#ifdef POPCNT_CAPABILITY\n    bool _usePOPCNTinstruction;\n#endif\n\n\t/**\n\t * Returns true iff the index contains the given string (exactly).  The\n\t * given string must contain only unambiguous characters.  TODO:\n\t * support skipping of ambiguous characters.\n\t */\n\tbool contains(\n\t\tconst BTDnaString& str,\n\t\tindex_t *top = NULL,\n\t\tindex_t *bot = NULL) const;\n\n\t/**\n\t * Returns true iff the index contains the given string (exactly).  The\n\t * given string must contain only unambiguous characters.  TODO:\n\t * support skipping of ambiguous characters.\n\t */\n\tbool contains(\n\t\tconst char *str,\n\t\tindex_t *top = NULL,\n\t\tindex_t *bot = NULL) const\n\t{\n\t\treturn contains(BTDnaString(str, true), top, bot);\n\t}\n\t\n\t/// Return true iff the Ebwt is currently in memory\n\tbool isInMemory() const {\n\t\tif(ebwt() != NULL) {\n\t\t\t// Note: We might have skipped loading _offs, _ftab,\n\t\t\t// _eftab, and _rstarts depending on whether this is the\n\t\t\t// reverse index and what algorithm is being used.\n\t\t\tassert(_eh.repOk());\n\t\t\t//assert(_ftab != NULL);\n\t\t\t//assert(_eftab != NULL);\n\t\t\tassert(fchr() != NULL);\n\t\t\t//assert(_offs != NULL);\n\t\t\t//assert(_rstarts != NULL);\n\t\t\tassert_neq(_zEbwtByteOff, (index_t)OFF_MASK);\n\t\t\tassert_neq(_zEbwtBpOff, -1);\n\t\t\treturn true;\n\t\t} else {\n\t\t\tassert(ftab() == NULL);\n\t\t\tassert(eftab() == NULL);\n\t\t\tassert(fchr() == NULL);\n\t\t\tassert(offs() == NULL);\n            assert(offsw() == NULL);\n\t\t\t// assert(rstarts() == NULL); // FIXME FB: Assertion fails when calling centrifuge-build-bin-debug\n\t\t\tassert_eq(_zEbwtByteOff, (index_t)OFF_MASK);\n\t\t\tassert_eq(_zEbwtBpOff, -1);\n\t\t\treturn false;\n\t\t}\n\t}\n\n\t/// Return true iff the Ebwt is currently stored on disk\n\tbool isEvicted() const {\n\t\treturn !isInMemory();\n\t}\n\n\t/**\n\t * Load this Ebwt into memory by reading it in from the _in1 and\n\t * _in2 streams.\n\t */\n\tvoid loadIntoMemory(\n\t\tint color,\n\t\tint needEntireReverse,\n\t\tbool loadSASamp,\n\t\tbool loadFtab,\n\t\tbool loadRstarts,\n\t\tbool loadNames,\n\t\tbool verbose)\n\t{\n\t\treadIntoMemory(\n\t\t\tcolor,       // expect index to be colorspace?\n\t\t\tneedEntireReverse, // require reverse index to be concatenated reference reversed\n\t\t\tloadSASamp,  // load the SA sample portion?\n\t\t\tloadFtab,    // load the ftab (_ftab[] and _eftab[])?\n\t\t\tloadRstarts, // load the r-starts (_rstarts[])?\n\t\t\tfalse,       // stop after loading the header portion?\n\t\t\tNULL,        // params\n\t\t\tfalse,       // mmSweep\n\t\t\tloadNames,   // loadNames\n\t\t\tverbose);    // startVerbose\n\t}\n\n\t/**\n\t * Frees memory associated with the Ebwt.\n\t */\n\tvoid evictFromMemory() {\n\t\tassert(isInMemory());\n\t\t_fchr.free();\n\t\t_ftab.free();\n\t\t_eftab.free();\n\t\t_rstarts.free();\n\t\t_offs.free(); // might not be under control of APtrWrap\n        _offsw.free(); // might not be under control of APtrWrap\n\t\t_ebwt.free(); // might not be under control of APtrWrap\n\t\t// Keep plen; it's small and the client may want to seq it\n\t\t// even when the others are evicted.\n\t\t//_plen  = NULL;\n\t\t_zEbwtByteOff = (index_t)OFF_MASK;\n\t\t_zEbwtBpOff = -1;\n\t}\n\n\t/**\n\t * Turn a substring of 'seq' starting at offset 'off' and having\n\t * length equal to the index's 'ftabChars' into an int that can be\n\t * used to index into the ftab array.\n\t */\n\tindex_t ftabSeqToInt(\n\t\tconst BTDnaString& seq,\n\t\tindex_t off,\n\t\tbool rev) const\n\t{\n\t\tint fc = _eh._ftabChars;\n\t\tindex_t lo = off, hi = lo + fc;\n\t\tassert_leq(hi, seq.length());\n\t\tindex_t ftabOff = 0;\n\t\tfor(int i = 0; i < fc; i++) {\n\t\t\tbool fwex = fw();\n\t\t\tif(rev) fwex = !fwex;\n\t\t\t// We add characters to the ftabOff in the order they would\n\t\t\t// have been consumed in a normal search.  For BWT, this\n\t\t\t// means right-to-left order; for BWT' it's left-to-right.\n\t\t\tint c = (fwex ? seq[lo + i] : seq[hi - i - 1]);\n\t\t\tif(c > 3) {\n\t\t\t\treturn std::numeric_limits<index_t>::max();\n\t\t\t}\n\t\t\tassert_range(0, 3, c);\n\t\t\tftabOff <<= 2;\n\t\t\tftabOff |= c;\n\t\t}\n\t\treturn ftabOff;\n\t}\n\t\n\t/**\n\t * Non-static facade for static function ftabHi.\n\t */\n\tindex_t ftabHi(index_t i) const {\n\t\treturn Ebwt<index_t>::ftabHi(\n\t\t\tftab(),\n\t\t\teftab(),\n\t\t\t_eh._len,\n\t\t\t_eh._ftabLen,\n\t\t    _eh._eftabLen,\n\t\t\ti);\n\t}\n\n\t/**\n\t * Get \"high interpretation\" of ftab entry at index i.  The high\n\t * interpretation of a regular ftab entry is just the entry\n\t * itself.  The high interpretation of an extended entry is the\n\t * second correpsonding ui32 in the eftab.\n\t *\n\t * It's a static member because it's convenient to ask this\n\t * question before the Ebwt is fully initialized.\n\t */\n\tstatic index_t ftabHi(\n\t\tconst index_t *ftab,\n\t\tconst index_t *eftab,\n\t\tindex_t len,\n\t\tindex_t ftabLen,\n\t\tindex_t eftabLen,\n\t\tindex_t i)\n\t{\n\t\tassert_lt(i, ftabLen);\n\t\tif(ftab[i] <= len) {\n\t\t\treturn ftab[i];\n\t\t} else {\n\t\t\tindex_t efIdx = ftab[i] ^ (index_t)OFF_MASK;\n\t\t\tassert_lt(efIdx*2+1, eftabLen);\n\t\t\treturn eftab[efIdx*2+1];\n\t\t}\n\t}\n\n\t/**\n\t * Non-static facade for static function ftabLo.\n\t */\n\tindex_t ftabLo(index_t i) const {\n\t\treturn Ebwt<index_t>::ftabLo(\n\t\t\tftab(),\n\t\t\teftab(),\n\t\t\t_eh._len,\n\t\t\t_eh._ftabLen,\n\t\t    _eh._eftabLen,\n\t\t\ti);\n\t}\n\t\n\t/**\n\t * Get low bound of ftab range.\n\t */\n\tindex_t ftabLo(const BTDnaString& seq, index_t off) const {\n\t\treturn ftabLo(ftabSeqToInt(seq, off, false));\n\t}\n\n\t/**\n\t * Get high bound of ftab range.\n\t */\n\tindex_t ftabHi(const BTDnaString& seq, index_t off) const {\n\t\treturn ftabHi(ftabSeqToInt(seq, off, false));\n\t}\n\t\n\t/**\n\t * Extract characters from seq starting at offset 'off' and going either\n\t * forward or backward, depending on 'rev'.  Order matters when compiling\n\t * the integer that gets looked up in the ftab.  Each successive character\n\t * is ORed into the least significant bit-pair, and characters are\n\t * integrated in the direction of the search.\n\t */\n\tbool\n\tftabLoHi(\n\t\tconst BTDnaString& seq, // sequence to extract from\n\t\tindex_t off,             // offset into seq to begin extracting\n\t\tbool rev,               // reverse while extracting\n\t\tindex_t& top,\n\t\tindex_t& bot) const\n\t{\n\t\tindex_t fi = ftabSeqToInt(seq, off, rev);\n\t\tif(fi == std::numeric_limits<index_t>::max()) {\n\t\t\treturn false;\n\t\t}\n\t\ttop = ftabHi(fi);\n\t\tbot = ftabLo(fi+1);\n\t\tassert_geq(bot, top);\n\t\treturn true;\n\t}\n\t\n\t/**\n\t * Get \"low interpretation\" of ftab entry at index i.  The low\n\t * interpretation of a regular ftab entry is just the entry\n\t * itself.  The low interpretation of an extended entry is the\n\t * first correpsonding ui32 in the eftab.\n\t *\n\t * It's a static member because it's convenient to ask this\n\t * question before the Ebwt is fully initialized.\n\t */\n\tstatic index_t ftabLo(\n\t\tconst index_t *ftab,\n\t\tconst index_t *eftab,\n\t\tindex_t len,\n\t\tindex_t ftabLen,\n\t\tindex_t eftabLen,\n\t\tindex_t i)\n\t{\n\t\tassert_lt(i, ftabLen);\n\t\tif(ftab[i] <= len) {\n\t\t\treturn ftab[i];\n\t\t} else {\n\t\t\tindex_t efIdx = ftab[i] ^ (index_t)OFF_MASK;\n\t\t\tassert_lt(efIdx*2+1, eftabLen);\n\t\t\treturn eftab[efIdx*2];\n\t\t}\n\t}\n\n\t/**\n\t * Try to resolve the reference offset of the BW element 'elt'.  If\n\t * it can be resolved immediately, return the reference offset.  If\n\t * it cannot be resolved immediately, return 0xffffffff.\n\t */\n\tindex_t tryOffset(index_t elt) const {\n#ifndef NDEBUG\n\t\tif(this->_offw) {\n\t\t\tassert(offsw() != NULL);\n\t\t} else {\n\t\t\tassert(offs() != NULL);\n\t\t}\n#endif\n\t\tif(elt == _zOff) return 0;\n\t\tif((elt & _eh._offMask) == elt) {\n\t\t\tindex_t eltOff = elt >> _eh._offRate;\n\t\t\tassert_lt(eltOff, _eh._offsLen);\n\t\t\tindex_t off;\n\t\t\tif(this->_offw) {\n\t\t\t\toff = offsw()[eltOff];\n\t\t\t} else {\n\t\t\t\toff = offs()[eltOff];\n\t\t\t}\n\t\t\tassert_neq((index_t)OFF_MASK, off);\n\t\t\treturn off;\n\t\t} else {\n\t\t\t// Try looking at zoff, the first check > 0 makes sure that we load the .4 index\n\t\t\tif ( _lastGenomeBoundary > 0 && elt <= _lastGenomeBoundary && _boundaryCheck.test( elt >> _boundaryCheckShift ) && saGenomeBoundaryHas( elt ) )\n\t\t\t{\n\t\t\t\tuint32_t ret = (index_t)saGenomeBoundaryVal( elt ) ;\n\t\t\t\tif ( this->_offw )\n\t\t\t\t\treturn ret ;\t\n\t\t\t\telse\n\t\t\t\t\treturn (uint16_t)ret ;\n\n\t\t\t}\n\n\t\t\treturn (index_t)OFF_MASK;\n\t\t}\n\t}\n\n\t/**\n\t * Try to resolve the reference offset of the BW element 'elt' such\n\t * that the offset returned is at the right-hand side of the\n\t * forward reference substring involved in the hit.\n\t */\n\tindex_t tryOffset(\n\t\tindex_t elt,\n\t\tbool fw,\n\t\tindex_t hitlen) const\n\t{\n\t\tindex_t off = tryOffset(elt);\n\t\tif(off != (index_t)OFF_MASK && !fw) {\n\t\t\tassert_lt(off, _eh._len);\n\t\t\toff = _eh._len - off - 1;\n\t\t\tassert_geq(off, hitlen-1);\n\t\t\toff -= (hitlen-1);\n\t\t\tassert_lt(off, _eh._len);\n\t\t}\n\t\treturn off;\n\t}\n\n\t/**\n\t * Walk 'steps' steps to the left and return the row arrived at.\n\t */\n\tindex_t walkLeft(index_t row, index_t steps) const;\n\n\t/**\n\t * Resolve the reference offset of the BW element 'elt'.\n\t */\n\tindex_t getOffset(index_t row) const;\n\n\t/**\n\t * Resolve the reference offset of the BW element 'elt' such that\n\t * the offset returned is at the right-hand side of the forward\n\t * reference substring involved in the hit.\n\t */\n\tindex_t getOffset(\n\t\tindex_t elt,\n\t\tbool fw,\n\t\tindex_t hitlen) const;\n\n\t/**\n\t * When using read() to create an Ebwt, we have to set a couple of\n\t * additional fields in the Ebwt object that aren't part of the\n\t * parameter list and are not stored explicitly in the file.  Right\n\t * now, this just involves initializing _zEbwtByteOff and\n\t * _zEbwtBpOff from _zOff.\n\t */\n\tvoid postReadInit(EbwtParams<index_t>& eh) {\n\t\tindex_t sideNum     = _zOff / eh._sideBwtLen;\n\t\tindex_t sideCharOff = _zOff % eh._sideBwtLen;\n\t\tindex_t sideByteOff = sideNum * eh._sideSz;\n\t\t_zEbwtByteOff = sideCharOff >> 2;\n\t\tassert_lt(_zEbwtByteOff, eh._sideBwtSz);\n\t\t_zEbwtBpOff = sideCharOff & 3;\n\t\tassert_lt(_zEbwtBpOff, 4);\n\t\t_zEbwtByteOff += sideByteOff;\n\t\tassert(repOk(eh)); // Ebwt should be fully initialized now\n\t}\n\n\t/**\n\t * Given basename of an Ebwt index, read and return its flag.\n\t */\n\tstatic int32_t readFlags(const string& instr);\n\n\t/**\n\t * Pretty-print the Ebwt to the given output stream.\n\t */\n\tvoid print(ostream& out) const {\n\t\tprint(out, _eh);\n\t}\n\t\n\t/**\n\t * Pretty-print the Ebwt and given EbwtParams to the given output\n\t * stream.\n\t */\n\tvoid print(ostream& out, const EbwtParams<index_t>& eh) const {\n\t\teh.print(out); // print params\n        return;\n\t\tout << \"Ebwt (\" << (isInMemory()? \"memory\" : \"disk\") << \"):\" << endl\n\t\t    << \"    zOff: \"         << _zOff << endl\n\t\t    << \"    zEbwtByteOff: \" << _zEbwtByteOff << endl\n\t\t    << \"    zEbwtBpOff: \"   << _zEbwtBpOff << endl\n\t\t    << \"    nPat: \"  << _nPat << endl\n\t\t    << \"    plen: \";\n\t\tif(plen() == NULL) {\n\t\t\tout << \"NULL\" << endl;\n\t\t} else {\n\t\t\tout << \"non-NULL, [0] = \" << plen()[0] << endl;\n\t\t}\n\t\tout << \"    rstarts: \";\n\t\tif(rstarts() == NULL) {\n\t\t\tout << \"NULL\" << endl;\n\t\t} else {\n\t\t\tout << \"non-NULL, [0] = \" << rstarts()[0] << endl;\n\t\t}\n\t\tout << \"    ebwt: \";\n\t\tif(ebwt() == NULL) {\n\t\t\tout << \"NULL\" << endl;\n\t\t} else {\n\t\t\tout << \"non-NULL, [0] = \" << ebwt()[0] << endl;\n\t\t}\n\t\tout << \"    fchr: \";\n\t\tif(fchr() == NULL) {\n\t\t\tout << \"NULL\" << endl;\n\t\t} else {\n\t\t\tout << \"non-NULL, [0] = \" << fchr()[0] << endl;\n\t\t}\n\t\tout << \"    ftab: \";\n\t\tif(ftab() == NULL) {\n\t\t\tout << \"NULL\" << endl;\n\t\t} else {\n\t\t\tout << \"non-NULL, [0] = \" << ftab()[0] << endl;\n\t\t}\n\t\tout << \"    eftab: \";\n\t\tif(eftab() == NULL) {\n\t\t\tout << \"NULL\" << endl;\n\t\t} else {\n\t\t\tout << \"non-NULL, [0] = \" << eftab()[0] << endl;\n\t\t}\n\t\tout << \"    offs: \";\n\t\tif(offs() == NULL) {\n\t\t\tout << \"NULL\" << endl;\n\t\t} else {\n\t\t\tout << \"non-NULL, [0] = \" << offs()[0] << endl;\n\t\t}\n\t}\n\n\t// Building\n\ttemplate <typename TStr> static TStr join(EList<TStr>& l, uint32_t seed);\n\ttemplate <typename TStr> static TStr join(EList<FileBuf*>& l, EList<RefRecord>& szs, index_t sztot, const RefReadInParams& refparams, uint32_t seed);\n\ttemplate <typename TStr> void joinToDisk(EList<FileBuf*>& l, EList<RefRecord>& szs, index_t sztot, const RefReadInParams& refparams, TStr& ret, ostream& out1, ostream& out2);\n\ttemplate <typename TStr> void buildToDisk(InorderBlockwiseSA<TStr>& sa, const TStr& s, ostream& out1, ostream& out2, ostream* saOut, ostream* bwtOut, ostream& out4, const EList<RefRecord>& szs, int kmer_size );\n\n\t// I/O\n\tvoid readIntoMemory(int color, int needEntireRev, bool loadSASamp, bool loadFtab, bool loadRstarts, bool justHeader, EbwtParams<index_t> *params, bool mmSweep, bool loadNames, bool startVerbose);\n\tvoid writeFromMemory(bool justHeader, ostream& out1, ostream& out2) const;\n\tvoid writeFromMemory(bool justHeader, const string& out1, const string& out2) const;\n\n\t// Sanity checking\n\tvoid sanityCheckUpToSide(int upToSide) const;\n\tvoid sanityCheckAll(int reverse) const;\n\tvoid restore(SString<char>& s) const;\n\tvoid checkOrigs(const EList<SString<char> >& os, bool color, bool mirror) const;\n\n\t// Searching and reporting\n\tvoid joinedToTextOff(index_t qlen, index_t off, index_t& tidx, index_t& textoff, index_t& tlen, bool rejectStraddle, bool& straddled) const;\n\n#define WITHIN_BWT_LEN(x) \\\n\tassert_leq(x[0], this->_eh._sideBwtLen); \\\n\tassert_leq(x[1], this->_eh._sideBwtLen); \\\n\tassert_leq(x[2], this->_eh._sideBwtLen); \\\n\tassert_leq(x[3], this->_eh._sideBwtLen)\n\n#define WITHIN_FCHR(x) \\\n\tassert_leq(x[0], this->fchr()[1]); \\\n\tassert_leq(x[1], this->fchr()[2]); \\\n\tassert_leq(x[2], this->fchr()[3]); \\\n\tassert_leq(x[3], this->fchr()[4])\n\n#define WITHIN_FCHR_DOLLARA(x) \\\n\tassert_leq(x[0], this->fchr()[1]+1); \\\n\tassert_leq(x[1], this->fchr()[2]); \\\n\tassert_leq(x[2], this->fchr()[3]); \\\n\tassert_leq(x[3], this->fchr()[4])\n\n\t/**\n\t * Count all occurrences of character c from the beginning of the\n\t * forward side to <by,bp> and add in the occ[] count up to the side\n\t * break just prior to the side.\n\t *\n\t * A Bowtie 2 side is shaped like:\n\t *\n\t * XXXXXXXXXXXXXXXX [A] [C] [G] [T]\n\t * --------48------ -4- -4- -4- -4-  (numbers in bytes)\n\t */\n\tinline index_t countBt2Side(const SideLocus<index_t>& l, int c) const {\n        assert_range(0, 3, c);\n        assert_range(0, (int)this->_eh._sideBwtSz-1, (int)l._by);\n        assert_range(0, 3, (int)l._bp);\n        const uint8_t *side = l.side(this->ebwt());\n        index_t cCnt = countUpTo(l, c);\n        assert_leq(cCnt, l.toBWRow());\n        assert_leq(cCnt, this->_eh._sideBwtLen);\n        if(c == 0 && l._sideByteOff <= _zEbwtByteOff && l._sideByteOff + l._by >= _zEbwtByteOff) {\n            // Adjust for the fact that we represented $ with an 'A', but\n            // shouldn't count it as an 'A' here\n            if((l._sideByteOff + l._by > _zEbwtByteOff) ||\n               (l._sideByteOff + l._by == _zEbwtByteOff && l._bp > _zEbwtBpOff))\n            {\n                cCnt--; // Adjust for '$' looking like an 'A'\n            }\n        }\n        index_t ret;\n        // Now factor in the occ[] count at the side break\n        const uint8_t *acgt8 = side + _eh._sideBwtSz;\n        const index_t *acgt = reinterpret_cast<const index_t*>(acgt8);\n        assert_leq(acgt[0], this->_eh._numSides * this->_eh._sideBwtLen); // b/c it's used as padding\n        assert_leq(acgt[1], this->_eh._len);\n        assert_leq(acgt[2], this->_eh._len);\n        assert_leq(acgt[3], this->_eh._len);\n        ret = acgt[c] + cCnt + this->fchr()[c];\n#ifndef NDEBUG\n        assert_leq(ret, this->fchr()[c+1]); // can't have jumpded into next char's section\n        if(c == 0) {\n            assert_leq(cCnt, this->_eh._sideBwtLen);\n        } else {\n            assert_leq(ret, this->_eh._bwtLen);\n        }\n#endif\n        return ret;\n\t}\n\n\t/**\n\t * Count all occurrences of all four nucleotides up to the starting\n\t * point (which must be in a forward side) given by 'l' storing the\n\t * result in 'cntsUpto', then count nucleotide occurrences within the\n\t * range of length 'num' storing the result in 'cntsIn'.  Also, keep\n\t * track of the characters occurring within the range by setting\n\t * 'masks' accordingly (masks[1][10] == true -> 11th character is a\n\t * 'C', and masks[0][10] == masks[2][10] == masks[3][10] == false.\n\t */\n\tinline void countBt2SideRange(\n\t\tSideLocus<index_t>& l,        // top locus\n\t\tindex_t num,        // number of elts in range to tall\n\t\tindex_t* cntsUpto,  // A/C/G/T counts up to top\n\t\tindex_t* cntsIn,    // A/C/G/T counts within range\n\t\tEList<bool> *masks) const // masks indicating which range elts = A/C/G/T\n\t{\n\t\tassert_gt(num, 0);\n\t\tassert_range(0, (int)this->_eh._sideBwtSz-1, (int)l._by);\n\t\tassert_range(0, 3, (int)l._bp);\n\t\tcountUpToEx(l, cntsUpto);\n\t\tWITHIN_FCHR_DOLLARA(cntsUpto);\n\t\tWITHIN_BWT_LEN(cntsUpto);\n\t\tconst uint8_t *side = l.side(this->ebwt());\n\t\tif(l._sideByteOff <= _zEbwtByteOff && l._sideByteOff + l._by >= _zEbwtByteOff) {\n\t\t\t// Adjust for the fact that we represented $ with an 'A', but\n\t\t\t// shouldn't count it as an 'A' here\n\t\t\tif((l._sideByteOff + l._by > _zEbwtByteOff) ||\n\t\t\t   (l._sideByteOff + l._by == _zEbwtByteOff && l._bp > _zEbwtBpOff))\n\t\t\t{\n\t\t\t\tcntsUpto[0]--; // Adjust for '$' looking like an 'A'\n\t\t\t}\n\t\t}\n\t\t// Now factor in the occ[] count at the side break\n\t\tconst index_t *acgt = reinterpret_cast<const index_t*>(side + _eh._sideBwtSz);\n\t\tassert_leq(acgt[0], this->fchr()[1] + this->_eh.sideBwtLen());\n\t\tassert_leq(acgt[1], this->fchr()[2]-this->fchr()[1]);\n\t\tassert_leq(acgt[2], this->fchr()[3]-this->fchr()[2]);\n\t\tassert_leq(acgt[3], this->fchr()[4]-this->fchr()[3]);\n\t\tassert_leq(acgt[0], this->_eh._len + this->_eh.sideBwtLen());\n\t\tassert_leq(acgt[1], this->_eh._len);\n\t\tassert_leq(acgt[2], this->_eh._len);\n\t\tassert_leq(acgt[3], this->_eh._len);\n\t\tcntsUpto[0] += (acgt[0] + this->fchr()[0]);\n\t\tcntsUpto[1] += (acgt[1] + this->fchr()[1]);\n\t\tcntsUpto[2] += (acgt[2] + this->fchr()[2]);\n\t\tcntsUpto[3] += (acgt[3] + this->fchr()[3]);\n\t\tmasks[0].resize(num);\n\t\tmasks[1].resize(num);\n\t\tmasks[2].resize(num);\n\t\tmasks[3].resize(num);\n\t\tWITHIN_FCHR_DOLLARA(cntsUpto);\n\t\tWITHIN_FCHR_DOLLARA(cntsIn);\n\t\t// 'cntsUpto' is complete now.\n\t\t// Walk forward until we've tallied the entire 'In' range\n\t\tindex_t nm = 0;\n\t\t// Rest of this side\n\t\tnm += countBt2SideRange2(l, true, num - nm, cntsIn, masks, nm);\n\t\tassert_eq(nm, cntsIn[0] + cntsIn[1] + cntsIn[2] + cntsIn[3]);\n\t\tassert_leq(nm, num);\n\t\tSideLocus<index_t> lcopy = l;\n\t\twhile(nm < num) {\n\t\t\t// Subsequent sides, if necessary\n\t\t\tlcopy.nextSide(this->_eh);\n\t\t\tnm += countBt2SideRange2(lcopy, false, num - nm, cntsIn, masks, nm);\n\t\t\tWITHIN_FCHR_DOLLARA(cntsIn);\n\t\t\tassert_leq(nm, num);\n\t\t\tassert_eq(nm, cntsIn[0] + cntsIn[1] + cntsIn[2] + cntsIn[3]);\n\t\t}\n\t\tassert_eq(num, cntsIn[0] + cntsIn[1] + cntsIn[2] + cntsIn[3]);\n\t\tWITHIN_FCHR_DOLLARA(cntsIn);\n\t}\n\n\t/**\n\t * Count all occurrences of character c from the beginning of the\n\t * forward side to <by,bp> and add in the occ[] count up to the side\n\t * break just prior to the side.\n\t *\n\t * A forward side is shaped like:\n\t *\n\t * [A] [C] XXXXXXXXXXXXXXXX\n\t * -4- -4- --------56------ (numbers in bytes)\n\t *         ^\n\t *         Side ptr (result from SideLocus.side())\n\t *\n\t * And following it is a reverse side shaped like:\n\t * \n\t * [G] [T] XXXXXXXXXXXXXXXX\n\t * -4- -4- --------56------ (numbers in bytes)\n\t *         ^\n\t *         Side ptr (result from SideLocus.side())\n\t *\n\t */\n\tinline void countBt2SideEx(const SideLocus<index_t>& l, index_t* arrs) const {\n\t\tassert_range(0, (int)this->_eh._sideBwtSz-1, (int)l._by);\n\t\tassert_range(0, 3, (int)l._bp);\n\t\tcountUpToEx(l, arrs);\n\t\tif(l._sideByteOff <= _zEbwtByteOff && l._sideByteOff + l._by >= _zEbwtByteOff) {\n\t\t\t// Adjust for the fact that we represented $ with an 'A', but\n\t\t\t// shouldn't count it as an 'A' here\n\t\t\tif((l._sideByteOff + l._by > _zEbwtByteOff) ||\n\t\t\t   (l._sideByteOff + l._by == _zEbwtByteOff && l._bp > _zEbwtBpOff))\n\t\t\t{\n\t\t\t\tarrs[0]--; // Adjust for '$' looking like an 'A'\n\t\t\t}\n\t\t}\n\t\tWITHIN_FCHR(arrs);\n\t\tWITHIN_BWT_LEN(arrs);\n\t\t// Now factor in the occ[] count at the side break\n\t\tconst uint8_t *side = l.side(this->ebwt());\n\t\tconst uint8_t *acgt16 = side + this->_eh._sideSz - sizeof(index_t) * 4;\n\t\tconst index_t *acgt = reinterpret_cast<const index_t*>(acgt16);\n\t\tassert_leq(acgt[0], this->fchr()[1] + this->_eh.sideBwtLen());\n\t\tassert_leq(acgt[1], this->fchr()[2]-this->fchr()[1]);\n\t\tassert_leq(acgt[2], this->fchr()[3]-this->fchr()[2]);\n\t\tassert_leq(acgt[3], this->fchr()[4]-this->fchr()[3]);\n\t\tassert_leq(acgt[0], this->_eh._len + this->_eh.sideBwtLen());\n\t\tassert_leq(acgt[1], this->_eh._len);\n\t\tassert_leq(acgt[2], this->_eh._len);\n\t\tassert_leq(acgt[3], this->_eh._len);\n\t\tarrs[0] += (acgt[0] + this->fchr()[0]);\n\t\tarrs[1] += (acgt[1] + this->fchr()[1]);\n\t\tarrs[2] += (acgt[2] + this->fchr()[2]);\n\t\tarrs[3] += (acgt[3] + this->fchr()[3]);\n\t\tWITHIN_FCHR(arrs);\n\t}\n\n    /**\n\t * Counts the number of occurrences of character 'c' in the given Ebwt\n\t * side up to (but not including) the given byte/bitpair (by/bp).\n\t *\n\t * This is a performance-critical function.  This is the top search-\n\t * related hit in the time profile.\n\t *\n\t * Function gets 11.09% in profile\n\t */\n\tinline index_t countUpTo(const SideLocus<index_t>& l, int c) const {\n\t\t// Count occurrences of c in each 64-bit (using bit trickery);\n\t\t// Someday countInU64() and pop() functions should be\n\t\t// vectorized/SSE-ized in case that helps.\n        bool usePOPCNT = false;\n\t\tindex_t cCnt = 0;\n\t\tconst uint8_t *side = l.side(this->ebwt());\n\t\tint i = 0;\n#ifdef POPCNT_CAPABILITY\n        if(_usePOPCNTinstruction) {\n            usePOPCNT = true;\n            int by = l._by + (l._bp > 0 ? 1 : 0);\n            for(; i < by; i += 8) {\n                if(i + 8 < by) {\n                    cCnt += countInU64<USE_POPCNT_INSTRUCTION>(c, *(uint64_t*)&side[i]);\n                } else {\n                    index_t by_shift = 8 - (by - i);\n                    index_t bp_shift = (l._bp > 0 ? 4 - l._bp : 0);\n                    index_t shift = (by_shift << 3) + (bp_shift << 1);\n                    uint64_t side_i = *(uint64_t*)&side[i];\n                    side_i = (_toBigEndian ? side_i >> shift : side_i << shift);\n                    index_t cCnt_add = countInU64<USE_POPCNT_INSTRUCTION>(c, side_i);\n                    if(c == 0) cCnt_add -= (shift >> 1);\n#ifndef NDEBUG\n                    index_t cCnt_temp = 0;\n                    for(int j = i; j < l._by; j++) {\n                        cCnt_temp += cCntLUT_4[0][c][side[j]];\n                    }\n                    if(l._bp > 0) {\n                        cCnt_temp += cCntLUT_4[(int)l._bp][c][side[l._by]];\n                    }\n                    assert_eq(cCnt_add, cCnt_temp);\n#endif\n                    cCnt += cCnt_add;\n                    break;\n                }\n            }\n        } else {\n            for(; i + 7 < l._by; i += 8) {\n                cCnt += countInU64<USE_POPCNT_GENERIC>(c, *(uint64_t*)&side[i]);\n            }\n        }\n#else\n        for(; i + 7 < l._by; i += 8) {\n            cCnt += countInU64(c, *(uint64_t*)&side[i]);\n        }\n#endif\n        \n        if(!usePOPCNT) {\n            // Count occurences of c in the rest of the side (using LUT)\n            for(; i < l._by; i++) {\n                cCnt += cCntLUT_4[0][c][side[i]];\n            }\n            \n            // Count occurences of c in the rest of the byte\n            if(l._bp > 0) {\n                cCnt += cCntLUT_4[(int)l._bp][c][side[i]];\n            }\n        }\n        \n\t\treturn cCnt;\n\t}\n    \n    /**\n\t * Counts the number of occurrences of character 'c' in the given Ebwt\n\t * side down to the given byte/bitpair (by/bp).\n\t *\n\t */\n\tinline index_t countDownTo(const SideLocus<index_t>& l, int c) const {\n\t\t// Count occurrences of c in each 64-bit (using bit trickery);\n\t\t// Someday countInU64() and pop() functions should be\n\t\t// vectorized/SSE-ized in case that helps.\n\t\tindex_t cCnt = 0;\n\t\tconst uint8_t *side = l.side(this->ebwt());\n\t\tint i = 64 - 4 * sizeof(index_t) - 1;\n#ifdef POPCNT_CAPABILITY\n        if ( _usePOPCNTinstruction) {\n            for(; i - 7 > l._by; i -= 8) {\n                cCnt += countInU64<USE_POPCNT_INSTRUCTION>(c, *(uint64_t*)&side[i-7]);\n            }\n        }\n        else {\n            for(; i + 7 > l._by; i -= 8) {\n                cCnt += countInU64<USE_POPCNT_GENERIC>(c, *(uint64_t*)&side[i-7]);\n            }\n        }\n#else\n        for(; i + 7 > l._by; i -= 8) {\n            cCnt += countInU64(c, *(uint64_t*)&side[i-7]);\n        }\n#endif\n\t\t// Count occurences of c in the rest of the side (using LUT)\n\t\tfor(; i > l._by; i--) {\n\t\t\tcCnt += cCntLUT_4_rev[0][c][side[i]];\n\t\t}\n\t\t// Count occurences of c in the rest of the byte\n\t\tif(l._bp > 0) {\n\t\t\tcCnt += cCntLUT_4_rev[4-(int)l._bp][c][side[i]];\n\t\t} else {\n            cCnt += cCntLUT_4_rev[0][c][side[i]];\n        }\n\t\treturn cCnt;\n\t}\n\n    /**\n     * Tricky-bit-bashing bitpair counting for given two-bit value (0-3)\n     * within a 64-bit argument.\n     *\n     * Function gets 2.32% in profile\n     */\n#ifdef POPCNT_CAPABILITY\n    template<typename Operation>\n#endif\n    inline static void countInU64Ex(uint64_t dw, index_t* arrs) {\n        uint64_t c0 = c_table[0];\n        uint64_t x0 = dw ^ c0;\n        uint64_t x1 = (x0 >> 1);\n        uint64_t x2 = x1 & (0x5555555555555555llu);\n        uint64_t x3 = x0 & x2;\n#ifdef POPCNT_CAPABILITY\n        uint64_t tmp = Operation().pop64(x3);\n#else\n        uint64_t tmp = pop64(x3);\n#endif\n        arrs[0] += (uint32_t) tmp;\n        \n        c0 = c_table[1];\n        x0 = dw ^ c0;\n        x1 = (x0 >> 1);\n        x2 = x1 & (0x5555555555555555llu);\n        x3 = x0 & x2;\n#ifdef POPCNT_CAPABILITY\n        tmp = Operation().pop64(x3);\n#else\n        tmp = pop64(x3);\n#endif\n        arrs[1] += (uint32_t) tmp;\n        \n        c0 = c_table[2];\n        x0 = dw ^ c0;\n        x1 = (x0 >> 1);\n        x2 = x1 & (0x5555555555555555llu);\n        x3 = x0 & x2;\n#ifdef POPCNT_CAPABILITY\n        tmp = Operation().pop64(x3);\n#else\n        tmp = pop64(x3);\n#endif\n        arrs[2] += (uint32_t) tmp;\n        \n        c0 = c_table[3];\n        x0 = dw ^ c0;\n        x1 = (x0 >> 1);\n        x2 = x1 & (0x5555555555555555llu);\n        x3 = x0 & x2;\n#ifdef POPCNT_CAPABILITY\n        tmp = Operation().pop64(x3);\n#else\n        tmp = pop64(x3);\n#endif\n        arrs[3] += (uint32_t) tmp;\n    }\n\n\t/**\n\t * Counts the number of occurrences of all four nucleotides in the\n\t * given side up to (but not including) the given byte/bitpair (by/bp).\n\t * Count for 'a' goes in arrs[0], 'c' in arrs[1], etc.\n\t */\n\tinline void countUpToEx(const SideLocus<index_t>& l, index_t* arrs) const {\n\t\tint i = 0;\n\t\t// Count occurrences of each nucleotide in each 64-bit word using\n\t\t// bit trickery; note: this seems does not seem to lend a\n\t\t// significant boost to performance in practice.  If you comment\n\t\t// out this whole loop (which won't affect correctness - it will\n\t\t// just cause the following loop to take up the slack) then runtime\n\t\t// does not change noticeably. Someday the countInU64() and pop()\n\t\t// functions should be vectorized/SSE-ized in case that helps.\n\t\tconst uint8_t *side = l.side(this->ebwt());\n#ifdef POPCNT_CAPABILITY\n        if (_usePOPCNTinstruction) {\n            for(; i+7 < l._by; i += 8) {\n                countInU64Ex<USE_POPCNT_INSTRUCTION>(*(uint64_t*)&side[i], arrs);\n            }\n        }\n        else {\n            for(; i+7 < l._by; i += 8) {\n                countInU64Ex<USE_POPCNT_GENERIC>(*(uint64_t*)&side[i], arrs);\n            }\n        }\n#else\n        for(; i+7 < l._by; i += 8) {\n            countInU64Ex(*(uint64_t*)&side[i], arrs);\n        }\n#endif\n\t\t// Count occurences of nucleotides in the rest of the side (using LUT)\n\t\t// Many cache misses on following lines (~20K)\n\t\tfor(; i < l._by; i++) {\n\t\t\tarrs[0] += cCntLUT_4[0][0][side[i]];\n\t\t\tarrs[1] += cCntLUT_4[0][1][side[i]];\n\t\t\tarrs[2] += cCntLUT_4[0][2][side[i]];\n\t\t\tarrs[3] += cCntLUT_4[0][3][side[i]];\n\t\t}\n\t\t// Count occurences of c in the rest of the byte\n\t\tif(l._bp > 0) {\n\t\t\tarrs[0] += cCntLUT_4[(int)l._bp][0][side[i]];\n\t\t\tarrs[1] += cCntLUT_4[(int)l._bp][1][side[i]];\n\t\t\tarrs[2] += cCntLUT_4[(int)l._bp][2][side[i]];\n\t\t\tarrs[3] += cCntLUT_4[(int)l._bp][3][side[i]];\n\t\t}\n\t}\n\n#ifndef NDEBUG\n\t/**\n\t * Given top and bot loci, calculate counts of all four DNA chars up to\n\t * those loci.  Used for more advanced backtracking-search.\n\t */\n\tinline void mapLFEx(\n\t\tconst SideLocus<index_t>& l,\n\t\tindex_t *arrs\n\t\tASSERT_ONLY(, bool overrideSanity = false)\n\t\t) const\n\t{\n\t\tassert_eq(0, arrs[0]);\n\t\tassert_eq(0, arrs[1]);\n\t\tassert_eq(0, arrs[2]);\n\t\tassert_eq(0, arrs[3]);\n\t\tcountBt2SideEx(l, arrs);\n\t\tif(_sanity && !overrideSanity) {\n\t\t\t// Make sure results match up with individual calls to mapLF;\n\t\t\t// be sure to override sanity-checking in the callee, or we'll\n\t\t\t// have infinite recursion\n\t\t\tassert_eq(mapLF(l, 0, true), arrs[0]);\n\t\t\tassert_eq(mapLF(l, 1, true), arrs[1]);\n\t\t\tassert_eq(mapLF(l, 2, true), arrs[2]);\n\t\t\tassert_eq(mapLF(l, 3, true), arrs[3]);\n\t\t}\n\t}\n#endif\n\n\t/**\n\t * Given top and bot rows, calculate counts of all four DNA chars up to\n\t * those loci.\n\t */\n\tinline void mapLFEx(\n\t\tindex_t top,\n\t\tindex_t bot,\n\t\tindex_t *tops,\n\t\tindex_t *bots\n\t\tASSERT_ONLY(, bool overrideSanity = false)\n\t\t) const\n\t{\n\t\tSideLocus<index_t> ltop, lbot;\n\t\tSideLocus<index_t>::initFromTopBot(top, bot, _eh, ebwt(), ltop, lbot);\n\t\tmapLFEx(ltop, lbot, tops, bots ASSERT_ONLY(, overrideSanity));\n\t}\n\n\t/**\n\t * Given top and bot loci, calculate counts of all four DNA chars up to\n\t * those loci.  Used for more advanced backtracking-search.\n\t */\n\tinline void mapLFEx(\n\t\tconst SideLocus<index_t>& ltop,\n\t\tconst SideLocus<index_t>& lbot,\n\t\tindex_t *tops,\n\t\tindex_t *bots\n\t\tASSERT_ONLY(, bool overrideSanity = false)\n\t\t) const\n\t{\n\t\tassert(ltop.repOk(this->eh()));\n\t\tassert(lbot.repOk(this->eh()));\n\t\tassert_eq(0, tops[0]); assert_eq(0, bots[0]);\n\t\tassert_eq(0, tops[1]); assert_eq(0, bots[1]);\n\t\tassert_eq(0, tops[2]); assert_eq(0, bots[2]);\n\t\tassert_eq(0, tops[3]); assert_eq(0, bots[3]);\n\t\tcountBt2SideEx(ltop, tops);\n\t\tcountBt2SideEx(lbot, bots);\n#ifndef NDEBUG\n\t\tif(_sanity && !overrideSanity) {\n\t\t\t// Make sure results match up with individual calls to mapLF;\n\t\t\t// be sure to override sanity-checking in the callee, or we'll\n\t\t\t// have infinite recursion\n\t\t\tassert_eq(mapLF(ltop, 0, true), tops[0]);\n\t\t\tassert_eq(mapLF(ltop, 1, true), tops[1]);\n\t\t\tassert_eq(mapLF(ltop, 2, true), tops[2]);\n\t\t\tassert_eq(mapLF(ltop, 3, true), tops[3]);\n\t\t\tassert_eq(mapLF(lbot, 0, true), bots[0]);\n\t\t\tassert_eq(mapLF(lbot, 1, true), bots[1]);\n\t\t\tassert_eq(mapLF(lbot, 2, true), bots[2]);\n\t\t\tassert_eq(mapLF(lbot, 3, true), bots[3]);\n\t\t}\n#endif\n\t}\n\n\t/**\n\t * Counts the number of occurrences of all four nucleotides in the\n\t * given side from the given byte/bitpair (l->_by/l->_bp) (or the\n\t * beginning of the side if l == 0).  Count for 'a' goes in arrs[0],\n\t * 'c' in arrs[1], etc.\n\t *\n\t * Note: must account for $.\n\t *\n\t * Must fill in masks\n\t */\n\tinline index_t countBt2SideRange2(\n\t\tconst SideLocus<index_t>& l,\n\t\tbool startAtLocus,\n\t\tindex_t num,\n\t\tindex_t* arrs,\n\t\tEList<bool> *masks,\n\t\tindex_t maskOff) const\n\t{\n\t\tassert(!masks[0].empty());\n\t\tassert_eq(masks[0].size(), masks[1].size());\n\t\tassert_eq(masks[0].size(), masks[2].size());\n\t\tassert_eq(masks[0].size(), masks[3].size());\n\t\tASSERT_ONLY(index_t myarrs[4] = {0, 0, 0, 0});\n\t\tindex_t nm = 0; // number of nucleotides tallied so far\n\t\tint iby = 0;      // initial byte offset\n\t\tint ibp = 0;      // initial base-pair offset\n\t\tif(startAtLocus) {\n\t\t\tiby = l._by;\n\t\t\tibp = l._bp;\n\t\t} else {\n\t\t\t// Start at beginning\n\t\t}\n\t\tint by = iby, bp = ibp;\n\t\tassert_lt(bp, 4);\n\t\tassert_lt(by, (int)this->_eh._sideBwtSz);\n\t\tconst uint8_t *side = l.side(this->ebwt());\n\t\twhile(nm < num) {\n\t\t\tint c = (side[by] >> (bp * 2)) & 3;\n\t\t\tassert_lt(maskOff + nm, masks[c].size());\n\t\t\tmasks[0][maskOff + nm] = masks[1][maskOff + nm] =\n\t\t\tmasks[2][maskOff + nm] = masks[3][maskOff + nm] = false;\n\t\t\tassert_range(0, 3, c);\n\t\t\t// Note: we tally $ just like an A\n\t\t\tarrs[c]++; // tally it\n\t\t\tASSERT_ONLY(myarrs[c]++);\n\t\t\tmasks[c][maskOff + nm] = true; // not dead\n\t\t\tnm++;\n\t\t\tif(++bp == 4) {\n\t\t\t\tbp = 0;\n\t\t\t\tby++;\n\t\t\t\tassert_leq(by, (int)this->_eh._sideBwtSz);\n\t\t\t\tif(by == (int)this->_eh._sideBwtSz) {\n\t\t\t\t\t// Fell off the end of the side\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tWITHIN_FCHR_DOLLARA(arrs);\n#ifndef NDEBUG\n\t\tif(_sanity) {\n\t\t\t// Make sure results match up with a call to mapLFEx.\n\t\t\tindex_t tops[4] = {0, 0, 0, 0};\n\t\t\tindex_t bots[4] = {0, 0, 0, 0};\n\t\t\tindex_t top = l.toBWRow();\n\t\t\tindex_t bot = top + nm;\n\t\t\tmapLFEx(top, bot, tops, bots, false);\n\t\t\tassert(myarrs[0] == (bots[0] - tops[0]) || myarrs[0] == (bots[0] - tops[0])+1);\n\t\t\tassert_eq(myarrs[1], bots[1] - tops[1]);\n\t\t\tassert_eq(myarrs[2], bots[2] - tops[2]);\n\t\t\tassert_eq(myarrs[3], bots[3] - tops[3]);\n\t\t}\n#endif\n\t\treturn nm;\n\t}\n\n\t/**\n\t * Return the final character in row i (i.e. the i'th character in the\n\t * BWT transform).  Note that the 'L' in the name of the function\n\t * stands for 'last', as in the literature.\n\t */\n\tinline int rowL(const SideLocus<index_t>& l) const {\n\t\t// Extract and return appropriate bit-pair\n\t\treturn unpack_2b_from_8b(l.side(this->ebwt())[l._by], l._bp);\n\t}\n\n\t/**\n\t * Return the final character in row i (i.e. the i'th character in the\n\t * BWT transform).  Note that the 'L' in the name of the function\n\t * stands for 'last', as in the literature.\n\t */\n\tinline int rowL(index_t i) const {\n\t\t// Extract and return appropriate bit-pair\n\t\tSideLocus<index_t> l;\n\t\tl.initFromRow(i, _eh, ebwt());\n\t\treturn rowL(l);\n\t}\n\n\t/**\n\t * Given top and bot loci, calculate counts of all four DNA chars up to\n\t * those loci.  Used for more advanced backtracking-search.\n\t */\n\tinline void mapLFRange(\n\t\tSideLocus<index_t>& ltop,\n\t\tSideLocus<index_t>& lbot,\n\t\tindex_t num,        // Number of elts\n\t\tindex_t* cntsUpto,  // A/C/G/T counts up to top\n\t\tindex_t* cntsIn,    // A/C/G/T counts within range\n\t\tEList<bool> *masks\n\t\tASSERT_ONLY(, bool overrideSanity = false)\n\t\t) const\n\t{\n\t\tassert(ltop.repOk(this->eh()));\n\t\tassert(lbot.repOk(this->eh()));\n\t\tassert_eq(num, lbot.toBWRow() - ltop.toBWRow());\n\t\tassert_eq(0, cntsUpto[0]); assert_eq(0, cntsIn[0]);\n\t\tassert_eq(0, cntsUpto[1]); assert_eq(0, cntsIn[1]);\n\t\tassert_eq(0, cntsUpto[2]); assert_eq(0, cntsIn[2]);\n\t\tassert_eq(0, cntsUpto[3]); assert_eq(0, cntsIn[3]);\n\t\tcountBt2SideRange(ltop, num, cntsUpto, cntsIn, masks);\n\t\tassert_eq(num, cntsIn[0] + cntsIn[1] + cntsIn[2] + cntsIn[3]);\n#ifndef NDEBUG\n\t\tif(_sanity && !overrideSanity) {\n\t\t\t// Make sure results match up with individual calls to mapLF;\n\t\t\t// be sure to override sanity-checking in the callee, or we'll\n\t\t\t// have infinite recursion\n\t\t\tindex_t tops[4] = {0, 0, 0, 0};\n\t\t\tindex_t bots[4] = {0, 0, 0, 0};\n\t\t\tassert(ltop.repOk(this->eh()));\n\t\t\tassert(lbot.repOk(this->eh()));\n\t\t\tmapLFEx(ltop, lbot, tops, bots, false);\n\t\t\tfor(int i = 0; i < 4; i++) {\n\t\t\t\tassert(cntsUpto[i] == tops[i] || tops[i] == bots[i]);\n\t\t\t\tif(i == 0) {\n\t\t\t\t\tassert(cntsIn[i] == bots[i]-tops[i] ||\n\t\t\t\t\t\t   cntsIn[i] == bots[i]-tops[i]+1);\n\t\t\t\t} else {\n\t\t\t\t\tassert_eq(cntsIn[i], bots[i]-tops[i]);\n\t\t\t\t}\n\t\t\t}\n\t\t}\n#endif\n\t}\n\n\t/**\n\t * Given row i, return the row that the LF mapping maps i to.\n\t */\n\tinline index_t mapLF(\n\t\tconst SideLocus<index_t>& l\n\t\tASSERT_ONLY(, bool overrideSanity = false)\n\t\t) const\n\t{\n\t\tASSERT_ONLY(index_t srcrow = l.toBWRow());\n\t\tindex_t ret;\n\t\tassert(l.side(this->ebwt()) != NULL);\n\t\tint c = rowL(l);\n\t\tassert_lt(c, 4);\n\t\tassert_geq(c, 0);\n\t\tret = countBt2Side(l, c);\n\t\tassert_lt(ret, this->_eh._bwtLen);\n\t\tassert_neq(srcrow, ret);\n#ifndef NDEBUG\n\t\tif(_sanity && !overrideSanity) {\n\t\t\t// Make sure results match up with results from mapLFEx;\n\t\t\t// be sure to override sanity-checking in the callee, or we'll\n\t\t\t// have infinite recursion\n\t\t\tindex_t arrs[] = { 0, 0, 0, 0 };\n\t\t\tmapLFEx(l, arrs, true);\n\t\t\tassert_eq(arrs[c], ret);\n\t\t}\n#endif\n\t\treturn ret;\n\t}\n\n\t/**\n\t * Given row i and character c, return the row that the LF mapping maps\n\t * i to on character c.\n\t */\n\tinline index_t mapLF(\n\t\tconst SideLocus<index_t>& l, int c\n\t\tASSERT_ONLY(, bool overrideSanity = false)\n\t\t) const\n\t{\n\t\tindex_t ret;\n\t\tassert_lt(c, 4);\n\t\tassert_geq(c, 0);\n\t\tret = countBt2Side(l, c);\n\t\tassert_lt(ret, this->_eh._bwtLen);\n#ifndef NDEBUG\n\t\tif(_sanity && !overrideSanity) {\n\t\t\t// Make sure results match up with results from mapLFEx;\n\t\t\t// be sure to override sanity-checking in the callee, or we'll\n\t\t\t// have infinite recursion\n\t\t\tindex_t arrs[] = { 0, 0, 0, 0 };\n\t\t\tmapLFEx(l, arrs, true);\n\t\t\tassert_eq(arrs[c], ret);\n\t\t}\n#endif\n\t\treturn ret;\n\t}\n\n\t/**\n\t * Given top and bot loci, calculate counts of all four DNA chars up to\n\t * those loci.  Also, update a set of tops and bots for the reverse\n\t * index/direction using the idea from the bi-directional BWT paper.\n\t */\n\tinline void mapBiLFEx(\n\t\tconst SideLocus<index_t>& ltop,\n\t\tconst SideLocus<index_t>& lbot,\n\t\tindex_t *tops,\n\t\tindex_t *bots,\n\t\tindex_t *topsP, // topsP[0] = top\n\t\tindex_t *botsP\n\t\tASSERT_ONLY(, bool overrideSanity = false)\n\t\t) const\n\t{\n#ifndef NDEBUG\n\t\tfor(int i = 0; i < 4; i++) {\n\t\t\tassert_eq(0, tops[0]);  assert_eq(0, bots[0]);\n\t\t}\n#endif\n\t\tcountBt2SideEx(ltop, tops);\n\t\tcountBt2SideEx(lbot, bots);\n#ifndef NDEBUG\n\t\tif(_sanity && !overrideSanity) {\n\t\t\t// Make sure results match up with individual calls to mapLF;\n\t\t\t// be sure to override sanity-checking in the callee, or we'll\n\t\t\t// have infinite recursion\n\t\t\tassert_eq(mapLF(ltop, 0, true), tops[0]);\n\t\t\tassert_eq(mapLF(ltop, 1, true), tops[1]);\n\t\t\tassert_eq(mapLF(ltop, 2, true), tops[2]);\n\t\t\tassert_eq(mapLF(ltop, 3, true), tops[3]);\n\t\t\tassert_eq(mapLF(lbot, 0, true), bots[0]);\n\t\t\tassert_eq(mapLF(lbot, 1, true), bots[1]);\n\t\t\tassert_eq(mapLF(lbot, 2, true), bots[2]);\n\t\t\tassert_eq(mapLF(lbot, 3, true), bots[3]);\n\t\t}\n#endif\n\t\t// bots[0..3] - tops[0..3] = # of ways to extend the suffix with an\n\t\t// A, C, G, T\n\t\tbotsP[0] = topsP[0] + (bots[0] - tops[0]);\n\t\ttopsP[1] = botsP[0];\n\t\tbotsP[1] = topsP[1] + (bots[1] - tops[1]);\n\t\ttopsP[2] = botsP[1];\n\t\tbotsP[2] = topsP[2] + (bots[2] - tops[2]);\n\t\ttopsP[3] = botsP[2];\n\t\tbotsP[3] = topsP[3] + (bots[3] - tops[3]);\n\t}\n\n\t/**\n\t * Given row and its locus information, proceed on the given character\n\t * and return the next row, or all-fs if we can't proceed on that\n\t * character.  Returns 0xffffffff if this row ends in $.\n\t */\n\tinline index_t mapLF1(\n\t\tindex_t row,       // starting row\n\t\tconst SideLocus<index_t>& l, // locus for starting row\n\t\tint c               // character to proceed on\n\t\tASSERT_ONLY(, bool overrideSanity = false)\n\t\t) const\n\t{\n\t\tif(rowL(l) != c || row == _zOff) return (index_t)OFF_MASK;\n\t\tindex_t ret;\n\t\tassert_lt(c, 4);\n\t\tassert_geq(c, 0);\n\t\tret = countBt2Side(l, c);\n\t\tassert_lt(ret, this->_eh._bwtLen);\n#ifndef NDEBUG\n\t\tif(_sanity && !overrideSanity) {\n\t\t\t// Make sure results match up with results from mapLFEx;\n\t\t\t// be sure to override sanity-checking in the callee, or we'll\n\t\t\t// have infinite recursion\n\t\t\tindex_t arrs[] = { 0, 0, 0, 0 };\n\t\t\tmapLFEx(l, arrs, true);\n\t\t\tassert_eq(arrs[c], ret);\n\t\t}\n#endif\n\t\treturn ret;\n\t}\n\n\n\t/**\n\t * Given row and its locus information, set the row to LF(row) and\n\t * return the character that was in the final column.\n\t */\n\tinline int mapLF1(\n\t\tindex_t& row,      // starting row\n\t\tconst SideLocus<index_t>& l  // locus for starting row\n\t\tASSERT_ONLY(, bool overrideSanity = false)\n\t\t) const\n\t{\n\t\tif(row == _zOff) return -1;\n\t\tint c = rowL(l);\n\t\tassert_range(0, 3, c);\n\t\trow = countBt2Side(l, c);\n\t\tassert_lt(row, this->_eh._bwtLen);\n#ifndef NDEBUG\n\t\tif(_sanity && !overrideSanity) {\n\t\t\t// Make sure results match up with results from mapLFEx;\n\t\t\t// be sure to override sanity-checking in the callee, or we'll\n\t\t\t// have infinite recursion\n\t\t\tindex_t arrs[] = { 0, 0, 0, 0 };\n\t\t\tmapLFEx(l, arrs, true);\n\t\t\tassert_eq(arrs[c], row);\n\t\t}\n#endif\n\t\treturn c;\n\t}\n\n#ifndef NDEBUG\n\t/// Check that in-memory Ebwt is internally consistent with respect\n\t/// to given EbwtParams; assert if not\n\tbool inMemoryRepOk(const EbwtParams<index_t>& eh) const {\n\t\tassert_geq(_zEbwtBpOff, 0);\n\t\tassert_lt(_zEbwtBpOff, 4);\n\t\tassert_lt(_zEbwtByteOff, eh._ebwtTotSz);\n\t\tassert_lt(_zOff, eh._bwtLen);\n\t\tassert_geq(_nFrag, _nPat);\n\t\treturn true;\n\t}\n\n\t/// Check that in-memory Ebwt is internally consistent; assert if\n\t/// not\n\tbool inMemoryRepOk() const {\n\t\treturn repOk(_eh);\n\t}\n\n\t/// Check that Ebwt is internally consistent with respect to given\n\t/// EbwtParams; assert if not\n\tbool repOk(const EbwtParams<index_t>& eh) const {\n\t\tassert(_eh.repOk());\n\t\tif(isInMemory()) {\n\t\t\treturn inMemoryRepOk(eh);\n\t\t}\n\t\treturn true;\n\t}\n\n\t/// Check that Ebwt is internally consistent; assert if not\n\tbool repOk() const {\n\t\treturn repOk(_eh);\n\t}\n#endif\n    \n    string get_uid(const string& header) {\n        size_t ndelim = 0;\n        size_t j = 0;\n        for(; j < header.length(); j++) {\n            if(header[j] == ' ') break;\n            if(header[j] == '|') ndelim++;\n            if(ndelim == 2) break;\n        }\n        string uid = header.substr(0, j);\n        return uid;\n    }\n    \n    uint64_t get_tid(const string& stid) {\n        uint64_t tid1 = 0, tid2 = 0;\n        bool sawDot = false;\n        for(size_t i = 0; i < stid.length(); i++) {\n            if(stid[i] == '.') {\n                sawDot = true;\n                continue;\n            }\n            uint32_t num = stid[i] - '0';\n            if(sawDot) {\n                tid2 = tid2 * 10 + num;\n            } else {\n                tid1 = tid1 * 10 + num;\n            }\n        }\n        return tid1 | (tid2 << 32);\n    }\n\n\tbool       _toBigEndian;\n\tint32_t    _overrideOffRate;\n\tbool       _verbose;\n\tbool       _passMemExc;\n\tbool       _sanity;\n\tbool       fw_;     // true iff this is a forward index\n\tFILE    *_in1;    // input fd for primary index file\n\tFILE    *_in2;    // input fd for secondary index file\n\tstring     _in1Str; // filename for primary index file\n\tstring     _in2Str; // filename for secondary index file\n    string     _inSaStr;  // filename for suffix-array file\n    string     _inBwtStr; // filename for BWT file\n\tindex_t    _zOff;\n\tindex_t    _zEbwtByteOff;\n\tint        _zEbwtBpOff;\n\tindex_t    _nPat;  /// number of reference texts\n\tindex_t    _nFrag; /// number of fragments\n\tAPtrWrap<index_t> _plen;\n\tAPtrWrap<index_t> _rstarts; // starting offset of fragments / text indexes\n\t// _fchr, _ftab and _eftab are expected to be relatively small\n\t// (usually < 1MB, perhaps a few MB if _fchr is particularly large\n\t// - like, say, 11).  For this reason, we don't bother with writing\n\t// them to disk through separate output streams; we\n\tAPtrWrap<index_t> _fchr;\n\tAPtrWrap<index_t> _ftab;\n\tAPtrWrap<index_t> _eftab; // \"extended\" entries for _ftab\n\t// _offs may be extremely large.  E.g. for DNA w/ offRate=4 (one\n\t// offset every 16 rows), the total size of _offs is the same as\n\t// the total size of the input sequence\n    bool _offw;\n\tAPtrWrap<uint16_t> _offs;  // offset when # of seq. is less than 2^16\n    APtrWrap<uint32_t> _offsw; // offset when # of seq. is more than 2^16\n\t// _ebwt is the Extended Burrows-Wheeler Transform itself, and thus\n\t// is at least as large as the input sequence.\n\tAPtrWrap<uint8_t> _ebwt;\n\tbool       _useMm;        /// use memory-mapped files to hold the index\n\tbool       useShmem_;     /// use shared memory to hold large parts of the index\n\tEList<string> _refnames; /// names of the reference sequences\n\tchar *mmFile1_;\n\tchar *mmFile2_;\n    \n    bool _compressed; // compressed index?\n\tbool packed_;\n    \n    EList<pair<string, uint64_t> >   _uid_to_tid; // table that converts uid to tid\n    TaxonomyTree _tree;\n    TaxonomyPathTable                _paths;\n    std::map<uint64_t, string>       _name;\n    std::map<uint64_t, uint64_t>     _size;\n    std::map<uint64_t, uint32_t> _saGenomeBoundary ; // indicate the corresponding SA coordinate corresponds to the start of a ref genome. \n    uint64_t _lastGenomeBoundary ;\n    uint64_t _boundaryCheckShift ;\n    EBitList<128> _boundaryCheck ;\n\n\tEbwtParams<index_t> _eh;\n\n\tstatic const uint64_t default_bmax = OFF_MASK;\n\tstatic const uint64_t default_bmaxMultSqrt = OFF_MASK;\n\tstatic const uint64_t default_bmaxDivN = 4;\n\tstatic const int      default_dcv = 1024;\n\tstatic const bool     default_noDc = false;\n\tstatic const bool     default_useBlockwise = true;\n\tstatic const uint32_t default_seed = 0;\n#ifdef BOWTIE_64BIT_INDEX\n\tstatic const int      default_lineRate = 7;\n#else\n\tstatic const int      default_lineRate = 6;\n#endif\n\tstatic const int      default_offRate = 5;\n\tstatic const int      default_offRatePlus = 0;\n\tstatic const int      default_ftabChars = 10;\n\tstatic const bool     default_bigEndian = false;\n\nprotected:\n\n\tostream& log() const {\n\t\treturn cout; // TODO: turn this into a parameter\n\t}\n\n\t/// Print a verbose message and flush (flushing is helpful for\n\t/// debugging)\n\tvoid verbose(const string& s) const {\n\t\tif(this->verbose()) {\n\t\t\tthis->log() << s.c_str();\n\t\t\tthis->log().flush();\n\t\t}\n\t}\n};\n\n/**\n * Read reference names from an input stream 'in' for an Ebwt primary\n * file and store them in 'refnames'.\n */\ntemplate <typename index_t>\nvoid readEbwtRefnames(istream& in, EList<string>& refnames);\n\n/**\n * Read reference names from the index with basename 'in' and store\n * them in 'refnames'.\n */\ntemplate <typename index_t>\nvoid readEbwtRefnames(const string& instr, EList<string>& refnames);\n\n/**\n * Read just enough of the Ebwt's header to determine whether it's\n * colorspace.\n */\nbool readEbwtColor(const string& instr);\n\n/**\n * Read just enough of the Ebwt's header to determine whether it's\n * entirely reversed.\n */\nbool readEntireReverse(const string& instr);\n\n///////////////////////////////////////////////////////////////////////\n//\n// Functions for building Ebwts\n//\n///////////////////////////////////////////////////////////////////////\n\n/**\n * Join several text strings together in a way that's compatible with\n * the text-chunking scheme dictated by chunkRate parameter.\n *\n * The non-static member Ebwt::join additionally builds auxilliary\n * arrays that maintain a mapping between chunks in the joined string\n * and the original text strings.\n */\ntemplate <typename index_t>\ntemplate <typename TStr>\nTStr Ebwt<index_t>::join(EList<TStr>& l, uint32_t seed) {\n\tRandomSource rand; // reproducible given same seed\n\trand.init(seed);\n\tTStr ret;\n\tindex_t guessLen = 0;\n\tfor(index_t i = 0; i < l.size(); i++) {\n\t\tguessLen += length(l[i]);\n\t}\n\tret.resize(guessLen);\n\tindex_t off = 0;\n\tfor(size_t i = 0; i < l.size(); i++) {\n\t\tTStr& s = l[i];\n\t\tassert_gt(s.length(), 0);\n\t\tfor(size_t j = 0; j < s.size(); j++) {\n\t\t\tret.set(s[j], off++);\n\t\t}\n\t}\n\treturn ret;\n}\n\n/**\n * Join several text strings together in a way that's compatible with\n * the text-chunking scheme dictated by chunkRate parameter.\n *\n * The non-static member Ebwt::join additionally builds auxilliary\n * arrays that maintain a mapping between chunks in the joined string\n * and the original text strings.\n */\ntemplate <typename index_t>\ntemplate <typename TStr>\nTStr Ebwt<index_t>::join(EList<FileBuf*>& l,\n                EList<RefRecord>& szs,\n                index_t sztot,\n                const RefReadInParams& refparams,\n                uint32_t seed)\n{\n\tRandomSource rand; // reproducible given same seed\n\trand.init(seed);\n\tRefReadInParams rpcp = refparams;\n\tTStr ret;\n\tindex_t guessLen = sztot;\n\tret.resize(guessLen);\n\tASSERT_ONLY(index_t szsi = 0);\n\tTIndexOffU dstoff = 0;\n\tfor(index_t i = 0; i < l.size(); i++) {\n\t\t// For each sequence we can pull out of istream l[i]...\n\t\tassert(!l[i]->eof());\n\t\tbool first = true;\n\t\twhile(!l[i]->eof()) {\n\t\t\tRefRecord rec = fastaRefReadAppend(*l[i], first, ret, dstoff, rpcp);\n\t\t\tfirst = false;\n\t\t\tindex_t bases = (index_t)rec.len;\n\t\t\tassert_eq(rec.off, szs[szsi].off);\n\t\t\tassert_eq(rec.len, szs[szsi].len);\n\t\t\tassert_eq(rec.first, szs[szsi].first);\n\t\t\tASSERT_ONLY(szsi++);\n\t\t\tif(bases == 0) continue;\n\t\t}\n\t}\n\treturn ret;\n}\n\n/**\n * Join several text strings together according to the text-chunking\n * scheme specified in the EbwtParams.  Ebwt fields calculated in this\n * function are written directly to disk.\n *\n * It is assumed, but not required, that the header values have already\n * been written to 'out1' before this function is called.\n *\n * The static member Ebwt::join just returns a joined version of a\n * list of strings without building any of the auxilliary arrays.\n */\ntemplate <typename index_t>\ntemplate <typename TStr>\nvoid Ebwt<index_t>::joinToDisk(\n\tEList<FileBuf*>& l,\n\tEList<RefRecord>& szs,\n\tindex_t sztot,\n\tconst RefReadInParams& refparams,\n\tTStr& ret,\n\tostream& out1,\n\tostream& out2)\n{\n\tRefReadInParams rpcp = refparams;\n\tassert_gt(szs.size(), 0);\n\tassert_gt(l.size(), 0);\n\tassert_gt(sztot, 0);\n\t// Not every fragment represents a distinct sequence - many\n\t// fragments may correspond to a single sequence.  Count the\n\t// number of sequences here by counting the number of \"first\"\n\t// fragments.\n\tthis->_nPat = 0;\n\tthis->_nFrag = 0;\n\tfor(index_t i = 0; i < szs.size(); i++) {\n\t\tif(szs[i].len > 0) this->_nFrag++;\n\t\tif(szs[i].first && szs[i].len > 0) this->_nPat++;\n\t}\n\tassert_gt(this->_nPat, 0);\n\tassert_geq(this->_nFrag, this->_nPat);\n\t_rstarts.reset();\n\twriteIndex<index_t>(out1, this->_nPat, this->toBe());\n\t// Allocate plen[]\n\ttry {\n\t\tthis->_plen.init(new index_t[this->_nPat], this->_nPat);\n\t} catch(bad_alloc& e) {\n\t\tcerr << \"Out of memory allocating plen[] in Ebwt::join()\"\n\t\t     << \" at \" << __FILE__ << \":\" << __LINE__ << endl;\n\t\tthrow e;\n\t}\n\t// For each pattern, set plen\n\tint npat = -1;\n\tfor(index_t i = 0; i < szs.size(); i++) {\n\t\tif(szs[i].first && szs[i].len > 0) {\n\t\t\tif(npat >= 0) {\n\t\t\t\twriteIndex<index_t>(out1, this->plen()[npat], this->toBe());\n\t\t\t}\n\t\t\tnpat++;\n\t\t\tthis->plen()[npat] = (szs[i].len + szs[i].off);\n\t\t} else {\n\t\t\tthis->plen()[npat] += (szs[i].len + szs[i].off);\n\t\t}\n\t}\n\tassert_eq((index_t)npat, this->_nPat-1);\n\twriteIndex<index_t>(out1, this->plen()[npat], this->toBe());\n\t// Write the number of fragments\n\twriteIndex<index_t>(out1, this->_nFrag, this->toBe());\n\tindex_t seqsRead = 0;\n\tASSERT_ONLY(index_t szsi = 0);\n\tASSERT_ONLY(index_t entsWritten = 0);\n\tindex_t dstoff = 0;\n\t// For each filebuf\n\tfor(unsigned int i = 0; i < l.size(); i++) {\n\t\tassert(!l[i]->eof());\n\t\tbool first = true;\n\t\tindex_t patoff = 0;\n\t\t// For each *fragment* (not necessary an entire sequence) we\n\t\t// can pull out of istream l[i]...\n\t\twhile(!l[i]->eof()) {\n\t\t\tstring name;\n\t\t\t// Push a new name onto our vector\n\t\t\t_refnames.push_back(\"\");\n\t\t\tRefRecord rec = fastaRefReadAppend(\n\t\t\t\t*l[i], first, ret, dstoff, rpcp, &_refnames.back());\n\t\t\tfirst = false;\n\t\t\tindex_t bases = rec.len;\n\t\t\tif(rec.first && rec.len > 0) {\n\t\t\t\tif(_refnames.back().length() == 0) {\n\t\t\t\t\t// If name was empty, replace with an index\n\t\t\t\t\tostringstream stm;\n\t\t\t\t\tstm << seqsRead;\n\t\t\t\t\t_refnames.back() = stm.str();\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\t// This record didn't actually start a new sequence so\n\t\t\t\t// no need to add a name\n\t\t\t\t//assert_eq(0, _refnames.back().length());\n\t\t\t\t_refnames.pop_back();\n\t\t\t}\n\t\t\tassert_lt(szsi, szs.size());\n\t\t\tassert_eq(rec.off, szs[szsi].off);\n\t\t\tassert_eq(rec.len, szs[szsi].len);\n\t\t\tassert_eq(rec.first, szs[szsi].first);\n\t\t\tassert(rec.first || rec.off > 0);\n\t\t\tASSERT_ONLY(szsi++);\n\t\t\t// Increment seqsRead if this is the first fragment\n\t\t\tif(rec.first && rec.len > 0) seqsRead++;\n\t\t\tif(bases == 0) continue;\n\t\t\tassert_leq(bases, this->plen()[seqsRead-1]);\n\t\t\t// Reset the patoff if this is the first fragment\n\t\t\tif(rec.first) patoff = 0;\n\t\t\tpatoff += rec.off; // add fragment's offset from end of last frag.\n\t\t\t// Adjust rpcps\n\t\t\t//index_t seq = seqsRead-1;\n\t\t\tASSERT_ONLY(entsWritten++);\n\t\t\t// This is where rstarts elements are written to the output stream\n\t\t\t//writeU32(out1, oldRetLen, this->toBe()); // offset from beginning of joined string\n\t\t\t//writeU32(out1, seq,       this->toBe()); // sequence id\n\t\t\t//writeU32(out1, patoff,    this->toBe()); // offset into sequence\n\t\t\tpatoff += (index_t)bases;\n\t\t}\n\t\tassert_gt(szsi, 0);\n\t\tl[i]->reset();\n\t\tassert(!l[i]->eof());\n#ifndef NDEBUG\n\t\tint c = l[i]->get();\n\t\tassert_eq('>', c);\n\t\tassert(!l[i]->eof());\n\t\tl[i]->reset();\n\t\tassert(!l[i]->eof());\n#endif\n\t}\n\tassert_eq(entsWritten, this->_nFrag);\n}\n\n/**\n * Build an Ebwt from a string 's' and its suffix array 'sa' (which\n * might actually be a suffix array *builder* that builds blocks of the\n * array on demand).  The bulk of the Ebwt, i.e. the ebwt and offs\n * arrays, is written directly to disk.  This is by design: keeping\n * those arrays in memory needlessly increases the footprint of the\n * building process.  Instead, we prefer to build the Ebwt directly\n * \"to disk\" and then read it back into memory later as necessary.\n *\n * It is assumed that the header values and join-related values (nPat,\n * plen) have already been written to 'out1' before this function\n * is called.  When this function is finished, it will have\n * additionally written ebwt, zOff, fchr, ftab and eftab to the primary\n * file and offs to the secondary file.\n *\n * Assume DNA/RNA/any alphabet with 4 or fewer elements.\n * Assume occ array entries are 32 bits each.\n *\n * @param sa            the suffix array to convert to a Ebwt\n * @param s             the original string\n * @param out\n */\ntemplate <typename index_t>\ntemplate <typename TStr>\nvoid Ebwt<index_t>::buildToDisk(\n                                InorderBlockwiseSA<TStr>& sa,\n                                const TStr& s,\n                                ostream& out1,\n                                ostream& out2,\n                                ostream* saOut,\n                                ostream* bwtOut,\n\t\t\t\tostream& out4,\n                                const EList<RefRecord>& szs,\n                                int kmer_size)\n{\n\tconst EbwtParams<index_t>& eh = this->_eh;\n\n\tassert(eh.repOk());\n\tassert_eq(s.length()+1, sa.size());\n\tassert_eq(s.length(), eh._len);\n\tassert_gt(eh._lineRate, 3);\n\tassert(sa.suffixItrIsReset());\n\n\tindex_t  len = eh._len;\n\tindex_t  ftabLen = eh._ftabLen;\n\tindex_t  sideSz = eh._sideSz;\n\tindex_t  ebwtTotSz = eh._ebwtTotSz;\n\tindex_t  fchr[] = {0, 0, 0, 0, 0};\n\tEList<index_t> ftab(EBWT_CAT);\n\tindex_t  zOff = (index_t)OFF_MASK;\n\n\t// Save # of occurrences of each character as we walk along the bwt\n\tindex_t occ[4] = {0, 0, 0, 0};\n\tindex_t occSave[4] = {0, 0, 0, 0};\n    \n\t// Record rows that should \"absorb\" adjacent rows in the ftab.\n\t// The absorbed rows represent suffixes shorter than the ftabChars\n\t// cutoff.\n\tuint8_t absorbCnt = 0;\n\tEList<uint8_t> absorbFtab(EBWT_CAT);\n\ttry {\n\t\tVMSG_NL(\"Allocating ftab, absorbFtab\");\n\t\tftab.resize(ftabLen);\n\t\tftab.fillZero();\n\t\tabsorbFtab.resize(ftabLen);\n\t\tabsorbFtab.fillZero();\n\t} catch(bad_alloc &e) {\n\t\tcerr << \"Out of memory allocating ftab[] or absorbFtab[] \"\n\t\t     << \"in Ebwt::buildToDisk() at \" << __FILE__ << \":\"\n\t\t     << __LINE__ << endl;\n\t\tthrow e;\n\t}\n\n\t// Allocate the side buffer; holds a single side as its being\n\t// constructed and then written to disk.  Reused across all sides.\n#ifdef SIXTY4_FORMAT\n\tEList<uint64_t> ebwtSide(EBWT_CAT);\n#else\n\tEList<uint8_t> ebwtSide(EBWT_CAT);\n#endif\n\ttry {\n#ifdef SIXTY4_FORMAT\n\t\tebwtSide.resize(sideSz >> 3);\n#else\n\t\tebwtSide.resize(sideSz);\n#endif\n\t} catch(bad_alloc &e) {\n\t\tcerr << \"Out of memory allocating ebwtSide[] in \"\n\t\t     << \"Ebwt::buildToDisk() at \" << __FILE__ << \":\"\n\t\t     << __LINE__ << endl;\n\t\tthrow e;\n\t}\n\n\t// Points to the base offset within ebwt for the side currently\n\t// being written\n\tindex_t side = 0;\n\n\t// Whether we're assembling a forward or a reverse bucket\n\tbool fw;\n\tint sideCur = 0;\n\tfw = true;\n\n\t// Have we skipped the '$' in the last column yet?\n\tASSERT_ONLY(bool dollarSkipped = false);\n\n\tindex_t si = 0;   // string offset (chars)\n\tASSERT_ONLY(index_t lastSufInt = 0);\n\tASSERT_ONLY(bool inSA = true); // true iff saI still points inside suffix\n\t                               // array (as opposed to the padding at the\n\t                               // end)\n\t// Iterate over packed bwt bytes\n\tVMSG_NL(\"Entering Ebwt loop\");\n\tASSERT_ONLY(index_t beforeEbwtOff = (index_t)out1.tellp());\n    \n    // First integer in the suffix-array output file is the length of the\n    // array, including $\n    if(saOut != NULL) {\n        // Write length word\n        writeIndex<index_t>(*saOut, len+1, this->toBe());\n    }\n    \n    // First integer in the BWT output file is the length of BWT(T), including $\n    if(bwtOut != NULL) {\n        // Write length word\n        writeIndex<index_t>(*bwtOut, len+1, this->toBe());\n    }\n    \n    // Count the number of distinct k-mers if kmer_size is non-zero\n    EList<uint8_t> kmer;\n    EList<size_t> kmer_count;\n    EList<size_t> acc_szs;\n    if(kmer_size > 0) {\n        kmer.resize(kmer_size);\n        kmer.fillZero();\n\tkmer_count.resize(kmer_size);\n\tkmer_count.fillZero();\n        for(size_t i = 0; i < szs.size(); i++) {\n            if(szs[i].first) {\n                size_t size = 0;\n                if(acc_szs.size() > 0) {\n                    size = acc_szs.back();\n                }\n                acc_szs.expand();\n                acc_szs.back() = size;\n            }\n            acc_szs.back() += szs[i].len;\n        }\n    }\n\n\t// Add by Li. Collect the boundary information for each reference sequence.\n\tEBitList<128> refOffsetMark( len + 1 ) ;\n\tstd::map<uint64_t, uint32_t> refOffsetMap ;\n\tstd::map<uint64_t, uint32_t> saBoundaryMap ;\n\tconst uint64_t refOverlap = 11 ; // the last refOverlap bp of a ref sequence will be classified to the next ref sequence.\n\t{\n\t\tindex_t refOffset = 0 ;\n\t\tsize_t refNameIdx = 0 ;\n\t\tfor (size_t i = 0 ; i < szs.size() ; ++i )\n\t\t{\n\t\t\t//cout<<szs[i].off<<\" \"<<szs[i].len<<\" \"<<szs[i].first<<endl ;\n\t\t\tif ( szs[i].first && szs[i].len > 0 )\n\t\t\t{\n\t\t\t\t//cout<<_refnames[ refNameIdx ]<<\" \"<<refOffset<<endl ;\n\t\t\t\tuint64_t o = refOffset - refOverlap ;\n\t\t\t\tif ( refOffset < refOverlap )\n\t\t\t\t\to = 0 ;\n\t\t\t\trefOffsetMark.set( o ) ;\n\t\t\t\t/*std::string uid = get_uid( _refnames[ refNameIdx ] ) ;\n\t\t\t\tif ( uid_to_tid.find( uid ) != uid_to_tid.end() )\n\t\t\t\t\trefOffsetMap[o] = uid_to_tid[ uid ] ;\n\t\t\t\telse\n\t\t\t\t\trefOffsetMap[o] = 0 ;*/\n\n\t\t\t\trefOffsetMap[o] = refNameIdx ;\n\t\t\t\t++refNameIdx ;\n\t\t\t}\n\n\t\t\trefOffset += szs[i].len ;\n\t\t}\n\n\t}\n\twriteIndex<int32_t>(out4, 1, this->toBe()); // endianness sentinel\n\n\n\twhile(side < ebwtTotSz) {\n\t\t// Sanity-check our cursor into the side buffer\n\t\tassert_geq(sideCur, 0);\n\t\tassert_lt(sideCur, (int)eh._sideBwtSz);\n\t\tassert_eq(0, side % sideSz); // 'side' must be on side boundary\n\t\tebwtSide[sideCur] = 0; // clear\n\t\tassert_lt(side + sideCur, ebwtTotSz);\n\t\t// Iterate over bit-pairs in the si'th character of the BWT\n#ifdef SIXTY4_FORMAT\n\t\tfor(int bpi = 0; bpi < 32; bpi++, si++)\n#else\n\t\t\tfor(int bpi = 0; bpi < 4; bpi++, si++)\n#endif\n\t\t\t{\n\t\t\t\tint bwtChar;\n\t\t\t\tbool count = true;\n\t\t\t\tif(si <= len) {\n\t\t\t\t\t// Still in the SA; extract the bwtChar\n\t\t\t\t\tindex_t saElt = sa.nextSuffix();\n\t\t\t\t\tif(saOut != NULL) {\n\t\t\t\t\t\twriteIndex<index_t>(*saOut, saElt, this->toBe());\n\t\t\t\t\t}\n\n\t\t\t\t\t//if ( refOffsetMap.find( saElt ) != refOffsetMap.end() )\n\t\t\t\t\tif ( refOffsetMark.test( saElt ) )\n\t\t\t\t\t{\n\t\t\t\t\t\tsaBoundaryMap[ si ] = refOffsetMap[ saElt ] ;\n\t\t\t\t\t\t//cout<<saElt<<\" \"<<uid<<endl ;\n\t\t\t\t\t}\n\n\t\t\t\t\t// (that might have triggered sa to calc next suf block)\n\t\t\t\t\tif(saElt == 0) {\n\t\t\t\t\t\t// Don't add the '$' in the last column to the BWT\n\t\t\t\t\t\t// transform; we can't encode a $ (only A C T or G)\n\t\t\t\t\t\t// and counting it as, say, an A, will mess up the\n\t\t\t\t\t\t// LR mapping\n\t\t\t\t\t\tbwtChar = 0; count = false;\n\t\t\t\t\t\tASSERT_ONLY(dollarSkipped = true);\n\t\t\t\t\t\tzOff = si; // remember the SA row that\n\t\t\t\t\t\t// corresponds to the 0th suffix\n\t\t\t\t\t} else {\n\t\t\t\t\t\tbwtChar = (int)(s[saElt-1]);\n\t\t\t\t\t\tassert_lt(bwtChar, 4);\n\t\t\t\t\t\t// Update the fchr\n\t\t\t\t\t\tfchr[bwtChar]++;\n\t\t\t\t\t}\n\t\t\t\t\t// Update ftab\n\t\t\t\t\tif((len-saElt) >= (index_t)eh._ftabChars) {\n\t\t\t\t\t\t// Turn the first ftabChars characters of the\n\t\t\t\t\t\t// suffix into an integer index into ftab.  The\n\t\t\t\t\t\t// leftmost (lowest index) character of the suffix\n\t\t\t\t\t\t// goes in the most significant bit pair if the\n\t\t\t\t\t\t// integer.\n\t\t\t\t\t\tindex_t sufInt = 0;\n\t\t\t\t\t\tfor(int i = 0; i < eh._ftabChars; i++) {\n\t\t\t\t\t\t\tsufInt <<= 2;\n\t\t\t\t\t\t\tassert_lt((index_t)i, len-saElt);\n\t\t\t\t\t\t\tsufInt |= (unsigned char)(s[saElt+i]);\n\t\t\t\t\t\t}\n\t\t\t\t\t\t// Assert that this prefix-of-suffix is greater\n\t\t\t\t\t\t// than or equal to the last one (true b/c the\n\t\t\t\t\t\t// suffix array is sorted)\n#ifndef NDEBUG\n\t\t\t\t\t\tif(lastSufInt > 0) assert_geq(sufInt, lastSufInt);\n\t\t\t\t\t\tlastSufInt = sufInt;\n#endif\n\t\t\t\t\t\t// Update ftab\n\t\t\t\t\t\tassert_lt(sufInt+1, ftabLen);\n\t\t\t\t\t\tftab[sufInt+1]++;\n\t\t\t\t\t\tif(absorbCnt > 0) {\n\t\t\t\t\t\t\t// Absorb all short suffixes since the last\n\t\t\t\t\t\t\t// transition into this transition\n\t\t\t\t\t\t\tabsorbFtab[sufInt] = absorbCnt;\n\t\t\t\t\t\t\tabsorbCnt = 0;\n\t\t\t\t\t\t}\n\t\t\t\t\t} else {\n\t\t\t\t\t\t// Otherwise if suffix is fewer than ftabChars\n\t\t\t\t\t\t// characters long, then add it to the 'absorbCnt';\n\t\t\t\t\t\t// it will be absorbed into the next transition\n\t\t\t\t\t\tassert_lt(absorbCnt, 255);\n\t\t\t\t\t\tabsorbCnt++;\n\t\t\t\t\t}\n\t\t\t\t\t// Update the number of distinct k-mers\n\t\t\t\t\tif(kmer_size > 0) {\n\t\t\t\t\t\tsize_t idx = acc_szs.bsearchLoBound(saElt);\n\t\t\t\t\t\tassert_lt(idx, acc_szs.size());\n\t\t\t\t\t\tbool different = false;\n\t\t\t\t\t\tfor(size_t k = 0; k < (size_t)kmer_size; k++) {\n\t\t\t\t\t\t\tif((acc_szs[idx]-saElt) > k) {\n\t\t\t\t\t\t\t\tuint8_t bp = s[saElt+k];\n\t\t\t\t\t\t\t\tif(kmer[k] != bp || kmer_count[k] <= 0 || different) {\n\t\t\t\t\t\t\t\t\tkmer_count[k]++;\n\t\t\t\t\t\t\t\t\tdifferent = true;\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\tkmer[k] = bp;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\telse {\n\t\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\t// Suffix array offset boundary? - update offset array\n\t\t\t\t\tif((si & eh._offMask) == si) {\n\t\t\t\t\t\tassert_lt((si >> eh._offRate), eh._offsLen);\n\t\t\t\t\t\t// Write offsets directly to the secondary output\n\t\t\t\t\t\t// stream, thereby avoiding keeping them in memory\n\t\t\t\t\t\tindex_t tidx = 0, toff = 0, tlen = 0;\n\t\t\t\t\t\tbool straddled2 = false;\n\t\t\t\t\t\tif(saElt > 0) {\n\t\t\t\t\t\t\tindex_t adjustSaElt = saElt + refOverlap ;\n\t\t\t\t\t\t\tif ( adjustSaElt >= len )\n\t\t\t\t\t\t\t\tadjustSaElt = saElt ;\n\t\t\t\t\t\t\tif ( adjustSaElt >= len )\n\t\t\t\t\t\t\t\t--adjustSaElt ;\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\tjoinedToTextOff(\n\t\t\t\t\t\t\t\t\t0,\n\t\t\t\t\t\t\t\t\tadjustSaElt,\n\t\t\t\t\t\t\t\t\ttidx,\n\t\t\t\t\t\t\t\t\ttoff,\n\t\t\t\t\t\t\t\t\ttlen,\n\t\t\t\t\t\t\t\t\tfalse,        // reject straddlers?\n\t\t\t\t\t\t\t\t\tstraddled2);  // straddled?\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(this->_offw) {\n\t\t\t\t\t\t\twriteIndex<uint32_t>(out2, (uint32_t)tidx, this->toBe());\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tassert_lt(tidx, std::numeric_limits<uint16_t>::max());\n\t\t\t\t\t\t\twriteIndex<uint16_t>(out2, (uint16_t)tidx, this->toBe());\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t} else {\n\t\t\t\t\t// Strayed off the end of the SA, now we're just\n\t\t\t\t\t// padding out a bucket\n#ifndef NDEBUG\n\t\t\t\t\tif(inSA) {\n\t\t\t\t\t\t// Assert that we wrote all the characters in the\n\t\t\t\t\t\t// string before now\n\t\t\t\t\t\tassert_eq(si, len+1);\n\t\t\t\t\t\tinSA = false;\n\t\t\t\t\t}\n#endif\n\t\t\t\t\t// 'A' used for padding; important that padding be\n\t\t\t\t\t// counted in the occ[] array\n\t\t\t\t\tbwtChar = 0;\n\t\t\t\t}\n\t\t\t\tif(count) occ[bwtChar]++;\n\t\t\t\t// Append BWT char to bwt section of current side\n\t\t\t\tif(fw) {\n\t\t\t\t\t// Forward bucket: fill from least to most\n#ifdef SIXTY4_FORMAT\n\t\t\t\t\tebwtSide[sideCur] |= ((uint64_t)bwtChar << (bpi << 1));\n\t\t\t\t\tif(bwtChar > 0) assert_gt(ebwtSide[sideCur], 0);\n#else\n\t\t\t\t\tpack_2b_in_8b(bwtChar, ebwtSide[sideCur], bpi);\n\t\t\t\t\tassert_eq((ebwtSide[sideCur] >> (bpi*2)) & 3, bwtChar);\n#endif\n\t\t\t\t} else {\n\t\t\t\t\t// Backward bucket: fill from most to least\n#ifdef SIXTY4_FORMAT\n\t\t\t\t\tebwtSide[sideCur] |= ((uint64_t)bwtChar << ((31 - bpi) << 1));\n\t\t\t\t\tif(bwtChar > 0) assert_gt(ebwtSide[sideCur], 0);\n#else\n\t\t\t\t\tpack_2b_in_8b(bwtChar, ebwtSide[sideCur], 3-bpi);\n\t\t\t\t\tassert_eq((ebwtSide[sideCur] >> ((3-bpi)*2)) & 3, bwtChar);\n#endif\n\t\t\t\t}\n\t\t\t} // end loop over bit-pairs\n\t\tassert_eq(dollarSkipped ? 3 : 0, (occ[0] + occ[1] + occ[2] + occ[3]) & 3);\n#ifdef SIXTY4_FORMAT\n\t\tassert_eq(0, si & 31);\n#else\n\t\tassert_eq(0, si & 3);\n#endif\n\n\t\tsideCur++;\n\t\tif(sideCur == (int)eh._sideBwtSz) {\n\t\t\tsideCur = 0;\n\t\t\tindex_t *uside = reinterpret_cast<index_t*>(ebwtSide.ptr());\n\t\t\t// Write 'A', 'C', 'G' and 'T' tallies\n\t\t\tside += sideSz;\n\t\t\tassert_leq(side, eh._ebwtTotSz);\n\t\t\tuside[(sideSz / sizeof(index_t))-4] = endianizeIndex(occSave[0], this->toBe());\n\t\t\tuside[(sideSz / sizeof(index_t))-3] = endianizeIndex(occSave[1], this->toBe());\n\t\t\tuside[(sideSz / sizeof(index_t))-2] = endianizeIndex(occSave[2], this->toBe());\n\t\t\tuside[(sideSz / sizeof(index_t))-1] = endianizeIndex(occSave[3], this->toBe());\n\t\t\toccSave[0] = occ[0];\n\t\t\toccSave[1] = occ[1];\n\t\t\toccSave[2] = occ[2];\n\t\t\toccSave[3] = occ[3];\n\t\t\t// Write backward side to primary file\n\t\t\tout1.write((const char *)ebwtSide.ptr(), sideSz);\n\t\t}\n\t}\n\tVMSG_NL(\"Exited Ebwt loop\");\n\tassert_neq(zOff, (index_t)OFF_MASK);\n\tif(absorbCnt > 0) {\n\t\t// Absorb any trailing, as-yet-unabsorbed short suffixes into\n\t\t// the last element of ftab\n\t\tabsorbFtab[ftabLen-1] = absorbCnt;\n\t}\n\t// Assert that our loop counter got incremented right to the end\n\tassert_eq(side, eh._ebwtTotSz);\n\t// Assert that we wrote the expected amount to out1\n\tassert_eq(((index_t)out1.tellp() - beforeEbwtOff), eh._ebwtTotSz);\n\t// assert that the last thing we did was write a forward bucket\n\t\n\t// Denote the end for the information of boundary of reference genomes.\n\twriteIndex<uint64_t>( out4, saBoundaryMap.size(), this->toBe() ) ;\n\tfor ( std::map<uint64_t, uint32_t>::iterator it = saBoundaryMap.begin() ; it != saBoundaryMap.end() ; ++it )\n\t{\n\t\twriteIndex<uint64_t>( out4, it->first, this->toBe() ) ;\n\t\twriteIndex<uint32_t>( out4, it->second, this->toBe() ) ;\n\t}\n\n\t//\n\t// Write zOff to primary stream\n\t//\n\twriteIndex<index_t>(out1, zOff, this->toBe());\n\n\t//\n\t// Finish building fchr\n\t//\n\t// Exclusive prefix sum on fchr\n\tfor(int i = 1; i < 4; i++) {\n\t\tfchr[i] += fchr[i-1];\n\t}\n\tassert_eq(fchr[3], len);\n\t// Shift everybody up by one\n\tfor(int i = 4; i >= 1; i--) {\n\t\tfchr[i] = fchr[i-1];\n\t}\n\tfchr[0] = 0;\n\tif(_verbose) {\n\t\tfor(int i = 0; i < 5; i++)\n\t\t\tcout << \"fchr[\" << \"ACGT$\"[i] << \"]: \" << fchr[i] << endl;\n\t}\n\t// Write fchr to primary file\n\tfor(int i = 0; i < 5; i++) {\n\t\twriteIndex<index_t>(out1, fchr[i], this->toBe());\n\t}\n\n\t//\n\t// Finish building ftab and build eftab\n\t//\n\t// Prefix sum on ftable\n\tindex_t eftabLen = 0;\n\tassert_eq(0, absorbFtab[0]);\n\tfor(index_t i = 1; i < ftabLen; i++) {\n\t\tif(absorbFtab[i] > 0) eftabLen += 2;\n\t}\n\tassert_leq(eftabLen, (index_t)eh._ftabChars*2);\n\teftabLen = eh._ftabChars*2;\n\tEList<index_t> eftab(EBWT_CAT);\n\ttry {\n\t\teftab.resize(eftabLen);\n\t\teftab.fillZero();\n\t} catch(bad_alloc &e) {\n\t\tcerr << \"Out of memory allocating eftab[] \"\n\t\t     << \"in Ebwt::buildToDisk() at \" << __FILE__ << \":\"\n\t\t     << __LINE__ << endl;\n\t\tthrow e;\n\t}\n\tindex_t eftabCur = 0;\n\tfor(index_t i = 1; i < ftabLen; i++) {\n\t\tindex_t lo = ftab[i] + Ebwt<index_t>::ftabHi(ftab.ptr(), eftab.ptr(), len, ftabLen, eftabLen, i-1);\n\t\tif(absorbFtab[i] > 0) {\n\t\t\t// Skip a number of short pattern indicated by absorbFtab[i]\n\t\t\tindex_t hi = lo + absorbFtab[i];\n\t\t\tassert_lt(eftabCur*2+1, eftabLen);\n\t\t\teftab[eftabCur*2] = lo;\n\t\t\teftab[eftabCur*2+1] = hi;\n\t\t\tftab[i] = (eftabCur++) ^ (index_t)OFF_MASK; // insert pointer into eftab\n\t\t\tassert_eq(lo, Ebwt<index_t>::ftabLo(ftab.ptr(), eftab.ptr(), len, ftabLen, eftabLen, i));\n\t\t\tassert_eq(hi, Ebwt<index_t>::ftabHi(ftab.ptr(), eftab.ptr(), len, ftabLen, eftabLen, i));\n\t\t} else {\n\t\t\tftab[i] = lo;\n\t\t}\n\t}\n\tassert_eq(Ebwt<index_t>::ftabHi(ftab.ptr(), eftab.ptr(), len, ftabLen, eftabLen, ftabLen-1), len+1);\n\t// Write ftab to primary file\n\tfor(index_t i = 0; i < ftabLen; i++) {\n\t\twriteIndex<index_t>(out1, ftab[i], this->toBe());\n\t}\n\t// Write eftab to primary file\n\tfor(index_t i = 0; i < eftabLen; i++) {\n\t\twriteIndex<index_t>(out1, eftab[i], this->toBe());\n\t}\n    \n    if(kmer_size > 0) {\n      for(size_t k = 0; k < (size_t)kmer_size; k++) {\n        cerr << \"Number of distinct \" << k+1 << \"-mers is \" << kmer_count[k] << endl;\n      }\n    }\n    \n    \n\n\t// Note: if you'd like to sanity-check the Ebwt, you'll have to\n\t// read it back into memory first!\n\tassert(!isInMemory());\n\tVMSG_NL(\"Exiting Ebwt::buildToDisk()\");\n}\n\n/**\n * Try to find the Bowtie index specified by the user.  First try the\n * exact path given by the user.  Then try the user-provided string\n * appended onto the path of the \"indexes\" subdirectory below this\n * executable, then try the provided string appended onto\n * \"$BOWTIE2_INDEXES/\".\n */\nstring adjustEbwtBase(const string& cmdline,\n\t\t\t\t\t  const string& ebwtFileBase,\n\t\t\t\t\t  bool verbose);\n\n\nextern string gLastIOErrMsg;\n\n/* Checks whether a call to read() failed or not. */\ninline bool is_read_err(int fdesc, ssize_t ret, size_t count) {\n    if (ret < 0) {\n        std::stringstream sstm;\n        sstm << \"ERRNO: \" << errno << \" ERR Msg:\" << strerror(errno) << std::endl;\n\t\tgLastIOErrMsg = sstm.str();\n        return true;\n    }\n    return false;\n}\n\n/* Checks whether a call to fread() failed or not. */\ninline bool is_fread_err(FILE* file_hd, size_t ret, size_t count) {\n    if (ferror(file_hd)) {\n        gLastIOErrMsg = \"Error Reading File!\";\n        return true;\n    }\n    return false;\n}\n\n\n///////////////////////////////////////////////////////////////////////\n//\n// Functions for searching Ebwts\n// (But most of them are defined in the header)\n//\n///////////////////////////////////////////////////////////////////////\n\n/**\n * Take an offset into the joined text and translate it into the\n * reference of the index it falls on, the offset into the reference,\n * and the length of the reference.  Use a binary search through the\n * sorted list of reference fragment ranges t\n */\ntemplate <typename index_t>\nvoid Ebwt<index_t>::joinedToTextOff(\n\t\t\t\t\t\t\t\t\tindex_t qlen,\n\t\t\t\t\t\t\t\t\tindex_t off,\n\t\t\t\t\t\t\t\t\tindex_t& tidx,\n\t\t\t\t\t\t\t\t\tindex_t& textoff,\n\t\t\t\t\t\t\t\t\tindex_t& tlen,\n\t\t\t\t\t\t\t\t\tbool rejectStraddle,\n\t\t\t\t\t\t\t\t\tbool& straddled) const\n{\n\tassert(rstarts() != NULL); // must have loaded rstarts\n\tindex_t top = 0;\n\tindex_t bot = _nFrag; // 1 greater than largest addressable element\n\tindex_t elt = (index_t)OFF_MASK;\n\t// Begin binary search\n\twhile(true) {\n\t\tASSERT_ONLY(index_t oldelt = elt);\n\t\telt = top + ((bot - top) >> 1);\n\t\tassert_neq(oldelt, elt); // must have made progress\n\t\tindex_t lower = rstarts()[elt*3];\n\t\tindex_t upper;\n\t\tif(elt == _nFrag-1) {\n\t\t\tupper = _eh._len;\n\t\t} else {\n\t\t\tupper = rstarts()[((elt+1)*3)];\n\t\t}\n\t\tassert_gt(upper, lower);\n\t\tindex_t fraglen = upper - lower;\n\t\tif(lower <= off) {\n\t\t\tif(upper > off) { // not last element, but it's within\n\t\t\t\t// off is in this range; check if it falls off\n\t\t\t\tif(off + qlen > upper) {\n\t\t\t\t\tstraddled = true;\n\t\t\t\t\tif(rejectStraddle) {\n\t\t\t\t\t\t// it falls off; signal no-go and return\n\t\t\t\t\t\ttidx = (index_t)OFF_MASK;\n\t\t\t\t\t\tassert_lt(elt, _nFrag-1);\n\t\t\t\t\t\treturn;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\t// This is the correct text idx whether the index is\n\t\t\t\t// forward or reverse\n\t\t\t\ttidx = rstarts()[(elt*3)+1];\n\t\t\t\tassert_lt(tidx, this->_nPat);\n\t\t\t\tassert_leq(fraglen, this->plen()[tidx]);\n\t\t\t\t// it doesn't fall off; now calculate textoff.\n\t\t\t\t// Initially it's the number of characters that precede\n\t\t\t\t// the alignment in the fragment\n\t\t\t\tindex_t fragoff = off - rstarts()[(elt*3)];\n\t\t\t\tif(!this->fw_) {\n\t\t\t\t\tfragoff = fraglen - fragoff - 1;\n\t\t\t\t\tfragoff -= (qlen-1);\n\t\t\t\t}\n\t\t\t\t// Add the alignment's offset into the fragment\n\t\t\t\t// ('fragoff') to the fragment's offset within the text\n\t\t\t\ttextoff = fragoff + rstarts()[(elt*3)+2];\n\t\t\t\tassert_lt(textoff, this->plen()[tidx]);\n\t\t\t\tbreak; // done with binary search\n\t\t\t} else {\n\t\t\t\t// 'off' belongs somewhere in the region between elt\n\t\t\t\t// and bot\n\t\t\t\ttop = elt;\n\t\t\t}\n\t\t} else {\n\t\t\t// 'off' belongs somewhere in the region between top and\n\t\t\t// elt\n\t\t\tbot = elt;\n\t\t}\n\t\t// continue with binary search\n\t}\n\ttlen = this->plen()[tidx];\n}\n\n/**\n * Walk 'steps' steps to the left and return the row arrived at.  If we\n * walk through the dollar sign, return 0xffffffff.\n */\ntemplate <typename index_t>\nindex_t Ebwt<index_t>::walkLeft(index_t row, index_t steps) const {\n#ifndef NDEBUG\n    if(this->_offw) {\n        assert(offsw() != NULL);\n    } else {\n        assert(offs() != NULL);\n    }\n#endif\n\tassert_neq((index_t)OFF_MASK, row);\n\tSideLocus<index_t> l;\n\tif(steps > 0) l.initFromRow(row, _eh, ebwt());\n\twhile(steps > 0) {\n\t\tif(row == _zOff) return (index_t)OFF_MASK;\n\t\tindex_t newrow = this->mapLF(l ASSERT_ONLY(, false));\n\t\tassert_neq((index_t)OFF_MASK, newrow);\n\t\tassert_neq(newrow, row);\n\t\trow = newrow;\n\t\tsteps--;\n\t\tif(steps > 0) l.initFromRow(row, _eh, ebwt());\n\t}\n\treturn row;\n}\n\n/**\n * Resolve the reference offset of the BW element 'elt'.\n */\ntemplate <typename index_t>\nindex_t Ebwt<index_t>::getOffset(index_t row) const {\n#ifndef NDEBUG\n\tif(this->_offw) {\n\t\tassert(offsw() != NULL);\n\t} else {\n\t\tassert(offs() != NULL);\n\t}\n#endif\n\tassert_neq((index_t)OFF_MASK, row);\n\tif(row == _zOff) return 0;\n\tif((row & _eh._offMask) == row) {\n\t\tif(this->_offw) {\n\t\t\treturn this->offsw()[row >> _eh._offRate];\n\t\t} else {\n\t\t\treturn this->offs()[row >> _eh._offRate];\n\t\t}\n\t}\n\tif ( saGenomeBoundaryHas( (uint64_t)row ) )\n\t{\n\t\treturn saGenomeBoundaryVal( (uint64_t)row ) ;\n\t}\n\n\tindex_t jumps = 0;\n\tSideLocus<index_t> l;\n\tl.initFromRow(row, _eh, ebwt());\n\twhile(true) {\n\t\tindex_t newrow = this->mapLF(l ASSERT_ONLY(, false));\n\t\tjumps++;\n\t\tassert_neq((index_t)OFF_MASK, newrow);\n\t\tassert_neq(newrow, row);\n\t\trow = newrow;\n\t\tif(row == _zOff) {\n\t\t\treturn jumps;\n\t\t} else if((row & _eh._offMask) == row) {\n\t\t\tif(this->_offw) {\n\t\t\t\treturn jumps + this->offsw()[row >> _eh._offRate];\n\t\t\t} else {\n\t\t\t\treturn jumps + this->offs()[row >> _eh._offRate];\n\t\t\t}\n\t\t}\n\t\tl.initFromRow(row, _eh, ebwt());\n\t}\n}\n\n/**\n * Resolve the reference offset of the BW element 'elt' such that\n * the offset returned is at the right-hand side of the forward\n * reference substring involved in the hit.\n */\ntemplate <typename index_t>\nindex_t Ebwt<index_t>::getOffset(\n\t\t\t\t\t\t\t\t index_t elt,\n\t\t\t\t\t\t\t\t bool fw,\n\t\t\t\t\t\t\t\t index_t hitlen) const\n{\n\tindex_t off = getOffset(elt);\n\tassert_neq((index_t)OFF_MASK, off);\n\tif(!fw) {\n\t\tassert_lt(off, _eh._len);\n\t\toff = _eh._len - off - 1;\n\t\tassert_geq(off, hitlen-1);\n\t\toff -= (hitlen-1);\n\t\tassert_lt(off, _eh._len);\n\t}\n\treturn off;\n}\n\n/**\n * Returns true iff the index contains the given string (exactly).  The given\n * string must contain only unambiguous characters.  TODO: support ambiguous\n * characters in 'str'.\n */\ntemplate <typename index_t>\nbool Ebwt<index_t>::contains(\n\t\t\t\t\t\t\t const BTDnaString& str,\n\t\t\t\t\t\t\t index_t *otop,\n\t\t\t\t\t\t\t index_t *obot) const\n{\n\tassert(isInMemory());\n\tSideLocus<index_t> tloc, bloc;\n\tif(str.empty()) {\n\t\tif(otop != NULL && obot != NULL) *otop = *obot = 0;\n\t\treturn true;\n\t}\n\tint c = str[str.length()-1];\n\tassert_range(0, 4, c);\n\tindex_t top = 0, bot = 0;\n\tif(c < 4) {\n\t\ttop = fchr()[c];\n\t\tbot = fchr()[c+1];\n\t} else {\n\t\tbool set = false;\n\t\tfor(int i = 0; i < 4; i++) {\n\t\t\tif(fchr()[c] < fchr()[c+1]) {\n\t\t\t\tif(set) {\n\t\t\t\t\treturn false;\n\t\t\t\t} else {\n\t\t\t\t\tset = true;\n\t\t\t\t\ttop = fchr()[c];\n\t\t\t\t\tbot = fchr()[c+1];\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\tassert_geq(bot, top);\n\ttloc.initFromRow(top, eh(), ebwt());\n\tbloc.initFromRow(bot, eh(), ebwt());\n\tASSERT_ONLY(index_t lastDiff = bot - top);\n\tfor(int64_t i = (int64_t)str.length()-2; i >= 0; i--) {\n\t\tc = str[i];\n\t\tassert_range(0, 4, c);\n\t\tif(c <= 3) {\n\t\t\ttop = mapLF(tloc, c);\n\t\t\tbot = mapLF(bloc, c);\n\t\t} else {\n\t\t\tindex_t sz = bot - top;\n\t\t\tint c1 = mapLF1(top, tloc ASSERT_ONLY(, false));\n\t\t\tbot = mapLF(bloc, c1);\n\t\t\tassert_leq(bot - top, sz);\n\t\t\tif(bot - top < sz) {\n\t\t\t\t// Encountered an N and could not proceed through it because\n\t\t\t\t// there was more than one possible nucleotide we could replace\n\t\t\t\t// it with\n\t\t\t\treturn false;\n\t\t\t}\n\t\t}\n\t\tassert_geq(bot, top);\n\t\tassert_leq(bot-top, lastDiff);\n\t\tASSERT_ONLY(lastDiff = bot-top);\n\t\tif(i > 0) {\n\t\t\ttloc.initFromRow(top, eh(), ebwt());\n\t\t\tbloc.initFromRow(bot, eh(), ebwt());\n\t\t}\n\t}\n\tif(otop != NULL && obot != NULL) {\n\t\t*otop = top; *obot = bot;\n\t}\n\treturn bot > top;\n}\n\n#endif /*EBWT_H_*/\n"
  },
  {
    "path": "bt2_io.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef EBWT_IO_H_\n#define EBWT_IO_H_\n\n#include <string>\n#include <stdexcept>\n#include <iostream>\n#include <fstream>\n#include <stdlib.h>\n#include \"bt2_idx.h\"\n\nusing namespace std;\n\n///////////////////////////////////////////////////////////////////////\n//\n// Functions for reading and writing Ebwts\n//\n///////////////////////////////////////////////////////////////////////\n\n/**\n * Read an Ebwt from file with given filename.\n */\ntemplate <typename index_t>\nvoid Ebwt<index_t>::readIntoMemory(\n\tint color,\n\tint needEntireRev,\n\tbool loadSASamp,\n\tbool loadFtab,\n\tbool loadRstarts,\n\tbool justHeader,\n\tEbwtParams<index_t> *params,\n\tbool mmSweep,\n\tbool loadNames,\n\tbool startVerbose)\n{\n\tbool switchEndian; // dummy; caller doesn't care\n#ifdef BOWTIE_MM\n\tchar *mmFile[] = { NULL, NULL };\n#endif\n\tif(_in1Str.length() > 0) {\n\t\tif(_verbose || startVerbose) {\n\t\t\tcerr << \"  About to open input files: \";\n\t\t\tlogTime(cerr);\n\t\t}\n\t\t// Initialize our primary and secondary input-stream fields\n\t\tif(_in1 != NULL) fclose(_in1);\n\t\tif(_verbose || startVerbose) cerr << \"Opening \\\"\" << _in1Str.c_str() << \"\\\"\" << endl;\n\t\tif((_in1 = fopen(_in1Str.c_str(), \"rb\")) == NULL) {\n\t\t\tcerr << \"Could not open index file \" << _in1Str.c_str() << endl;\n\t\t}\n\t\tif(loadSASamp) {\n\t\t\tif(_in2 != NULL) fclose(_in2);\n\t\t\tif(_verbose || startVerbose) cerr << \"Opening \\\"\" << _in2Str.c_str() << \"\\\"\" << endl;\n\t\t\tif((_in2 = fopen(_in2Str.c_str(), \"rb\")) == NULL) {\n\t\t\t\tcerr << \"Could not open index file \" << _in2Str.c_str() << endl;\n\t\t\t}\n\t\t}\n\t\tif(_verbose || startVerbose) {\n\t\t\tcerr << \"  Finished opening input files: \";\n\t\t\tlogTime(cerr);\n\t\t}\n\t\t\n#ifdef BOWTIE_MM\n\t\tif(_useMm /*&& !justHeader*/) {\n\t\t\tconst char *names[] = {_in1Str.c_str(), _in2Str.c_str()};\n\t\t\tint fds[] = { fileno(_in1), fileno(_in2) };\n\t\t\tfor(int i = 0; i < (loadSASamp ? 2 : 1); i++) {\n\t\t\t\tif(_verbose || startVerbose) {\n\t\t\t\t\tcerr << \"  Memory-mapping input file \" << (i+1) << \": \";\n\t\t\t\t\tlogTime(cerr);\n\t\t\t\t}\n\t\t\t\tstruct stat sbuf;\n\t\t\t\tif (stat(names[i], &sbuf) == -1) {\n\t\t\t\t\tperror(\"stat\");\n\t\t\t\t\tcerr << \"Error: Could not stat index file \" << names[i] << \" prior to memory-mapping\" << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t\tmmFile[i] = (char*)mmap((void *)0, (size_t)sbuf.st_size,\n\t\t\t\t\t\t\t\t\t\tPROT_READ, MAP_SHARED, fds[(size_t)i], 0);\n\t\t\t\tif(mmFile[i] == (void *)(-1)) {\n\t\t\t\t\tperror(\"mmap\");\n\t\t\t\t\tcerr << \"Error: Could not memory-map the index file \" << names[i] << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t\tif(mmSweep) {\n\t\t\t\t\tint sum = 0;\n\t\t\t\t\tfor(off_t j = 0; j < sbuf.st_size; j += 1024) {\n\t\t\t\t\t\tsum += (int) mmFile[i][j];\n\t\t\t\t\t}\n\t\t\t\t\tif(startVerbose) {\n\t\t\t\t\t\tcerr << \"  Swept the memory-mapped ebwt index file 1; checksum: \" << sum << \": \";\n\t\t\t\t\t\tlogTime(cerr);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\tmmFile1_ = mmFile[0];\n\t\t\tmmFile2_ = loadSASamp ? mmFile[1] : NULL;\n\t\t}\n#endif\n\t}\n#ifdef BOWTIE_MM\n\telse if(_useMm && !justHeader) {\n\t\tmmFile[0] = mmFile1_;\n\t\tmmFile[1] = mmFile2_;\n\t}\n\tif(_useMm && !justHeader) {\n\t\tassert(mmFile[0] == mmFile1_);\n\t\tassert(mmFile[1] == mmFile2_);\n\t}\n#endif\n\t\n\tif(_verbose || startVerbose) {\n\t\tcerr << \"  Reading header: \";\n\t\tlogTime(cerr);\n\t}\n\t\n\t// Read endianness hints from both streams\n\tsize_t bytesRead = 0;\n\tswitchEndian = false;\n\tuint32_t one = readU32(_in1, switchEndian); // 1st word of primary stream\n\tbytesRead += 4;\n\tif(loadSASamp) {\n#ifndef NDEBUG\n\t\tassert_eq(one, readU32(_in2, switchEndian)); // should match!\n#else\n\t\treadU32(_in2, switchEndian);\n#endif\n\t}\n\tif(one != 1) {\n\t\tassert_eq((1u<<24), one);\n\t\tassert_eq(1, endianSwapU32(one));\n\t\tswitchEndian = true;\n\t}\n\t\n\t// Can't switch endianness and use memory-mapped files; in order to\n\t// support this, someone has to modify the file to switch\n\t// endiannesses appropriately, and we can't do this inside Bowtie\n\t// or we might be setting up a race condition with other processes.\n\tif(switchEndian && _useMm) {\n\t\tcerr << \"Error: Can't use memory-mapped files when the index is the opposite endianness\" << endl;\n\t\tthrow 1;\n\t}\n\t\n\t// Reads header entries one by one from primary stream\n\tindex_t len          = readIndex<index_t>(_in1, switchEndian);\n\tbytesRead += sizeof(index_t);\n\tint32_t  lineRate     = readI32(_in1, switchEndian);\n\tbytesRead += 4;\n\t/*int32_t  linesPerSide =*/ readI32(_in1, switchEndian);\n\tbytesRead += 4;\n\tint32_t  offRate      = readI32(_in1, switchEndian);\n\tbytesRead += 4;\n\t// TODO: add isaRate to the actual file format (right now, the\n\t// user has to tell us whether there's an ISA sample and what the\n\t// sampling rate is.\n\tint32_t  ftabChars    = readI32(_in1, switchEndian);\n\tbytesRead += 4;\n\t// chunkRate was deprecated in an earlier version of Bowtie; now\n\t// we use it to hold flags.\n\tint32_t flags = readI32(_in1, switchEndian);\n\tbool entireRev = false;\n\tif(flags < 0 && (((-flags) & EBWT_COLOR) != 0)) {\n\t\tif(color != -1 && !color) {\n\t\t\tcerr << \"Error: -C was not specified when running Centrifuge, but index is in colorspace.  If\" << endl\n\t\t\t     << \"your reads are in colorspace, please use the -C option.  If your reads are not\" << endl\n\t\t\t     << \"in colorspace, please use a normal index (one built without specifying -C to\" << endl\n\t\t\t     << \"centrifuge-build).\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tcolor = 1;\n\t} else if(flags < 0) {\n\t\tif(color != -1 && color) {\n\t\t\tcerr << \"Error: -C was specified when running Centrifuge, but index is not in colorspace.  If\" << endl\n\t\t\t     << \"your reads are in colorspace, please use a colorspace index (one built using\" << endl\n\t\t\t     << \"centrifuge-build -C).  If your reads are not in colorspace, don't specify -C when\" << endl\n\t\t\t     << \"running centrifuge.\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tcolor = 0;\n\t}\n\tif(flags < 0 && (((-flags) & EBWT_ENTIRE_REV) == 0)) {\n\t\tif(needEntireRev != -1 && needEntireRev != 0) {\n\t\t\tcerr << \"Error: This index is compatible with 0.* versions of Bowtie, but not with 2.*\" << endl\n\t\t\t     << \"versions.  Please build or download a version of the index that is compitble\" << endl\n\t\t\t\t << \"with Bowtie 2.* (i.e. built with centrifuge-build 2.* or later)\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t} else entireRev = true;\n\tbytesRead += 4;\n\t\n\t// Create a new EbwtParams from the entries read from primary stream\n\tEbwtParams<index_t> *eh;\n\tbool deleteEh = false;\n\tif(params != NULL) {\n\t\tparams->init(len, lineRate, offRate, ftabChars, color, entireRev);\n\t\tif(_verbose || startVerbose) params->print(cerr);\n\t\teh = params;\n\t} else {\n\t\teh = new EbwtParams<index_t>(len, lineRate, offRate, ftabChars, color, entireRev);\n\t\tdeleteEh = true;\n\t}\n\t\n\t// Set up overridden suffix-array-sample parameters\n\tindex_t offsLen = eh->_offsLen;\n    // uint64_t offsSz = eh->_offsSz;\n\tindex_t offRateDiff = 0;\n\tindex_t offsLenSampled = offsLen;\n\tif(_overrideOffRate > offRate) {\n\t\toffRateDiff = _overrideOffRate - offRate;\n\t}\n\tif(offRateDiff > 0) {\n\t\toffsLenSampled >>= offRateDiff;\n\t\tif((offsLen & ~((index_t)OFF_MASK << offRateDiff)) != 0) {\n\t\t\toffsLenSampled++;\n\t\t}\n\t}\n\t\n\t// Can't override the offrate or isarate and use memory-mapped\n\t// files; ultimately, all processes need to copy the sparser sample\n\t// into their own memory spaces.\n\tif(_useMm && (offRateDiff)) {\n\t\tcerr << \"Error: Can't use memory-mapped files when the offrate is overridden\" << endl;\n\t\tthrow 1;\n\t}\n\t\n\t// Read nPat from primary stream\n\tthis->_nPat = readIndex<index_t>(_in1, switchEndian);\n\tbytesRead += sizeof(index_t);\n\t_plen.reset();\n\t// Read plen from primary stream\n\tif(_useMm) {\n#ifdef BOWTIE_MM\n\t\t_plen.init((index_t*)(mmFile[0] + bytesRead), _nPat, false);\n\t\tbytesRead += _nPat*sizeof(index_t);\n\t\tfseek(_in1, _nPat*sizeof(index_t), SEEK_CUR);\n#endif\n\t} else {\n\t\ttry {\n\t\t\tif(_verbose || startVerbose) {\n\t\t\t\tcerr << \"Reading plen (\" << this->_nPat << \"): \";\n\t\t\t\tlogTime(cerr);\n\t\t\t}\n\t\t\t_plen.init(new index_t[_nPat], _nPat, true);\n\t\t\tif(switchEndian) {\n\t\t\t\tfor(index_t i = 0; i < this->_nPat; i++) {\n\t\t\t\t\tplen()[i] = readIndex<index_t>(_in1, switchEndian);\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\tsize_t r = MM_READ(_in1, (void*)(plen()), _nPat*sizeof(index_t));\n\t\t\t\tif(r != (size_t)(_nPat*sizeof(index_t))) {\n\t\t\t\t\tcerr << \"Error reading _plen[] array: \" << r << \", \" << _nPat*sizeof(index_t) << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t}\n\t\t} catch(bad_alloc& e) {\n\t\t\tcerr << \"Out of memory allocating plen[] in Ebwt::read()\"\n\t\t\t<< \" at \" << __FILE__ << \":\" << __LINE__ << endl;\n\t\t\tthrow e;\n\t\t}\n\t}\n    \n    this->_offw = this->_nPat > std::numeric_limits<uint16_t>::max();\n\t\n\tbool shmemLeader;\n    size_t OFFSET_SIZE;\n\t\n\t// TODO: I'm not consistent on what \"header\" means.  Here I'm using\n\t// \"header\" to mean everything that would exist in memory if we\n\t// started to build the Ebwt but stopped short of the build*() step\n\t// (i.e. everything up to and including join()).\n\tif(justHeader) goto done;\n\t\n\tthis->_nFrag = readIndex<index_t>(_in1, switchEndian);\n\tbytesRead += sizeof(index_t);\n\tif(_verbose || startVerbose) {\n\t\tcerr << \"Reading rstarts (\" << this->_nFrag*3 << \"): \";\n\t\tlogTime(cerr);\n\t}\n\tassert_geq(this->_nFrag, this->_nPat);\n\t_rstarts.reset();\n\tif(loadRstarts) {\n\t\tif(_useMm) {\n#ifdef BOWTIE_MM\n\t\t\t_rstarts.init((index_t*)(mmFile[0] + bytesRead), _nFrag*3, false);\n\t\t\tbytesRead += this->_nFrag*sizeof(index_t)*3;\n\t\t\tfseek(_in1, this->_nFrag*sizeof(index_t)*3, SEEK_CUR);\n#endif\n\t\t} else {\n\t\t\t_rstarts.init(new index_t[_nFrag*3], _nFrag*3, true);\n\t\t\tif(switchEndian) {\n\t\t\t\tfor(size_t i = 0; i < (size_t)(this->_nFrag*3); i += 3) {\n\t\t\t\t\t// fragment starting position in joined reference\n\t\t\t\t\t// string, text id, and fragment offset within text\n\t\t\t\t\tthis->rstarts()[i]   = readIndex<index_t>(_in1, switchEndian);\n\t\t\t\t\tthis->rstarts()[i+1] = readIndex<index_t>(_in1, switchEndian);\n\t\t\t\t\tthis->rstarts()[i+2] = readIndex<index_t>(_in1, switchEndian);\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\tsize_t r = MM_READ(_in1, (void *)rstarts(), this->_nFrag*sizeof(index_t)*3);\n\t\t\t\tif(r != (size_t)(this->_nFrag*sizeof(index_t)*3)) {\n\t\t\t\t\tcerr << \"Error reading _rstarts[] array: \" << r << \", \" << (this->_nFrag*sizeof(index_t)*3) << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t} else {\n\t\t// Skip em\n\t\tassert(rstarts() == NULL);\n\t\tbytesRead += this->_nFrag*sizeof(index_t)*3;\n\t\tfseek(_in1, this->_nFrag*sizeof(index_t)*3, SEEK_CUR);\n\t}\n\t\n\t_ebwt.reset();\n\tif(_useMm) {\n#ifdef BOWTIE_MM\n\t\t_ebwt.init((uint8_t*)(mmFile[0] + bytesRead), eh->_ebwtTotLen, false);\n\t\tbytesRead += eh->_ebwtTotLen;\n\t\tfseek(_in1, eh->_ebwtTotLen, SEEK_CUR);\n#endif\n\t} else {\n\t\t// Allocate ebwt (big allocation)\n\t\tif(_verbose || startVerbose) {\n\t\t\tcerr << \"Reading ebwt (\" << eh->_ebwtTotLen << \"): \";\n\t\t\tlogTime(cerr);\n\t\t}\n\t\tbool shmemLeader = true;\n\t\tif(useShmem_) {\n\t\t\tuint8_t *tmp = NULL;\n\t\t\tshmemLeader = ALLOC_SHARED_U8(\n\t\t\t\t(_in1Str + \"[ebwt]\"), eh->_ebwtTotLen, &tmp,\n\t\t\t\t\"ebwt[]\", (_verbose || startVerbose));\n\t\t\tassert(tmp != NULL);\n\t\t\t_ebwt.init(tmp, eh->_ebwtTotLen, false);\n\t\t\tif(_verbose || startVerbose) {\n\t\t\t\tcerr << \"  shared-mem \" << (shmemLeader ? \"leader\" : \"follower\") << endl;\n\t\t\t}\n\t\t} else {\n\t\t\ttry {\n\t\t\t\t_ebwt.init(new uint8_t[eh->_ebwtTotLen], eh->_ebwtTotLen, true);\n\t\t\t} catch(bad_alloc& e) {\n\t\t\t\tcerr << \"Out of memory allocating the ebwt[] array for the Bowtie index.  Please try\" << endl\n\t\t\t\t<< \"again on a computer with more memory.\" << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t}\n\t\tif(shmemLeader) {\n\t\t\t// Read ebwt from primary stream\n\t\t\tuint64_t bytesLeft = eh->_ebwtTotLen;\n\t\t\tchar *pebwt = (char*)this->ebwt();\n            while (bytesLeft>0){\n\t\t\t\tsize_t r = MM_READ(this->_in1, (void *)pebwt, bytesLeft);\n\t\t\t\tif(MM_IS_IO_ERR(this->_in1, r, bytesLeft)) {\n\t\t\t\t\tcerr << \"Error reading _ebwt[] array: \" << r << \", \"\n                    << bytesLeft << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t\tpebwt += r;\n\t\t\t\tbytesLeft -= r;\n\t\t\t}\n\t\t\tif(switchEndian) {\n\t\t\t\tuint8_t *side = this->ebwt();\n\t\t\t\tfor(size_t i = 0; i < eh->_numSides; i++) {\n\t\t\t\t\tindex_t *cums = reinterpret_cast<index_t*>(side + eh->_sideSz - sizeof(index_t)*2);\n\t\t\t\t\tcums[0] = endianSwapIndex(cums[0]);\n\t\t\t\t\tcums[1] = endianSwapIndex(cums[1]);\n\t\t\t\t\tside += this->_eh._sideSz;\n\t\t\t\t}\n\t\t\t}\n#ifdef BOWTIE_SHARED_MEM\n\t\t\tif(useShmem_) NOTIFY_SHARED(ebwt(), eh->_ebwtTotLen);\n#endif\n\t\t} else {\n\t\t\t// Seek past the data and wait until master is finished\n\t\t\tfseek(_in1, eh->_ebwtTotLen, SEEK_CUR);\n#ifdef BOWTIE_SHARED_MEM\n\t\t\tif(useShmem_) WAIT_SHARED(ebwt(), eh->_ebwtTotLen);\n#endif\n\t\t}\n\t}\n\t\n\t// Read zOff from primary stream\n\t_zOff = readIndex<index_t>(_in1, switchEndian);\n\tbytesRead += sizeof(index_t);\n\tassert_lt(_zOff, len);\n\t\n\ttry {\n\t\t// Read fchr from primary stream\n\t\tif(_verbose || startVerbose) cerr << \"Reading fchr (5)\" << endl;\n\t\t_fchr.reset();\n\t\tif(_useMm) {\n#ifdef BOWTIE_MM\n\t\t\t_fchr.init((index_t*)(mmFile[0] + bytesRead), 5, false);\n\t\t\tbytesRead += 5*sizeof(index_t);\n\t\t\tfseek(_in1, 5*sizeof(index_t), SEEK_CUR);\n#endif\n\t\t} else {\n\t\t\t_fchr.init(new index_t[5], 5, true);\n\t\t\tfor(int i = 0; i < 5; i++) {\n\t\t\t\tthis->fchr()[i] = readIndex<index_t>(_in1, switchEndian);\n\t\t\t\tassert_leq(this->fchr()[i], len);\n\t\t\t\tassert(i <= 0 || this->fchr()[i] >= this->fchr()[i-1]);\n\t\t\t}\n\t\t}\n\t\tassert_gt(this->fchr()[4], this->fchr()[0]);\n\t\t// Read ftab from primary stream\n\t\tif(_verbose || startVerbose) {\n\t\t\tif(loadFtab) {\n\t\t\t\tcerr << \"Reading ftab (\" << eh->_ftabLen << \"): \";\n\t\t\t\tlogTime(cerr);\n\t\t\t} else {\n\t\t\t\tcerr << \"Skipping ftab (\" << eh->_ftabLen << \"): \";\n\t\t\t}\n\t\t}\n\t\t_ftab.reset();\n\t\tif(loadFtab) {\n\t\t\tif(_useMm) {\n#ifdef BOWTIE_MM\n\t\t\t\t_ftab.init((index_t*)(mmFile[0] + bytesRead), eh->_ftabLen, false);\n\t\t\t\tbytesRead += eh->_ftabLen*sizeof(index_t);\n\t\t\t\tfseek(_in1, eh->_ftabLen*sizeof(index_t), SEEK_CUR);\n#endif\n\t\t\t} else {\n\t\t\t\t_ftab.init(new index_t[eh->_ftabLen], eh->_ftabLen, true);\n\t\t\t\tif(switchEndian) {\n\t\t\t\t\tfor(size_t i = 0; i < eh->_ftabLen; i++)\n\t\t\t\t\t\tthis->ftab()[i] = readIndex<index_t>(_in1, switchEndian);\n\t\t\t\t} else {\n\t\t\t\t\tsize_t r = MM_READ(_in1, (void *)ftab(), eh->_ftabLen*sizeof(index_t));\n\t\t\t\t\tif(r != (size_t)(eh->_ftabLen*sizeof(index_t))) {\n\t\t\t\t\t\tcerr << \"Error reading _ftab[] array: \" << r << \", \" << (eh->_ftabLen*sizeof(index_t)) << endl;\n\t\t\t\t\t\tthrow 1;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\t// Read etab from primary stream\n\t\t\tif(_verbose || startVerbose) {\n\t\t\t\tif(loadFtab) {\n\t\t\t\t\tcerr << \"Reading eftab (\" << eh->_eftabLen << \"): \";\n\t\t\t\t\tlogTime(cerr);\n\t\t\t\t} else {\n\t\t\t\t\tcerr << \"Skipping eftab (\" << eh->_eftabLen << \"): \";\n\t\t\t\t}\n\n\t\t\t}\n\t\t\t_eftab.reset();\n\t\t\tif(_useMm) {\n#ifdef BOWTIE_MM\n\t\t\t\t_eftab.init((index_t*)(mmFile[0] + bytesRead), eh->_eftabLen, false);\n\t\t\t\tbytesRead += eh->_eftabLen*sizeof(index_t);\n\t\t\t\tfseek(_in1, eh->_eftabLen*sizeof(index_t), SEEK_CUR);\n#endif\n\t\t\t} else {\n\t\t\t\t_eftab.init(new index_t[eh->_eftabLen], eh->_eftabLen, true);\n\t\t\t\tif(switchEndian) {\n\t\t\t\t\tfor(size_t i = 0; i < eh->_eftabLen; i++)\n\t\t\t\t\t\tthis->eftab()[i] = readIndex<index_t>(_in1, switchEndian);\n\t\t\t\t} else {\n\t\t\t\t\tsize_t r = MM_READ(_in1, (void *)this->eftab(), eh->_eftabLen*sizeof(index_t));\n\t\t\t\t\tif(r != (size_t)(eh->_eftabLen*sizeof(index_t))) {\n\t\t\t\t\t\tcerr << \"Error reading _eftab[] array: \" << r << \", \" << (eh->_eftabLen*sizeof(index_t)) << endl;\n\t\t\t\t\t\tthrow 1;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\tfor(index_t i = 0; i < eh->_eftabLen; i++) {\n\t\t\t\tif(i > 0 && this->eftab()[i] > 0) {\n\t\t\t\t\tassert_geq(this->eftab()[i], this->eftab()[i-1]);\n\t\t\t\t} else if(i > 0 && this->eftab()[i-1] == 0) {\n\t\t\t\t\tassert_eq(0, this->eftab()[i]);\n\t\t\t\t}\n\t\t\t}\n\t\t} else {\n\t\t\tassert(ftab() == NULL);\n\t\t\tassert(eftab() == NULL);\n\t\t\t// Skip ftab\n\t\t\tbytesRead += eh->_ftabLen*sizeof(index_t);\n\t\t\tfseek(_in1, eh->_ftabLen*sizeof(index_t), SEEK_CUR);\n\t\t\t// Skip eftab\n\t\t\tbytesRead += eh->_eftabLen*sizeof(index_t);\n\t\t\tfseek(_in1, eh->_eftabLen*sizeof(index_t), SEEK_CUR);\n\t\t}\n\t} catch(bad_alloc& e) {\n\t\tcerr << \"Out of memory allocating fchr[], ftab[] or eftab[] arrays for the Bowtie index.\" << endl\n\t\t<< \"Please try again on a computer with more memory.\" << endl;\n\t\tthrow 1;\n\t}\n\t\n\t// Read reference sequence names from primary index file (or not,\n\t// if --refidx is specified)\n\tif(loadNames) {\n\t\twhile(true) {\n\t\t\tchar c = '\\0';\n\t\t\tif(MM_READ(_in1, (void *)(&c), (size_t)1) != (size_t)1) break;\n\t\t\tbytesRead++;\n\t\t\tif(c == '\\0') break;\n\t\t\telse if(c == '\\n') {\n\t\t\t\tthis->_refnames.push_back(\"\");\n\t\t\t} else {\n\t\t\t\tif(this->_refnames.size() == 0) {\n\t\t\t\t\tthis->_refnames.push_back(\"\");\n\t\t\t\t}\n\t\t\t\tthis->_refnames.back().push_back(c);\n\t\t\t}\n\t\t}\n        if(this->_refnames.back().empty()) {\n            this->_refnames.pop_back();\n        }\n\t}\n\t\n    OFFSET_SIZE = (this->_offw ? 4 : 2);\n\t_offs.reset();\n    _offsw.reset();\n    if(loadSASamp) {\n        bytesRead = 4; // reset for secondary index file (already read 1-sentinel)\n        \n        shmemLeader = true;\n        if(_verbose || startVerbose) {\n            cerr << \"Reading offs (\" << offsLenSampled << \" \" << std::setw(2) << sizeof(index_t)*8 << \"-bit words): \";\n            logTime(cerr);\n        }\n        \n        if(!_useMm) {\n            if(!useShmem_) {\n                // Allocate offs_\n                try {\n                    if(this->_offw) {\n                        _offsw.init(new uint32_t[offsLenSampled], offsLenSampled, true);\n                    } else {\n                        _offs.init(new uint16_t[offsLenSampled], offsLenSampled, true);\n                    }\n                } catch(bad_alloc& e) {\n                    cerr << \"Out of memory allocating the offs[] array  for the Bowtie index.\" << endl\n\t\t\t\t\t<< \"Please try again on a computer with more memory.\" << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t} else {\n                if(this->_offw) {\n                    uint32_t *tmp = NULL;\n                    shmemLeader = ALLOC_SHARED_U32(\n                                                   (_in2Str + \"[offs]\"), offsLenSampled*OFFSET_SIZE, &tmp,\n                                                   \"offs\", (_verbose || startVerbose));\n                    _offsw.init((uint32_t*)tmp, offsLenSampled, false);\n                } else {\n                    uint16_t *tmp = NULL;\n                    shmemLeader = ALLOC_SHARED_U32(\n                                                   (_in2Str + \"[offs]\"), offsLenSampled*OFFSET_SIZE, &tmp,\n                                                   \"offs\", (_verbose || startVerbose));\n                    _offs.init((uint16_t*)tmp, offsLenSampled, false);\n                }\n\t\t\t}\n\t\t}\n        \n        if(_overrideOffRate < 32) {\n            if(shmemLeader) {\n                // Allocate offs (big allocation)\n                if(switchEndian || offRateDiff > 0) {\n                    assert(!_useMm);\n                    const index_t blockMaxSz = (index_t)(2 * 1024 * 1024); // 2 MB block size\n                    const index_t blockMaxSzU = (blockMaxSz / OFFSET_SIZE); // # U32s per block\n                    char *buf;\n                    try {\n                        buf = new char[blockMaxSz];\n                    } catch(std::bad_alloc& e) {\n                        cerr << \"Error: Out of memory allocating part of _offs array: '\" << e.what() << \"'\" << endl;\n                        throw e;\n                    }\n                    for(index_t i = 0; i < offsLen; i += blockMaxSzU) {\n                        index_t block = min<index_t>((index_t)blockMaxSzU, (index_t)(offsLen - i));\n                        size_t r = MM_READ(_in2, (void *)buf, block * OFFSET_SIZE);\n                        if(r != (size_t)(block * OFFSET_SIZE)) {\n                            cerr << \"Error reading block of _offs[] array: \" << r << \", \" << (block * OFFSET_SIZE) << endl;\n                            throw 1;\n                        }\n                        index_t idx = i >> 1;\n                        for(index_t j = 0; j < block; j += OFFSET_SIZE) {\n                            assert_lt(idx, offsLenSampled);\n                            if(this->_offw) {\n                                this->offsw()[idx] = ((uint32_t*)buf)[j];\n                                if(switchEndian) {\n                                    this->offsw()[idx] = endianSwapIndex((uint32_t)this->offs()[idx]);\n                                }\n                            } else {\n                                this->offs()[idx] = ((uint16_t*)buf)[j];\n                                if(switchEndian) {\n                                    this->offs()[idx] = endianSwapIndex((uint16_t)this->offs()[idx]);\n                                }\n                            }\n                            idx++;\n                        }\n                    }\n                    delete[] buf;\n                } else {\n                    if(_useMm) {\n#ifdef BOWTIE_MM\n                        if(this->_offw) {\n                            _offsw.init((uint32_t*)(mmFile[1] + bytesRead), offsLen, false);\n                        } else {\n                            _offs.init((uint16_t*)(mmFile[1] + bytesRead), offsLen, false);\n                        }\n                        bytesRead += (offsLen * OFFSET_SIZE);\n                        fseek(_in2, (offsLen * OFFSET_SIZE), SEEK_CUR);\n#endif\n                    } else {\n                        // Workaround for small-index mode where MM_READ may\n                        // not be able to handle read amounts greater than 2^32\n                        // bytes.\n                        uint64_t bytesLeft = (offsLen * OFFSET_SIZE);\n                        char *offs = NULL;\n                        if(this->_offw) {\n                            offs = (char *)this->offsw();\n                        } else {\n                            offs = (char *)this->offs();\n                        }\n                        while(bytesLeft > 0) {\n                            size_t r = MM_READ(_in2, (void*)offs, bytesLeft);\n                            if(MM_IS_IO_ERR(_in2, r, bytesLeft)) {\n                                cerr << \"Error reading block of _offs[] array: \"\n                                << r << \", \" << bytesLeft << gLastIOErrMsg << endl;\n                                throw 1;\n                            }\n                            offs += r;\n                            bytesLeft -= r;\n                        }\n                    }\n                }\n#ifdef BOWTIE_SHARED_MEM\n                if(useShmem_) {\n                    if(this->_offw) {\n                        NOTIFY_SHARED(offsw(), offsLenSampled*OFFSET_SIZE);\n                    } else {\n                        NOTIFY_SHARED(offs(), offsLenSampled*OFFSET_SIZE);\n                    }\n                }\n#endif\n            } else {\n                // Not the shmem leader\n\t\t\t\tfseek(_in2, offsLenSampled*OFFSET_SIZE, SEEK_CUR);\n#ifdef BOWTIE_SHARED_MEM\n                if(this->_offw) {\n                    NOTIFY_SHARED(offsw(), offsLenSampled*OFFSET_SIZE);\n                } else {                    \n                    NOTIFY_SHARED(offs(), offsLenSampled*OFFSET_SIZE);\n                }\n#endif\n            }\n        }\n    }\n    \n    this->postReadInit(*eh); // Initialize fields of Ebwt not read from file\n    if(_verbose || startVerbose) print(cerr, *eh);\n    \n    // The fact that _ebwt and friends actually point to something\n    // (other than NULL) now signals to other member functions that the\n    // Ebwt is loaded into memory.\n    \ndone: // Exit hatch for both justHeader and !justHeader\n\t\n\t// Be kind\n\tif(deleteEh) delete eh;\n#ifdef BOWTIE_MM\n\tif(_in1 != NULL) fseek(_in1, 0, SEEK_SET);\n\tif(_in2 != NULL) fseek(_in2, 0, SEEK_SET);\n#else\n\tif(_in1 != NULL) rewind(_in1);\n    if(_in2 != NULL) rewind(_in2);\n#endif\n}\n\n/**\n * Read reference names from an input stream 'in' for an Ebwt primary\n * file and store them in 'refnames'.\n */\ntemplate <typename index_t>\nvoid readEbwtRefnames(istream& in, EList<string>& refnames) {\n\t// _in1 must already be open with the get cursor at the\n\t// beginning and no error flags set.\n\tassert(in.good());\n\tassert_eq((streamoff)in.tellg(), ios::beg);\n\t\n\t// Read endianness hints from both streams\n\tbool switchEndian = false;\n\tuint32_t one = readU32(in, switchEndian); // 1st word of primary stream\n\tif(one != 1) {\n\t\tassert_eq((1u<<24), one);\n\t\tswitchEndian = true;\n\t}\n\t\n\t// Reads header entries one by one from primary stream\n\tindex_t len          = readIndex<index_t>(in, switchEndian);\n\tint32_t  lineRate     = readI32(in, switchEndian);\n\t/*int32_t  linesPerSide =*/ readI32(in, switchEndian);\n\tint32_t  offRate      = readI32(in, switchEndian);\n\tint32_t  ftabChars    = readI32(in, switchEndian);\n\t// BTL: chunkRate is now deprecated\n\tint32_t flags = readI32(in, switchEndian);\n\tbool color = false;\n\tbool entireReverse = false;\n\tif(flags < 0) {\n\t\tcolor = (((-flags) & EBWT_COLOR) != 0);\n\t\tentireReverse = (((-flags) & EBWT_ENTIRE_REV) != 0);\n\t}\n\t\n\t// Create a new EbwtParams from the entries read from primary stream\n\tEbwtParams<index_t> eh(len, lineRate, offRate, ftabChars, color, entireReverse);\n\t\n\tindex_t nPat = readIndex<index_t>(in, switchEndian); // nPat\n\tin.seekg(nPat*sizeof(index_t), ios_base::cur); // skip plen\n\t\n\t// Skip rstarts\n\tindex_t nFrag = readIndex<index_t>(in, switchEndian);\n\tin.seekg(nFrag*sizeof(index_t)*3, ios_base::cur);\n\t\n\t// Skip ebwt\n\tin.seekg(eh._ebwtTotLen, ios_base::cur);\n\t\n\t// Skip zOff from primary stream\n\treadIndex<index_t>(in, switchEndian);\n\t\n\t// Skip fchr\n\tin.seekg(5 * sizeof(index_t), ios_base::cur);\n\t\n\t// Skip ftab\n\tin.seekg(eh._ftabLen*sizeof(index_t), ios_base::cur);\n\t\n\t// Skip eftab\n\tin.seekg(eh._eftabLen*sizeof(index_t), ios_base::cur);\n\t\n\t// Read reference sequence names from primary index file\n\twhile(true) {\n\t\tchar c = '\\0';\n\t\tin.read(&c, 1);\n\t\tif(in.eof()) break;\n\t\tif(c == '\\0') break;\n\t\telse if(c == '\\n') {\n\t\t\trefnames.push_back(\"\");\n\t\t} else {\n\t\t\tif(refnames.size() == 0) {\n\t\t\t\trefnames.push_back(\"\");\n\t\t\t}\n\t\t\trefnames.back().push_back(c);\n\t\t}\n\t}\n\tif(refnames.back().empty()) {\n\t\trefnames.pop_back();\n\t}\n\t\n\t// Be kind\n\tin.clear(); in.seekg(0, ios::beg);\n\tassert(in.good());\n}\n\n/**\n * Read reference names from the index with basename 'in' and store\n * them in 'refnames'.\n */\ntemplate <typename index_t>\nvoid readEbwtRefnames(const string& instr, EList<string>& refnames) {\n\tifstream in;\n\t// Initialize our primary and secondary input-stream fields\n\tin.open((instr + \".1.\" + gEbwt_ext).c_str(), ios_base::in | ios::binary);\n\tif(!in.is_open()) {\n\t\tthrow EbwtFileOpenException(\"Cannot open file \" + instr);\n\t}\n\tassert(in.is_open());\n\tassert(in.good());\n\tassert_eq((streamoff)in.tellg(), ios::beg);\n\treadEbwtRefnames<index_t>(in, refnames);\n}\n\n/**\n * Read just enough of the Ebwt's header to get its flags\n */\ntemplate <typename index_t>\nint32_t Ebwt<index_t>::readFlags(const string& instr) {\n\tifstream in;\n\t// Initialize our primary and secondary input-stream fields\n\tin.open((instr + \".1.\" + gEbwt_ext).c_str(), ios_base::in | ios::binary);\n\tif(!in.is_open()) {\n\t\tthrow EbwtFileOpenException(\"Cannot open file \" + instr);\n\t}\n\tassert(in.is_open());\n\tassert(in.good());\n\tbool switchEndian = false;\n\tuint32_t one = readU32(in, switchEndian); // 1st word of primary stream\n\tif(one != 1) {\n\t\tassert_eq((1u<<24), one);\n\t\tassert_eq(1, endianSwapU32(one));\n\t\tswitchEndian = true;\n\t}\n\treadIndex<index_t>(in, switchEndian);\n\treadI32(in, switchEndian);\n\treadI32(in, switchEndian);\n\treadI32(in, switchEndian);\n\treadI32(in, switchEndian);\n\tint32_t flags = readI32(in, switchEndian);\n\treturn flags;\n}\n\n/**\n * Read just enough of the Ebwt's header to determine whether it's\n * colorspace.\n */\nbool\nreadEbwtColor(const string& instr) {\n\tint32_t flags = Ebwt<>::readFlags(instr);\n\tif(flags < 0 && (((-flags) & EBWT_COLOR) != 0)) {\n\t\treturn true;\n\t} else {\n\t\treturn false;\n\t}\n}\n\n/**\n * Read just enough of the Ebwt's header to determine whether it's\n * entirely reversed.\n */\nbool\nreadEntireReverse(const string& instr) {\n\tint32_t flags = Ebwt<>::readFlags(instr);\n\tif(flags < 0 && (((-flags) & EBWT_ENTIRE_REV) != 0)) {\n\t\treturn true;\n\t} else {\n\t\treturn false;\n\t}\n}\n\n/**\n * Write an extended Burrows-Wheeler transform to a pair of output\n * streams.\n *\n * @param out1 output stream to primary file\n * @param out2 output stream to secondary file\n * @param be   write in big endian?\n */\ntemplate <typename index_t>\nvoid Ebwt<index_t>::writeFromMemory(bool justHeader,\n                           ostream& out1,\n                           ostream& out2) const\n{\n\tconst EbwtParams<index_t>& eh = this->_eh;\n\tassert(eh.repOk());\n\tuint32_t be = this->toBe();\n\tassert(out1.good());\n\tassert(out2.good());\n\t\n\t// When building an Ebwt, these header parameters are known\n\t// \"up-front\", i.e., they can be written to disk immediately,\n\t// before we join() or buildToDisk()\n\twriteI32(out1, 1, be); // endian hint for priamry stream\n\twriteI32(out2, 1, be); // endian hint for secondary stream\n\twriteIndex<index_t>(out1, eh._len,          be); // length of string (and bwt and suffix array)\n\twriteI32(out1, eh._lineRate,     be); // 2^lineRate = size in bytes of 1 line\n\twriteI32(out1, 2,                be); // not used\n\twriteI32(out1, eh._offRate,      be); // every 2^offRate chars is \"marked\"\n\twriteI32(out1, eh._ftabChars,    be); // number of 2-bit chars used to address ftab\n\tint32_t flags = 1;\n\tif(eh._color) flags |= EBWT_COLOR;\n\tif(eh._entireReverse) flags |= EBWT_ENTIRE_REV;\n\twriteI32(out1, -flags, be); // BTL: chunkRate is now deprecated\n\t\n\tif(!justHeader) {\n\t\tassert(rstarts() != NULL);\n\t\tassert(offs() != NULL);\n\t\tassert(ftab() != NULL);\n\t\tassert(eftab() != NULL);\n\t\tassert(isInMemory());\n\t\t// These Ebwt parameters are known after the inputs strings have\n\t\t// been joined() but before they have been built().  These can\n\t\t// written to the disk next and then discarded from memory.\n\t\twriteIndex<index_t>(out1, this->_nPat,      be);\n\t\tfor(index_t i = 0; i < this->_nPat; i++)\n\t\t\twriteIndex<index_t>(out1, this->plen()[i], be);\n\t\tassert_geq(this->_nFrag, this->_nPat);\n\t\twriteIndex<index_t>(out1, this->_nFrag, be);\n\t\tfor(size_t i = 0; i < this->_nFrag*3; i++)\n\t\t\twriteIndex<index_t>(out1, this->rstarts()[i], be);\n\t\t\n\t\t// These Ebwt parameters are discovered only as the Ebwt is being\n\t\t// built (in buildToDisk()).  Of these, only 'offs' and 'ebwt' are\n\t\t// terribly large.  'ebwt' is written to the primary file and then\n\t\t// discarded from memory as it is built; 'offs' is similarly\n\t\t// written to the secondary file and discarded.\n\t\tout1.write((const char *)this->ebwt(), eh._ebwtTotLen);\n\t\twriteIndex<index_t>(out1, this->zOff(), be);\n\t\tindex_t offsLen = eh._offsLen;\n\t\tfor(index_t i = 0; i < offsLen; i++)\n\t\t\twriteIndex<index_t>(out2, this->offs()[i], be);\n\t\t\n\t\t// 'fchr', 'ftab' and 'eftab' are not fully determined until the\n\t\t// loop is finished, so they are written to the primary file after\n\t\t// all of 'ebwt' has already been written and only then discarded\n\t\t// from memory.\n\t\tfor(int i = 0; i < 5; i++)\n\t\t\twriteIndex<index_t>(out1, this->fchr()[i], be);\n\t\tfor(index_t i = 0; i < eh._ftabLen; i++)\n\t\t\twriteIndex<index_t>(out1, this->ftab()[i], be);\n\t\tfor(index_t i = 0; i < eh._eftabLen; i++)\n\t\t\twriteIndex<index_t>(out1, this->eftab()[i], be);\n\t}\n}\n\n/**\n * Given a pair of strings representing output filenames, and assuming\n * this Ebwt object is currently in memory, write out this Ebwt to the\n * specified files.\n *\n * If sanity-checking is enabled, then once the streams have been\n * fully written and closed, we reopen them and read them into a\n * (hopefully) exact copy of this Ebwt.  We then assert that the\n * current Ebwt and the copy match in all of their fields.\n */\ntemplate <typename index_t>\nvoid Ebwt<index_t>::writeFromMemory(bool justHeader,\n                           const string& out1,\n                           const string& out2) const\n{\n\tASSERT_ONLY(const EbwtParams<index_t>& eh = this->_eh);\n\tassert(isInMemory());\n\tassert(eh.repOk());\n\t\n\tofstream fout1(out1.c_str(), ios::binary);\n\tofstream fout2(out2.c_str(), ios::binary);\n\twriteFromMemory(justHeader, fout1, fout2);\n\tfout1.close();\n\tfout2.close();\n\t\n\t// Read the file back in and assert that all components match\n\tif(_sanity) {\n#if 0\n\t\tif(_verbose)\n\t\t\tcout << \"Re-reading \\\"\" << out1 << \"\\\"/\\\"\" << out2 << \"\\\" for sanity check\" << endl;\n\t\tEbwt copy(out1, out2, _verbose, _sanity);\n\t\tassert(!isInMemory());\n\t\tcopy.loadIntoMemory(eh._color ? 1 : 0, true, false, false);\n\t\tassert(isInMemory());\n\t    assert_eq(eh._lineRate,     copy.eh()._lineRate);\n\t    assert_eq(eh._offRate,      copy.eh()._offRate);\n\t    assert_eq(eh._ftabChars,    copy.eh()._ftabChars);\n\t    assert_eq(eh._len,          copy.eh()._len);\n\t    assert_eq(_zOff,             copy.zOff());\n\t    assert_eq(_zEbwtBpOff,       copy.zEbwtBpOff());\n\t    assert_eq(_zEbwtByteOff,     copy.zEbwtByteOff());\n\t\tassert_eq(_nPat,             copy.nPat());\n\t\tfor(index_t i = 0; i < _nPat; i++)\n\t\t\tassert_eq(this->_plen[i], copy.plen()[i]);\n\t\tassert_eq(this->_nFrag, copy.nFrag());\n\t\tfor(size_t i = 0; i < this->nFrag*3; i++) {\n\t\t\tassert_eq(this->_rstarts[i], copy.rstarts()[i]);\n\t\t}\n\t\tfor(index_t i = 0; i < 5; i++)\n\t\t\tassert_eq(this->_fchr[i], copy.fchr()[i]);\n\t\tfor(size_t i = 0; i < eh._ftabLen; i++)\n\t\t\tassert_eq(this->ftab()[i], copy.ftab()[i]);\n\t\tfor(size_t i = 0; i < eh._eftabLen; i++)\n\t\t\tassert_eq(this->eftab()[i], copy.eftab()[i]);\n\t\tfor(index_t i = 0; i < eh._offsLen; i++)\n\t\t\tassert_eq(this->_offs[i], copy.offs()[i]);\n\t\tfor(index_t i = 0; i < eh._ebwtTotLen; i++)\n\t\t\tassert_eq(this->ebwt()[i], copy.ebwt()[i]);\n\t\tcopy.sanityCheckAll();\n\t\tif(_verbose)\n\t\t\tcout << \"Read-in check passed for \\\"\" << out1 << \"\\\"/\\\"\" << out2 << \"\\\"\" << endl;\n#endif\n\t}\n}\n\n/**\n * Write the rstarts array given the szs array for the reference.\n */\ntemplate <typename index_t>\nvoid Ebwt<index_t>::szsToDisk(const EList<RefRecord>& szs, ostream& os, int reverse) {\n#ifdef CENTRIFUGE\n    if(rstarts() == NULL) {\n        _rstarts.init(new index_t[this->_nFrag*3], this->_nFrag*3, true);\n    }\n#endif\n    \n\tsize_t seq = 0;\n\tindex_t off = 0;\n\tindex_t totlen = 0;\n    index_t rstarts_idx = 0;\n\tfor(size_t i = 0; i < szs.size(); i++) {\n\t\tif(szs[i].len == 0) continue;\n\t\tif(szs[i].first) off = 0;\n\t\toff += szs[i].off;\n\t\tif(szs[i].first && szs[i].len > 0) seq++;\n\t\tindex_t seqm1 = seq-1;\n\t\tassert_lt(seqm1, _nPat);\n\t\tindex_t fwoff = off;\n\t\tif(reverse == REF_READ_REVERSE) {\n\t\t\t// Invert pattern idxs\n\t\t\tseqm1 = _nPat - seqm1 - 1;\n\t\t\t// Invert pattern idxs\n\t\t\tassert_leq(off + szs[i].len, plen()[seqm1]);\n\t\t\tfwoff = plen()[seqm1] - (off + szs[i].len);\n\t\t}\n\t\twriteIndex<index_t>(os, totlen, this->toBe()); // offset from beginning of joined string\n\t\twriteIndex<index_t>(os, (index_t)seqm1,  this->toBe()); // sequence id\n\t\twriteIndex<index_t>(os, (index_t)fwoff,  this->toBe()); // offset into sequence\n#ifdef CENTRIFUGE\n        this->rstarts()[rstarts_idx*3]   = totlen;\n        this->rstarts()[rstarts_idx*3+1] = (index_t)seqm1;\n        this->rstarts()[rstarts_idx*3+2] = (index_t)fwoff;\n        rstarts_idx++;\n#endif\n\t\ttotlen += szs[i].len;\n\t\toff += szs[i].len;\n\t}\n}\n\n#endif /*EBWT_IO_H_*/\n\n"
  },
  {
    "path": "bt2_util.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef EBWT_UTIL_H_\n#define EBWT_UTIL_H_\n\n#include <string>\n#include <stdexcept>\n#include <iostream>\n#include <fstream>\n#include <stdlib.h>\n#include <string.h>\n#include \"bt2_idx.h\"\n\n///////////////////////////////////////////////////////////////////////\n//\n// Functions for printing and sanity-checking Ebwts\n//\n///////////////////////////////////////////////////////////////////////\n\n/**\n * Check that the ebwt array is internally consistent up to (and not\n * including) the given side index by re-counting the chars and\n * comparing against the embedded occ[] arrays.\n */\ntemplate <typename index_t>\nvoid Ebwt<index_t>::sanityCheckUpToSide(int upToSide) const {\n\tassert(isInMemory());\n\tindex_t occ[] = {0, 0, 0, 0};\n\tASSERT_ONLY(index_t occ_save[] = {0, 0, 0, 0});\n\tindex_t cur = 0; // byte pointer\n\tconst EbwtParams<index_t>& eh = this->_eh;\n\tbool fw = false;\n\twhile(cur < (upToSide * eh._sideSz)) {\n\t\tassert_leq(cur + eh._sideSz, eh._ebwtTotLen);\n\t\tfor(index_t i = 0; i < eh._sideBwtSz; i++) {\n\t\t\tuint8_t by = this->ebwt()[cur + (fw ? i : eh._sideBwtSz-i-1)];\n\t\t\tfor(int j = 0; j < 4; j++) {\n\t\t\t\t// Unpack from lowest to highest bit pair\n\t\t\t\tint twoBit = unpack_2b_from_8b(by, fw ? j : 3-j);\n\t\t\t\tocc[twoBit]++;\n\t\t\t}\n\t\t\tassert_eq(0, (occ[0] + occ[1] + occ[2] + occ[3]) % 4);\n\t\t}\n\t\tassert_eq(0, (occ[0] + occ[1] + occ[2] + occ[3]) % eh._sideBwtLen);\n\t\t// Finished forward bucket; check saved [A], [C], [G] and [T]\n\t\t// against the index_ts encoded here\n\t\tASSERT_ONLY(const index_t *uebwt = reinterpret_cast<const index_t*>(&ebwt()[cur + eh._sideBwtSz]));\n\t\tASSERT_ONLY(index_t as = uebwt[0]);\n\t\tASSERT_ONLY(index_t cs = uebwt[1]);\n\t\tASSERT_ONLY(index_t gs = uebwt[2]);\n\t\tASSERT_ONLY(index_t ts = uebwt[3]);\n\t\tassert(as == occ_save[0] || as == occ_save[0]-1);\n\t\tassert_eq(cs, occ_save[1]);\n\t\tassert_eq(gs, occ_save[2]);\n\t\tassert_eq(ts, occ_save[3]);\n#ifndef NDEBUG\n\t\tocc_save[0] = occ[0];\n\t\tocc_save[1] = occ[1];\n\t\tocc_save[2] = occ[2];\n\t\tocc_save[3] = occ[3];\n#endif\n\t\tcur += eh._sideSz;\n\t}\n}\n\n/**\n * Sanity-check various pieces of the Ebwt\n */\ntemplate <typename index_t>\nvoid Ebwt<index_t>::sanityCheckAll(int reverse) const {\n\tconst EbwtParams<index_t>& eh = this->_eh;\n\tassert(isInMemory());\n\t// Check ftab\n\tfor(index_t i = 1; i < eh._ftabLen; i++) {\n\t\tassert_geq(this->ftabHi(i), this->ftabLo(i-1));\n\t\tassert_geq(this->ftabLo(i), this->ftabHi(i-1));\n\t\tassert_leq(this->ftabHi(i), eh._bwtLen+1);\n\t}\n\tassert_eq(this->ftabHi(eh._ftabLen-1), eh._bwtLen);\n\t\n\t// Check offs\n\tint seenLen = (eh._bwtLen + 31) >> ((index_t)5);\n\tuint32_t *seen;\n\ttry {\n\t\tseen = new uint32_t[seenLen]; // bitvector marking seen offsets\n\t} catch(bad_alloc& e) {\n\t\tcerr << \"Out of memory allocating seen[] at \" << __FILE__ << \":\" << __LINE__ << endl;\n\t\tthrow e;\n\t}\n\tmemset(seen, 0, 4 * seenLen);\n\tindex_t offsLen = eh._offsLen;\n\tfor(index_t i = 0; i < offsLen; i++) {\n\t\tassert_lt(this->offs()[i], eh._bwtLen);\n\t\tint w = this->offs()[i] >> 5;\n\t\tint r = this->offs()[i] & 31;\n\t\tassert_eq(0, (seen[w] >> r) & 1); // shouldn't have been seen before\n\t\tseen[w] |= (1 << r);\n\t}\n\tdelete[] seen;\n\t\n\t// Check nPat\n\tassert_gt(this->_nPat, 0);\n    \n    // Check plen, flen\n\tfor(index_t i = 0; i < this->_nPat; i++) {\n\t\tassert_geq(this->plen()[i], 0);\n\t}\n    \n\t// Check rstarts\n\tif(this->rstarts() != NULL) {\n\t\tfor(index_t i = 0; i < this->_nFrag-1; i++) {\n\t\t\tassert_gt(this->rstarts()[(i+1)*3], this->rstarts()[i*3]);\n\t\t\tif(reverse == REF_READ_REVERSE) {\n\t\t\t\tassert(this->rstarts()[(i*3)+1] >= this->rstarts()[((i+1)*3)+1]);\n\t\t\t} else {\n\t\t\t\tassert(this->rstarts()[(i*3)+1] <= this->rstarts()[((i+1)*3)+1]);\n\t\t\t}\n\t\t}\n\t}\n\t\n\t// Check ebwt\n\tsanityCheckUpToSide(eh._numSides);\n\tVMSG_NL(\"Ebwt::sanityCheck passed\");\n}\n\n/**\n * Transform this Ebwt into the original string in linear time by using\n * the LF mapping to walk backwards starting at the row correpsonding\n * to the end of the string.  The result is written to s.  The Ebwt\n * must be in memory.\n */\ntemplate <typename index_t>\nvoid Ebwt<index_t>::restore(SString<char>& s) const {\n\tassert(isInMemory());\n\ts.resize(this->_eh._len);\n\tindex_t jumps = 0;\n\tindex_t i = this->_eh._len; // should point to final SA elt (starting with '$')\n\tSideLocus<index_t> l(i, this->_eh, this->ebwt());\n\twhile(i != _zOff) {\n\t\tassert_lt(jumps, this->_eh._len);\n\t\t//if(_verbose) cout << \"restore: i: \" << i << endl;\n\t\t// Not a marked row; go back a char in the original string\n\t\tindex_t newi = mapLF(l ASSERT_ONLY(, false));\n\t\tassert_neq(newi, i);\n\t\ts[this->_eh._len - jumps - 1] = rowL(l);\n\t\ti = newi;\n\t\tl.initFromRow(i, this->_eh, this->ebwt());\n\t\tjumps++;\n\t}\n\tassert_eq(jumps, this->_eh._len);\n}\n\n/**\n * Check that this Ebwt, when restored via restore(), matches up with\n * the given array of reference sequences.  For sanity checking.\n */\ntemplate <typename index_t>\nvoid Ebwt<index_t>::checkOrigs(\n\tconst EList<SString<char> >& os,\n\tbool color,\n\tbool mirror) const\n{\n\tSString<char> rest;\n\trestore(rest);\n\tindex_t restOff = 0;\n\tsize_t i = 0, j = 0;\n\tif(mirror) {\n\t\t// TODO: FIXME\n\t\treturn;\n\t}\n\twhile(i < os.size()) {\n\t\tsize_t olen = os[i].length();\n\t\tint lastorig = -1;\n\t\tfor(; j < olen; j++) {\n\t\t\tsize_t joff = j;\n\t\t\tif(mirror) joff = olen - j - 1;\n\t\t\tif((int)os[i][joff] == 4) {\n\t\t\t\t// Skip over Ns\n\t\t\t\tlastorig = -1;\n\t\t\t\tif(!mirror) {\n\t\t\t\t\twhile(j < olen && (int)os[i][j] == 4) j++;\n\t\t\t\t} else {\n\t\t\t\t\twhile(j < olen && (int)os[i][olen-j-1] == 4) j++;\n\t\t\t\t}\n\t\t\t\tj--;\n\t\t\t\tcontinue;\n\t\t\t}\n\t\t\tif(lastorig == -1 && color) {\n\t\t\t\tlastorig = os[i][joff];\n\t\t\t\tcontinue;\n\t\t\t}\n\t\t\tif(color) {\n\t\t\t\tassert_neq(-1, lastorig);\n\t\t\t\tassert_eq(dinuc2color[(int)os[i][joff]][lastorig], rest[restOff]);\n\t\t\t} else {\n\t\t\t\tassert_eq(os[i][joff], rest[restOff]);\n\t\t\t}\n\t\t\tlastorig = (int)os[i][joff];\n\t\t\trestOff++;\n\t\t}\n\t\tif(j == os[i].length()) {\n\t\t\t// Moved to next sequence\n\t\t\ti++;\n\t\t\tj = 0;\n\t\t} else {\n\t\t\t// Just jumped over a gap\n\t\t}\n\t}\n}\n\n#endif /*EBWT_UTIL_H_*/\n\n"
  },
  {
    "path": "btypes.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n\n#ifndef BOWTIE_INDEX_TYPES_H\n#define\tBOWTIE_INDEX_TYPES_H\n\n#ifdef BOWTIE_64BIT_INDEX\n#define OFF_MASK 0xffffffffffffffff\n#define OFF_LEN_MASK 0xc000000000000000\n#define LS_SIZE 0x100000000000000\n#define OFF_SIZE 8\n\ntypedef uint64_t TIndexOffU;\ntypedef int64_t TIndexOff;\n    \n#else\n#define OFF_MASK 0xffffffff\n#define OFF_LEN_MASK 0xc0000000\n#define LS_SIZE 0x10000000\n#define OFF_SIZE 4\n\ntypedef uint32_t TIndexOffU;\ntypedef int TIndexOff;\n\n#endif /* BOWTIE_64BIT_INDEX */\n\nextern const std::string gEbwt_ext;\n\n#endif\t/* BOWTIE_INDEX_TYPES_H */\n\n"
  },
  {
    "path": "ccnt_lut.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <stdint.h>\n\n/* Generated by gen_lookup_tables.pl */\n\nuint8_t cCntLUT_4[4][4][256];\nuint8_t cCntLUT_4_rev[4][4][256];\n\nint countCnt(int by, int c, uint8_t str) {\n    int count = 0;\n    if(by == 0) by = 4;\n    while(by-- > 0) {\n        int c2 = str & 3;\n        str >>= 2;\n        if(c == c2) count++;\n    }\n    \n    return count;\n}\n\nint countCnt_rev(int by, int c, uint8_t str) {\n    int count = 0;\n    if(by == 0) by = 4;\n    while(by-- > 0) {\n        int c2 = (str >> 6) & 3;\n        str <<= 2;\n        if(c == c2) count++;\n    }\n    \n    return count;\n}\n\nvoid initializeCntLut() {\n    for(int by = 0; by < 4; by++) {\n        for(int c = 0; c < 4; c++) {\n            for(int str = 0; str < 256; str++) {\n                cCntLUT_4[by][c][str] = countCnt(by, c, str);\n                cCntLUT_4_rev[by][c][str] = countCnt_rev(by, c, str);\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "centrifuge",
    "content": "#!/usr/bin/env perl\n\n#\n# Copyright 2014, Daehwan Kim <infphilo@gmail.com>\n#\n# This file is part of Centrifuge, which is copied and modified from bowtie2 in\n# the Bowtie2 package.\n#\n# Centrifuge is free software: you can redistribute it and/or modify\n# it under the terms of the GNU General Public License as published by\n# the Free Software Foundation, either version 3 of the License, or\n# (at your option) any later version.\n#\n# Centrifuge is distributed in the hope that it will be useful,\n# but WITHOUT ANY WARRANTY; without even the implied warranty of\n# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n# GNU General Public License for more details.\n#\n# You should have received a copy of the GNU General Public License\n# along with Centrifuge.  If not, see <http://www.gnu.org/licenses/>.\n#\n\n# centrifuge:\n#\n# A wrapper script for centrifuge.  Provides various advantages over running\n# centrifuge directly, including:\n#\n# 1. Handling compressed inputs\n# 2. Redirecting output to various files\n# 3. Output directly to bam (not currently supported)\n\nuse strict;\nuse warnings;\nuse Getopt::Long qw(GetOptions);\nuse File::Spec;\nuse POSIX;\n\n\nmy ($vol,$script_path,$prog);\n$prog = File::Spec->rel2abs( __FILE__ );\n\nwhile (-f $prog && -l $prog){\n    my (undef, $dir, undef) = File::Spec->splitpath($prog);\n    $prog = File::Spec->rel2abs(readlink($prog), $dir);\n}\n\n($vol,$script_path,$prog) \n                = File::Spec->splitpath($prog);\nmy $os_is_nix   = ($^O eq \"linux\") || ($^O eq \"darwin\");\nmy $align_bin_s = $os_is_nix ? 'centrifuge-class' : 'centrifuge-class.exe'; \nmy $build_bin   = $os_is_nix ? 'centrifuge-build' : 'centrifuge-build.exe';               \nmy $align_prog  = File::Spec->catpath($vol,$script_path,'centrifuge-class');\nmy $idx_ext       = 'hc'; \nmy %signo       = ();\nmy @signame     = ();\n\n{\n\t# Get signal info\n\tuse Config;\n\tmy $i = 0;\n\tfor my $name (split(' ', $Config{sig_name})) {\n\t\t$signo{$name} = $i;\n\t\t$signame[$i] = $name;\n\t\t$i++;\n\t}\n}\n\n(-x \"$align_prog\") ||\n\tFail(\"Expected centrifuge to be in same directory with centrifuge-class:\\n$script_path\\n\");\n\n# Get description of arguments from Centrifuge so that we can distinguish Centrifuge\n# args from wrapper args\nsub getBt2Desc($) {\n\tmy $d = shift;\n\tmy $cmd = \"$align_prog --wrapper basic-0 --arg-desc\";\n\topen(my $fh, \"$cmd |\") || Fail(\"Failed to run command '$cmd'\\n\");\n\twhile(readline $fh) {\n\t\tchomp;\n\t\tnext if /^\\s*$/;\n\t\tmy @ts = split(/\\t/);\n\t\t$d->{$ts[0]} = $ts[1];\n\t}\n\tclose($fh);\n\t$? == 0 || Fail(\"Description of arguments failed!\\n\");\n}\n\nmy %desc = ();\nmy %wrapped = (\"1\" => 1, \"2\" => 1);\ngetBt2Desc(\\%desc);\n\n# Given an option like -1, determine whether it's wrapped (i.e. should be\n# handled by this script rather than being passed along to Centrifuge)\nsub isWrapped($) { return defined($wrapped{$_[0]}); }\n\nmy @orig_argv = @ARGV;\n\nmy @bt2w_args = (); # options for wrapper\nmy @bt2_args  = (); # options for Centrifuge\nmy $saw_dd = 0;\nfor(0..$#ARGV) {\n\tif($ARGV[$_] eq \"--\") {\n\t\t$saw_dd = 1;\n\t\tnext;\n\t}\n\tpush @bt2w_args, $ARGV[$_] if !$saw_dd;\n\tpush @bt2_args,  $ARGV[$_] if  $saw_dd;\n}\nif(!$saw_dd) {\n\t@bt2_args = @bt2w_args;\n\t@bt2w_args= ();\n}\n\nmy $debug = 0;\nmy %read_fns = ();\nmy %read_compress = ();\nmy $cap_out = undef;       # Filename for passthrough\nmy $no_unal = 0;\nmy $large_idx = 0;\n\n# Variables handling the output format\nmy $outputFmtSam = 0 ;\nmy $tabFmtOptIdx = 0 ;\nmy $needReadSeq = 0 ;\nmy $removeSeqCols = 0 ;\n\n# Variable hanlding multiple sample\nmy $sampleSheetFile = undef ;\nmy $otherInput = undef ;\nmy @sampleSheet ;\nmy @sampleSheetOrder ; # map to the order centrifuge process the samples.\n\n# Remove whitespace\nfor my $i (0..$#bt2_args) {\n\t$bt2_args[$i]=~ s/^\\s+//; $bt2_args[$i] =~ s/\\s+$//;\n}\n\n# We've handled arguments that the user has explicitly directed either to the\n# wrapper or to centrifuge, now we capture some of the centrifuge arguments that\n# ought to be handled in the wrapper\nfor(my $i = 0; $i < scalar(@bt2_args); $i++) {\n\tnext unless defined($bt2_args[$i]);\n\tmy $arg = $bt2_args[$i];\n\tmy @args = split(/=/, $arg);\n\tif(scalar(@args) > 2) {\n\t\t$args[1] = join(\"=\", @args[1..$#args]);\n\t}\n\t$arg = $args[0];\n\n\tif ( $arg eq \"--sample-sheet\" )\n\t{\n\t\t$bt2_args[$i] = undef ;\n\t\t++$i ;\n\t\t$sampleSheetFile = $bt2_args[ $i ] ;\n\t\t$bt2_args[$i] = undef ;\n\t}\n\n\tif($arg eq \"-U\" || $arg eq \"--unpaired\") {\n\t\t$bt2_args[$i] = undef;\n\t\t$arg =~ s/^-U//; $arg =~ s/^--unpaired//;\n\t\tif($arg ne \"\") {\n\t\t\t# Argument was part of this token\n\t\t\tmy @args = split(/,/, $arg);\n\t\t\tfor my $a (@args) { push @bt2w_args, (\"-U\", $a); }\n\t\t} else {\n\t\t\t# Argument is in the next token\n\t\t\t$i < scalar(@bt2_args)-1 || Fail(\"Argument expected in next token!\\n\");\n\t\t\t$i++;\n\t\t\tmy @args = split(/,/, $bt2_args[$i]);\n\t\t\tfor my $a (@args) { push @bt2w_args, (\"-U\", $a); }\n\t\t\t$bt2_args[$i] = undef;\n\t\t}\n\t\t$otherInput = 1 ;\n\t}\n\tif($arg =~ /^--?([12])/ && $arg !~ /^--?12/) {\n\t\tmy $mate = $1;\n\t\t$bt2_args[$i] = undef;\n\t\t$arg =~ s/^--?[12]//;\n\t\tif($arg ne \"\") {\n\t\t\t# Argument was part of this token\n\t\t\tmy @args = split(/,/, $arg);\n\t\t\tfor my $a (@args) { push @bt2w_args, (\"-$mate\", $a); }\n\t\t} else {\n\t\t\t# Argument is in the next token\n\t\t\t$i < scalar(@bt2_args)-1 || Fail(\"Argument expected in next token!\\n\");\n\t\t\t$i++;\n\t\t\tmy @args = split(/,/, $bt2_args[$i]);\n\t\t\tfor my $a (@args) { push @bt2w_args, (\"-$mate\", $a); }\n\t\t\t$bt2_args[$i] = undef;\n\t\t}\n\t\t$otherInput = 1 ;\n\t}\n\tif($arg eq \"--debug\") {\n\t\t$debug = 1;\n\t\t$bt2_args[$i] = undef;\n\t}\n\tif($arg eq \"--no-unal\") {\n\t\t$no_unal = 1;\n\t\t$bt2_args[$i] = undef;\n\t}\n\tif($arg eq \"--large-index\") {\n\t\t$large_idx = 1;\n\t\t$bt2_args[$i] = undef;\n\t}\n\tfor my $rarg (\"un-conc\", \"al-conc\", \"un\", \"al\") {\n\t\tif($arg =~ /^--${rarg}$/ || $arg =~ /^--${rarg}-gz$/ || $arg =~ /^--${rarg}-bz2$/) {\n\t\t\t$needReadSeq = 1 ;\n\t\t\t$bt2_args[$i] = undef;\n\t\t\tif(scalar(@args) > 1 && $args[1] ne \"\") {\n\t\t\t\t$read_fns{$rarg} = $args[1];\n\t\t\t} else {\n\t\t\t\t$i < scalar(@bt2_args)-1 || Fail(\"--${rarg}* option takes an argument.\\n\");\n\t\t\t\t$read_fns{$rarg} = $bt2_args[$i+1];\n\t\t\t\t$bt2_args[$i+1] = undef;\n\t\t\t}\n\t\t\t$read_compress{$rarg} = \"\";\n\t\t\t$read_compress{$rarg} = \"gzip\"  if $arg eq \"--${rarg}-gz\";\n\t\t\t$read_compress{$rarg} = \"bzip2\" if $arg eq \"--${rarg}-bz2\";\n\t\t\tlast;\n\t\t}\n\t}\n\tif ($arg eq \"--out-fmt\" )\n\t{\n\t\t$i < scalar(@bt2_args)-1 || Fail(\"Argument expected in next token!\\n\");\n\t\t$i++;\n\t\tif ( $bt2_args[$i] eq \"sam\" ) \n\t\t{\n\t\t\t$outputFmtSam = 1 ;\n\t\t}\n\t\t#$bt2_args[$i] = undef;\n\n\t}\n\n\tif ( $arg eq \"--tab-fmt-cols\" )\n\t{\n\t\t$i < scalar(@bt2_args)-1 || Fail(\"Argument expected in next token!\\n\");\n\t\t$tabFmtOptIdx = $i + 1 ;\n\t}\n}\n\n# process the sample sheet\nif ( defined $sampleSheetFile )\n{\n\tFail( \"Can not specify other read files by -U,-1,-2 when using --sample-sheet!\\n\" ) if ( defined $otherInput ) ;\n\tpush @bt2_args, \"--separator\" ;\n\n\topen FP1, $sampleSheetFile or Fail( \"Could not open sample sheet file \".$sampleSheetFile.\"\\n\" ) ;\n\tmy $singleCnt = 0 ;\n\tmy $mateCnt = 0 ;\n\twhile ( <FP1> )\n\t{\n\t\tchomp ;\n\t\tmy @cols = split /\\t/, $_ ;\n\t\tif ( scalar( @cols ) != 5 ) \n\t\t{\n\t\t\tFail( \"The line in sample sheet file has wrong format: \".$_.\"!\\n\" ) ;\n\t\t}\n\n\t\t++$singleCnt if ( $cols[0] == 1 ) ;\n\t\t++$mateCnt if ( $cols[0] == 2 ) ;\n\n\t\tpush @sampleSheet, \\@cols ;\n\t}\n\t\n\t# push the files to centrifuge arguments\n\tmy $i ;\n\tmy $ssCnt = scalar( @sampleSheet ) ;\n\tfor ( $i = 0 ; $i < $ssCnt ; ++$i )\n\t{\n\t\tmy @cols = @{ $sampleSheet[$i] } ;\n\t\tif ( $cols[0] == 1 )\n\t\t{\n\t\t\tpush @bt2w_args, ( \"-U\", $cols[1] ) ;\n\t\t}\n\t\telsif ( $cols[0] == 2 )\n\t\t{\n\t\t\tpush @bt2w_args, ( \"-1\", $cols[1], \"-2\", $cols[2] ) ;\n\t\t}\n\t\t\n\t}\n\t\n\tmy @fileType = ( 2, 1 ) ;\n\tmy $idx = 0 ;\n\tfor my $f (@fileType )\n\t{\n\t\tfor ( $i = 0 ; $i < scalar( $ssCnt ) ; ++$i )\n\t\t{\n\t\t\tif ( @{ $sampleSheet[ $i ] }[0] == $f )\n\t\t\t{\n\t\t\t\t$sampleSheetOrder[ $idx ] = $i ;\n\t\t\t\t++$idx ;\n\t\t\t}\n\t\t}\n\t}\n\n\tclose FP1 ;\n}\n\n# Determine whether we need to add two extra columns for seq and qual to out-fmt\nif ( $needReadSeq == 1 ) #&& ( $tabFmtOptIdx == 0 || $outputFmtSam == 1 ) )\n{\n\tmy $i ;\n\tmy $needAdd = 1 ;\n\tif ( $tabFmtOptIdx != 0 )\n\t{\n\t\tmy @cols = split /,/, $bt2_args[ $tabFmtOptIdx ] ;\n\t\tforeach my $f (@cols)\n\t\t{\n\t\t\tif ( $f eq \"readSeq\" )\n\t\t\t{\n\t\t\t\t$needAdd = 0 ;\t\t\n\t\t\t\tlast ;\n\t\t\t}\n\t\t}\n\n\t}\n\telse\n\t{\n\t\tpush @bt2_args, \"--tab-fmt-cols\" ;\n\t\tpush @bt2_args, \"readID,seqID,taxID,score,2ndBestScore,hitLength,queryLength,numMatches\" ;\n\t\t$tabFmtOptIdx = scalar( @bt2_args ) - 1 ;\n\t}\n\t\n\tif ( $needAdd )\n\t{\n\t\t$removeSeqCols = 1 ;\n\t\t$bt2_args[ $tabFmtOptIdx ] .= \",readSeq,readQual\" ; \n\t}\n}\n\n# If the user asked us to redirect some reads to files, or to suppress\n# unaligned reads, then we need to capture the output from Centrifuge and pass it\n# through this wrapper.\nmy $passthru = 0;\nif(scalar(keys %read_fns) > 0 || $no_unal || defined $sampleSheetFile ) {\n\t$passthru = 1;\n\tpush @bt2_args, \"--passthrough\";\n\t$cap_out = \"-\";\n\tfor(my $i = 0; $i < scalar(@bt2_args); $i++) {\n\t\tnext unless defined($bt2_args[$i]);\n\t\tmy $arg = $bt2_args[$i];\n\t\tif($arg eq \"-S\" || $arg eq \"--output\") {\n\t\t\t$i < scalar(@bt2_args)-1 || Fail(\"-S/--output takes an argument.\\n\");\n\t\t\t$cap_out = $bt2_args[$i+1];\n\t\t\t$bt2_args[$i] = undef;\n\t\t\t$bt2_args[$i+1] = undef;\n\t\t}\n\t}\n}\nmy @tmp = ();\nfor (@bt2_args) { push(@tmp, $_) if defined($_); }\n@bt2_args = @tmp;\n\nmy @unps = ();\nmy @mate1s = ();\nmy @mate2s = ();\nmy @to_delete = ();\nmy $temp_dir = \"/tmp\";\nmy $bam_out = 0;\nmy $ref_str = undef;\nmy $no_pipes = 0;\nmy $keep = 0;\nmy $verbose = 0;\nmy $readpipe = undef;\nmy $log_fName = undef;\nmy $help = 0;\nmy $createNewPipe = 0  ;\n\nmy @bt2w_args_cp = (@bt2w_args>0) ? @bt2w_args : @bt2_args;\nGetopt::Long::Configure(\"pass_through\",\"no_ignore_case\");\n\nmy @old_ARGV = @ARGV;\n@ARGV = @bt2w_args_cp;\n\nGetOptions(\n\t\"1=s\"                           => \\@mate1s,\n\t\"2=s\"                           => \\@mate2s,\n\t\"reads|U=s\"                     => \\@unps,\n\t\"temp-directory=s\"              => \\$temp_dir,\n\t\"bam\"                           => \\$bam_out,\n\t\"no-named-pipes\"                => \\$no_pipes,\n\t\"ref-string|reference-string=s\" => \\$ref_str,\n\t\"keep\"                          => \\$keep,\n\t\"verbose\"                       => \\$verbose,\n\t\"log-file=s\"                    => \\$log_fName,\n\t\"help|h\"                        => \\$help\n);\n\n@ARGV = @old_ARGV;\n\nmy $old_stderr;\n\nif ($log_fName) {\n    open($old_stderr, \">&STDERR\") or Fail(\"Cannot dup STDERR!\\n\");\n    open(STDERR, \">\", $log_fName) or Fail(\"Cannot redirect to log file $log_fName.\\n\");\n}\n\nInfo(\"Before arg handling:\\n\");\nInfo(\"  Wrapper args:\\n[ @bt2w_args ]\\n\");\nInfo(\"  Binary args:\\n[ @bt2_args ]\\n\");\n\nsub cat_file($$) {\n\tmy ($ifn, $ofh) = @_;\n\tmy $ifh = undef;\n\tif($ifn =~ /\\.gz$/) {\n\t\topen($ifh, \"gzip -dc $ifn |\") ||\n\t\t\t Fail(\"Could not open gzipped read file: $ifn \\n\");\n\t} elsif($ifn =~ /\\.bz2/) {\n\t\topen($ifh, \"bzip2 -dc $ifn |\") ||\n\t\t\tFail(\"Could not open bzip2ed read file: $ifn \\n\");\n\t} else {\n\t\topen($ifh, $ifn) || Fail(\"Could not open read file: $ifn \\n\");\n\t}\n\twhile(readline $ifh) { print {$ofh} $_; }\n\tclose($ifh);\n}\n\n# Return non-zero if and only if the input should be wrapped (i.e. because\n# it's compressed).\nsub wrapInput($$$) {\n\tmy ($unps, $mate1s, $mate2s) = @_;\n\tfor my $fn (@$unps, @$mate1s, @$mate2s) {\n\t\treturn 1 if $fn =~ /\\.gz$/ || $fn =~ /\\.bz2$/;\n\t}\n\treturn 0;\n}\n\nsub Info {\n    if ($verbose) {\n        print STDERR \"(INFO): \" ,@_;\n    }\n}\n\nsub Error {\n    my @msg = @_;\n    $msg[0] = \"(ERR): \".$msg[0];\n    printf STDERR @msg;\n}\n\nsub Fail {\n    Error(@_);\n    die(\"Exiting now ...\\n\");    \n}\n\nsub Extract_IndexName_From {\n    my $index_opt = $ref_str ? '--index' : '-x';\n    for (my $i=0; $i<@_; $i++) {\n        if ($_[$i] eq $index_opt){\n            return $_[$i+1];\n        }\n    }\n    Info(\"Cannot find any index option (--reference-string, --ref-string or -x) in the given command line.\\n\");    \n}\n\nif(wrapInput(\\@unps, \\@mate1s, \\@mate2s)) {\n\tif ( defined $sampleSheetFile )\n\t{\n\t\t$createNewPipe = 1 ;\n\t\tif(scalar(@mate2s) > 0) {\n\t\t\t#\n\t\t\t# Wrap paired-end inputs\n\t\t\t#\n\t\t\t# Put reads into temporary files or fork off processes to feed named pipes\n\t\t\tscalar(@mate2s) == scalar(@mate1s) ||\n\t\t\t\tFail(\"Different number of files specified with --reads/-1 as with -2\\n\");\n\t\t\tfor ( my $i = 0 ; $i < scalar( @mate2s ) ; ++$i )\n\t\t\t{\n\t\t\t\t# Make a named pipe for delivering mate #1s\n\t\t\t\tmy $m1fn = \"$temp_dir/$$\".\"_$i\".\"inpipe1\";\n\t\t\t\tpush @to_delete, $m1fn;\n\t\t\t\tpush @bt2_args, \"-1 $m1fn\";\n\t\t\t\tif(!$no_pipes) {\n\t\t\t\t\tmkfifo($m1fn, 0700) || Fail(\"mkfifo($m1fn) failed.\\n\");\n\t\t\t\t}\n\t\t\t\t\n\t\t\t\t# Make a named pipe for delivering mate #2s\n\t\t\t\tmy $m2fn = \"$temp_dir/$$.\".\"_$i\".\"inpipe2\";\n\t\t\t\tpush @to_delete, $m2fn;\n\t\t\t\tpush @bt2_args, \"-2 $m2fn\";\n\t\t\t\tif(!$no_pipes) {\n\t\t\t\t\tmkfifo($m2fn, 0700) || Fail(\"mkfifo($m2fn) failed.\\n\");\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tif(scalar(@unps) > 0) {\n\t\t\t#\n\t\t\t# Wrap unpaired inputs.\n\t\t\t#\n\n\t\t\tfor ( my $i = 0 ; $i < scalar( @unps ) ; ++$i )\n\t\t\t{\n\t\t\t\t# Make a named pipe for delivering unpaired reads\n\t\t\t\tmy $ufn = \"$temp_dir/$$\".\"_$i\".\".unp\";\n\t\t\t\tpush @to_delete, $ufn;\n\t\t\t\tpush @bt2_args, \"-U $ufn\";\n\t\t\t\tif(!$no_pipes) {\n\t\t\t\t\tmkfifo($ufn, 0700) || Fail(\"mkfifo($ufn) failed.\\n\");\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\telse\n\t{\n\t\tif(scalar(@mate2s) > 0) {\n\t\t\t#\n\t\t\t# Wrap paired-end inputs\n\t\t\t#\n\t\t\t# Put reads into temporary files or fork off processes to feed named pipes\n\t\t\tscalar(@mate2s) == scalar(@mate1s) ||\n\t\t\t\tFail(\"Different number of files specified with --reads/-1 as with -2\\n\");\n\t\t\t# Make a named pipe for delivering mate #1s\n\t\t\tmy $m1fn = \"$temp_dir/$$.inpipe1\";\n\t\t\tpush @to_delete, $m1fn;\n\t\t\tpush @bt2_args, \"-1 $m1fn\";\n\t\t\t# Create named pipe 1 for writing\n\t\t\tif(!$no_pipes) {\n\t\t\t\tmkfifo($m1fn, 0700) || Fail(\"mkfifo($m1fn) failed.\\n\");\n\t\t\t}\n\t\t\tmy $pid = 0;\n\t\t\t$pid = fork() unless $no_pipes;\n\t\t\tif($pid == 0) {\n\t\t\t\t# Open named pipe 1 for writing\n\t\t\t\topen(my $ofh, \">$m1fn\") || Fail(\"Can't open '$m1fn' for writing\\n\");\n\t\t\t\tfor my $ifn (@mate1s) { cat_file($ifn, $ofh); }\n\t\t\t\tclose($ofh);\n\t\t\t\texit 0 unless $no_pipes;\n\t\t\t}\n\t\t\t# Make a named pipe for delivering mate #2s\n\t\t\tmy $m2fn = \"$temp_dir/$$.inpipe2\";\n\t\t\tpush @to_delete, $m2fn;\n\t\t\tpush @bt2_args, \"-2 $m2fn\";\n\t\t\t# Create named pipe 2 for writing\n\t\t\tif(!$no_pipes) {\n\t\t\t\tmkfifo($m2fn, 0700) || Fail(\"mkfifo($m2fn) failed.\\n\");\n\t\t\t}\n\t\t\t$pid = 0;\n\t\t\t$pid = fork() unless $no_pipes;\n\t\t\tif($pid == 0) {\n\t\t\t\t# Open named pipe 2 for writing\n\t\t\t\topen(my $ofh, \">$m2fn\") || Fail(\"Can't open '$m2fn' for writing.\\n\");\n\t\t\t\tfor my $ifn (@mate2s) { cat_file($ifn, $ofh); }\n\t\t\t\tclose($ofh);\n\t\t\t\texit 0 unless $no_pipes;\n\t\t\t}\n\t\t}\n\t\tif(scalar(@unps) > 0) {\n\t\t\t#\n\t\t\t# Wrap unpaired inputs.\n\t\t\t#\n\t\t\t# Make a named pipe for delivering unpaired reads\n\t\t\tmy $ufn = \"$temp_dir/$$.unp\";\n\t\t\tpush @to_delete, $ufn;\n\t\t\tpush @bt2_args, \"-U $ufn\";\n\t\t\t# Create named pipe 2 for writing\n\t\t\tif(!$no_pipes) {\n\t\t\t\tmkfifo($ufn, 0700) || Fail(\"mkfifo($ufn) failed.\\n\");\n\t\t\t}\n\t\t\tmy $pid = 0;\n\t\t\t$pid = fork() unless $no_pipes;\n\t\t\tif($pid == 0) {\n\t\t\t\t# Open named pipe 2 for writing\n\t\t\t\topen(my $ofh, \">$ufn\") || Fail(\"Can't open '$ufn' for writing.\\n\");\n\t\t\t\tfor my $ifn (@unps) { cat_file($ifn, $ofh); }\n\t\t\t\tclose($ofh);\n\t\t\t\texit 0 unless $no_pipes;\n\t\t\t}\n\t\t}\n\t}\n} else {\n\tif(scalar(@mate2s) > 0) {\n\t\t# Just pass all the mate arguments along to the binary\n\t\tpush @bt2_args, (\"-1\", join(\",\", @mate1s));\n\t\tpush @bt2_args, (\"-2\", join(\",\", @mate2s));\n\t}\n\tif(scalar(@unps) > 0) {\n\t\tpush @bt2_args, (\"-U\", join(\",\", @unps));\n\t}\n}\n\nif(defined($ref_str)) {\n\tmy $ofn = \"$temp_dir/$$.ref_str.fa\";\n\topen(my $ofh, \">$ofn\") ||\n\t\tFail(\"could not open temporary fasta file '$ofn' for writing.\\n\");\n\tprint {$ofh} \">1\\n$ref_str\\n\";\n\tclose($ofh);\n\tpush @to_delete, $ofn;\n\tsystem(\"$build_bin $ofn $ofn\") == 0 ||\n\t\tFail(\"centrifuge-build returned non-0 exit level.\\n\");\n\tpush @bt2_args, (\"--index\", \"$ofn\");\n\tpush @to_delete, (\"$ofn.1.\".$idx_ext, \"$ofn.2.\".$idx_ext, \n\t                  \"$ofn.3.\".$idx_ext, \"$ofn.4.\".$idx_ext,\n\t\t\t  \"$ofn.5.\".$idx_ext, \"$ofn.6.\".$idx_ext,\n\t                  \"$ofn.rev.1.\".$idx_ext, \"$ofn.rev.2.\".$idx_ext,\n\t\t\t  \"$ofn.rev.5.\".$idx_ext, \"$ofn.rev.6.\".$idx_ext);\n}\n\nInfo(\"After arg handling:\\n\");\nInfo(\"  Binary args:\\n[ @bt2_args ]\\n\");\n\nmy $index_name = Extract_IndexName_From(@bt2_args);\n\nmy $debug_str = ($debug ? \"-debug\" : \"\");\n\n# Construct command invoking centrifuge-class\nmy $cmd = \"$align_prog$debug_str --wrapper basic-0 \".join(\" \", @bt2_args);\n\n# Possibly add read input on an anonymous pipe\n$cmd = \"$readpipe $cmd\" if defined($readpipe);\n\n# The function removes the two extra columns that we added to get the read seq and qual\nsub RemoveSeqCols\n{\n\tmy $line = $_[0] ;\n\tmy @cols = split /\\t/, $line ;\n\tpop @cols ;\n\tpop @cols ;\n\tmy $tab = \"\\t\" ;\n\treturn join( $tab, @cols ) ;\n}\n\nInfo(\"$cmd\\n\");\nmy $ret;\n\nsub GetSampleSheetOutputFileName\n{\n\tmy $sampleIdx = $_[0] ;\n\treturn @{ $sampleSheet[ $sampleSheetOrder[ $sampleIdx ] ] }[3] ;\n\n}\n\nsub GetSampleSheetReportFileName\n{\n\tmy $sampleIdx = $_[0] ;\n\treturn @{ $sampleSheet[ $sampleSheetOrder[ $sampleIdx ] ] }[4] ;\n}\n\nsub WritePipe\n{\n\tmy $i = $_[0] ;\n\tif ( $i < scalar( @mate2s ) )\n\t{\n\t\tmy $m1fn = \"$temp_dir/$$\".\"_$i\".\"inpipe1\";\n\n\t\t# Create named pipe 1 for writing\n\t\t#if(!$no_pipes) {\n\t\t#\tmkfifo($m1fn, 0700) || Fail(\"mkfifo($m1fn) failed.\\n\");\n\t\t#}\n\t\tmy $pid = 0;\n\t\t$pid = fork() unless $no_pipes;\n\t\tif($pid == 0) {\n\t\t# Open named pipe 1 for writing\n\t\t\topen(my $ofh, \">$m1fn\") || Fail(\"Can't open '$m1fn' for writing\\n\");\n\t\t\tcat_file($mate1s[$i], $ofh); \n\t\t\tclose($ofh);\n\t\t\texit 0 unless $no_pipes;\n\t\t}\n\n\t\tmy $m2fn = \"$temp_dir/$$.\".\"_$i\".\"inpipe2\";\n\n\t\t#if(!$no_pipes) {\n\t\t#\tmkfifo($m2fn, 0700) || Fail(\"mkfifo($m2fn) failed.\\n\");\n\t\t#}\n\t\t$pid = 0;\n\t\t$pid = fork() unless $no_pipes;\n\t\tif($pid == 0) {\n\t\t\t# Open named pipe 2 for writing\n\t\t\topen(my $ofh, \">$m2fn\") || Fail(\"Can't open '$m2fn' for writing.\\n\");\n\t\t\tcat_file( $mate2s[$i], $ofh); \n\t\t\tclose($ofh);\n\t\t\texit 0 unless $no_pipes;\n\t\t}\n\t}\n\telsif ( $i < scalar( @mate2s ) + scalar( @unps ) )\n\t{\n\t\t$i -= scalar( @mate2s ) ;\n\t\tmy $ufn = \"$temp_dir/$$\".\"_$i\".\".unp\";\n\t\t#if(!$no_pipes) {\n\t\t#\tmkfifo($ufn, 0700) || Fail(\"mkfifo($ufn) failed.\\n\");\n\t\t#}\n\t\tmy $pid = 0;\n\t\t$pid = fork() unless $no_pipes;\n\t\tif($pid == 0) {\n\t\t\t# Open named pipe 2 for writing\n\t\t\topen(my $ofh, \">$ufn\") || Fail(\"Can't open '$ufn' for writing.\\n\");\n\t\t\tcat_file($unps[$i], $ofh); \n\t\t\tclose($ofh);\n\t\t\texit 0 unless $no_pipes;\n\t\t}\n\n\t}\n}\n\nWritePipe( 0 ) if ( $createNewPipe ) ;\nif(defined($cap_out)) {\n\t# Open Centrifuge pipe\n\topen(BT, \"$cmd |\") || Fail(\"Could not open Centrifuge pipe: '$cmd |'\\n\");\n\t# Open output pipe\n\tmy $ofh = *STDOUT;\n\t\n\tmy $sampleIdx = 0 ;\n\tif ( defined $sampleSheetFile )\n\t{\n\t\tmy $file = GetSampleSheetOutputFileName( $sampleIdx ) ;\n\t\topen ( $ofh, \">$file\" ) || Fail( \"Could not open output file '$file' for writing.\\n\" ) ;\n\t}\n\telse\n\t{\n\t\tif($cap_out ne \"-\") {\n\t\t\topen($ofh, \">$cap_out\") ||\n\t\t\t\tFail(\"Could not open output file '$cap_out' for writing.\\n\");\n\t\t}\n\t}\n\n\tmy @fhs_to_close = ();\n\tmy %read_fhs = ();\n\tfor my $i (\"al\", \"un\", \"al-conc\", \"un-conc\") {\n\t\tif(defined($read_fns{$i})) {\n\t\t\tmy ($vol, $base_spec_dir, $base_fname) = File::Spec->splitpath($read_fns{$i});\n\t\t\tif (-d $read_fns{$i}) {\n\t\t\t\t$base_spec_dir = $read_fns{$i};\n\t\t\t\t$base_fname = undef;\n\t\t\t}\n\t\t\tif($i =~ /-conc$/) {\n# Open 2 output files, one for mate 1, one for mate 2\n\t\t\t\tmy ($fn1, $fn2);\n\t\t\t\tif ($base_fname) {\n\t\t\t\t\t($fn1, $fn2) = ($base_fname,$base_fname);\n\t\t\t\t}\n\t\t\t\telse {\n\t\t\t\t\t($fn1, $fn2) = ($i.'-mate',$i.'-mate');\n\t\t\t\t}\n\t\t\t\tif($fn1 =~ /%/) {\n\t\t\t\t\t$fn1 =~ s/%/1/g; $fn2 =~ s/%/2/g;\n\t\t\t\t} elsif($fn1 =~ /\\.[^.]*$/) {\n\t\t\t\t\t$fn1 =~ s/\\.([^.]*)$/.1.$1/;\n\t\t\t\t\t$fn2 =~ s/\\.([^.]*)$/.2.$1/;\n\t\t\t\t} else {\n\t\t\t\t\t$fn1 .= \".1\";\n\t\t\t\t\t$fn2 .= \".2\";\n\t\t\t\t}\n\t\t\t\t$fn1 = File::Spec->catpath($vol,$base_spec_dir,$fn1);\n\t\t\t\t$fn2 = File::Spec->catpath($vol,$base_spec_dir,$fn2);\n\t\t\t\t$fn1 ne $fn2 || Fail(\"$fn1\\n$fn2\\n\");\n\t\t\t\tmy ($redir1, $redir2) = (\">$fn1\", \">$fn2\");\n\t\t\t\t$redir1 = \"| gzip -c $redir1\"  if $read_compress{$i} eq \"gzip\";\n\t\t\t\t$redir1 = \"| bzip2 -c $redir1\" if $read_compress{$i} eq \"bzip2\";\n\t\t\t\t$redir2 = \"| gzip -c $redir2\"  if $read_compress{$i} eq \"gzip\";\n\t\t\t\t$redir2 = \"| bzip2 -c $redir2\" if $read_compress{$i} eq \"bzip2\";\n\t\t\t\topen($read_fhs{$i}{1}, $redir1) || Fail(\"Could not open --$i mate-1 output file '$fn1'\\n\");\n\t\t\t\topen($read_fhs{$i}{2}, $redir2) || Fail(\"Could not open --$i mate-2 output file '$fn2'\\n\");\n\t\t\t\tpush @fhs_to_close, $read_fhs{$i}{1};\n\t\t\t\tpush @fhs_to_close, $read_fhs{$i}{2};\n\t\t\t} else {\n\t\t\t\tmy $redir = \">\".File::Spec->catpath($vol,$base_spec_dir,$i.\"-seqs\");\n\t\t\t\tif ($base_fname) {\n\t\t\t\t\t$redir = \">$read_fns{$i}\";\n\t\t\t\t}\n\t\t\t\t$redir = \"| gzip -c $redir\"  if $read_compress{$i} eq \"gzip\";\n\t\t\t\t$redir = \"| bzip2 -c $redir\" if $read_compress{$i} eq \"bzip2\";\n\t\t\t\topen($read_fhs{$i}, $redir) || Fail(\"Could not open --$i output file '$read_fns{$i}'\\n\");\n\t\t\t\tpush @fhs_to_close, $read_fhs{$i};\n\t\t\t}\n\t\t}\n\t}\n\n\tmy $seqIndex = -1 ;\n\tmy $qualIndex = -1 ;\n\tmy $readIdIndex = -1 ;\n\tmy $outputHeader = \"\" ;\n\n\tif ( scalar(keys %read_fhs) != 0 )\n\t{\n\t\tif ( $outputFmtSam == 0 )\n\t\t{\n\t\t\twhile ( 1 )\n\t\t\t{\n\t\t\t\t$outputHeader = <BT> ;\n\t\t\t\tlast if ( $outputHeader =~ /readID/ || $outputFmtSam ) ;\n\t\t\t}\n\t\t\tmy @cols = split /\\t/, $outputHeader ;\n\t\t\tfor ( my $i = 0 ; $i < scalar( @cols ) ; ++$i )\n\t\t\t{\n\t\t\t\tif ( $cols[$i] =~ /readSeq/ )\n\t\t\t\t{\n\t\t\t\t\t$seqIndex = $i ;\n\t\t\t\t}\n\t\t\t\telsif ( $cols[$i] =~ /readQual/ )\n\t\t\t\t{\n\t\t\t\t\t$qualIndex = $i ;\n\t\t\t\t}\n\t\t\t\telsif ( $cols[$i] =~ /readID/ )\n\t\t\t\t{\n\t\t\t\t\t$readIdIndex = $i ;\n\t\t\t\t}\n\t\t\t}\n\t\t\tif ( $seqIndex == -1 && scalar( keys %read_fhs) == 0 )\n\t\t\t{\n\t\t\t\tError( \"Must use readSeq in --tabFmtCols in order to output unaligned reads.\" )  ;\n\t\t\t}\n\n\t\t\t$outputHeader = RemoveSeqCols( $outputHeader ).\"\\n\" if ( $removeSeqCols == 1 ) ;\n\t\t\tprint {$ofh} $outputHeader ;\n\t\t}\n\t\telse\n\t\t{\n\t\t\t$seqIndex = 9 ;\n\t\t\t$qualIndex = 10 ;\n\t\t\t$readIdIndex = 0 ;\n\t\t}\n\t}\n\telse\n\t{\n\t\t$outputHeader = <BT> ;\n\t\tprint {$ofh} $outputHeader ;\n\t}\n\n\twhile(<BT>) {\n\t\tchomp;\n\t\tmy $filt = 0;\n\n\t\tif ( defined $sampleSheetFile )\n\t\t{\t\n\t\t\tif ( $_ eq \"#File_End_Here\" )\n\t\t\t{\n\t\t\t\tclose( $ofh ) ;\n\t\t\t\trename( \"centrifuge_report_$sampleIdx.tsv\", GetSampleSheetReportFileName( $sampleIdx ) ) ;\n\t\t\t\t\n\t\t\t\t++$sampleIdx ;\n\t\t\t\tif ( $sampleIdx < scalar( @sampleSheet ) )\n\t\t\t\t{\n\t\t\t\t\tmy $file = GetSampleSheetOutputFileName( $sampleIdx ) ;\n\t\t\t\t\topen ( $ofh, \">$file\" ) || Fail( \"Could not open output file '$file' for writing.\\n\" ) ;\n\t\t\t\t\tprint {$ofh} $outputHeader ;\n\t\t\t\t\t\n\t\t\t\t\tWritePipe( $sampleIdx) if ( $createNewPipe ) ;\n\t\t\t\t}\n\t\t\t\tnext ;\n\t\t\t}\n\t\t}\n\n\t\t# originally this was @ sign for sam file, but we use # sign for comment in tsv here.\n\t\tunless(substr($_, 0, 1) eq \"#\" ) {\n\t\t\t# If we are supposed to output certain reads to files...\n\t\t\t#my $tab1_i = index($_, \"\\t\") + 1;\n\t\t\t#my $tab2_i = index($_, \"\\t\", $tab1_i);\n\t\t\t#my $fl = substr($_, $tab1_i, $tab2_i - $tab1_i);\n\t\t\tmy $unal = 0 ;\n\t\t\tif ( /unclassified/ )\n\t\t\t{\n\t\t\t\t$unal = 1 ;\n\t\t\t}\n\t\t\t$filt = 1 if $no_unal && $unal;\n\t\t\tif($passthru) {\n\t\t\t\tif(scalar(keys %read_fhs) == 0 || $seqIndex == -1 ) {\n\t\t\t\t\t# Next line is read with some whitespace escaped\n\t\t\t\t\t# my $l = <BT>;\n\t\t\t\t} else {\n\t\t\t\t\tmy @cols = split /\\t/ ;\n\t\t\t\t\tmy $isPaired = 0 ;\n\t\t\t\t\tmy $pair = 0 ;\n\t\t\t\t\tif (  $cols[$seqIndex] =~ /_/ )\n\t\t\t\t\t{\n\t\t\t\t\t\t$pair = 1 ;\n\t\t\t\t\t}\n\t\t\t\t\tmy $unp = !$pair ;\n\n\t\t\t\t\t# Next line is read with some whitespace escaped\n\t\t\t\t\t#my $l = <BT>;\n\t\t\t\t\t#chomp($l);\n\t\t\t\t\t#$l =~ s/%(..)/chr(hex($1))/eg;\n\n\t\t\t\t\tif((defined($read_fhs{un}) || defined($read_fhs{al})) && $unp) {\n\t\t\t\t\t\tmy $l ;\n\t\t\t\t\t\tif ( $qualIndex != -1 )\n\t\t\t\t\t\t{\n\t\t\t\t\t\t\t$l = \"@\".$cols[ $readIdIndex ].\"\\n\".$cols[$seqIndex].\"\\n+\\n\".$cols[$qualIndex].\"\\n\" ;\n\t\t\t\t\t\t}\n\t\t\t\t\t\telse \n\t\t\t\t\t\t{\n\t\t\t\t\t\t\t$l = \">\".$cols[ $readIdIndex ].\"\\n\".$cols[$seqIndex].\"\\n\" ;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif($unal) {\n\t\t\t\t\t\t\t# Failed to align\n\t\t\t\t\t\t\tprint {$read_fhs{un}} $l if defined($read_fhs{un});\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t# Aligned\n\t\t\t\t\t\t\tprint {$read_fhs{al}} $l if defined($read_fhs{al});\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\tmy $warnedAboutLength = 0 ;\n\t\t\t\t\tif((defined($read_fhs{\"un-conc\"}) || defined($read_fhs{\"al-conc\"})) && $pair) {\n\t\t\t\t\t\tmy @seq = split /_/, $cols[$seqIndex] ;\n\t\t\t\t\t\tmy @qual = ( substr( $cols[$qualIndex], 0, length( $seq[0] ) ), substr( $cols[$qualIndex], length( $seq[0] ) + 1 ) ) ;\n\t\t\t\t\t\t\n\t\t\t\t\t\tmy $l1 ;\n\t\t\t\t\t\tmy $l2 ;\n\t\t\t\t\t\tif ( $qualIndex != -1 )\n\t\t\t\t\t\t{\n\t\t\t\t\t\t\t$l1 = \"@\".$cols[ $readIdIndex ].\"\\n\".$seq[0].\"\\n+\\n\".$qual[0].\"\\n\" ;\n\t\t\t\t\t\t}\n\t\t\t\t\t\telse \n\t\t\t\t\t\t{\n\t\t\t\t\t\t\t$l1 = \">\".$cols[ $readIdIndex ].\"\\n\".$seq[0].\"\\n\" ;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif ( $qualIndex != -1 )\n\t\t\t\t\t\t{\n\t\t\t\t\t\t\t$l2 = \"@\".$cols[ $readIdIndex ].\"\\n\".$seq[1].\"\\n+\\n\".$qual[1].\"\\n\" ;\n\t\t\t\t\t\t}\n\t\t\t\t\t\telse \n\t\t\t\t\t\t{\n\t\t\t\t\t\t\t$l2 = \">\".$cols[ $readIdIndex ].\"\\n\".$seq[1].\"\\n\" ;\n\t\t\t\t\t\t}\n\n\t\t\t\t\t\tif     ( !$unal) {\n\t\t\t\t\t\t\tprint {$read_fhs{\"al-conc\"}{1}} $l1 if defined($read_fhs{\"al-conc\"});\n\t\t\t\t\t\t\tprint {$read_fhs{\"al-conc\"}{2}} $l2 if defined($read_fhs{\"al-conc\"});\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tprint {$read_fhs{\"un-conc\"}{1}} $l1 if defined($read_fhs{\"un-conc\"});\n\t\t\t\t\t\t\tprint {$read_fhs{\"un-conc\"}{2}} $l2 if defined($read_fhs{\"un-conc\"});\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\t$_ = RemoveSeqCols( $_ ) if ( $removeSeqCols == 1 ) ;\n\t\tprint {$ofh} \"$_\\n\" if !$filt;\n\t}\n\tfor my $k (@fhs_to_close) { close($k); }\n\tclose($ofh);\n\tclose(BT);\n\t$ret = $?;\n} else {\n\t$ret = system($cmd);\n}\nif(!$keep) { for(@to_delete) { unlink($_); } }\n\nif ($ret == -1) {\n    Error(\"Failed to execute centrifuge-class: $!\\n\");\n\texit 1;\n} elsif ($ret & 127) {\n\tmy $signm = \"(unknown)\";\n\t$signm = $signame[$ret & 127] if defined($signame[$ret & 127]);\n\tmy $ad = \"\";\n\t$ad = \"(core dumped)\" if (($ret & 128) != 0);\n    Error(\"centrifuge-class died with signal %d (%s) $ad\\n\", ($ret & 127), $signm);\n\texit 1;\n} elsif($ret != 0) {\n\tError(\"centrifuge-class exited with value %d\\n\", ($ret >> 8));\n}\nexit ($ret >> 8);\n"
  },
  {
    "path": "centrifuge-BuildSharedSequence.pl",
    "content": "#!/usr/bin/env perl\n\nuse strict ;\nuse Getopt::Long;\nuse File::Basename;\n\nmy $usage = \"perl \".basename($0).\" file_list [-prefix tmp -kmerSize 53 -kmerPortion 0.01 -nucmerIdy 99 -overlap 250 [-fragment] ]\" ;\n\nmy @fileNames ; # Assume the genes for each subspecies are concatenated.\nmy @used ;\nmy $i ;\nmy $j ;\nmy $k ;\nmy %globalKmer ;\nmy $kmerSize = 53 ;\nmy $useKmerPortion = 0.01 ;\nmy %localKmer ;\nmy $id = 0 ;\nmy $kmer ;\nmy %chroms ;\nmy %shared ;\nmy $id = \"\" ;\nmy $seq = \"\" ;\nmy $index ;\nmy $prefix = \"tmp\";\nmy %sharedKmerCnt ;\nmy $nucmerIdy = 99 ;\nmy $overlap = 250 ;\nmy $fragment = 0 ;\nmy $jellyfish = \"jellyfish\";\n\n#print `jellyfish --version`;\n\nGetOptions (\n\t\"prefix=s\" => \\$prefix,\n\t\"kmerSize=s\" => \\$kmerSize,\n\t\"kmerPortion=s\" => \\$useKmerPortion,\n\t\"nucmerIdy=s\" => \\$nucmerIdy,\n\t\"overlap=s\" => \\$overlap,\n\t\"fragment\" => \\$fragment,\n\t\"jellyfish=s\" => \\$jellyfish)\nor die(\"Error in command line arguments. \\n\\n$usage\");\n\ndie \"$usage\\n\" if ( scalar( @ARGV ) == 0 ) ;\nopen FP1, $ARGV[0] ; \n\n# Create the temporary files, while making sure the header id is unique within each file\nmy $listSize = 0 ;\nwhile ( <FP1> )\n{\n\tchomp ;\n\topen FP2, $_ ;\n\tmy $fileName = $prefix.\"_\".$listSize.\".fa\" ;\n\topen FPtmp, \">$fileName\" ;\n\tmy $chromCnt = 0 ;\n\twhile ( <FP2> )\n\t{\n\t\tif ( /^>/ )\n\t\t{\n\t\t\ts/\\s/\\|$chromCnt / ;\n\t\t\tprint FPtmp ;\n\t\t\t++$chromCnt ;\n\t\t}\n\t\telse\n\t\t{\n\t\t\tprint FPtmp ;\n\t\t}\n\t}\n\t++$listSize ;\n}\n\n# Read in the file names\n#while ( <FP1> )\n#{\n#\tchomp ;\n#\tpush @fileNames, $_ ;\n#\tpush @used, 0 ;\n#\t++$index ;\n#}\n#close( FP1 ) ;\nfor ( my $i = 0 ; $i < $listSize ; ++$i )\n{\n\tmy $fileName = $prefix.\"_\".$i.\".fa\" ;\n\tpush @fileNames, $fileName ;\n\tpush @used, 1 ;\n\t++$index ;\n}\n\n\n\n# Count and select kmers\nprint \"Find the kmers for testing\\n\" ;\nsystem_call(\"$jellyfish count -o tmp_$prefix.jf -m $kmerSize -s 5000000 -C -t 8 @fileNames\") ;\nsystem_call(\"$jellyfish dump tmp_$prefix.jf > tmp_$prefix.jf_dump\") ;\n\nsrand( 17 ) ;\nopen FP1, \"tmp_$prefix.jf_dump\" ;\nopen FP2, \">tmp_$prefix.testingKmer\" ;\nwhile ( <FP1> )\n{\n\tmy $line = $_ ;\n\t$kmer = <FP1> ;\n\t#chomp $kmer ;\n\t\n\t#++$testingKmer{ $kmer} if ( rand() < $useKmerPortion ) ;\n\tprint FP2 \"$line$kmer\" if ( rand() < $useKmerPortion ) ;\n\t\t\n}\nclose FP1 ;\nclose FP2 ;\n\nprint \"Get the kmer profile for each input file\\n\" ;\nfor ( $i = 0 ; $i < scalar( @fileNames )  ; ++$i )\n{\n\tmy $fileName = $fileNames[ $i ] ;\n\tsystem_call(\"$jellyfish count --if tmp_$prefix.testingKmer -o tmp_$prefix.jf -m $kmerSize -s 5000000 -C -t 8 $fileName\") ;\n\tsystem_call(\"$jellyfish dump tmp_$prefix.jf > tmp_$prefix.jf_dump\") ;\n\topen FP1, \"tmp_$prefix.jf_dump\";\n\n\twhile ( <FP1> )\n\t{\n\t\tchomp ;\n\t\tmy $cnt = substr( $_, 1 )  ;\n\t\tif ( $cnt eq \"0\" ) \n\t\t{\n\t\t\t$kmer = <FP1> ;\n\t\t\tnext ;\n\t\t}\n\t\t#print \"$cnt\\n\" ;\n\t\t$kmer = <FP1> ;\n\t\tchomp $kmer ;\n\t\t$localKmer{$i}->{$kmer} = 1 ; \n\t}\n\tclose FP1 ;\n}\n\n# Get the genome sizes\nsub GetGenomeSize\n{\n\topen FPfa, $_[0] ;\n\tmy $size = 0 ;\n\twhile ( <FPfa> )\n\t{\n\t\tnext if ( /^>/ ) ;\n\t\t$size += length( $_ ) - 1 ;\n\t}\n\tclose FPfa ;\n\treturn $size ;\n}\nmy $longestGenome = 0 ;\nfor ( $i = 0 ;  $i < scalar( @fileNames ) ; ++$i )\n{\n\tmy $size = GetGenomeSize( $fileNames[$i] ) ;\n\tif ( $size > $longestGenome )\n\t{\n\t\t$longestGenome = $size ;\n\t}\n}\n\n#for ( $i = 0 ; $i < scalar( @fileNames ) ; ++$i )\nprint \"Begin merge files\\n\" ;\nmy $maxSharedKmerCnt = -1 ;\nwhile ( 1 ) \n{\n\t# Find the suitable files\n\tmy @maxPair ;\n\tmy $max = 0 ;\n\tprint \"Selecting two genomes to merge.\\n\" ;\n\tforeach $i (keys %localKmer )\n\t{\n\t\tforeach $j ( keys %localKmer )\n\t\t{\n\t\t\tnext if ( $i <= $j ) ;\n\n\t\t\tmy $cnt = 0 ;\n\n\t\t\tif ( defined $sharedKmerCnt{ \"$i $j\" } )\n\t\t\t{\n\t\t\t\t$cnt = $sharedKmerCnt{ \"$i $j\" }\n\t\t\t}\n\t\t\telse\n\t\t\t{\n\t\t\t\tforeach $kmer ( keys %{$localKmer{ $i } } )\n\t\t\t\t{\n\t\t\t\t\t#print $kmer, \"\\n\" ;\n\t\t\t\t\tif ( defined $localKmer{ $j }->{ $kmer} ) \n\t\t\t\t\t{\n\t\t\t\t\t\t++$cnt ;\t\t\t\t\n\t\t\t\t\t}\t\n\t\t\t\t}\n\t\t\t\t$sharedKmerCnt{ \"$i $j\" } = $cnt ;\n\t\t\t}\n\n\t\t\tif ( $cnt > $max )\n\t\t\t{\n\t\t\t\t$max = $cnt ;\n\t\t\t\t$maxPair[0] = $i ;\n\t\t\t\t$maxPair[1] = $j ;\n\t\t\t}\n\t\t}\n\t}\n\n\t$maxSharedKmerCnt = $max if ( $maxSharedKmerCnt == -1 ) ;\n\tlast if ( $max == 0 || $max < $maxSharedKmerCnt * 0.01 ) ;\n\n\tmy @commonRegion ;\n\tmy $fileNameA ;\n\tmy $fileNameB ;\n\t$i = $maxPair[0] ;\n\t$j = $maxPair[1] ;\n\tif ( $i < scalar( @fileNames ) )\n\t{\n\t\t$fileNameA = $fileNames[ $i ] ;\n\t}\n\telse\n\t{\n\t\t$fileNameA = $prefix.\"_\".$i.\".fa\" ;\n\t}\n\tif ( $j < scalar( @fileNames ) )\n\t{\n\t\t$fileNameB = $fileNames[ $j ] ;\n\t}\n\telse\n\t{\n\t\t$fileNameB = $prefix.\"_\".$j.\".fa\" ;\n\t}\n\n\tif ( GetGenomeSize( $fileNameA ) < 0.01 * $longestGenome || GetGenomeSize( $fileNameB ) < 0.01 * $longestGenome )\n\t{\n\t\tlast ;\n\t}\n\n\tmy $nucmerC = 3 * $overlap ;\n\tmy $nucmerG = 10 ;\n\tmy $nucmerB = 10 ;\n\tif ( $i >= scalar( @fileNames ) && $j >= scalar( @fileNames ) )\n\t{\n\t\t$nucmerG = 5 ;\n\t\t$nucmerB = 5 ;\n\t}\n\tprint \"nucmer --maxmatch -l $kmerSize -g $nucmerG -b $nucmerB -c $nucmerC -p nucmer_$prefix $fileNameA $fileNameB\\n\" ; \n\tmy $nucRet = system(\"nucmer --maxmatch -l $kmerSize -g $nucmerG -b $nucmerB -c $nucmerC -p nucmer_$prefix $fileNameA $fileNameB\") ; # if the call to nucmer failed, we just not compress at all. \n\tprint \"show-coords nucmer_$prefix.delta > nucmer_$prefix.coords\\n\" ; \n\t$nucRet = system( \"show-coords nucmer_$prefix.delta > nucmer_$prefix.coords\\n\" )  ; # if the call to nucmer failed, we just not compress at all. \n\n\topen FPCoords, \"nucmer_$prefix.coords\" ;\n\tmy $line ;\n\t$line = <FPCoords> ;\n\t$line = <FPCoords> ;\n\t$line = <FPCoords> ;\n\t$line = <FPCoords> ;\n\t$line = <FPCoords> ;\n\t\n\t#1     5195  |        1     5195  |     5195     5195  |    99.98  | gi|385223048|ref|NC_017374.1|\tgi|385230889|ref|NC_017381.1|\n\tprint \"Merging $fileNameA $fileNameB\\n\" ;\n\n\tmy $cnt = 0 ;\t\n\twhile ( <FPCoords> )\n\t{\n\t\tchomp ;\n\t\t$line = $_ ;\n\t\tmy @cols = split /\\s+/, $line ;\n\n\t\tshift @cols if ( $cols[0] eq \"\" ) ;\n\n\t\tnext if ( $cols[6] <= 3 * $overlap || $cols[9] < $nucmerIdy ) ;\n\t\t\n\t\t++$cnt ;\n\t\tmy $ind = scalar( @commonRegion ) ;\n\t\tpush @{ $commonRegion[$ind] }, ( $cols[11], $cols[0] + $overlap, $cols[1] - $overlap )  ;\n\t\tif ( $cols[3] < $cols[4] )\n\t\t{\n\t\t\tpush @{ $commonRegion[ $ind ] }, ( $cols[12], $cols[3] + $overlap, $cols[4] - $overlap ) ;\n\t\t}\n\t\telse\n\t\t{\n\t\t\tpush @{ $commonRegion[ $ind ] }, ( $cols[12], $cols[4] + $overlap, $cols[3] - $overlap ) ;\n\t\t}\n\t}\n\tlast if ( $cnt == 0 ) ;\n\tclose FPCoords ;\n\n\tmy $outputSeq = \"\" ;\n\tmy $outputHeader = \">${prefix}_${index}\" ;\n\t# Use fileNameA to represent the shared sequences\n\tif ( $fragment == 0 )\n\t{\n\t\t# Just a big chunk.\n\t\topen FP1, $fileNameA ;\n\t\twhile ( <FP1> )\n\t\t{\n\t\t\tchomp ;\n\t\t\tnext if ( /^>/ ) ;\n\t\t\t$outputSeq .= $_ ;\n\t\t\t#print \"$_\\n$outputSeq\\n\" ;\n\t\t}\n\t\tclose FP1 ;\n\t}\n\telse\n\t{\n\t\topen FP1, $fileNameA ;\n\t\t$id = \"\" ;\n\t\t$seq = \"\" ;\n\t\tundef %chroms ;\n\t\tundef %shared ;\n\t\twhile ( <FP1> )\n\t\t{\n\t\t\tchomp ;\n\t\t\tif ( /^>/ )\n\t\t\t{\n\t\t\t\t$chroms{ $id } = $seq if ( $id ne \"\" ) ;\n\t\t\t\t$id = ( split /\\s+/, substr( $_, 1 ) )[0] ;\n\t\t\t\t#print $id, \"\\n\" ;\n\t\t\t\t$seq = \"\" ;\n\t\t\t\t@#{ $shared{ $id } } = length( $seq ) + 1 ; \n\t\t\t}\n\t\t\telse\n\t\t\t{\n\t\t\t\t$seq .= $_ ;\n\t\t\t}\n\t\t}\n\t\t$chroms{ $id } = $seq if ( $id ne \"\" ) ;\n\t\t@#{ $shared{ $id } } = length( $seq ) + 1 ; \n\t\tclose FP1 ;\n\n\t\tfor ( $i = 0 ; $i < scalar( @commonRegion) ; ++$i )\n\t\t{\n\t\t\t#print \"hi $i $commonRegion[$i]->[1] $commonRegion[$i]->[2]\\n\" ;\t\n\t\t\tmy $tmpArray = \\@{ $shared{ $commonRegion[$i]->[0] } } ;\n\t\t\t$commonRegion[$i]->[1] -= $overlap if ( $commonRegion[$i]->[1] == $overlap + 1 ) ;\n\t\t\t$commonRegion[$i]->[2] += $overlap if ( $commonRegion[$i]->[2] + $overlap == length( $chroms{ $commonRegion[$i]->[0] } ) ) ;\n\t\t\tfor ( $j = $commonRegion[$i]->[1] - 1 ; $j < $commonRegion[$i]->[2] ; ++$j ) # Shift the coordinates\n\t\t\t{\n\t\t\t\t$tmpArray->[$j] = 1 ;\n\t\t\t}\n\t\t}\n\n\t\t# Print the information of genome A, including the shared part\n\t\tmy $fileName = $prefix.\"_\".$index.\".fa\" ;\n\t\topen fpOut, \">$fileName\" ;\n\t\tforeach $i (keys %chroms )\n\t\t{\n\t\t\tmy $tmpArray = \\@{ $shared{ $i } } ;\n\t\t\tmy $len = length( $chroms{ $i } ) ;\n\t\t\tmy $header = ( split /\\|Range:/, $i )[0] ;\n\t\t\tmy $origStart = ( split /-/, ( ( split /\\|Range:/, $i )[1] ) )[0] ;\n\t\t\tfor ( $j = 0 ; $j < $len ;  )\n\t\t\t{\n\t\t\t\tif ( $tmpArray->[$j] == 1 )\n\t\t\t\t{\n\t\t\t\t\tmy $start = $j ;\n\t\t\t\t\tmy $end ;\n\t\t\t\t\tfor ( $end = $j + 1 ; $end < $len && $tmpArray->[$end] == 1 ; ++$end )\n\t\t\t\t\t{\n\t\t\t\t\t\t;\n\t\t\t\t\t}\n\t\t\t\t\t--$end ;\n\t\t\t\t\tmy $rangeStart = $origStart + $start ;\n\t\t\t\t\tmy $rangeEnd = $origStart + $end ;\n\t\t\t\t\tprint fpOut \">$header|Range:$rangeStart-$rangeEnd shared\\n\" ;\n\t\t\t\t\tprint fpOut substr( $chroms{$i}, $start, $end - $start + 1 ), \"\\n\" ;\n\n\t\t\t\t\t$j = $end + 1 ;\n\t\t\t\t}\n\t\t\t\telse\n\t\t\t\t{\n\t\t\t\t\tmy $start = $j ;\n\t\t\t\t\tmy $end ;\n\n\t\t\t\t\tfor ( $end = $j + 1 ; $end < $len && $tmpArray->[$end] == 0 ; ++$end )\n\t\t\t\t\t{\n\t\t\t\t\t\t;\n\t\t\t\t\t}\n\t\t\t\t\t--$end ;\n\t\t\t\t\tmy $rangeStart = $origStart + $start ;\n\t\t\t\t\tmy $rangeEnd = $origStart + $end ;\n\n\t\t\t\t\tprint fpOut \">$header|Range:$rangeStart-$rangeEnd non-shared\\n\" ;\n\t\t\t\t\tprint fpOut substr( $chroms{$i}, $start, $end - $start + 1 ), \"\\n\" ;\n\n\t\t\t\t\t$j = $end + 1 ;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t} # end if fragment. There might be bugs in the fragment mode\n\n\t# Print the sequence from genome B, only including the non-shared part\n\topen FP1, $fileNameB ;\n\t$id = \"\" ;\n\t$seq = \"\" ;\n\tundef %chroms ;\n\tundef %shared ;\n\twhile ( <FP1> )\n\t{\n\t\tchomp ;\n\t\tif ( /^>/ )\n\t\t{\n\t\t\t$chroms{ $id } = $seq if ( $id ne \"\" ) ;\n\t\t\t$id = ( split /\\s+/, substr( $_, 1 ) )[0] ;\n\t\t\t$seq = \"\" ;\n\t\t\t@#{ $shared{ $id } } = length( $seq ) + 1 ; \n\t\t}\n\t\telse\n\t\t{\n\t\t\t$seq .= $_ ;\n\t\t}\n\t}\n\t$chroms{ $id } = $seq if ( $id ne \"\" ) ;\n\t@#{ $shared{ $id } } = length( $seq ) + 1 ; \n\tclose FP1 ;\n\n\tfor ( $i = 0 ; $i < scalar( @commonRegion) ; ++$i )\n\t{\n\t\tmy $tmpArray = \\@{ $shared{ $commonRegion[$i]->[3] } } ;\n\t\t$commonRegion[$i]->[4] -= $overlap if ( $commonRegion[$i]->[4] == $overlap + 1 ) ;\n\t\t$commonRegion[$i]->[5] += $overlap if ( $commonRegion[$i]->[5] + $overlap == length( $chroms{ $commonRegion[$i]->[3] } ) ) ;\n\t\tfor ( $j = $commonRegion[$i]->[4] - 1 ; $j < $commonRegion[$i]->[5] ; ++$j )\t\n\t\t{\n\t\t\t$tmpArray->[$j] = 1 ;\n\t\t}\n\t}\n\n\tforeach $i (keys %chroms )\n\t{\n\t\tmy $tmpArray = \\@{ $shared{ $i } } ;\n\t\tmy $len = length( $chroms{ $i } ) ;\n\t\tmy $header = ( split /\\|Range:/, $i )[0] ;\n\t\tmy $origStart = ( split /-/, ( ( split /\\|Range:/, $i )[1] ) )[0] ;\n\t\tfor ( $j = 0 ; $j < $len ;  )\n\t\t{\n\t\t\tif ( $tmpArray->[$j] == 1 ) \n\t\t\t{\n\t\t\t\t++$j ;\n\t\t\t\tnext ;\n\t\t\t}\n\n\t\t\tmy $start = $j ;\n\t\t\tmy $end ;\n\t\t\tfor ( $end = $j + 1 ; $end < $len && $tmpArray->[$end] == 0 ; ++$end )\n\t\t\t{\n\t\t\t\t;\n\t\t\t}\n\t\t\t--$end ;\n\t\t\t$j = $end + 1 ;\n\t\t\t\n\t\t\tnext if ( $end - $start < $overlap ) ;\n\n\t\t\tmy $rangeStart = $origStart + $start ;\n\t\t\tmy $rangeEnd = $origStart + $end ;\n\n\t\t\tif ( $fragment == 0 )\n\t\t\t{\n\t\t\t\t#print length( $outputSeq ), \"\\n\" ;\n\t\t\t\t$outputSeq .= substr( $chroms{$i}, $start, $end - $start + 1 ) ; \t\n\t\t\t\t#print length( $outputSeq ), \"\\n\" ;\n\t\t\t}\n\t\t\telse\n\t\t\t{\n\t\t\t\tmy $fileName = $prefix.\"_\".$index.\".fa\" ;\n\t\t\t\t#open fpOut, \">>$fileName\" ;\n\t\t\t\tprint fpOut \">$header|Range:$rangeStart-$rangeEnd non-shared\\n\" ;\n\t\t\t\tprint fpOut substr( $chroms{$i}, $start, $end - $start + 1 ), \"\\n\" ;\n\t\t\t\t#close fpOut ;\n\t\t\t}\n\t\t}\n\t}\n\n\tdelete $localKmer{ $maxPair[0] } ;\n\tdelete $localKmer{ $maxPair[1] } ;\n\n\t#print defined( $localKmer{ $maxPair[0] }) ;\n\tunlink glob(\"$fileNameA\") ; #if ( $maxPair[0] >= scalar( @fileNames ) ) ;\n\tunlink glob(\"$fileNameB\") ; #if ( $maxPair[1] >= scalar( @fileNames ) ) ;\n\n\t# Count the kmer for the new genome\n\tmy $fileName = $prefix.\"_\".$index.\".fa\" ;\n\tif ( $fragment == 0 )\n\t{\n\t\topen fpOut, \">$fileName\" ;\n\t\tprint fpOut \"$outputHeader\\n$outputSeq\\n\" ;\n\t}\n\tclose fpOut ;\n\n\tsystem_call(\"$jellyfish count --if tmp_$prefix.testingKmer -o tmp_$prefix.jf -m $kmerSize -s 5000000 -C -t 4 $fileName\") ;\n\tsystem_call(\"$jellyfish dump tmp_$prefix.jf > tmp_$prefix.jf_dump\") ;\n\topen FP1, \"tmp_$prefix.jf_dump\";\n\n\twhile ( <FP1> )\n\t{\n\t\tchomp ;\n\t\tmy $cnt = substr( $_, 1 )  ;\n\t\tif ( $cnt eq \"0\" )\n\t\t{\n\t\t\t$kmer = <FP1> ;\n\t\t\tnext ;\n\t\t}\n\t\t$kmer = <FP1> ;\n\t\tchomp $kmer ;\n\t\t$localKmer{$index}->{$kmer} = 1 ;\n\t}\n\tclose FP1 ;\n\t\n\t++$index ;\n} # while 1\n\n#foreach $i (keys %localKmer )\n#{\n#\tif ( $i < scalar( @fileNames ) ) \n#\t{\n#\t\tmy $path = $fileNames[ $i ] ;\n#\t\tmy $fileName = $prefix.\"_\".$i.\".fa\" ;\n#\t\tsystem_call(\"cp $path $fileName\") ;\n#\t}\n#}\n\n# clean up\nunlink glob(\"tmp_$prefix.jf tmp_$prefix.jf_dump tmp_$prefix.testingKmer\") ;\nunlink glob(\"nucmer_$prefix*\") ;\nprint \"Finish.\\n\" ;\n\nsub system_call {\n    print STDERR \"SYSTEM CALL: \".join(\" \",@_).\"\\n\" ;\n\tsystem(@_) == 0\n\t  or die \"system @_ failed: $?\";\n    print STDERR \" finished\\n\";\n}\n\n\n"
  },
  {
    "path": "centrifuge-RemoveEmptySequence.pl",
    "content": "#!/usr/bin/env perl\n\n# remove the headers with empty sequences. possible introduced by dustmask\n\nuse strict ;\n\n# die \"usage: a.pl < in.fa >out.fa\"\n\nmy $tag ;\nmy @lines ;\n\n$lines[0] = <> ;\n$tag = 0 ;\nwhile ( <> )\n{\n\t$lines[1-$tag] = $_ ;\n\tif ( /^>/ && $lines[$tag] =~ /^>/ ) \n\t{\n\t\t$tag = 1 - $tag ;\n\t\tnext ;\n\t}\n\tprint $lines[$tag] ;\n\t$tag = 1- $tag ;\n}\nif ( !( $lines[$tag] =~ /^>/ ) )\n{\n\tprint $lines[$tag] ;\n}\n"
  },
  {
    "path": "centrifuge-RemoveN.pl",
    "content": "#!/usr/bin/env perl\n\nuse strict ;\nuse warnings ;\n\ndie \"usage: a.pl xxx.fa > output.fa\" if ( @ARGV == 0 ) ;\n\nmy $LINE_WIDTH = 80 ;\nmy $BUFFER_SIZE = 100000 ;\nopen FP1, $ARGV[0] ;\nmy $buffer = \"\" ;\n\nwhile ( <FP1> )\n{\n\tif ( /^>/ )\n\t{\n\t\tmy $line = $_ ;\n\t\tif ( $buffer ne \"\" )\n\t\t{\n\t\t\tmy $len = length( $buffer ) ;\n\t\t\tfor ( my $i = 0 ; $i < $len ; $i += $LINE_WIDTH )\n\t\t\t{\n\t\t\t\tprint substr( $buffer, $i, $LINE_WIDTH ), \"\\n\" ;\n\t\t\t}\n\t\t\t$buffer = \"\" ;\n\t\t}\n\t\tprint $line ;\n\t}\n\telse\n\t{\n\t\tchomp ;\n\t\ttr/nN//d ;\n\t\t#my $line = uc( $_ ) ;\n\t\tmy $line = $_ ;\n\t\t$buffer .= $line ;\n\n\t\tif ( length( $buffer ) > $BUFFER_SIZE )\n\t\t{\n\t\t\tmy $len = length( $buffer ) ;\n\t\t\tmy $i = 0 ;\n\t\t\tfor ( $i = 0 ; $i + $LINE_WIDTH - 1 < $len ; $i += $LINE_WIDTH )\n\t\t\t{\n\t\t\t\tprint substr( $buffer, $i, $LINE_WIDTH ), \"\\n\" ;\n\t\t\t}\n\t\t\tsubstr( $buffer, 0, $i, \"\" ) ;\n\t\t}\n\t}\n}\nif ( $buffer ne \"\" )\n{\n\tmy $len = length( $buffer ) ;\n\tfor ( my $i = 0 ; $i < $len ; $i += $LINE_WIDTH )\n\t{\n\t\tprint substr( $buffer, $i, $LINE_WIDTH ), \"\\n\" ;\n\t}\n\t$buffer = \"\" ;\n}\n"
  },
  {
    "path": "centrifuge-build",
    "content": "#!/usr/bin/env python\n\n\"\"\"\n Copyright 2014, Daehwan Kim <infphilo@gmail.com>\n\n This file is part of Centrifuge, which is copied and modified from bowtie2 in the Bowtie2 package.\n\n Centrifuge is free software: you can redistribute it and/or modify\n it under the terms of the GNU General Public License as published by\n the Free Software Foundation, either version 3 of the License, or\n (at your option) any later version.\n\n Centrifuge is distributed in the hope that it will be useful,\n but WITHOUT ANY WARRANTY; without even the implied warranty of\n MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n GNU General Public License for more details.\n\n You should have received a copy of the GNU General Public License\n along with Centrifuge.  If not, see <http://www.gnu.org/licenses/>.\n\"\"\"\n\n\nimport os\nimport sys\nimport inspect\nimport logging\n\n\ndef build_args():\n    \"\"\"\n    Parse the wrapper arguments. Returns the options,<programm arguments> tuple.\n    \"\"\"\n\n    parsed_args = {}\n    to_remove = []\n    argv = sys.argv[:]\n    for i, arg in enumerate(argv):\n        if arg == '--debug':\n            parsed_args[arg] = \"\"\n            to_remove.append(i)\n        elif arg == '--verbose':\n            parsed_args[arg] = \"\"\n            to_remove.append(i)\n\n    for i in reversed(to_remove):\n        del argv[i]\n\n    return parsed_args, argv\n\n\ndef main():\n    logging.basicConfig(level=logging.ERROR,\n                        format='%(levelname)s: %(message)s'\n                        )\n    delta               = 200\n    small_index_max_size= 4 * 1024**3 - delta\n    build_bin_name      = \"centrifuge-build-bin\"\n    build_bin_s         = \"centrifuge-build-bin\"\n    curr_script         = os.path.realpath(inspect.getsourcefile(main))\n    ex_path             = os.path.dirname(curr_script)\n    build_bin_spec      = os.path.join(ex_path,build_bin_s)\n\n    options, argv = build_args()\n\n    if '--verbose' in options:\n        logging.getLogger().setLevel(logging.INFO)\n        \n    if '--debug' in options:\n        build_bin_spec += '-debug'\n        build_bin_l += '-debug'\n\n    argv[0] = build_bin_name\n    argv.insert(1, 'basic-0')\n    argv.insert(1, '--wrapper')\n    logging.info('Command: %s %s' % (build_bin_spec, ' '.join(argv[1:])))\n    os.execv(build_bin_spec, argv)\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "centrifuge-compress.pl",
    "content": "#!/usr/bin/env perl\n\n# Read and merge the sequence for the chosen level\n\nuse strict ;\nuse warnings ;\n\nuse threads ;\nuse threads::shared ;\nuse FindBin qw($Bin);\nuse File::Basename;\nuse File::Find;\nuse Getopt::Long;\nuse Cwd;\nuse Cwd 'cwd' ;\nuse Cwd 'abs_path' ;\n\nmy $CWD = dirname( abs_path( $0 ) ) ;\nmy $PWD = abs_path( \"./\" ) ;\n\nmy $usage = \"USAGE: perl \".basename($0).\" path_to_download_files path_to_taxnonomy [-map header_to_taxid_map -o compressed -noCompress -t 1 -maxG 50000000 -noDustmasker]\\n\" ;\n\nmy $level = \"species\" ;\nmy $output = \"compressed\" ;\nmy $bssPath = $CWD ; # take path of binary as script directory\nmy $numOfThreads = 1 ;\nmy $noCompress = 0 ;\nmy $noDustmasker = 0 ;\nmy $verbose = 0;\nmy $maxGenomeSizeForCompression = 50000000 ;\nmy $mapFile = \"\" ;\n\nGetOptions (\"level|l=s\" => \\$level,\n\t\t\t\"output|o=s\" => \\$output,\n\t\t\t\"bss=s\" => \\$bssPath,\n\t\t\t\"threads|t=i\" => \\$numOfThreads,\n\t\t\t\"maxG=i\" => \\$maxGenomeSizeForCompression, \n\t\t\t\"map=s\" => \\$mapFile, \n            \"verbose|v\" => \\$verbose,\n\t\t\t\"noCompress\" => \\$noCompress,\n\t\t\t\"noDustmasker\" => \\$noDustmasker)\nor die(\"Error in command line arguments. \\n\\n$usage\");\n\ndie $usage unless @ARGV == 2;\n\nmy $path = $ARGV[0] ;\nmy $taxPath = $ARGV[1] ;\n\nmy $i ;\n\nmy %gidToFile ;\nmy %gidUsed ;\nmy %tidToGid ; # each tid can corresponds several gid\nmy %gidToTid ;\nmy %speciesList ; # hold the tid in the species\nmy %species ; # hold the species tid\nmy %genus ; # hold the genus tid\nmy %speciesToGenus ;\nmy %fileUsed : shared ;\nmy $fileUsedLock : shared ;\nmy %taxTree ;\n\nmy @speciesListKey ;\nmy $speciesUsed : shared ;\nmy $debug: shared ;\nmy %speciesIdToName : shared ;\n\nmy %idToTaxId : shared ; \nmy %newIdToTaxId : shared ;\nmy %idToGenomeSize : shared ;\nmy $idMapLock : shared ;\n\nmy $step = 1 ;\n\n# check the depended softwares\nif ( `which jellyfish` eq \"\" )\n{\n\tdie \"Could not find Jellyfish.\\n\" ;\n}\nelse\n{\n\tmy $version = `jellyfish --version` ;\n\tchomp $version ;\n\tif ( ( split /\\s+/, $version )[1] lt 2 )\n\t{\n\t\tdie \"Jellyfish on your system is \", $version, \" , centrifuge-compress.pl requires at least 2.0.0\\n\" ;\n\t}\n}\n\nif ( `which nucmer` eq \"\" )\n{\n\tdie \"Could not find Nucmer.\\n\"\n}\n\nif ( `which dustmasker` eq \"\" ) \n{\n\tprint STDERR \"Could not find dustmasker. And will turn on -noDustmasker option.\\n\" ;\n\t$noDustmasker = 1 ;\n}\n\n#Extract the gid we met in the file\nif ( $mapFile ne \"\" )\n{\n\tprint STDERR \"Step $step: Reading in the provided mapping list of ids to taxionomy ids.\\n\";\n\t++$step ;\n\n\topen FP1, $mapFile ;\n\twhile ( <FP1> )\n\t{\n\t\tchomp ;\n\t\tmy @cols = split ;\n\t\t$idToTaxId{ $cols[0] } = $cols[1] ;\n\n\t\tif ( $noCompress == 1 )\n\t\t{\n\t\t\t$newIdToTaxId{ $cols[0] } = $cols[1] ;\n\t\t}\n\t}\n}\n\nprint STDERR \"Step $step: Collecting all .fna files in $path and getting gids\\n\";\n++$step ;\nif ( $noCompress == 1 )\n{\n\t`rm tmp_output.fa` ;\n}\n\nfind ( { wanted=>sub {\n    return unless -f  ;        # Must be a file\n    return unless -s;        # Must be non-zero\n    if ( !( /\\.f[nf]?a$/ || /\\.ffn$/ || /\\.fasta$/ ) )\n    {\n    \treturn ;\n    }\n\n    my $fullfile = $File::Find::name; ## file with full path, but the CWD is actually the file's path\n    my $file = $_; ## file name\n\topen FP2, $file or die \"Error opening $file: $!\";\n\tmy $head = <FP2> ;\n\tclose FP2 ;\n\n\tchomp $head ;\n\tmy $headId = substr( ( split /\\s+/, $head )[0], 1 ) ;  \n\n\tif ( $noCompress == 1 )\n\t{\n\t\t# it seems the find will change the working directory\n\t\tsystem_call( \"cat $PWD/$fullfile >> $PWD/tmp_output.fa\" ) ;\n\t\tif ( defined $idToTaxId{ $headId } )\n\t\t{\n\t\t\t$newIdToTaxId{ $headId } = $idToTaxId{ $headId } ;\n\t\t}\n\t\telse\n\t\t{\n\t\t\t$newIdToTaxId{ $headId } = -1 ;\n\t\t\tmy @cols = split /\\|/, $headId ;\n\t\t\tmy $subHeader = $cols[0].\"\\|\".$cols[1] ; \n\t\t\tif ( $headId =~ /gi\\|([0-9]+)?\\|/ )\n\t\t\t{\n\t\t\t\t$newIdToTaxId{ $subHeader } = -1 ;\n\t\t\t\t#print STDERR \"$headId $subHeader\\n\" if $verbose ;\n\t\t\t}\n\t\t\telsif ( $headId =~ /taxid\\|([0-9]+)?[\\|\\s]/ )\n\t\t\t{\n\t\t\t\t$newIdToTaxId{ $headId } = $1 ;\n\t\t\t\t$newIdToTaxId{ $subHeader } = $1 ;\n\t\t\t}\n\t\t\telsif ( scalar( @cols ) > 2 )\n\t\t\t{\n\t\t\t\t$newIdToTaxId{ $subHeader } = -1 ;\n\t\t\t}\n\t\t}\n\t}\n\n\n\tif ( defined $idToTaxId{ $headId } )\n\t{\n\t\tmy $tid = $idToTaxId{ $headId } ;\n\t\tmy $dummyGid = \"centrifuge_gid_\".$fullfile.\"_$tid\" ;\n\t\t$gidUsed{ $dummyGid } = 1 ;\n\t\t$gidToFile{ $dummyGid } = $fullfile ;\n\t\t$fileUsed{ $fullfile } = 0 ;\n\t\tpush @{ $tidToGid{ $tid } }, $dummyGid ;\t\n\n\t\tprint STDERR \"tid=$tid $file\\n\" if $verbose;\n\t}\n\telsif ( $head =~ /^>gi\\|([0-9]+)?\\|/ ) {\n\t\tmy $gid = $1 ;\n\t\tprint STDERR \"gid=$gid $file\\n\" if $verbose;\n\t\tif ( defined $gidUsed{ $gid } )\n\t\t{\n\t\t\tprint \"Repeated gid $gid\\n\" if $verbose ;\n\t\t\t$fileUsed{ $fullfile } = 1 ;\n\t\t}\n\t\telse\n\t\t{\n\t\t\t$fileUsed{ $fullfile } = 0 ;\n\t\t\t$gidToFile{ $gid } = $fullfile ;\n\t\t}\n\n\t\t$gidUsed{ $gid } = 1 ;\n\t} elsif ( $head =~ /taxid\\|([0-9]+)?[\\|\\s]/ ) {\n\t\tmy $tid = $1 ;\n\t\tmy $dummyGid = \"centrifuge_gid_\".$fullfile.\"_$1\" ;\n\t\t$gidUsed{ $dummyGid } = 1 ;\n\t\t$gidToFile{ $dummyGid } = $fullfile ;\n\t\t$fileUsed{ $fullfile } = 0 ;\n\t\tpush @{ $tidToGid{ $tid } }, $dummyGid ;\t\n\t\t\n\t\tprint STDERR \"tid=$tid $file\\n\" if $verbose;\n\t} else {\n\t\tprint STDERR \"Excluding $fullfile: Wrong header.\\n\";\n\t}\n\n}, follow=>1 }, $path );\n\nif ( $noCompress == 1 ) \n{\n# Remove the Ns from the file\n\tif ( $noDustmasker == 1 )\n\t{\n\t\tsystem_call(\"perl $bssPath/centrifuge-RemoveN.pl tmp_output.fa | perl $bssPath/centrifuge-RemoveEmptySequence.pl > $output.fa\") ;\n\t}\n\telse\n\t{\n\t\tsystem_call(\"perl $bssPath/centrifuge-RemoveN.pl tmp_output.fa > tmp_output_fmt.fa\") ;\n\t\tsystem_call( \"dustmasker -infmt fasta -in tmp_output_fmt.fa -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]//g' > tmp_output_dustmasker.fa\" ) ;\n\t\tsystem_call(\"perl $bssPath/centrifuge-RemoveN.pl tmp_output_dustmasker.fa | perl $bssPath/centrifuge-RemoveEmptySequence.pl > $output.fa\") ;\n\t}\n}\n\n# Extract the tid that are associated with the gids\nprint STDERR \"Step $step: Extract the taxonomy ids that are associated with the gids\\n\";\n++$step ;\nopen FP1, \"$taxPath/gi_taxid_nucl.dmp\" ;\nwhile ( <FP1> )\n{\n\tchomp ;\n\tmy @cols = split ;\t\t\n#print $cols[0], \"\\n\" ;\t\n\tnext if ( @ARGV < 2 ) ;\n\tif ( defined( $gidUsed{ $cols[0] } ) )\n\t{\n\t\tpush @{ $tidToGid{ $cols[1] } }, $cols[0] ;\n\t\t$gidToTid{ $cols[0] } = $cols[1] ;\n\t\tprint STDERR \"gid=\", $cols[0], \" tid=\", $cols[1], \" \", $gidToFile{ $cols[0] }, \"\\n\" if $verbose;\n\t}\n}\nclose FP1 ;\n\nif ( $noCompress == 1 )\n{\n\topen FP1, \">tmp_$output.map\" ;\n\tforeach my $key (keys %newIdToTaxId )\n\t{\n\t\tif ( $newIdToTaxId{$key} != -1 )\n\t\t{\n\t\t\tprint FP1 \"$key\\t\", $newIdToTaxId{$key}, \"\\n\" ; \n\t\t}\n\t\telsif ( $key =~ /gi\\|([0-9]+)?/ )\n\t\t{\n\t\t\t#if ( defined $gidToTid{ $1 } )\n\t\t\t#{\n\t\t\t#\t$newIdToTaxId{ $key } = $gidToTid{ $1 } ;\n\t\t\t#}\n\t\t\tmy $taxId = 1 ;\n\t\t\tif (defined $gidToTid{ $1 } )\n\t\t\t{\n\t\t\t\t$taxId = $gidToTid{ $1 } ;\n\t\t\t}\n\t\t\tprint FP1 \"$key\\t\", $taxId, \"\\n\" ; \n\t\t}\n\t\telse\n\t\t{\n\t\t\tprint FP1 \"$key\\t1\\n\" ;\n\t\t}\n\t}\n\tclose FP1 ;\n\tsystem_call( \"sort tmp_$output.map | uniq > $output.map\" ) ;\n\texit ;\n}\n\n# Organize the tree\nprint STDERR \"Step $step: Organizing the taxonomical tree\\n\";\n++$step ;\nopen FP1, \"$taxPath/nodes.dmp\" or die \"Couldn't open $taxPath/nodes.dmp: $!\";\nwhile ( <FP1> ) {\n\tchomp ;\n\n\tmy @cols = split ;\n#next if ( !defined $tidToGid{ $cols[0] } ) ;\n\n\tmy $tid = $cols[0] ;\n\tmy $parentTid = $cols[2] ;\n\tmy $rank = $cols[4] ;\n#print \"subspecies: $tid $parentTid\\n\" ;\n#push @{ $species{ $parentTid } }, $tid ;\n#$tidToSpecies{ $tid } = $parentTid ;\n\n\t$taxTree{ $cols[0] } = $cols[2] ;\n#print $cols[0], \"=>\", $cols[2], \"\\n\" ;\n\tif ( $rank eq \"species\" )\t\n\t{\n\t\t$species{ $cols[0] } = 1 ;\n\t}\n\telsif ( $rank eq \"genus\" )\n\t{\n\t\t$genus{ $cols[0] } = 1 ;\n\t}\n}\nclose FP1 ;\n\n# Put the sub-species taxonomy id into the corresponding species.\nprint STDERR \"Step $step: Putting the sub-species taxonomy id into the corresponding species\\n\";\n++$step ;\nfor $i ( keys %tidToGid )\n{\n\tmy $p = $i ;\n\tmy $found = 0 ;\n\twhile ( 1 )\n\t{\n\t\tlast if ( $p <= 1 ) ;\n\t\tif ( defined $species{ $p } ) \n\t\t{\n\t\t\t$found = 1 ;\n\t\t\tlast ;\n\t\t}\n\t\tif ( defined $taxTree{ $p } )\n\t\t{\n\t\t\t$p = $taxTree{ $p } ;\n\t\t}\n\t\telse\n\t\t{\n\t\t\tlast ;\n\t\t}\n\t}\n\n\tif ( $found == 1 )\n\t{\n\t\tpush @{ $speciesList{ $p } }, $i ;\n\t}\n}\n\nprint STDERR \"Step $step: Reading the name of the species\\n\";\n++$step ;\nopen FP1, \"$taxPath/names.dmp\" or die \"Could not open $taxPath/names.dmp: $!\";\nwhile ( <FP1> )\n{\n\tnext if (!/scientific name/ ) ;\n\tmy @cols = split /\\t/ ;\n\n\tif ( defined $species{ $cols[0] } )\n\t{\n\t\t$speciesIdToName{ $cols[0] } = $cols[2] ;\n\t}\n}\nclose FP1 ;\n#exit ;\n\nsub GetGenomeSize\n{\n\topen FPfa, $_[0] ;\n\tmy $size = 0 ;\n\twhile ( <FPfa> )\n\t{\n\t\tnext if ( /^>/ ) ;\n\t\t$size += length( $_ ) - 1 ;\n\t}\n\tclose FPfa ;\n\treturn $size ;\n}\n\n# Compress one species.\nsub solve\n{\n# Extracts the files\n#print \"find $pwd -name *.fna > tmp.list\\n\" ;\n\tmy $tid = threads->tid() - 1 ;\n#system_call(\"find $pwd -name *.fna > tmp_$tid.list\") ;\n\n#system_call(\"find $pwd -name *.fa >> tmp.list\") ; # Just in case\n# Build the header\n\tmy $genusId ;\n\tmy $speciesId = $_[0] ;\n\tmy $speciesName ;\n\tmy $i ;\n\tmy $file ;\n\tmy @subspeciesList = @{ $speciesList{ $speciesId } } ;\n\t$genusId = $taxTree{ $speciesId } ;\n\twhile ( 1 )\n\t{\n\t\tif ( !defined $genusId || $genusId <= 1 )\n\t\t{\n\t\t\t$genusId = 0 ;\n\t\t\tlast ;\n\t\t}\n\n\t\tif ( defined $genus{ $genusId } )\n\t\t{\n\t\t\tlast ;\n\t\t}\n\t\telse\n\t\t{\n\t\t\t$genusId = $taxTree{ $genusId } ;\n\t\t}\n\t}\n\n\tmy $FP1 ;\n\topen FP1, \">tmp_$tid.list\" ;\n\tmy $genomeSize = 0 ;\n\tmy $avgGenomeSize = 0 ;\n\tforeach $i ( @subspeciesList )\n\t{\n\t\tforeach my $j ( @{$tidToGid{ $i } } )\n\t\t{\n#{\n#\tlock( $debug ) ;\n#\tprint \"Merge \", $gidToFile{ $j }, \"\\n\" ;\n#}\n\t\t\t$file =  $gidToFile{ $j } ;\n\t\t\t{\n\t\t\t\tlock( $fileUsedLock ) ;\n\t\t\t\t$fileUsed{ $file } = 1 ;\n\t\t\t}\n\t\t\tprint FP1 $file, \"\\n\" ;\n\n\t\t\tmy $tmp = GetGenomeSize( $file ) ;\n\t\t\tif ( $tmp > $genomeSize )\n\t\t\t{\n\t\t\t\t$genomeSize = $tmp ;\n\t\t\t}\n\t\t\t$avgGenomeSize += $tmp ;\n\t\t}\n\t}\n\tclose FP1 ;\n\n#$genomeSize = int( $genomeSize / scalar( @subspeciesList ) ) ;\n\t$avgGenomeSize = int( $avgGenomeSize / scalar( @subspeciesList ) ) ;\n\n#print $file, \"\\n\" ;\t\n#if ( $file =~ /\\/(\\w*?)uid/ )\n#{\n#\t$speciesName = ( split /_/, $1 )[0].\"_\". ( split /_/, $1 )[1] ;\n#}\n\tif ( defined $speciesIdToName{ $speciesId } )\n\t{\n\t\t$speciesName = $speciesIdToName{ $speciesId } ;\n\t\t$speciesName =~ s/ /_/g ;\n\t}\n\telse\n\t{\n\t\t$speciesName = \"Unknown_species_name\" ;\n\t}\n\tmy $id = $speciesId ;#( $speciesId << 32 ) | $genusId ;\n\tmy $header = \">cid|$id $speciesName $avgGenomeSize \".scalar( @subspeciesList ) ;\n\tprint STDERR \"$header\\n\" ;\n\t{\n\t\tlock( $idMapLock ) ;\n\t\t$newIdToTaxId{ \"cid|$id\" } = $speciesId ;\n\t\t$idToGenomeSize{ \"cid|$id\" } = $avgGenomeSize ;\n\t}\n\n#return ;\n# Build the sequence\n\tmy $seq = \"\" ;\n\tif ( $noCompress == 0 &&  ( $maxGenomeSizeForCompression < 0 || $genomeSize <= $maxGenomeSizeForCompression ) ) #$genomeSize < 50000000 )\n\t{\n\t\tsystem_call(\"perl $bssPath/centrifuge-BuildSharedSequence.pl tmp_$tid.list -prefix tmp_${tid}_$id\" ) ;\n\n# Merge all the fragmented sequence into one big chunk.\n\t\tsystem_call(\"cat tmp_${tid}_${id}_*.fa > tmp_${tid}_$id.fa\");\n\n\t\topen FP1, \"tmp_${tid}_$id.fa\" ;\n\t\twhile ( <FP1> )\n\t\t{\n\t\t\tchomp ;\n\t\t\tnext if ( /^>/ ) ;\n\t\t\t$seq .= $_ ;\n\t\t}\n\t\tclose FP1 ;\n\t}\n\telse\n\t{\n\t\tforeach $i ( @subspeciesList )\n\t\t{\n\t\t\tforeach my $j ( @{$tidToGid{ $i } } )\n\t\t\t{\n\t\t\t\t$file =  $gidToFile{ $j } ;\n\t\t\t\topen FP1, $file ;\n\t\t\t\twhile ( <FP1> )\n\t\t\t\t{\n\t\t\t\t\t#chomp ;\n\t\t\t\t\tnext if ( /^>/ ) ;\n\t\t\t\t\t$seq .= $_ ;\n\t\t\t\t}\n\t\t\t\tclose FP1 ;\n\t\t\t}\n\t\t}\n\t}\n\topen fpOut, \">>${output}_${tid}\" ;\n\tprint fpOut \"$header\\n$seq\\n\" ;\n\tclose fpOut ;\n\n\tunlink glob(\"tmp_${tid}_*\");\t\n}\n\nsub system_call {\n\tprint STDERR \"SYSTEM CALL: \".join(\" \",@_);\n\tsystem(@_) == 0\n\t\tor die \"system @_ failed: $?\";\n\tprint STDERR \" finished\\n\";\n}\n\nsub threadWrapper\n{\n\tmy $tid = threads->tid() - 1 ;\n\tunlink(\"${output}_${tid}\");\n\n\twhile ( 1 )\n\t{\n\t\tmy $u ;\n\t\t{\n\t\t\tlock $speciesUsed ;\n\t\t\t$u = $speciesUsed ;\n\t\t\t++$speciesUsed ;\n\t\t}\n\t\tlast if ( $u >= scalar( @speciesListKey ) ) ;\n\t\tsolve( $speciesListKey[ $u ] ) ;\n\t}\n}\n\n\nprint STDERR \"Step $step: Merging sub-species\\n\";\n++$step ;\nmy @threads ;\n@speciesListKey = keys %speciesList ; \n$speciesUsed = 0 ;\nfor ( $i = 0 ; $i < $numOfThreads ; ++$i )\n{\n\tpush @threads, $i ;\n}\n\nforeach (@threads)\n{\n\t$_ = threads->create( \\&threadWrapper ) ;\n}\n\nforeach (@threads)\n{\n\t$_->join() ;\n}\n\n# merge the files generated from each threads\nsystem_call(\"cat ${output}_* > tmp_output.fa\");\nunlink glob(\"${output}_*\");\n\n#print unused files\nforeach $i ( keys %fileUsed )\n{\n\tif ( $fileUsed{ $i } == 0 )\n\t{\n#print $i, \"\\n\" ;\n#`cat $i >> tmp_output.fa` ;\n\t\tprint \"Unused file: $i\\n\" ;\n\t}\n}\n\n# Remove the Ns from the file\nif ( $noDustmasker == 1 )\n{\n\tsystem_call(\"perl $bssPath/centrifuge-RemoveN.pl tmp_output.fa | perl $bssPath/centrifuge-RemoveEmptySequence.pl > $output.fa\") ;\n}\nelse\n{\n\tsystem_call(\"perl $bssPath/centrifuge-RemoveN.pl tmp_output.fa > tmp_output_fmt.fa\") ;\n\tsystem_call( \"dustmasker -infmt fasta -in tmp_output_fmt.fa -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]//g' > tmp_output_dustmasker.fa\" ) ;\n\tsystem_call(\"perl $bssPath/centrifuge-RemoveN.pl tmp_output_dustmasker.fa | perl $bssPath/centrifuge-RemoveEmptySequence.pl > $output.fa\") ;\n}\n\n# Output the mapping of the ids to species\nopen FP1, \">$output.map\" ;\nforeach my $key (keys %newIdToTaxId )\n{\n\tprint FP1 \"$key\\t\", $newIdToTaxId{ $key }, \"\\n\" ;\n}\nclose FP1 ;\n\n# Output the genome sizem map\nopen FP1, \">$output.size\" ;\nforeach my $key ( keys %newIdToTaxId )\n{\n\tprint FP1 $newIdToTaxId{ $key }, \"\\t\", $idToGenomeSize{ $key }, \"\\n\" ;\n}\nclose FP1 ;\nunlink glob(\"tmp_*\") ;\n"
  },
  {
    "path": "centrifuge-download",
    "content": "#!/bin/bash\n\nset -eu -o pipefail\n\nexists() {\n  command -v \"$1\" >/dev/null 2>&1\n}\n\ncut_after_first_space_or_second_pipe() {\n    grep '^>' | sed 's/ .*//' | sed 's/\\([^|]*|[^|]*\\).*/\\1/'\n}\nexport -f cut_after_first_space_or_second_pipe\n\nmap_headers_to_taxid() {\n    grep '^>' | cut_after_first_space_or_second_pipe | sed -e \"s/^>//\" -e  \"s/\\$/    $1/\"\n}\nexport -f map_headers_to_taxid\n\n\n\n#########################################################\n## Functions\n\nfunction download_n_process() {\n    IFS=$'\\t' read -r TAXID FILEPATH <<< \"$1\"\n\n    NAME=`basename $FILEPATH .gz`\n    GZIPPED_FILE=\"$LIBDIR/$DOMAIN/$NAME.gz\"\n    UNZIPPED_FILE=\"$LIBDIR/$DOMAIN/$NAME\"\n    DUSTMASKED_FILE=\"$LIBDIR/$DOMAIN/${NAME%.fna}_dustmasked.fna\"\n    [[ \"$DO_DUST\" == \"1\" ]] && RES_FILE=$DUSTMASKED_FILE || RES_FILE=$UNZIPPED_FILE\n\n    if [[ ! -s \"$RES_FILE\" ]]; then\n        if [ $DL_MODE = \"rsync\" ]; then\n            FILEPATH=${FILEPATH/ftp/rsync}\n            $DL_CMD \"$FILEPATH\" \"$LIBDIR/$DOMAIN/$NAME.gz\" || \\\n            { printf \"\\nError downloading $FILEPATH!\\n\" >&2 && exit 1; }\n        else\n            $DL_CMD \"$LIBDIR/$DOMAIN/$NAME.gz\" \"$FILEPATH\" || \\\n            $DL_CMD \"$LIBDIR/$DOMAIN/$NAME.gz\" \"$FILEPATH\" || \\\n            $DL_CMD \"$LIBDIR/$DOMAIN/$NAME.gz\" \"$FILEPATH\" || \\\n\t    { printf \"\\nError downloading $FILEPATH!\\n\" >&2 && exit 1; }\n        fi\n    \n        [[ -s \"$LIBDIR/$DOMAIN/$NAME.gz\" ]] || return;\n        gunzip -f \"$LIBDIR/$DOMAIN/$NAME.gz\" ||{ printf \"\\nError gunzipping $LIBDIR/$DOMAIN/$NAME.gz [ downloaded from $FILEPATH ]!\\n\" >&2 &&  exit 255; }\n    \n        \n        if [[ \"$CHANGE_HEADER\" == \"1\" ]]; then\n            sed -i \"s/^>/>kraken:taxid|$TAXID /\" $LIBDIR/$DOMAIN/$NAME\n        fi\n    \n        if [[ \"$FILTER_UNPLACED\" == \"1\" ]]; then\n            echo TODO 2>&1\n            ##sed -n '1,/^>.*unplaced/p; /'\n        fi\n    \n        if [[ \"$DO_DUST\" == \"1\" ]]; then\n          ## TODO: Consider hard-masking only low-complexity stretches with 10 or more bps\n          dustmasker -infmt fasta -in $LIBDIR/$DOMAIN/$NAME -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > $DUSTMASKED_FILE\n          rm $LIBDIR/$DOMAIN/$NAME\n        fi\n    fi\n\n    ## Output sequenceID to taxonomy ID map to STDOUT\n    cat $RES_FILE | map_headers_to_taxid $TAXID\n    echo done\n}\nexport -f download_n_process\n\n\n\nfunction download_n_process_nofail() {\n    download_n_process \"$@\" || true\n}\nexport -f download_n_process_nofail\n\n\nceol=`tput el || echo -n \"\"` # terminfo clr_eol\n\nfunction count {\n   typeset C=0\n   while read L; do\n      if [[ \"$L\" == \"done\" ]]; then\n        [[ \"$VERBOSE\" == 1 ]] && continue;\n        C=$(( C + 1 ))\n        _progress=$(( (${C}*100/${1}*100)/100 ))\n        _done=$(( (${_progress}*4)/10 ))\n        _left=$(( 40-$_done ))\n        # Build progressbar string lengths\n        _done=$(printf \"%${_done}s\")\n        _left=$(printf \"%${_left}s\")\n\n        printf \"\\rProgress : [${_done// /#}${_left// /-}] ${_progress}%% $C/$1\"  1>&2\n      else\n        echo \"$L\"\n      fi\n   done\n}\n\nfunction check_or_mkdir_no_fail {\n    #echo -n \"Creating $1 ... \" >&2\n    if [[ -d $1 && ! -n `find $1 -prune -empty -type d` ]]; then\n        echo \"Directory $1 exists.  Continuing\" >&2\n        return `true`\n    else \n        #echo \"Done\" >&2\n        mkdir -p $1\n        return `true`\n    fi\n}\n\nfunction c_echo() {\n        printf \"\\033[34m$*\\033[0m\\n\"\n}\n\n\n\n## Check if GNU parallel exists\ncommand -v parallel >/dev/null 2>&1 && PARALLEL=1 || PARALLEL=0\n\n\nALL_GENOMES=\"bacteria viral archaea fungi protozoa invertebrate plant vertebrate_mammalian vertebrate_other\"\nALL_DATABASES=\"refseq genbank taxonomy contaminants\"\nALL_ASSEMBLY_LEVELS=\"Complete\\ Genome Chromosome Scaffold Contig\"\n\n## Option parsing\nDATABASE=\"refseq\"\nASSEMBLY_LEVEL=\"Complete Genome\"\nREFSEQ_CATEGORY=\"\"\nTAXID=\"\"\n\n\nDL_PROG=\"NA\"\nif hash curl 2>/dev/null; then\n  DL_PROG=\"curl\"\nelif hash wget 2>/dev/null; then\n  DL_PROG=\"wget\"\nelif hash rsync 2>/dev/null; then\n  DL_PROG=\"rsync\"\nfi\n\n\nBASE_DIR=\".\"\nN_PROC=1\nCHANGE_HEADER=0\nDOWNLOAD_RNA=0\nDO_DUST=0\nFILTER_UNPLACED=0\nVERBOSE=0\n\nUSAGE=\"\n`basename $0` [<options>] <database>\n\nARGUMENT\n <database>        One of refseq, genbank, contaminants or taxonomy:\n                     - use refseq or genbank for genomic sequences,\n                     - contaminants gets contaminant sequences from UniVec and EmVec,\n                     - taxonomy for taxonomy mappings.\n\nCOMMON OPTIONS\n -o <directory>         Folder to which the files are downloaded. Default: '$BASE_DIR'.\n -P <# of threads>      Number of processes when downloading (uses xargs). Default: '$N_PROC'\n\nWHEN USING database refseq OR genbank:\n -d <domain>            What domain to download. One or more of ${ALL_GENOMES// /, } (comma separated).\n -a <assembly level>    Only download genomes with the specified assembly level. Default: '$ASSEMBLY_LEVEL'. Use 'Any' for any assembly level.\n -c <refseq category>   Only download genomes in the specified refseq category. Default: any.\n -t <taxids>            Only download the specified taxonomy IDs, comma separated. Default: any.\n -g <program>           Download using program. Options: rsync, curl, wget. Default $DL_PROG (auto-detected).\n -r                     Download RNA sequences, too.\n -u                     Filter unplaced sequences.\n -m                     Mask low-complexity regions using dustmasker. Default: off.\n -l                     Modify header to include taxonomy ID. Default: off.\n -g                     Download GI map.\n -v                     Verbose mode\n\"\n\n# arguments: $OPTFIND (current index), $OPTARG (argument for option), $OPTERR (bash-specific)\nwhile getopts \"o:P:d:a:c:t:g:urlmv\" OPT \"$@\"; do\n    case $OPT in\n        o) BASE_DIR=\"$OPTARG\" ;;\n        P) N_PROC=\"$OPTARG\" ;;\n        d) DOMAINS=${OPTARG//,/ } ;;\n        a) ASSEMBLY_LEVEL=\"$OPTARG\" ;;\n        c) REFSEQ_CATEGORY=\"$OPTARG\" ;;\n        g) DL_PROG=\"$OPTARG\" ;;\n        t) TAXID=\"$OPTARG\" ;;\n        r) DOWNLOAD_RNA=1 ;;\n        u) FILTER_UNPLACED=1 ;;\n        m) DO_DUST=1 ;;\n        v) VERBOSE=1 ;;\n        l) CHANGE_HEADER=1 ;;\n        \\?) echo \"Invalid option: -$OPTARG\" >&2 \n            exit 1 \n        ;;\n        :) echo \"Option -$OPTARG requires an argument.\" >&2\n           exit 1\n        ;;\n    esac\ndone\nshift $((OPTIND-1))\n\n[[ $# -eq 1 ]] || { printf \"$USAGE\" >&2 && exit 1; };\nDATABASE=$1\n\n\nif [[ \"$DL_PROG\" == \"rsync\" ]]; then\n    DL_CMD=\"rsync --no-motd\"\n        DL_MODE=\"rsync\"\nelif [[ \"$DL_PROG\" == \"wget\" ]]; then\n    DL_CMD=\"wget -N --reject=index.html -qO\"\n        DL_MODE=\"https\"\nelif [[ \"$DL_PROG\" == \"curl\" ]]; then\n    DL_CMD=\"curl -s -o\"\n        DL_MODE=\"https\"\nelse\n        echo \"Unknown download program - please install one of rsync, wget or curl, and specify it with the -g option\" >&2\n        exit 1\nfi\n\ncecho() {\n  echo $* 1>&2\n  $*\n}\n\nif [[ \"$VERBOSE\" == \"1\" ]]; then\n  export -f cecho\n  DL_CMD=\"cecho $DL_CMD\"\nfi\n\nexport DL_CMD DL_MODE VERBOSE\n\n#### TAXONOMY DOWNLOAD\nFTP=\"ftp://ftp.ncbi.nih.gov\"\nif [[ \"$DATABASE\" == \"taxonomy\" ]]; then \n  echo \"Downloading NCBI taxonomy ... \" >&2\n  if check_or_mkdir_no_fail \"$BASE_DIR\"; then\n    cd \"$BASE_DIR\" > /dev/null\n    if [ $DL_MODE = \"rsync\" ]; then\n        $DL_CMD ${FTP/ftp/rsync}/pub/taxonomy/taxdump.tar.gz taxdump.tar.gz\n    else\n        $DL_CMD taxdump.tar.gz $FTP/pub/taxonomy/taxdump.tar.gz\n    fi\n    tar -zxvf taxdump.tar.gz nodes.dmp\n    tar -zxvf taxdump.tar.gz names.dmp\n    rm taxdump.tar.gz\n    cd - > /dev/null\n  fi\n  exit 0\nfi\n\ndat_to_fna() {\n    grep -E '^DE|^ ' | awk '/^DE/ { sub(/DE */,\">\"); gsub(/[ |]/,\"_\") }; { print }' | awk '/^ / { gsub(/ /,\"\"); sub(/[0-9]*$/,\"\") }; { print }' \n}\n\n#### CONTAMINANT SEQ DOWNLOAD\nif [[ \"$DATABASE\" == \"contaminants\" ]]; then \n  echo \"Downloading contaminant databases ... \" >&2\n  CONTAMINANT_TAXID=32630\n  CONTAMINANT_DIR=\"$BASE_DIR/contaminants\"\n  if check_or_mkdir_no_fail \"$CONTAMINANT_DIR\"; then\n    cd \"$CONTAMINANT_DIR\" > /dev/null\n\n    # download UniVec and EmVec database\n    if [ $DL_MODE = \"rsync\" ]; then\n        $DL_CMD rsync://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec UniVec.fna\n        $DL_CMD rsync://ftp.ebi.ac.uk/pub/databases/emvec/emvec.dat.gz emvec.dat.gz\n    else\n        $DL_CMD UniVec.fna ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec\n        $DL_CMD emvec.dat.gz ftp://ftp.ebi.ac.uk/pub/databases/emvec/emvec.dat.gz\n    fi\n    gunzip -c emvec.dat.gz | dat_to_fna > EmVec.fna\n \n    if [[ \"$CHANGE_HEADER\" == \"1\" ]]; then\n        sed -i \"s/^>/>taxid|$CONTAMINANT_TAXID /\" UniVec.fna\n        sed -i \"s/^>/>taxid|$CONTAMINANT_TAXID /\" EmVec.fna\n    else \n    cat UniVec.fna | map_headers_to_taxid $CONTAMINANT_TAXID\n    cat EmVec.fna | map_headers_to_taxid $CONTAMINANT_TAXID\n    fi\n\n   #sed ':a $!N; s/^>.*\\n>/>/; P; D' Contaminants/emvec.fa  > Contaminants/emvec.fa\n    rm emvec.dat.gz\n\n    cd - > /dev/null\n    exit 0;\n  fi\nfi\n\n\n\n#### REFSEQ/GENBANK DOWNLOAD\n\nexport LIBDIR=\"$BASE_DIR\"\nexport DO_DUST=\"$DO_DUST\"\nexport CHANGE_HEADER=\"$CHANGE_HEADER\"\n\n## Fields in the assembly_summary.txt file\nREFSEQ_CAT_FIELD=5\nTAXID_FIELD=6\nSPECIES_TAXID_FIELD=7\nVERSION_STATUS_FIELD=11\nASSEMBLY_LEVEL_FIELD=12\nFTP_PATH_FIELD=20\nFTP_PATH_FIELD2=21  ## Needed for wrongly formatted virus files - hopefully just a temporary fix\n\nAWK_QUERY=\"\\$$VERSION_STATUS_FIELD==\\\"latest\\\"\"\n[[ \"$ASSEMBLY_LEVEL\" != \"Any\" ]] && AWK_QUERY=\"$AWK_QUERY && \\$$ASSEMBLY_LEVEL_FIELD==\\\"$ASSEMBLY_LEVEL\\\"\"\n[[ \"$REFSEQ_CATEGORY\" != \"\" ]] && AWK_QUERY=\"$AWK_QUERY && \\$$REFSEQ_CAT_FIELD==\\\"$REFSEQ_CATEGORY\\\"\"\n\nTAXID=${TAXID//,/|}\n[[ \"$TAXID\" != \"\" ]] && AWK_QUERY=\"$AWK_QUERY && match(\\$$TAXID_FIELD,\\\"^($TAXID)\\$\\\")\"\n\n#echo \"$AWK_QUERY\" >&2\n\n#echo \"Downloading genomes for $DOMAINS at assembly level $ASSEMBLY_LEVEL\" >&2\nif exists wget; then\n    wget -qO- --no-remove-listing ftp://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE/ > /dev/null\nelse\n    curl --disable-epsv -s ftp://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE/ > .listing\nfi\n\nif [[ \"$CHANGE_HEADER\" == \"1\" ]]; then\n    echo \"Modifying header to include taxonomy ID\" >&2\nfi\n\n\nfor DOMAIN in $DOMAINS; do\n    if [[ -s .listing ]]; then\n        if [[ ! `grep \"^d.* $DOMAIN\" .listing` ]]; then\n            c_echo \"$DOMAIN is not a valid domain - use one of the following:\" >&2\n            grep '^d' .listing  | sed 's/.* //' | sed 's/^/  /' 1>&2\n            exit 1\n        fi\n    fi\n    \n    #if [[ \"$CHANGE_HEADER\" != \"1\" ]]; then\n        #echo \"Writing taxonomy ID to sequence ID map to STDOUT\" >&2\n        #[[ -s \"$LIBDIR/$DOMAIN.map\" ]] && rm \"$LIBDIR/$DOMAIN.map\"\n    #fi\n\n    export DOMAIN=$DOMAIN\n    check_or_mkdir_no_fail $LIBDIR/$DOMAIN\n\n    FULL_ASSEMBLY_SUMMARY_FILE=\"$LIBDIR/$DOMAIN/assembly_summary.txt\"\n    ASSEMBLY_SUMMARY_FILE=\"$LIBDIR/$DOMAIN/assembly_summary_filtered.txt\"\n\n    echo \"Downloading ftp://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE/$DOMAIN/assembly_summary.txt ...\" >&2\n    if [ $DL_MODE = \"rsync\" ]; then\n        $DL_CMD rsync://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE/$DOMAIN/assembly_summary.txt \"$FULL_ASSEMBLY_SUMMARY_FILE\" || {\n            c_echo \"rsync Download failed! Have a look at valid domains at ftp://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE .\" >&2\n            exit 1;\n        }\n    else\n        $DL_CMD \"$FULL_ASSEMBLY_SUMMARY_FILE\" ftp://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE/$DOMAIN/assembly_summary.txt > \"$FULL_ASSEMBLY_SUMMARY_FILE\" || {\n            c_echo \"Download failed! Have a look at valid domains at ftp://ftp.ncbi.nlm.nih.gov/genomes/$DATABASE .\" >&2\n            exit 1;\n        }\n    fi\n\n    awk -F \"\\t\" \"BEGIN {OFS=\\\"\\t\\\"} $AWK_QUERY\" \"$FULL_ASSEMBLY_SUMMARY_FILE\" > \"$ASSEMBLY_SUMMARY_FILE\"\n\n    N_EXPECTED=`cat \"$ASSEMBLY_SUMMARY_FILE\" | wc -l`\n    [[ $N_EXPECTED -gt 0 ]] || { echo \"Domain $DOMAIN has no genomes with specified filter.\" >&2; exit 1; }\n\n    if [[ \"$DOMAIN\" == \"viral\" ]]; then\n      ## Wrong columns in viral assembly summary files - the path is sometimes in field 20, sometimes 21\n      cut -f \"$TAXID_FIELD,$FTP_PATH_FIELD,$FTP_PATH_FIELD2\" \"$ASSEMBLY_SUMMARY_FILE\" | \\\n       sed 's/^\\(.*\\)\\t\\(ftp:.*\\)\\t.*/\\1\\t\\2/;s/^\\(.*\\)\\t.*\\t\\(ftp:.*\\)/\\1\\t\\2/' | \\\n      sed 's#\\([^/]*\\)$#\\1/\\1_genomic.fna.gz#' |\\\n         tr '\\n' '\\0' | xargs -0 -n1 -P $N_PROC bash -c 'download_n_process_nofail \"$@\"' _ | count $N_EXPECTED\n    else\n      echo \"Downloading $N_EXPECTED $DOMAIN genomes at assembly level $ASSEMBLY_LEVEL ... (will take a while)\" >&2\n      cut -f \"$TAXID_FIELD,$FTP_PATH_FIELD\" \"$ASSEMBLY_SUMMARY_FILE\" | sed 's#\\([^/]*\\)$#\\1/\\1_genomic.fna.gz#' |\\\n         tr '\\n' '\\0' | xargs -0 -n1 -P $N_PROC bash -c 'download_n_process_nofail \"$@\"' _ | count $N_EXPECTED\n    fi\n    echo >&2\n\n\n\n    if [[ \"$DOWNLOAD_RNA\" == \"1\" && ! `echo $DOMAIN | egrep 'bacteria|viral|archaea'` ]]; then\n        echo \"Downloadinging rna sequence files\" >&2\n        cut -f $TAXID_FIELD,$FTP_PATH_FIELD  \"$ASSEMBLY_SUMMARY_FILE\"| sed 's#\\([^/]*\\)$#\\1/\\1_rna.fna.gz#' |\\\n            tr '\\n' '\\0' | xargs -0 -n1 -P $N_PROC bash -c 'download_n_process_nofail \"$@\"' _ | count $N_EXPECTED\n        echo >&2\n    fi\ndone\n"
  },
  {
    "path": "centrifuge-inspect",
    "content": "#!/usr/bin/env python\n\n\"\"\"\n Copyright 2014, Daehwan Kim <infphilo@gmail.com>\n\n This file is part of Centrifuge, which is copied and modified from bowtie2-inspect in the Bowtie2 package.\n\n Centrifuge is free software: you can redistribute it and/or modify\n it under the terms of the GNU General Public License as published by\n the Free Software Foundation, either version 3 of the License, or\n (at your option) any later version.\n\n Centrifuge is distributed in the hope that it will be useful,\n but WITHOUT ANY WARRANTY; without even the implied warranty of\n MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n GNU General Public License for more details.\n\n You should have received a copy of the GNU General Public License\n along with Centrifuge.  If not, see <http://www.gnu.org/licenses/>.\n\"\"\"\n\n\nimport os\nimport imp\nimport inspect\nimport logging\n\n\n\ndef main():\n    logging.basicConfig(level=logging.ERROR,\n                        format='%(levelname)s: %(message)s'\n                        )\n    inspect_bin_name      = \"centrifuge-inspect\"\n    curr_script           = os.path.realpath(inspect.getsourcefile(main))\n    ex_path               = os.path.dirname(curr_script)\n    inspect_bin_spec      = os.path.join(ex_path, \"centrifuge-inspect-bin\")\n    bld                   = imp.load_source('centrifuge-build', os.path.join(ex_path,'centrifuge-build'))\n    options,arguments     = bld.build_args()\n\n    if '--verbose' in options:\n        logging.getLogger().setLevel(logging.INFO)\n        \n    if '--debug' in options:\n        inspect_bin_spec += '-debug'\n    \n    arguments[0] = inspect_bin_name\n    arguments.insert(1, 'basic-0')\n    arguments.insert(1, '--wrapper')\n    logging.info('Command: %s %s' %  (inspect_bin_spec,' '.join(arguments[1:])))\n    os.execv(inspect_bin_spec, arguments)        \n        \n        \nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "centrifuge-kreport",
    "content": "#!/usr/bin/env perl\n\n# Give a Kraken-style report from a Centrifuge output\n#\n# Based on kraken-report by Derrick Wood\n# Copyright 2013-2016, Derrick Wood <dwood@cs.jhu.edu>\n#\n\nuse strict;\nuse warnings;\nuse Getopt::Long;\nuse File::Basename;\nuse Cwd;\nuse Cwd 'cwd' ;\nuse Cwd 'abs_path' ;\n\nmy ($centrifuge_index, $min_score, $min_length);\nmy $no_lca = 0;\nmy $show_zeros = 0;\nmy $is_cnts_table = 0;\nmy $PROG = \"centrifuge-kreport\";\nmy $CWD = dirname( abs_path( $0 ) ) ;\n\nGetOptions(\n  \"help\" => \\&display_help,\n  \"x=s\" => \\$centrifuge_index,\n  \"show-zeros\" => \\$show_zeros,\n  \"is-count-table\" => \\$is_cnts_table,\n  \"min-score=i\" => \\$min_score,\n  \"min-length=i\"=> \\$min_length,\n  \"no-lca\" => \\$no_lca\n) or usage();\n\nusage() unless defined $centrifuge_index;\nif (!defined $ARGV[0]) {\n    print STDERR \"Reading centrifuge out file from STDIN ... \\n\";\n}\n\nsub usage {\n  my $exit_code = @_ ? shift : 64;\n  print STDERR \"\nUsage: centrifuge-kreport -x <index name> OPTIONS <centrifuge output file(s)>\n\ncentrifuge-kreport creates Kraken-style reports from centrifuge out files.\n\nOptions:\n    -x INDEX            (REQUIRED) Centrifuge index\n\n    --no-lca             Do not report the LCA of multiple assignments, but report count fractions at the taxa.\n    --show-zeros         Show clades that have zero reads, too\n    --is-count-table     The format of the file is 'taxID<tab>COUNT' instead of the standard\n                         Centrifuge output format\n\n    --min-score SCORE    Require a minimum score for reads to be counted\n    --min-length LENGTH  Require a minimum alignment length to the read\n  \n  \";\n  exit $exit_code;\n}\n\nsub display_help {\n  usage(0);\n}\n\nmy (%child_lists, %name_map, %rank_map, %parent_map);\nprint STDERR \"Loading taxonomy ...\\n\";\nload_taxonomy();\n\nmy %taxo_counts;\nmy $seq_count = 0;\n$taxo_counts{0} = 0;\nif ($is_cnts_table) {\n  while (<>) {\n    my ($taxID,$count) = split;\n    $taxo_counts{$taxID} = $count;\n    $seq_count += $count;\n  }\n} else {\n  chomp(my $header = <>);\n  my @cols = split /\\t/, $header ;\n  my %headerMap ;\n  for ( my $i = 0 ; $i < scalar( @cols ) ; ++$i )\n  {\n  \t$headerMap{ $cols[$i] } = $i ;\n  }\n\n  my $prevReadID;\n  my $prevTaxID;\n  while (<>) {\n    #my (undef,$seqID,$taxID,$score, undef, $hitLength, $queryLength, $numMatches) = split /\\t/;\n    my @cols = split /\\t/ ;\n    my $readID = $cols[ $headerMap{ \"readID\" } ] ;\n    my $seqID = $cols[ $headerMap{ \"seqID\" } ] ; \n    my $taxID = $cols[ $headerMap{ \"taxID\" } ] ; \n    my $score = $cols[ $headerMap{ \"score\" } ] ; \n    my $hitLength = $cols[ $headerMap{ \"hitLength\" } ] ; \n    my $queryLength = $cols[ $headerMap{ \"queryLength\" } ] ; \n    my $numMatches = $cols[ $headerMap{ \"numMatches\" } ] ; \n\n    $taxID = 1 if ( !isTaxIDInTree( $taxID ) ) ;\n\n    if ($no_lca) {\n      next if defined $min_length && $hitLength < $min_length;\n      next if defined $min_score && $score < $min_score;\n      $taxo_counts{$taxID} += 1/$numMatches;\n      $seq_count += 1/$numMatches;\n    } else {\n      if ( ( defined $prevReadID ) && ( $readID eq $prevReadID ) ) {\n        --$taxo_counts{$prevTaxID};\n        $prevTaxID = lca($prevTaxID, $taxID);\n        ++$taxo_counts{$prevTaxID};\n      } else {\n        ++$taxo_counts{$taxID};\n        ++$seq_count;\n        $prevTaxID = $taxID;\n      }\n    }\n    $prevReadID = $readID ;\n  }\n}\nmy $classified_count = $seq_count - $taxo_counts{0};\n\nmy %clade_counts = %taxo_counts;\ndfs_summation(1);\n\nfor (keys %name_map) {\n  $clade_counts{$_} ||= 0;\n}\n\ndie \"No sequence matches with given settings\" unless $seq_count > 0;\n\nprintf \"%6.2f\\t%d\\t%d\\t%s\\t%d\\t%s%s\\n\",\n  $clade_counts{0} * 100 / $seq_count,\n  $clade_counts{0}, $taxo_counts{0}, \"U\",\n  0, \"\", \"unclassified\";\ndfs_report(1, 0);\n\nsub dfs_report {\n  my $node = shift;\n  my $depth = shift;\n  if (! $clade_counts{$node} && ! $show_zeros) {\n    return;\n  }\n  printf \"%6.2f\\t%d\\t%d\\t%s\\t%d\\t%s%s\\n\",\n    ($clade_counts{$node} || 0) * 100 / $seq_count,\n    ($clade_counts{$node} || 0),\n    ($taxo_counts{$node} || 0),\n    rank_code($rank_map{$node}),\n    $node,\n    \"  \" x $depth,\n    $name_map{$node};\n  my $children = $child_lists{$node};\n  if ($children) {\n    my @sorted_children = sort { $clade_counts{$b} <=> $clade_counts{$a} } @$children;\n    for my $child (@sorted_children) {\n      dfs_report($child, $depth + 1);\n    }\n  }\n}\n\nsub isTaxIDInTree {\n\tmy $a = $_[0] ;\n\n\twhile ( $a > 1 )\n\t{\n\t\tif ( !defined $parent_map{ $a } )\n\t\t{\n\t\t\tprint STDERR \"Couldn't find parent of taxID $a - directly assigned to root.\\n\";\n\t\t\treturn 0 ;\n\t\t}\n\t\tlast if ( $a eq $parent_map{$a} ) ;\n\t\t$a = $parent_map{ $a } ;\n\t}\n\treturn 1 ;\n}\n\nsub lca {\n  my ($a, $b) = @_;\n  return $b if $a eq 0;\n  return $a if $b eq 0;\n  return $a if $a eq $b;\n  my %a_path;\n  while ($a ge 1) {\n    $a_path{$a} = 1;\n    if (!defined $parent_map{$a}) {\n      print STDERR \"Couldn't find parent of taxID $a - directly assigned to root.\\n\";\n      last;\n    }\n    last if $a eq $parent_map{$a};\n    $a = $parent_map{$a};\n  }\n  while ($b > 1) {\n    return $b if (defined $a_path{$b});\n    if (!defined $parent_map{$b}) {\n      print STDERR \"Couldn't find parent of taxID $b - directly assigned to root.\\n\";\n      last;\n    }\n    last if $b eq $parent_map{$b};\n    $b = $parent_map{$b};\n  }\n  return 1;\n}\n\nsub rank_code {\n  my $rank = shift;\n  for ($rank) {\n    $_ eq \"species\" and return \"S\";\n    $_ eq \"genus\" and return \"G\";\n    $_ eq \"family\" and return \"F\";\n    $_ eq \"order\" and return \"O\";\n    $_ eq \"class\" and return \"C\";\n    $_ eq \"phylum\" and return \"P\";\n    $_ eq \"kingdom\" and return \"K\";\n    $_ eq \"superkingdom\" and return \"D\";\n  }\n  return \"-\";\n}\n\nsub dfs_summation {\n  my $node = shift;\n  my $children = $child_lists{$node};\n  if ($children) {\n    for my $child (@$children) {\n      dfs_summation($child);\n      $clade_counts{$node} += ($clade_counts{$child} || 0);\n    }\n  }\n}\n\nsub load_taxonomy {\n\n  print STDERR \"Loading names file ...\\n\";\n  open NAMES, \"-|\", \"$CWD/centrifuge-inspect --name-table $centrifuge_index\"\n    or die \"$PROG: can't open names file: $!\\n\";\n  while (<NAMES>) {\n    chomp;\n    s/\\t\\|$//;\n    my @fields = split /\\t/;\n    my ($node_id, $name) = @fields[0,1];\n    $name_map{$node_id} = $name;\n  }\n  close NAMES;\n\n  print STDERR \"Loading nodes file ...\\n\";\n  open NODES, \"-|\", \"$CWD/centrifuge-inspect --taxonomy-tree $centrifuge_index\"\n    or die \"$PROG: can't open nodes file: $!\\n\";\n  while (<NODES>) {\n    chomp;\n    my @fields = split /\\t\\|\\t/;\n    my ($node_id, $parent_id, $rank) = @fields[0,1,2];\n    if ($node_id == 1) {\n      $parent_id = 0;\n    }\n    $child_lists{$parent_id} ||= [];\n    push @{ $child_lists{$parent_id} }, $node_id;\n    $rank_map{$node_id} = $rank;\n    $parent_map{$node_id} = $parent_id;\n  }\n  close NODES;\n}\n"
  },
  {
    "path": "centrifuge-promote",
    "content": "#!/usr/bin/env perl\n\nuse strict ;\nuse warnings ;\n\nuse File::Basename;\nuse Cwd;\nuse Cwd 'cwd' ;\nuse Cwd 'abs_path' ;\n\n\ndie \"Usage: centrifuge-promote centrifuge_index_name centrifuge_output level > output\\n\\n\".\n\t\"Promote the taxonomy id to specified level in Centrifuge output.\\n\".\n\t\"\\tIf level equals \\\"lca\\\", this will merge the multiassignment to their lowest common ancestor.\\n\" if ( @ARGV == 0 ) ;\n\nmy $CWD = dirname( abs_path( $0 ) ) ;\n# Go through the index to obtain the taxonomy tree\nmy %taxParent ; \nmy %taxIdToSeqId ;\nmy %taxLevel ;\n\nmy $centrifuge_index = $ARGV[0] ;\nopen FP1, \"-|\", \"$CWD/centrifuge-inspect --taxonomy-tree $centrifuge_index\" or die \"can't open $!\\n\" ;\nwhile ( <FP1> )\n{\n\tchomp ;\n\tmy @cols = split /\\t\\|\\t/;\n\t$taxParent{ $cols[0] } = $cols[1] ;\n\t$taxLevel{ $cols[0] } = $cols[2] ;\n}\nclose FP1 ;\nopen FP1, \"-|\", \"$CWD/centrifuge-inspect --conversion-table $centrifuge_index\" or die \"can't open $!\\n\" ;\nwhile ( <FP1> )\n{\n\tchomp ;\n\tmy @cols = split /\\t/ ;\n\t$taxIdToSeqId{ $cols[1] } = $cols[0] ;\n}\nclose FP1 ;\n\n# Go through the output of centrifuge\nmy $level = $ARGV[2] ;\nsub PromoteTaxId\n{\n\tmy $tid = $_[0] ;\n\treturn 0 if ( $tid <= 0 || !defined( $taxLevel{ $tid } ) ) ;\n\n\tif ( $taxLevel{ $tid } eq $level )\n\t{\n\t\treturn $tid ;\n\t}\n\telse\n\t{\n\t\treturn 0 if ( $tid <= 1 ) ;\n\t\treturn PromoteTaxId( $taxParent{ $tid } ) ;\n\t}\n}\n\nsub lca \n{\n\tmy ($a, $b) = @_;\n\treturn $b if $a eq 0;\n\treturn $a if $b eq 0;\n\treturn $a if $a eq $b;\n\tmy %a_path;\n\twhile ($a ge 1) \n\t{\n\t\t$a_path{$a} = 1;\n\t\tif (!defined $taxParent{$a}) {\n\t\t\tprint STDERR \"Couldn't find parent of taxID $a - directly assigned to root.\\n\";\n\t\t\tlast;\n\t\t}\n\t\tlast if $a eq $taxParent{$a};\n\t\t$a = $taxParent{$a};\n\t}\n\n\twhile ($b > 1) \n\t{\n\t\treturn $b if (defined $a_path{$b});\n\t\tif (!defined $taxParent{$b}) {\n\t\t\tprint STDERR \"Couldn't find parent of taxID $b - directly assigned to root.\\n\";\n\t\t\tlast;\n\t\t}\n\t\tlast if $b eq $taxParent{$b};\n\t\t$b = $taxParent{$b};\n\t}\n\treturn 1;\n}\n\nsub OutputPromotedLines\n{\n\tmy @lines = @{ $_[0] } ;\n\treturn if ( scalar( @lines ) <= 0 ) ;\n\n\tmy @newLines ;\n\tmy $i ;\n\tmy $numMatches = 0 ;\n\tmy %showedUpTaxId ;\n\tmy $tab = sprintf( \"\\t\" ) ;\n\n\tif ( $level ne \"lca\" )\n\t{\n\t\tfor ( $i = 0 ; $i < scalar( @lines ) ; ++$i )\n\t\t{\n\t\t\tmy @cols = split /\\t+/, $lines[ $i ] ;\n\t\t\tmy $newTid = PromoteTaxId( $cols[2] ) ;\n\t\t\tif ( $newTid <= 1 )\n\t\t\t{\n\t\t\t\t$newTid = $cols[2] ;\n\t\t\t}\n\t\t\tmy $newLevel = $cols[1] ;\n\t\t\t$newLevel = $taxLevel{ $newTid } if ( $newTid >= 1 && defined $taxLevel{ $newTid } ) ;\n\n\t\t\tnext if ( defined $showedUpTaxId{ $newTid } ) ;\n\n\t\t\t$showedUpTaxId{ $newTid } = 1 ;\t\n\t\t\t++$numMatches ;\n\n\t\t\t$cols[2] = $newTid ;\n\t\t\t$cols[1] = $newLevel ;\n\t\t\tpush @newLines, join( $tab, @cols ) ;\n\t\t}\n\t}\n\telse\n\t{\n\t\t$numMatches = 1 ;\n\t\tmy @cols = split /\\t+/, $lines[0] ;\n\t\tmy $l = $cols[2] ;\n\t\tfor ( $i = 1 ; $i < scalar( @lines ) ; ++$i )\n\t\t{\n\t\t\t@cols = split /\\t+/, $lines[ $i ] ;\n\t\t\t\n\t\t\t$l = lca( $l, $cols[2] ) ;\n\t\t}\n\t\t\n\t\t@cols = split /\\t+/, $lines[0] ;\n\t\t$cols[1] = $taxLevel{ $l } if ( $l ne $cols[2] ) ;\n\t\t$cols[2] = $l ;\n\t\tpush @newLines, join( $tab, @cols ) ;\n\t}\n\n\tfor ( $i = 0 ; $i < scalar( @newLines ) ; ++$i )\n\t{\n\t\tmy @cols = split /\\t+/, $newLines[$i] ;\n\t\t$cols[-1] = $numMatches ;\n\t\tprint join( $tab, @cols ), \"\\n\" ;\n\t}\n}\n\nopen FP1, $ARGV[1] ;\nmy $header = <FP1> ;\nmy $prevReadId = \"\" ;\nmy @lines ;\n\nprint $header ;\nwhile ( <FP1> )\n{\n\tchomp ;\n\tmy @cols = split /\\t/ ;\n\tif ( $cols[0] eq $prevReadId )\n\t{\n\t\tpush @lines, $_ ;\n\t}\n\telse\n\t{\n\t\t$prevReadId = $cols[0] ;\n\t\t\t\t\n\t\tOutputPromotedLines( \\@lines ) ;\n\n\t\tundef @lines ;\n\t\tpush @lines, $_ ;\n\t}\n}\nOutputPromotedLines( \\@lines )  ;\nclose FP1 ;\n"
  },
  {
    "path": "centrifuge-sort-nt.pl",
    "content": "#! /usr/bin/env perl\n#\n# Sort nt file sequences according to their taxonomy ID\n# Uses the new mappinf file format available at\n# ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/\n#\n# Author fbreitwieser <fbreitwieser@sherman>\n# Version 0.2\n# Copyright (C) 2016 fbreitwieser <fbreitwieser@sherman>\n# Modified On 2016-09-26\n# Created  2016-02-28 12:56\n#\nuse strict;\nuse warnings;\nuse File::Basename;\nuse Getopt::Long;\n\nmy $new_map_file;\nmy $ac_wo_mapping_file;\nmy $opt_help;\n\nmy $USAGE = \n\"USAGE: \".basename($0).\" OPTIONS <sequence file> <mapping file> [<mapping file>*]\\n\n\nOPTIONS:\n  -m str      Output mappings that are present in sequence file to file str\n  -a str      Output ACs w/o mapping to file str\n  -h          This message\n\";\n\nGetOptions(\n\t\"m|map=s\" => \\$new_map_file,\n\t\"a=s\" => \\$ac_wo_mapping_file,\n\t\"h|help\" => \\$opt_help) or die \"Error in command line arguments\";\n\nscalar(@ARGV) >= 2 or die $USAGE;\nif (defined $opt_help) {\n\tprint STDERR $USAGE;\n\texit(0);\n}\n\nmy ($nt_file, @ac_taxid_files) = @ARGV;\n\nmy %ac_to_pos;\nmy %taxid_to_ac;\nmy %ac_to_taxid;\n\nprint STDERR \"Reading headers from $nt_file ... \";\nopen(my $NT, \"<\", $nt_file) or die $!;\nwhile (<$NT>) {\n    # get the headers with (!) the version number\n    if (/(^>([^ ]*).*)/) {\n        # record the position of this AC\n        $ac_to_pos{$2} = [tell($NT),$1];\n    }\n}\nprint STDERR \"found \", scalar(keys %ac_to_pos), \" ACs\\n\";\n\nforeach my $ac_taxid_file (@ac_taxid_files) {\nprint STDERR \"Reading ac to taxid mapping from $ac_taxid_file ...\\n\";\n    my $FP1;\n    if ($ac_taxid_file =~ /.gz$/) {\n        open($FP1, \"-|\", \"gunzip -c '$ac_taxid_file'\") or die $!;\n    } else {\n        open($FP1, \"<\", $ac_taxid_file) or die $!;\n    }\n\n    # format: accession <tab> accession.version <tab> taxid <tab> gi\n    # currently we look for a mapping with the version number\n    while ( <$FP1> ) {\n    \tmy (undef, $ac, $taxid) = split;\n        next unless defined $taxid;\n\t    if ( defined( $ac_to_pos{ $ac } ) ) {\n            push @{ $taxid_to_ac{ $taxid } }, $ac;\n\t\t    $ac_to_taxid{ $ac } = $taxid;\n        }\n    }\n    close $FP1;\n}\nprint STDERR \"Got taxonomy mappings for \", scalar(keys %ac_to_taxid), \" ACs\\n\";\nif (defined $ac_wo_mapping_file && scalar(keys %ac_to_taxid) < scalar(keys %ac_to_pos)) {\n    print STDERR \"Writing ACs without taxonomy mapping to $ac_wo_mapping_file\\n\";\n    open(my $FP2, \">\", $ac_wo_mapping_file) or die $!;\n    foreach my $ac (keys %ac_to_pos) {\n        next unless defined $ac_to_taxid{$ac};\n        print $FP2 $ac, \"\\n\";\n    }\n    close($FP2);\n}\n\nif (defined $new_map_file) {\nprint STDERR \"Writing taxonomy ID mapping to $new_map_file\\n\";\nopen(my $FP3, \">\", $new_map_file) or die $!;\nforeach my $ac (keys %ac_to_taxid) {\n    print $FP3 $ac,\"\\t\",$ac_to_taxid{$ac},\"\\n\";\n}\nclose($FP3);\n}\n\n\nprint STDERR \"Outputting sorted FASTA ...\\n\";\nforeach my $taxid (sort {$a <=> $b} keys %taxid_to_ac) {\n    my @acs = @{$taxid_to_ac{$taxid}};\n    my @sorted_acs = sort { $ac_to_pos{$a}->[0] <=> $ac_to_pos{$b}->[0] } @acs;\n    foreach (@sorted_acs) {\n        print $ac_to_pos{$_}->[1],\"\\n\";\n        seek($NT, $ac_to_pos{$_}->[0], 0);\n        while (<$NT>) {\n            last if (/^>/);\n            print $_;\n        }\n    }\n}\nclose $NT;\n"
  },
  {
    "path": "centrifuge.cpp",
    "content": "/*\n * Copyright 2014, Daehwan Kim <infphilo@gmail.com>\n *\n * This file is part of Centrifuge.\n *\n * Centrifuge is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Centrifuge is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Centrifuge.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <stdlib.h>\n#include <iostream>\n#include <fstream>\n#include <string>\n#include <cassert>\n#include <stdexcept>\n#include <getopt.h>\n#include <math.h>\n#include <utility>\n#include <limits>\n#include <map>\n#include \"alphabet.h\"\n#include \"assert_helpers.h\"\n#include \"endian_swap.h\"\n#include \"bt2_idx.h\"\n#include \"bt2_io.h\"\n#include \"bt2_util.h\"\n#include \"hier_idx.h\"\n#include \"formats.h\"\n#include \"sequence_io.h\"\n#include \"tokenize.h\"\n#include \"aln_sink.h\"\n#include \"pat.h\"\n#include \"threading.h\"\n#include \"ds.h\"\n#include \"aligner_metrics.h\"\n#include \"aligner_seed_policy.h\"\n#include \"classifier.h\"\n#include \"util.h\"\n#include \"pe.h\"\n#include \"simple_func.h\"\n#include \"presets.h\"\n#include \"opts.h\"\n#include \"outq.h\"\n\nusing namespace std;\n\nstatic EList<string> mates1;  // mated reads (first mate)\nstatic EList<string> mates2;  // mated reads (second mate)\nstatic EList<string> mates12; // mated reads (1st/2nd interleaved in 1 file)\nstatic string adjIdxBase;\nbool gColor;              // colorspace (not supported)\nint gVerbose;             // be talkative\nstatic bool startVerbose; // be talkative at startup\nint gQuiet;               // print nothing but the alignments\nstatic int sanityCheck;   // enable expensive sanity checks\nstatic int format;        // default read format is FASTQ\nstatic string origString; // reference text, or filename(s)\nstatic int seed;          // srandom() seed\nstatic int timing;        // whether to report basic timing data\nstatic int metricsIval;   // interval between alignment metrics messages (0 = no messages)\nstatic string metricsFile;// output file to put alignment metrics in\nstatic bool metricsStderr;// output file to put alignment metrics in\nstatic bool metricsPerRead; // report a metrics tuple for every read\nstatic bool allHits;      // for multihits, report just one\nstatic bool showVersion;  // just print version and quit?\nstatic int ipause;        // pause before maching?\nstatic uint32_t qUpto;    // max # of queries to read\nint gTrim5;               // amount to trim from 5' end\nint gTrim3;               // amount to trim from 3' end\nstatic int offRate;       // keep default offRate\nstatic bool solexaQuals;  // quality strings are solexa quals, not phred, and subtract 64 (not 33)\nstatic bool phred64Quals; // quality chars are phred, but must subtract 64 (not 33)\nstatic bool integerQuals; // quality strings are space-separated strings of integers, not ASCII\nstatic int nthreads;      // number of pthreads operating concurrently\nstatic int outType;       // style of output\nstatic bool noRefNames;   // true -> print reference indexes; not names\nstatic uint32_t khits;    // number of hits per read; >1 is much slower\nstatic uint32_t mhits;    // don't report any hits if there are > mhits\nstatic int partitionSz;   // output a partitioning key in first field\nstatic bool useSpinlock;  // false -> don't use of spinlocks even if they're #defines\nstatic bool fileParallel; // separate threads read separate input files in parallel\nstatic bool useShmem;     // use shared memory to hold the index\nstatic bool useMm;        // use memory-mapped files to hold the index\nstatic bool mmSweep;      // sweep through memory-mapped files immediately after mapping\nint gMinInsert;           // minimum insert size\nint gMaxInsert;           // maximum insert size\nbool gMate1fw;            // -1 mate aligns in fw orientation on fw strand\nbool gMate2fw;            // -2 mate aligns in rc orientation on fw strand\nbool gFlippedMatesOK;     // allow mates to be in wrong order\nbool gDovetailMatesOK;    // allow one mate to extend off the end of the other\nbool gContainMatesOK;     // allow one mate to contain the other in PE alignment\nbool gOlapMatesOK;        // allow mates to overlap in PE alignment\nbool gExpandToFrag;       // incr max frag length to =larger mate len if necessary\nbool gReportDiscordant;   // find and report discordant paired-end alignments\nbool gReportMixed;        // find and report unpaired alignments for paired reads\nstatic uint32_t cacheLimit;      // ranges w/ size > limit will be cached\nstatic uint32_t cacheSize;       // # words per range cache\nstatic uint32_t skipReads;       // # reads/read pairs to skip\nbool gNofw; // don't align fw orientation of read\nbool gNorc; // don't align rc orientation of read\nstatic uint32_t fastaContLen;\nstatic uint32_t fastaContFreq;\nstatic bool hadoopOut; // print Hadoop status and summary messages\nstatic bool fuzzy;\nstatic bool fullRef;\nstatic bool samTruncQname; // whether to truncate QNAME to 255 chars\nstatic bool samOmitSecSeqQual; // omit SEQ/QUAL for 2ndary alignments?\nstatic bool samNoUnal; // don't print records for unaligned reads\nstatic bool samNoHead; // don't print any header lines in SAM output\nstatic bool samNoSQ;   // don't print @SQ header lines\nstatic bool sam_print_as;\nstatic bool sam_print_xs;  // XS:i\nstatic bool sam_print_xss; // Xs:i and Ys:i\nstatic bool sam_print_yn;  // YN:i and Yn:i\nstatic bool sam_print_xn;\nstatic bool sam_print_cs;\nstatic bool sam_print_cq;\nstatic bool sam_print_x0;\nstatic bool sam_print_x1;\nstatic bool sam_print_xm;\nstatic bool sam_print_xo;\nstatic bool sam_print_xg;\nstatic bool sam_print_nm;\nstatic bool sam_print_md;\nstatic bool sam_print_yf;\nstatic bool sam_print_yi;\nstatic bool sam_print_ym;\nstatic bool sam_print_yp;\nstatic bool sam_print_yt;\nstatic bool sam_print_ys;\nstatic bool sam_print_zs;\nstatic bool sam_print_xr;\nstatic bool sam_print_xt;\nstatic bool sam_print_xd;\nstatic bool sam_print_xu;\nstatic bool sam_print_yl;\nstatic bool sam_print_ye;\nstatic bool sam_print_yu;\nstatic bool sam_print_xp;\nstatic bool sam_print_yr;\nstatic bool sam_print_zb;\nstatic bool sam_print_zr;\nstatic bool sam_print_zf;\nstatic bool sam_print_zm;\nstatic bool sam_print_zi;\nstatic bool sam_print_zp;\nstatic bool sam_print_zu;\nstatic bool sam_print_xs_a;\nstatic bool bwaSwLike;\nstatic float bwaSwLikeC;\nstatic float bwaSwLikeT;\nstatic bool qcFilter;\nstatic bool sortByScore;      // prioritize alignments to report by score?\nbool gReportOverhangs;        // false -> filter out alignments that fall off the end of a reference sequence\nstatic string rgid;           // ID: setting for @RG header line\nstatic string rgs;            // SAM outputs for @RG header line\nstatic string rgs_optflag;    // SAM optional flag to add corresponding to @RG ID\nstatic bool msample;          // whether to report a random alignment when maxed-out via -m/-M\nint      gGapBarrier;         // # diags on top/bot only to be entered diagonally\nstatic EList<string> qualities;\nstatic EList<string> qualities1;\nstatic EList<string> qualities2;\nstatic string polstr;         // temporary holder for policy string\nstatic bool  msNoCache;       // true -> disable local cache\nstatic int   bonusMatchType;  // how to reward matches\nstatic int   bonusMatch;      // constant reward if bonusMatchType=constant\nstatic int   penMmcType;      // how to penalize mismatches\nstatic int   penMmcMax;       // max mm penalty\nstatic int   penMmcMin;       // min mm penalty\nstatic int   penNType;        // how to penalize Ns in the read\nstatic int   penN;            // constant if N pelanty is a constant\nstatic bool  penNCatPair;     // concatenate mates before N filtering?\nstatic bool  localAlign;      // do local alignment in DP steps\nstatic bool  noisyHpolymer;   // set to true if gap penalties should be reduced to be consistent with a sequencer that under- and overcalls homopolymers\nstatic int   penRdGapConst;   // constant cost of extending a gap in the read\nstatic int   penRfGapConst;   // constant cost of extending a gap in the reference\nstatic int   penRdGapLinear;  // coeff of linear term for cost of gap extension in read\nstatic int   penRfGapLinear;  // coeff of linear term for cost of gap extension in ref\nstatic SimpleFunc scoreMin;   // minimum valid score as function of read len\nstatic SimpleFunc nCeil;      // max # Ns allowed as function of read len\nstatic SimpleFunc msIval;     // interval between seeds as function of read len\nstatic double descConsExp;    // how to adjust score minimum as we descent further into index-assisted alignment\nstatic size_t descentLanding; // don't place a search root if it's within this many positions of end\nstatic SimpleFunc descentTotSz;    // maximum space a DescentDriver can use in bytes\nstatic SimpleFunc descentTotFmops; // maximum # FM ops a DescentDriver can perform\nstatic int    multiseedMms;   // mismatches permitted in a multiseed seed\nstatic int    multiseedLen;   // length of multiseed seeds\nstatic size_t multiseedOff;   // offset to begin extracting seeds\nstatic uint32_t seedCacheLocalMB;   // # MB to use for non-shared seed alignment cacheing\nstatic uint32_t seedCacheCurrentMB; // # MB to use for current-read seed hit cacheing\nstatic uint32_t exactCacheCurrentMB; // # MB to use for current-read seed hit cacheing\nstatic size_t maxhalf;        // max width on one side of DP table\nstatic bool seedSumm;         // print summary information about seed hits, not alignments\nstatic bool doUngapped;       // do ungapped alignment\nstatic size_t maxIters;       // stop after this many extend loop iterations\nstatic size_t maxUg;          // stop after this many ungap extends\nstatic size_t maxDp;          // stop after this many DPs\nstatic size_t maxItersIncr;   // amt to add to maxIters for each -k > 1\nstatic size_t maxEeStreak;    // stop after this many end-to-end fails in a row\nstatic size_t maxUgStreak;    // stop after this many ungap fails in a row\nstatic size_t maxDpStreak;    // stop after this many dp fails in a row\nstatic size_t maxStreakIncr;  // amt to add to streak for each -k > 1\nstatic size_t maxMateStreak;  // stop seed range after this many mate-find fails\nstatic bool doExtend;         // extend seed hits\nstatic bool enable8;          // use 8-bit SSE where possible?\nstatic size_t cminlen;        // longer reads use checkpointing\nstatic size_t cpow2;          // checkpoint interval log2\nstatic bool doTri;            // do triangular mini-fills?\nstatic string defaultPreset;  // default preset; applied immediately\nstatic bool ignoreQuals;      // all mms incur same penalty, regardless of qual\nstatic string wrapper;        // type of wrapper script, so we can print correct usage\nstatic EList<string> queries; // list of query files\nstatic string outfile;        // write SAM output to this file\nstatic int mapqv;             // MAPQ calculation version\nstatic int tighten;           // -M tighten mode (0=none, 1=best, 2=secbest+1)\nstatic bool doExactUpFront;   // do exact search up front if seeds seem good enough\nstatic bool do1mmUpFront;     // do 1mm search up front if seeds seem good enough\nstatic size_t do1mmMinLen;    // length below which we disable 1mm e2e search\nstatic int seedBoostThresh;   // if average non-zero position has more than this many elements\nstatic size_t nSeedRounds;    // # seed rounds\nstatic bool reorder;          // true -> reorder SAM recs in -p mode\nstatic float sampleFrac;      // only align random fraction of input reads\nstatic bool arbitraryRandom;  // pseudo-randoms no longer a function of read properties\nstatic bool bowtie2p5;\n\nstatic string bt2index;      // read Bowtie 2 index from files with this prefix\nstatic EList<pair<int, string> > extra_opts;\nstatic size_t extra_opts_cur;\n\nstatic EList<uint64_t> thread_rids;\nstatic MUTEX_T         thread_rids_mutex;\n\nstatic uint32_t minHitLen;   // minimum length of partial hits\nstatic string reportFile;    // file name of specices report file\nstatic uint32_t minTotalLen; // minimum summed length of partial hits per read\nstatic bool abundance_analysis;\nstatic bool tree_traverse;\nstatic string classification_rank;\nstatic EList<uint64_t> host_taxIDs;\nstatic EList<uint64_t> excluded_taxIDs;\n\n\nstatic string tab_fmt_col_def;\nstatic bool sam_format;\nstatic EList<uint32_t> tab_fmt_cols;\nstatic EList<string> tab_fmt_cols_str;\nstatic map<string, uint32_t> col_name_map;\n\n#ifdef USE_SRA\nstatic EList<string> sra_accs;\n#endif\n\nstatic bool separator ; // whether we want to add separator between classification output and change the output of centrifuge_report\n\n#define DMAX std::numeric_limits<double>::max()\n\n\nstatic void parse_col_fmt(const string arg, EList<string>& tab_fmt_cols_str, EList<uint32_t>& tab_fmt_cols) {\n    tab_fmt_cols.clear();\n    tab_fmt_cols_str.clear();\n    tokenize(arg, \",\", tab_fmt_cols_str);\n    for(size_t i = 0; i < tab_fmt_cols_str.size(); i++) {\n        map<string, uint32_t>::iterator it = col_name_map.find(tab_fmt_cols_str[i]);\n        if (it == col_name_map.end()) {\n            cerr << \"Column definition \"  << tab_fmt_cols_str[i] << \" invalid.\" << endl;\n            exit(1);\n        }\n        tab_fmt_cols.push_back(it->second);\n    }\n\n}\n\n\n\nstatic void resetOptions() {\n\n#ifndef NDEBUG\n\tcerr << \"Setting standard options\" << endl;\n#endif\n\n\tmates1.clear();\n\tmates2.clear();\n\tmates12.clear();\n\tadjIdxBase\t            = \"\";\n\tgColor                  = false;\n\tgVerbose                = 0;\n\tstartVerbose\t\t\t= 0;\n\tgQuiet\t\t\t\t\t= false;\n\tsanityCheck\t\t\t\t= 0;  // enable expensive sanity checks\n\tformat\t\t\t\t\t= FASTQ; // default read format is FASTQ\n\torigString\t\t\t\t= \"\"; // reference text, or filename(s)\n\tseed\t\t\t\t\t= 0; // srandom() seed\n\ttiming\t\t\t\t\t= 0; // whether to report basic timing data\n\tmetricsIval\t\t\t\t= 1; // interval between alignment metrics messages (0 = no messages)\n\tmetricsFile             = \"\"; // output file to put alignment metrics in\n\tmetricsStderr           = false; // print metrics to stderr (in addition to --metrics-file if it's specified\n\tmetricsPerRead          = false; // report a metrics tuple for every read?\n\tallHits\t\t\t\t\t= false; // for multihits, report just one\n\tshowVersion\t\t\t\t= false; // just print version and quit?\n\tipause\t\t\t\t\t= 0; // pause before maching?\n\tqUpto\t\t\t\t\t= 0xffffffff; // max # of queries to read\n\tgTrim5\t\t\t\t\t= 0; // amount to trim from 5' end\n\tgTrim3\t\t\t\t\t= 0; // amount to trim from 3' end\n\toffRate\t\t\t\t\t= -1; // keep default offRate\n\tsolexaQuals\t\t\t\t= false; // quality strings are solexa quals, not phred, and subtract 64 (not 33)\n\tphred64Quals\t\t\t= false; // quality chars are phred, but must subtract 64 (not 33)\n\tintegerQuals\t\t\t= false; // quality strings are space-separated strings of integers, not ASCII\n\tnthreads\t\t\t\t= 1;     // number of pthreads operating concurrently\n\toutType\t\t\t\t\t= OUTPUT_SAM;  // style of output\n\tnoRefNames\t\t\t\t= false; // true -> print reference indexes; not names\n\tkhits\t\t\t\t\t= 5;     // number of hits per read; >1 is much slower\n\tmhits\t\t\t\t\t= 0;     // stop after finding this many alignments+1\n\tpartitionSz\t\t\t\t= 0;     // output a partitioning key in first field\n\tuseSpinlock\t\t\t\t= true;  // false -> don't use of spinlocks even if they're #defines\n\tfileParallel\t\t\t= false; // separate threads read separate input files in parallel\n\tuseShmem\t\t\t\t= false; // use shared memory to hold the index\n\tuseMm\t\t\t\t\t= false; // use memory-mapped files to hold the index\n\tmmSweep\t\t\t\t\t= false; // sweep through memory-mapped files immediately after mapping\n\tgMinInsert\t\t\t\t= 0;     // minimum insert size\n\tgMaxInsert\t\t\t\t= 500;   // maximum insert size\n\tgMate1fw\t\t\t\t= true;  // -1 mate aligns in fw orientation on fw strand\n\tgMate2fw\t\t\t\t= false; // -2 mate aligns in rc orientation on fw strand\n\tgFlippedMatesOK         = false; // allow mates to be in wrong order\n\tgDovetailMatesOK        = false; // allow one mate to extend off the end of the other\n\tgContainMatesOK         = true;  // allow one mate to contain the other in PE alignment\n\tgOlapMatesOK            = true;  // allow mates to overlap in PE alignment\n\tgExpandToFrag           = true;  // incr max frag length to =larger mate len if necessary\n\tgReportDiscordant       = true;  // find and report discordant paired-end alignments\n\tgReportMixed            = true;  // find and report unpaired alignments for paired reads\n\n\tcacheLimit\t\t\t\t= 5;     // ranges w/ size > limit will be cached\n\tcacheSize\t\t\t\t= 0;     // # words per range cache\n\tskipReads\t\t\t\t= 0;     // # reads/read pairs to skip\n\tgNofw\t\t\t\t\t= false; // don't align fw orientation of read\n\tgNorc\t\t\t\t\t= false; // don't align rc orientation of read\n\tfastaContLen\t\t\t= 0;\n\tfastaContFreq\t\t\t= 0;\n\thadoopOut\t\t\t\t= false; // print Hadoop status and summary messages\n\tfuzzy\t\t\t\t\t= false; // reads will have alternate basecalls w/ qualities\n\tfullRef\t\t\t\t\t= false; // print entire reference name instead of just up to 1st space\n\tsamTruncQname           = true;  // whether to truncate QNAME to 255 chars\n\tsamOmitSecSeqQual       = false; // omit SEQ/QUAL for 2ndary alignments?\n\tsamNoUnal               = false; // omit SAM records for unaligned reads\n\tsamNoHead\t\t\t\t= true;  // don't print any header lines in SAM output\n\tsamNoSQ\t\t\t\t\t= false; // don't print @SQ header lines\n\tsam_print_as            = true;\n\tsam_print_xs            = true;\n\tsam_print_xss           = false; // Xs:i and Ys:i\n\tsam_print_yn            = false; // YN:i and Yn:i\n\tsam_print_xn            = true;\n\tsam_print_cs            = false;\n\tsam_print_cq            = false;\n\tsam_print_x0            = true;\n\tsam_print_x1            = true;\n\tsam_print_xm            = true;\n\tsam_print_xo            = true;\n\tsam_print_xg            = true;\n\tsam_print_nm            = true;\n\tsam_print_md            = true;\n\tsam_print_yf            = true;\n\tsam_print_yi            = false;\n\tsam_print_ym            = false;\n\tsam_print_yp            = false;\n\tsam_print_yt            = true;\n\tsam_print_ys            = true;\n\tsam_print_zs            = false;\n\tsam_print_xr            = false;\n\tsam_print_xt            = false;\n\tsam_print_xd            = false;\n\tsam_print_xu            = false;\n\tsam_print_yl            = false;\n\tsam_print_ye            = false;\n\tsam_print_yu            = false;\n\tsam_print_xp            = false;\n\tsam_print_yr            = false;\n\tsam_print_zb            = false;\n\tsam_print_zr            = false;\n\tsam_print_zf            = false;\n\tsam_print_zm            = false;\n\tsam_print_zi            = false;\n\tsam_print_zp            = false;\n\tsam_print_zu            = false;\n    sam_print_xs_a          = true;\n\tbwaSwLike               = false;\n\tbwaSwLikeC              = 5.5f;\n\tbwaSwLikeT              = 20.0f;\n\tqcFilter                = false; // don't believe upstream qc by default\n\tsortByScore             = true;  // prioritize alignments to report by score?\n\trgid\t\t\t\t\t= \"\";    // SAM outputs for @RG header line\n\trgs\t\t\t\t\t\t= \"\";    // SAM outputs for @RG header line\n\trgs_optflag\t\t\t\t= \"\";    // SAM optional flag to add corresponding to @RG ID\n\tmsample\t\t\t\t    = true;\n\tgGapBarrier\t\t\t\t= 4;     // disallow gaps within this many chars of either end of alignment\n\tqualities.clear();\n\tqualities1.clear();\n\tqualities2.clear();\n\tpolstr.clear();\n\tmsNoCache       = true; // true -> disable local cache\n\tbonusMatchType  = DEFAULT_MATCH_BONUS_TYPE;\n\tbonusMatch      = DEFAULT_MATCH_BONUS;\n\tpenMmcType      = DEFAULT_MM_PENALTY_TYPE;\n\tpenMmcMax       = DEFAULT_MM_PENALTY_MAX;\n\tpenMmcMin       = DEFAULT_MM_PENALTY_MIN;\n\tpenNType        = DEFAULT_N_PENALTY_TYPE;\n\tpenN            = DEFAULT_N_PENALTY;\n\tpenNCatPair     = DEFAULT_N_CAT_PAIR; // concatenate mates before N filtering?\n\tlocalAlign      = false;     // do local alignment in DP steps\n\tnoisyHpolymer   = false;\n\tpenRdGapConst   = DEFAULT_READ_GAP_CONST;\n\tpenRfGapConst   = DEFAULT_REF_GAP_CONST;\n\tpenRdGapLinear  = DEFAULT_READ_GAP_LINEAR;\n\tpenRfGapLinear  = DEFAULT_REF_GAP_LINEAR;\n\t// scoreMin.init  (SIMPLE_FUNC_LINEAR, DEFAULT_MIN_CONST,   DEFAULT_MIN_LINEAR);\n    scoreMin.init  (SIMPLE_FUNC_CONST, -18, 0);\n\tnCeil.init     (SIMPLE_FUNC_LINEAR, 0.0f, DMAX, 2.0f, 0.1f);\n\tmsIval.init    (SIMPLE_FUNC_LINEAR, 1.0f, DMAX, DEFAULT_IVAL_B, DEFAULT_IVAL_A);\n\tdescConsExp     = 2.0;\n\tdescentLanding  = 20;\n\tdescentTotSz.init(SIMPLE_FUNC_LINEAR, 1024.0, DMAX, 0.0, 1024.0);\n\tdescentTotFmops.init(SIMPLE_FUNC_LINEAR, 100.0, DMAX, 0.0, 10.0);\n\tmultiseedMms    = DEFAULT_SEEDMMS;\n\tmultiseedLen    = DEFAULT_SEEDLEN;\n\tmultiseedOff    = 0;\n\tseedCacheLocalMB   = 32; // # MB to use for non-shared seed alignment cacheing\n\tseedCacheCurrentMB = 20; // # MB to use for current-read seed hit cacheing\n\texactCacheCurrentMB = 20; // # MB to use for current-read seed hit cacheing\n\tmaxhalf            = 15; // max width on one side of DP table\n\tseedSumm           = false; // print summary information about seed hits, not alignments\n\tdoUngapped         = true;  // do ungapped alignment\n\tmaxIters           = 400;   // max iterations of extend loop\n\tmaxUg              = 300;   // stop after this many ungap extends\n\tmaxDp              = 300;   // stop after this many dp extends\n\tmaxItersIncr       = 20;    // amt to add to maxIters for each -k > 1\n\tmaxEeStreak        = 15;    // stop after this many end-to-end fails in a row\n\tmaxUgStreak        = 15;    // stop after this many ungap fails in a row\n\tmaxDpStreak        = 15;    // stop after this many dp fails in a row\n\tmaxStreakIncr      = 10;    // amt to add to streak for each -k > 1\n\tmaxMateStreak      = 10;    // in PE: abort seed range after N mate-find fails\n\tdoExtend           = true;  // do seed extensions\n\tenable8            = true;  // use 8-bit SSE where possible?\n\tcminlen            = 2000;  // longer reads use checkpointing\n\tcpow2              = 4;     // checkpoint interval log2\n\tdoTri              = false; // do triangular mini-fills?\n\tdefaultPreset      = \"sensitive%LOCAL%\"; // default preset; applied immediately\n\textra_opts.clear();\n\textra_opts_cur = 0;\n\tbt2index.clear();        // read Bowtie 2 index from files with this prefix\n\tignoreQuals = false;     // all mms incur same penalty, regardless of qual\n\twrapper.clear();         // type of wrapper script, so we can print correct usage\n\tqueries.clear();         // list of query files\n\toutfile.clear();         // write SAM output to this file\n\tmapqv = 2;               // MAPQ calculation version\n\ttighten = 3;             // -M tightening mode\n\tdoExactUpFront = true;   // do exact search up front if seeds seem good enough\n\tdo1mmUpFront = true;     // do 1mm search up front if seeds seem good enough\n\tseedBoostThresh = 300;   // if average non-zero position has more than this many elements\n\tnSeedRounds = 2;         // # rounds of seed searches to do for repetitive reads\n\tdo1mmMinLen = 60;        // length below which we disable 1mm search\n\treorder = false;         // reorder SAM records with -p > 1\n\tsampleFrac = 1.1f;       // align all reads\n\tarbitraryRandom = false; // let pseudo-random seeds be a function of read properties\n\tbowtie2p5 = false;\n    minHitLen = 22;\n    minTotalLen = 0;\n    reportFile = \"centrifuge_report.tsv\";\n    abundance_analysis = true;\n    tree_traverse = true;\n    host_taxIDs.clear();\n    classification_rank = \"strain\";\n    excluded_taxIDs.clear();\n\tsam_format = false;\n\n    col_name_map[\"readID\"] = READ_ID;\n    col_name_map[\"seqID\"] = SEQ_ID;\n    col_name_map[\"taxLevel\"] = TAX_RANK;\n    col_name_map[\"taxRank\"] = TAX_RANK;\n    col_name_map[\"taxID\"] = TAX_ID;\n    col_name_map[\"taxName\"] = TAX_NAME;\n    col_name_map[\"score\"] = SCORE;\n    col_name_map[\"2ndBestScore\"] = SCORE2;\n    col_name_map[\"hitLength\"] = HIT_LENGTH;\n    col_name_map[\"queryLength\"] = QUERY_LENGTH;\n    col_name_map[\"numMatches\"] = NUM_MATCHES;\n    col_name_map[\"readSeq\"] = SEQ;\n    col_name_map[\"readQual\"] = QUAL;\n\n    // SAM names\n    col_name_map[\"QNAME\"] = READ_ID;\n    col_name_map[\"FLAG\"] = PLACEHOLDER_ZERO;\n    col_name_map[\"RNAME\"] = TAX_ID;\n    col_name_map[\"POS\"] = PLACEHOLDER_ZERO;\n    col_name_map[\"MAPQ\"] = PLACEHOLDER_ZERO;\n    col_name_map[\"CIGAR\"] = PLACEHOLDER;\n    col_name_map[\"RNEXT\"] = SEQ_ID;\n    col_name_map[\"PNEXT\"] = PLACEHOLDER_ZERO;\n    col_name_map[\"TLEN\"] = QUERY_LENGTH ; //PLACEHOLDER_ZERO;\n    col_name_map[\"SEQ\"] = SEQ;\n    col_name_map[\"QUAL\"] = QUAL;\n\n    // further columns\n    col_name_map[\"SEQ1\"] = SEQ1;\n    col_name_map[\"SEQ2\"] = SEQ2;\n    col_name_map[\"QUAL1\"] = QUAL1;\n    col_name_map[\"QUAL2\"] = QUAL2;\n    col_name_map[\"readSeq1\"] = SEQ1;\n    col_name_map[\"readSeq2\"] = SEQ2;\n    col_name_map[\"readQual1\"] = QUAL1;\n    col_name_map[\"readQual2\"] = QUAL2;\n\n    tab_fmt_col_def = \"readID,seqID,taxID,score,2ndBestScore,hitLength,queryLength,numMatches\";\n    parse_col_fmt(tab_fmt_col_def, tab_fmt_cols_str, tab_fmt_cols);\n    \n#ifdef USE_SRA\n    sra_accs.clear();\n#endif\n\n    separator = false ;\n}\n\nstatic const char *short_options = \"fF:qbzhcu:rv:s:aP:t3:5:w:p:k:M:1:2:I:X:CQ:N:i:L:U:x:S:g:O:D:R:\";\n\nstatic struct option long_options[] = {\n\t{(char*)\"verbose\",      no_argument,       0,            ARG_VERBOSE},\n\t{(char*)\"startverbose\", no_argument,       0,            ARG_STARTVERBOSE},\n\t{(char*)\"quiet\",        no_argument,       0,            ARG_QUIET},\n\t{(char*)\"sanity\",       no_argument,       0,            ARG_SANITY},\n\t{(char*)\"pause\",        no_argument,       &ipause,      1},\n\t{(char*)\"orig\",         required_argument, 0,            ARG_ORIG},\n\t{(char*)\"all\",          no_argument,       0,            'a'},\n\t{(char*)\"solexa-quals\", no_argument,       0,            ARG_SOLEXA_QUALS},\n\t{(char*)\"integer-quals\",no_argument,       0,            ARG_INTEGER_QUALS},\n\t{(char*)\"int-quals\",    no_argument,       0,            ARG_INTEGER_QUALS},\n\t{(char*)\"metrics\",      required_argument, 0,            ARG_METRIC_IVAL},\n\t{(char*)\"metrics-file\", required_argument, 0,            ARG_METRIC_FILE},\n\t{(char*)\"metrics-stderr\",no_argument,      0,            ARG_METRIC_STDERR},\n\t{(char*)\"metrics-per-read\", no_argument,   0,            ARG_METRIC_PER_READ},\n\t{(char*)\"met-read\",     no_argument,       0,            ARG_METRIC_PER_READ},\n\t{(char*)\"met\",          required_argument, 0,            ARG_METRIC_IVAL},\n\t{(char*)\"met-file\",     required_argument, 0,            ARG_METRIC_FILE},\n\t{(char*)\"met-stderr\",   no_argument,       0,            ARG_METRIC_STDERR},\n\t{(char*)\"time\",         no_argument,       0,            't'},\n\t{(char*)\"trim3\",        required_argument, 0,            '3'},\n\t{(char*)\"trim5\",        required_argument, 0,            '5'},\n\t{(char*)\"seed\",         required_argument, 0,            ARG_SEED},\n\t{(char*)\"qupto\",        required_argument, 0,            'u'},\n\t{(char*)\"upto\",         required_argument, 0,            'u'},\n\t{(char*)\"version\",      no_argument,       0,            ARG_VERSION},\n\t{(char*)\"filepar\",      no_argument,       0,            ARG_FILEPAR},\n\t{(char*)\"help\",         no_argument,       0,            'h'},\n\t{(char*)\"threads\",      required_argument, 0,            'p'},\n\t{(char*)\"khits\",        required_argument, 0,            'k'},\n\t{(char*)\"minins\",       required_argument, 0,            'I'},\n\t{(char*)\"maxins\",       required_argument, 0,            'X'},\n\t{(char*)\"quals\",        required_argument, 0,            'Q'},\n\t{(char*)\"Q1\",           required_argument, 0,            ARG_QUALS1},\n\t{(char*)\"Q2\",           required_argument, 0,            ARG_QUALS2},\n\t{(char*)\"refidx\",       no_argument,       0,            ARG_REFIDX},\n\t{(char*)\"partition\",    required_argument, 0,            ARG_PARTITION},\n\t{(char*)\"ff\",           no_argument,       0,            ARG_FF},\n\t{(char*)\"fr\",           no_argument,       0,            ARG_FR},\n\t{(char*)\"rf\",           no_argument,       0,            ARG_RF},\n\t{(char*)\"cachelim\",     required_argument, 0,            ARG_CACHE_LIM},\n\t{(char*)\"cachesz\",      required_argument, 0,            ARG_CACHE_SZ},\n\t{(char*)\"nofw\",         no_argument,       0,            ARG_NO_FW},\n\t{(char*)\"norc\",         no_argument,       0,            ARG_NO_RC},\n\t{(char*)\"skip\",         required_argument, 0,            's'},\n\t{(char*)\"12\",           required_argument, 0,            ARG_ONETWO},\n\t{(char*)\"tab5\",         required_argument, 0,            ARG_TAB5},\n\t{(char*)\"tab6\",         required_argument, 0,            ARG_TAB6},\n\t{(char*)\"phred33-quals\", no_argument,      0,            ARG_PHRED33},\n\t{(char*)\"phred64-quals\", no_argument,      0,            ARG_PHRED64},\n\t{(char*)\"phred33\",       no_argument,      0,            ARG_PHRED33},\n\t{(char*)\"phred64\",      no_argument,       0,            ARG_PHRED64},\n\t{(char*)\"solexa1.3-quals\", no_argument,    0,            ARG_PHRED64},\n\t{(char*)\"mm\",           no_argument,       0,            ARG_MM},\n\t{(char*)\"shmem\",        no_argument,       0,            ARG_SHMEM},\n\t{(char*)\"mmsweep\",      no_argument,       0,            ARG_MMSWEEP},\n\t{(char*)\"hadoopout\",    no_argument,       0,            ARG_HADOOPOUT},\n\t{(char*)\"fuzzy\",        no_argument,       0,            ARG_FUZZY},\n\t{(char*)\"fullref\",      no_argument,       0,            ARG_FULLREF},\n\t{(char*)\"usage\",        no_argument,       0,            ARG_USAGE},\n\t{(char*)\"omit-sec-seq\", no_argument,       0,            ARG_SAM_OMIT_SEC_SEQ},\n\t{(char*)\"gbar\",         required_argument, 0,            ARG_GAP_BAR},\n\t{(char*)\"qseq\",         no_argument,       0,            ARG_QSEQ},\n\t{(char*)\"policy\",       required_argument, 0,            ARG_ALIGN_POLICY},\n\t{(char*)\"preset\",       required_argument, 0,            'P'},\n\t{(char*)\"seed-summ\",    no_argument,       0,            ARG_SEED_SUMM},\n\t{(char*)\"seed-summary\", no_argument,       0,            ARG_SEED_SUMM},\n\t{(char*)\"overhang\",     no_argument,       0,            ARG_OVERHANG},\n\t{(char*)\"no-cache\",     no_argument,       0,            ARG_NO_CACHE},\n\t{(char*)\"cache\",        no_argument,       0,            ARG_USE_CACHE},\n\t{(char*)\"454\",          no_argument,       0,            ARG_NOISY_HPOLY},\n\t{(char*)\"ion-torrent\",  no_argument,       0,            ARG_NOISY_HPOLY},\n\t{(char*)\"no-mixed\",     no_argument,       0,            ARG_NO_MIXED},\n\t{(char*)\"no-discordant\",no_argument,       0,            ARG_NO_DISCORDANT},\n\t{(char*)\"local\",        no_argument,       0,            ARG_LOCAL},\n\t{(char*)\"end-to-end\",   no_argument,       0,            ARG_END_TO_END},\n\t{(char*)\"ungapped\",     no_argument,       0,            ARG_UNGAPPED},\n\t{(char*)\"no-ungapped\",  no_argument,       0,            ARG_UNGAPPED_NO},\n\t{(char*)\"sse8\",         no_argument,       0,            ARG_SSE8},\n\t{(char*)\"no-sse8\",      no_argument,       0,            ARG_SSE8_NO},\n\t{(char*)\"scan-narrowed\",no_argument,       0,            ARG_SCAN_NARROWED},\n\t{(char*)\"qc-filter\",    no_argument,       0,            ARG_QC_FILTER},\n\t{(char*)\"bwa-sw-like\",  no_argument,       0,            ARG_BWA_SW_LIKE},\n\t{(char*)\"multiseed\",        required_argument, 0,        ARG_MULTISEED_IVAL},\n\t{(char*)\"ma\",               required_argument, 0,        ARG_SCORE_MA},\n\t{(char*)\"mp\",               required_argument, 0,        ARG_SCORE_MMP},\n\t{(char*)\"np\",               required_argument, 0,        ARG_SCORE_NP},\n\t{(char*)\"rdg\",              required_argument, 0,        ARG_SCORE_RDG},\n\t{(char*)\"rfg\",              required_argument, 0,        ARG_SCORE_RFG},\n\t{(char*)\"score-min\",        required_argument, 0,        ARG_SCORE_MIN},\n\t{(char*)\"min-score\",        required_argument, 0,        ARG_SCORE_MIN},\n\t{(char*)\"n-ceil\",           required_argument, 0,        ARG_N_CEIL},\n\t{(char*)\"dpad\",             required_argument, 0,        ARG_DPAD},\n\t{(char*)\"mapq-print-inputs\",no_argument,       0,        ARG_SAM_PRINT_YI},\n\t{(char*)\"no-score-priority\",no_argument,       0,        ARG_NO_SCORE_PRIORITY},\n\t{(char*)\"seedlen\",          required_argument, 0,        'L'},\n\t{(char*)\"seedmms\",          required_argument, 0,        'N'},\n\t{(char*)\"seedival\",         required_argument, 0,        'i'},\n\t{(char*)\"ignore-quals\",     no_argument,       0,        ARG_IGNORE_QUALS},\n\t{(char*)\"index\",            required_argument, 0,        'x'},\n\t{(char*)\"arg-desc\",         no_argument,       0,        ARG_DESC},\n\t{(char*)\"wrapper\",          required_argument, 0,        ARG_WRAPPER},\n\t{(char*)\"unpaired\",         required_argument, 0,        'U'},\n\t{(char*)\"output\",           required_argument, 0,        'S'},\n\t{(char*)\"mapq-v\",           required_argument, 0,        ARG_MAPQ_V},\n\t{(char*)\"dovetail\",         no_argument,       0,        ARG_DOVETAIL},\n\t{(char*)\"no-dovetail\",      no_argument,       0,        ARG_NO_DOVETAIL},\n\t{(char*)\"contain\",          no_argument,       0,        ARG_CONTAIN},\n\t{(char*)\"no-contain\",       no_argument,       0,        ARG_NO_CONTAIN},\n\t{(char*)\"overlap\",          no_argument,       0,        ARG_OVERLAP},\n\t{(char*)\"no-overlap\",       no_argument,       0,        ARG_NO_OVERLAP},\n\t{(char*)\"tighten\",          required_argument, 0,        ARG_TIGHTEN},\n\t{(char*)\"exact-upfront\",    no_argument,       0,        ARG_EXACT_UPFRONT},\n\t{(char*)\"1mm-upfront\",      no_argument,       0,        ARG_1MM_UPFRONT},\n\t{(char*)\"no-exact-upfront\", no_argument,       0,        ARG_EXACT_UPFRONT_NO},\n\t{(char*)\"no-1mm-upfront\",   no_argument,       0,        ARG_1MM_UPFRONT_NO},\n\t{(char*)\"1mm-minlen\",       required_argument, 0,        ARG_1MM_MINLEN},\n\t{(char*)\"seed-off\",         required_argument, 0,        'O'},\n\t{(char*)\"seed-boost\",       required_argument, 0,        ARG_SEED_BOOST_THRESH},\n\t{(char*)\"read-times\",       no_argument,       0,        ARG_READ_TIMES},\n\t{(char*)\"show-rand-seed\",   no_argument,       0,        ARG_SHOW_RAND_SEED},\n\t{(char*)\"dp-fail-streak\",   required_argument, 0,        ARG_DP_FAIL_STREAK_THRESH},\n\t{(char*)\"ee-fail-streak\",   required_argument, 0,        ARG_EE_FAIL_STREAK_THRESH},\n\t{(char*)\"ug-fail-streak\",   required_argument, 0,        ARG_UG_FAIL_STREAK_THRESH},\n\t{(char*)\"fail-streak\",      required_argument, 0,        'D'},\n\t{(char*)\"dp-fails\",         required_argument, 0,        ARG_DP_FAIL_THRESH},\n\t{(char*)\"ug-fails\",         required_argument, 0,        ARG_UG_FAIL_THRESH},\n\t{(char*)\"extends\",          required_argument, 0,        ARG_EXTEND_ITERS},\n\t{(char*)\"no-extend\",        no_argument,       0,        ARG_NO_EXTEND},\n\t{(char*)\"mapq-extra\",       no_argument,       0,        ARG_MAPQ_EX},\n\t{(char*)\"seed-rounds\",      required_argument, 0,        'R'},\n\t{(char*)\"reorder\",          no_argument,       0,        ARG_REORDER},\n\t{(char*)\"passthrough\",      no_argument,       0,        ARG_READ_PASSTHRU},\n\t{(char*)\"sample\",           required_argument, 0,        ARG_SAMPLE},\n\t{(char*)\"cp-min\",           required_argument, 0,        ARG_CP_MIN},\n\t{(char*)\"cp-ival\",          required_argument, 0,        ARG_CP_IVAL},\n\t{(char*)\"tri\",              no_argument,       0,        ARG_TRI},\n\t{(char*)\"nondeterministic\", no_argument,       0,        ARG_NON_DETERMINISTIC},\n\t{(char*)\"non-deterministic\", no_argument,      0,        ARG_NON_DETERMINISTIC},\n\t{(char*)\"local-seed-cache-sz\", required_argument, 0,     ARG_LOCAL_SEED_CACHE_SZ},\n\t{(char*)\"seed-cache-sz\",       required_argument, 0,     ARG_CURRENT_SEED_CACHE_SZ},\n\t{(char*)\"no-unal\",          no_argument,       0,        ARG_SAM_NO_UNAL},\n\t{(char*)\"test-25\",          no_argument,       0,        ARG_TEST_25},\n\t// TODO: following should be a function of read length?\n\t{(char*)\"desc-kb\",          required_argument, 0,        ARG_DESC_KB},\n\t{(char*)\"desc-landing\",     required_argument, 0,        ARG_DESC_LANDING},\n\t{(char*)\"desc-exp\",         required_argument, 0,        ARG_DESC_EXP},\n\t{(char*)\"desc-fmops\",       required_argument, 0,        ARG_DESC_FMOPS},\n    {(char*)\"min-hitlen\",       required_argument, 0,        ARG_MIN_HITLEN},\n    {(char*)\"min-totallen\",     required_argument, 0,        ARG_MIN_TOTALLEN},\n    {(char*)\"host-taxids\",      required_argument, 0,        ARG_HOST_TAXIDS},\n\t{(char*)\"report-file\",      required_argument, 0,        ARG_REPORT_FILE},\n    {(char*)\"no-abundance\",     no_argument,       0,        ARG_NO_ABUNDANCE},\n    {(char*)\"no-traverse\",      no_argument,       0,        ARG_NO_TRAVERSE},\n    {(char*)\"classification-rank\", required_argument,    0,  ARG_CLASSIFICATION_RANK},\n    {(char*)\"exclude-taxids\",   required_argument, 0,  ARG_EXCLUDE_TAXIDS},\n    {(char*)\"out-fmt\",          required_argument, 0,  ARG_OUT_FMT},\n    {(char*)\"tab-fmt-cols\",     required_argument, 0,  ARG_TAB_FMT_COLS},\n#ifdef USE_SRA\n    {(char*)\"sra-acc\",   required_argument, 0,        ARG_SRA_ACC},\n#endif\n\t{(char*)\"separator\",     no_argument, 0,  ARG_SEPARATOR },\n\t{(char*)0, 0, 0, 0} // terminator\n};\n\n/**\n * Print out a concise description of what options are taken and whether they\n * take an argument.\n */\nstatic void printArgDesc(ostream& out) {\n\t// struct option {\n\t//   const char *name;\n\t//   int has_arg;\n\t//   int *flag;\n\t//   int val;\n\t// };\n\tsize_t i = 0;\n\twhile(long_options[i].name != 0) {\n\t\tout << long_options[i].name << \"\\t\"\n\t\t    << (long_options[i].has_arg == no_argument ? 0 : 1)\n\t\t    << endl;\n\t\ti++;\n\t}\n\tsize_t solen = strlen(short_options);\n\tfor(i = 0; i < solen; i++) {\n\t\t// Has an option?  Does if next char is :\n\t\tif(i == solen-1) {\n\t\t\tassert_neq(':', short_options[i]);\n\t\t\tcout << (char)short_options[i] << \"\\t\" << 0 << endl;\n\t\t} else {\n\t\t\tif(short_options[i+1] == ':') {\n\t\t\t\t// Option with argument\n\t\t\t\tcout << (char)short_options[i] << \"\\t\" << 1 << endl;\n\t\t\t\ti++; // skip the ':'\n\t\t\t} else {\n\t\t\t\t// Option with no argument\n\t\t\t\tcout << (char)short_options[i] << \"\\t\" << 0 << endl;\n\t\t\t}\n\t\t}\n\t}\n}\n\n/**\n * Print a summary usage message to the provided output stream.\n */\nstatic void printUsage(ostream& out) {\n\tout << \"Centrifuge version \" << string(CENTRIFUGE_VERSION).c_str() << \" by the Centrifuge developer team (centrifuge.metagenomics@gmail.com)\" << endl;\n\tstring tool_name = \"centrifuge-class\";\n\tif(wrapper == \"basic-0\") {\n\t\ttool_name = \"centrifuge\";\n\t}\n\tout << \"Usage: \" << endl ;\n\tif ( wrapper == \"basic-0\" )\n\t{\n\t\tout\n#ifdef USE_SRA\n\t\t\t<< \"  \" << tool_name.c_str() << \" [options]* -x <cf-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number> | --sample-sheet <s>} [-S <filename>] [--report-file <report>]\" << endl\n#else\n\t\t\t<< \"  \" << tool_name.c_str() << \" [options]* -x <cf-idx> {-1 <m1> -2 <m2> | -U <r> | --sample-sheet <s> } [-S <filename>] [--report-file <report>]\" << endl\n#endif\n\t\t\t<<endl ;\n\t}\n\telse\n\t{\n\t\tout\n#ifdef USE_SRA\n\t\t\t<< \"  \" << tool_name.c_str() << \" [options]* -x <cf-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <filename>] [--report-file <report>]\" << endl\n#else\n\t\t\t<< \"  \" << tool_name.c_str() << \" [options]* -x <cf-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <filename>] [--report-file <report>]\" << endl\n#endif\n\t\t\t<< endl ;\n\t}\n\n\tout\t<<     \"  <cf-idx>   Index filename prefix (minus trailing .X.\" << gEbwt_ext << \").\" << endl\n\t\t<<     \"  <m1>       Files with #1 mates, paired with files in <m2>.\" << endl;\n\tif(wrapper == \"basic-0\") {\n\t\tout << \"             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).\" << endl;\n\t}\n\tout <<     \"  <m2>       Files with #2 mates, paired with files in <m1>.\" << endl;\n\tif(wrapper == \"basic-0\") {\n\t\tout << \"             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).\" << endl;\n\t}\n\tout <<     \"  <r>        Files with unpaired reads.\" << endl;\n\tif(wrapper == \"basic-0\") {\n\t\tout << \"             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).\" << endl;\n\t}\n#ifdef USE_SRA\n\tout <<     \"  <SRA accession number>        Comma-separated list of SRA accession numbers, e.g. --sra-acc SRR353653,SRR353654.\" << endl;\n#endif \n\tif ( wrapper == \"basic-0\" )\n\t\tout <<\t\"  <s>        A TSV file where each line represents a sample.\" << endl ;\n\tout <<     \"  <filename>      File for classification output (default: stdout)\" << endl\n\t\t<<     \"  <report>   File for tabular report output (default: \" << reportFile << \")\" << endl\n\t\t<< endl\n\t\t<< \"  <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be\" << endl\n\t\t<< \"  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.\" << endl\n\t\t// Wrapper script should write <bam> line next\n\t\t<< endl\n\t\t<< \"Options (defaults in parentheses):\" << endl\n\t\t<< endl\n\t\t<< \" Input:\" << endl\n\t\t<< \"  -q                 query input files are FASTQ .fq/.fastq (default)\" << endl\n\t\t<< \"  --qseq             query input files are in Illumina's qseq format\" << endl\n\t\t<< \"  -f                 query input files are (multi-)FASTA .fa/.mfa\" << endl\n\t\t<< \"  -r                 query input files are raw one-sequence-per-line\" << endl\n\t\t<< \"  -c                 <m1>, <m2>, <r> are sequences themselves, not files\" << endl\n\t\t<< \"  -s/--skip <int>    skip the first <int> reads/pairs in the input (none)\" << endl\n\t\t<< \"  -u/--upto <int>    stop after first <int> reads/pairs (no limit)\" << endl\n\t\t<< \"  -5/--trim5 <int>   trim <int> bases from 5'/left end of reads (0)\" << endl\n\t\t<< \"  -3/--trim3 <int>   trim <int> bases from 3'/right end of reads (0)\" << endl\n\t\t<< \"  --phred33          qualities are Phred+33 (default)\" << endl\n\t\t<< \"  --phred64          qualities are Phred+64\" << endl\n\t\t<< \"  --int-quals        qualities encoded as space-delimited integers\" << endl\n\t\t<< \"  --ignore-quals     treat all quality values as 30 on Phred scale (off)\" << endl\n\t\t<< \"  --nofw             do not align forward (original) version of read (off)\" << endl\n\t\t<< \"  --norc             do not align reverse-complement version of read (off)\" << endl\n#ifdef USE_SRA\n\t\t<< \"  --sra-acc          SRA accession ID\" << endl\n#endif\n\t\t<< endl\n\t\t<< \"Classification:\" << endl\n\t\t<< \"  --min-hitlen <int>    minimum length of partial hits (default \" << minHitLen << \", must be greater than 15)\" << endl\n\t\t<< \"  -k <int>              report upto <int> distinct, primary assignments for each read or pair\" << endl\n\t\t//<< \"  --min-totallen <int>  minimum summed length of partial hits per read (default \" << minTotalLen << \")\" << endl\n\t\t<< \"  --host-taxids <taxids> comma-separated list of taxonomic IDs that will be preferred in classification\" << endl\n\t\t<< \"  --exclude-taxids <taxids> comma-separated list of taxonomic IDs that will be excluded in classification\" << endl\n\t\t<< endl\n\t\t<< \" Output:\" << endl;\n\t//if(wrapper == \"basic-0\") {\n\t//\tout << \"  --bam              output directly to BAM (by piping through 'samtools view')\" << endl;\n\t//}\n\tout << \"  --out-fmt <str>       define output format, either 'tab' or 'sam' (tab)\" << endl\n\t\t<< \"  --tab-fmt-cols <str>  columns in tabular format, comma separated \" << endl \n\t\t<< \"                          default: \" << tab_fmt_col_def << endl;\n\tout << \"  -t/--time             print wall-clock time taken by search phases\" << endl;\n\tif(wrapper == \"basic-0\") {\n\t\tout << \"  --un <path>           write unpaired reads that didn't align to <path>\" << endl\n\t\t\t<< \"  --al <path>           write unpaired reads that aligned at least once to <path>\" << endl\n\t\t\t<< \"  --un-conc <path>      write pairs that didn't align concordantly to <path>\" << endl\n\t\t\t<< \"  --al-conc <path>      write pairs that aligned concordantly at least once to <path>\" << endl\n\t\t\t<< \"  (Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.\" << endl\n\t\t\t<< \"  --un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)\" << endl;\n\t}\n\tout << \"  --quiet               print nothing to stderr except serious errors\" << endl\n\t\t//  << \"  --refidx              refer to ref. seqs by 0-based index rather than name\" << endl\n\t\t<< \"  --met-file <path>     send metrics to file at <path> (off)\" << endl\n\t\t<< \"  --met-stderr          send metrics to stderr (off)\" << endl\n\t\t<< \"  --met <int>           report internal counters & metrics every <int> secs (1)\" << endl\n\t\t<< endl\n\t\t<< \" Performance:\" << endl\n\t\t//<< \"  -o/--offrate <int> override offrate of index; must be >= index's offrate\" << endl\n\t\t<< \"  -p/--threads <int> number of alignment threads to launch (1)\" << endl\n#ifdef BOWTIE_MM\n\t\t<< \"  --mm               use memory-mapped I/O for index; many instances can share\" << endl\n#endif\n\t\t<< endl\n\t\t<< \" Other:\" << endl\n\t\t<< \"  --qc-filter        filter out reads that are bad according to QSEQ filter\" << endl\n\t\t<< \"  --seed <int>       seed for random number generator (0)\" << endl\n\t\t<< \"  --non-deterministic seed rand. gen. arbitrarily instead of using read attributes\" << endl\n\t\t//  << \"  --verbose          verbose output for debugging\" << endl\n\t\t<< \"  --version          print version information and quit\" << endl\n\t\t<< \"  -h/--help          print this usage message\" << endl\n\t\t;\n\tif(wrapper.empty()) {\n\t\tcerr << endl\n\t\t\t<< \"*** Warning ***\" << endl\n\t\t\t<< \"'centrifuge-class' was run directly.  It is recommended that you run the wrapper script 'centrifuge' instead.\" << endl\n\t\t\t<< endl;\n\t}\n}\n\n/**\n * Parse an int out of optarg and enforce that it be at least 'lower';\n * if it is less than 'lower', than output the given error message and\n * exit with an error and a usage message.\n */\nstatic int parseInt(int lower, int upper, const char *errmsg, const char *arg) {\n\tlong l;\n\tchar *endPtr= NULL;\n\tl = strtol(arg, &endPtr, 10);\n\tif (endPtr != NULL) {\n\t\tif (l < lower || l > upper) {\n\t\t\tcerr << errmsg << endl;\n\t\t\tprintUsage(cerr);\n\t\t\tthrow 1;\n\t\t}\n\t\treturn (int32_t)l;\n\t}\n\tcerr << errmsg << endl;\n\tprintUsage(cerr);\n\tthrow 1;\n\treturn -1;\n}\n\n/**\n * Upper is maximum int by default.\n */\nstatic int parseInt(int lower, const char *errmsg, const char *arg) {\n\treturn parseInt(lower, std::numeric_limits<int>::max(), errmsg, arg);\n}\n\n/**\n * Parse a T string 'str'.\n */\ntemplate<typename T>\nT parse(const char *s) {\n\tT tmp;\n\tstringstream ss(s);\n\tss >> tmp;\n\treturn tmp;\n}\n\n/**\n * Parse a pair of Ts from a string, 'str', delimited with 'delim'.\n */\ntemplate<typename T>\npair<T, T> parsePair(const char *str, char delim) {\n\tstring s(str);\n\tEList<string> ss;\n\ttokenize(s, delim, ss);\n\tpair<T, T> ret;\n\tret.first = parse<T>(ss[0].c_str());\n\tret.second = parse<T>(ss[1].c_str());\n\treturn ret;\n}\n\n/**\n * Parse a pair of Ts from a string, 'str', delimited with 'delim'.\n */\ntemplate<typename T>\nvoid parseTuple(const char *str, char delim, EList<T>& ret) {\n\tstring s(str);\n\tEList<string> ss;\n\ttokenize(s, delim, ss);\n\tfor(size_t i = 0; i < ss.size(); i++) {\n\t\tret.push_back(parse<T>(ss[i].c_str()));\n\t}\n}\n\nstatic string applyPreset(const string& sorig, Presets& presets) {\n\tstring s = sorig;\n\tsize_t found = s.find(\"%LOCAL%\");\n\tif(found != string::npos) {\n\t\ts.replace(found, strlen(\"%LOCAL%\"), localAlign ? \"-local\" : \"\");\n\t}\n\tif(gVerbose) {\n\t\tcerr << \"Applying preset: '\" << s.c_str() << \"' using preset menu '\"\n\t\t\t << presets.name() << \"'\" << endl;\n\t}\n\tstring pol;\n\tpresets.apply(s, pol, extra_opts);\n\treturn pol;\n}\n\nstatic bool saw_M;\nstatic bool saw_a;\nstatic bool saw_k;\nstatic EList<string> presetList;\n\n/**\n * TODO: Argument parsing is very, very flawed.  The biggest problem is that\n * there are two separate worlds of arguments, the ones set via polstr, and\n * the ones set directly in variables.  This makes for nasty interactions,\n * e.g., with the -M option being resolved at an awkward time relative to\n * the -k and -a options.\n */\nstatic void parseOption(int next_option, const char *arg) {\n\tswitch (next_option) {\n\t\tcase ARG_TEST_25: bowtie2p5 = true; break;\n\t\tcase ARG_DESC_KB: descentTotSz = SimpleFunc::parse(arg, 0.0, 1024.0, 1024.0, DMAX); break;\n\t\tcase ARG_DESC_FMOPS: descentTotFmops = SimpleFunc::parse(arg, 0.0, 10.0, 100.0, DMAX); break;\n\t\tcase ARG_DESC_LANDING: descentLanding = parse<int>(arg); break;\n\t\tcase ARG_DESC_EXP: {\n\t\t\tdescConsExp = parse<double>(arg);\n\t\t\tif(descConsExp < 0.0) {\n\t\t\t\tcerr << \"Error: --desc-exp must be greater than or equal to 0\" << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\tbreak;\n\t\t}\n\t\tcase '1': tokenize(arg, \",\", mates1); break;\n\t\tcase '2': tokenize(arg, \",\", mates2); break;\n\t\tcase ARG_ONETWO: tokenize(arg, \",\", mates12); format = TAB_MATE5; break;\n\t\tcase ARG_TAB5:   tokenize(arg, \",\", mates12); format = TAB_MATE5; break;\n\t\tcase ARG_TAB6:   tokenize(arg, \",\", mates12); format = TAB_MATE6; break;\n\t\tcase 'f': format = FASTA; break;\n\t\tcase 'F': {\n\t\t\tformat = FASTA_CONT;\n\t\t\tpair<uint32_t, uint32_t> p = parsePair<uint32_t>(arg, ',');\n\t\t\tfastaContLen = p.first;\n\t\t\tfastaContFreq = p.second;\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_BWA_SW_LIKE: {\n\t\t\tbwaSwLikeC = 5.5f;\n\t\t\tbwaSwLikeT = 30;\n\t\t\tbwaSwLike = true;\n\t\t\tlocalAlign = true;\n\t\t\t// -a INT   Score of a match [1]\n\t\t\t// -b INT   Mismatch penalty [3]\n\t\t\t// -q INT   Gap open penalty [5]\n\t\t\t// -r INT   Gap extension penalty. The penalty for a contiguous\n\t\t\t//          gap of size k is q+k*r. [2] \n\t\t\tpolstr += \";MA=1;MMP=C3;RDG=5,2;RFG=5,2\";\n\t\t\tbreak;\n\t\t}\n\t\tcase 'q': format = FASTQ; break;\n\t\tcase 'r': format = RAW; break;\n\t\tcase 'c': format = CMDLINE; break;\n\t\tcase ARG_QSEQ: format = QSEQ; break;\n\t\tcase 'C': {\n\t\t\tcerr << \"Error: -C specified but Bowtie 2 does not support colorspace input.\" << endl;\n\t\t\tthrow 1;\n\t\t\tbreak;\n\t\t}\n\t\tcase 'I':\n\t\t\tgMinInsert = parseInt(0, \"-I arg must be positive\", arg);\n\t\t\tbreak;\n\t\tcase 'X':\n\t\t\tgMaxInsert = parseInt(1, \"-X arg must be at least 1\", arg);\n\t\t\tbreak;\n\t\tcase ARG_NO_DISCORDANT: gReportDiscordant = false; break;\n\t\tcase ARG_NO_MIXED: gReportMixed = false; break;\n\t\tcase 's':\n\t\t\tskipReads = (uint32_t)parseInt(0, \"-s arg must be positive\", arg);\n\t\t\tbreak;\n\t\tcase ARG_FF: gMate1fw = true;  gMate2fw = true;  break;\n\t\tcase ARG_RF: gMate1fw = false; gMate2fw = true;  break;\n\t\tcase ARG_FR: gMate1fw = true;  gMate2fw = false; break;\n\t\tcase ARG_SHMEM: useShmem = true; break;\n\t\tcase ARG_SEED_SUMM: seedSumm = true; break;\n\t\tcase ARG_MM: {\n#ifdef BOWTIE_MM\n\t\t\tuseMm = true;\n\t\t\tbreak;\n#else\n\t\t\tcerr << \"Memory-mapped I/O mode is disabled because bowtie was not compiled with\" << endl\n\t\t\t\t << \"BOWTIE_MM defined.  Memory-mapped I/O is not supported under Windows.  If you\" << endl\n\t\t\t\t << \"would like to use memory-mapped I/O on a platform that supports it, please\" << endl\n\t\t\t\t << \"refrain from specifying BOWTIE_MM=0 when compiling Bowtie.\" << endl;\n\t\t\tthrow 1;\n#endif\n\t\t}\n\t\tcase ARG_MMSWEEP: mmSweep = true; break;\n\t\tcase ARG_HADOOPOUT: hadoopOut = true; break;\n\t\tcase ARG_SOLEXA_QUALS: solexaQuals = true; break;\n\t\tcase ARG_INTEGER_QUALS: integerQuals = true; break;\n\t\tcase ARG_PHRED64: phred64Quals = true; break;\n\t\tcase ARG_PHRED33: solexaQuals = false; phred64Quals = false; break;\n\t\tcase ARG_OVERHANG: gReportOverhangs = true; break;\n\t\tcase ARG_NO_CACHE: msNoCache = true; break;\n\t\tcase ARG_USE_CACHE: msNoCache = false; break;\n\t\tcase ARG_LOCAL_SEED_CACHE_SZ:\n\t\t\tseedCacheLocalMB = (uint32_t)parseInt(1, \"--local-seed-cache-sz arg must be at least 1\", arg);\n\t\t\tbreak;\n\t\tcase ARG_CURRENT_SEED_CACHE_SZ:\n\t\t\tseedCacheCurrentMB = (uint32_t)parseInt(1, \"--seed-cache-sz arg must be at least 1\", arg);\n\t\t\tbreak;\n\t\tcase ARG_REFIDX: noRefNames = true; break;\n\t\tcase ARG_FUZZY: fuzzy = true; break;\n\t\tcase ARG_FULLREF: fullRef = true; break;\n\t\tcase ARG_GAP_BAR:\n\t\t\tgGapBarrier = parseInt(1, \"--gbar must be no less than 1\", arg);\n\t\t\tbreak;\n\t\tcase ARG_SEED:\n\t\t\tseed = parseInt(0, \"--seed arg must be at least 0\", arg);\n\t\t\tbreak;\n\t\tcase ARG_NON_DETERMINISTIC:\n\t\t\tarbitraryRandom = true;\n\t\t\tbreak;\n\t\tcase 'u':\n\t\t\tqUpto = (uint32_t)parseInt(1, \"-u/--qupto arg must be at least 1\", arg);\n\t\t\tbreak;\n\t\tcase 'Q':\n\t\t\ttokenize(arg, \",\", qualities);\n\t\t\tintegerQuals = true;\n\t\t\tbreak;\n\t\tcase ARG_QUALS1:\n\t\t\ttokenize(arg, \",\", qualities1);\n\t\t\tintegerQuals = true;\n\t\t\tbreak;\n\t\tcase ARG_QUALS2:\n\t\t\ttokenize(arg, \",\", qualities2);\n\t\t\tintegerQuals = true;\n\t\t\tbreak;\n\t\tcase ARG_CACHE_LIM:\n\t\t\tcacheLimit = (uint32_t)parseInt(1, \"--cachelim arg must be at least 1\", arg);\n\t\t\tbreak;\n\t\tcase ARG_CACHE_SZ:\n\t\t\tcacheSize = (uint32_t)parseInt(1, \"--cachesz arg must be at least 1\", arg);\n\t\t\tcacheSize *= (1024 * 1024); // convert from MB to B\n\t\t\tbreak;\n\t\tcase ARG_WRAPPER: wrapper = arg; break;\n        case 'x': bt2index = arg; break;\n\t\tcase 'p':\n\t\t\tnthreads = parseInt(1, \"-p/--threads arg must be at least 1\", arg);\n\t\t\tbreak;\n\t\tcase ARG_FILEPAR:\n\t\t\tfileParallel = true;\n\t\t\tbreak;\n\t\tcase '3': gTrim3 = parseInt(0, \"-3/--trim3 arg must be at least 0\", arg); break;\n\t\tcase '5': gTrim5 = parseInt(0, \"-5/--trim5 arg must be at least 0\", arg); break;\n\t\tcase 'h': printUsage(cout); throw 0; break;\n\t\tcase ARG_USAGE: printUsage(cout); throw 0; break;\n\t\t//\n\t\t// NOTE that unlike in Bowtie 1, -M, -a and -k are mutually\n\t\t// exclusive here.\n\t\t//\n\t\tcase 'M': {\n\t\t\tmsample = true;\n\t\t\tmhits = parse<uint32_t>(arg);\n\t\t\tif(saw_a || saw_k) {\n\t\t\t\tcerr << \"Warning: -M, -k and -a are mutually exclusive. \"\n\t\t\t\t\t << \"-M will override\" << endl;\n\t\t\t\tkhits = 1;\n\t\t\t}\n\t\t\tassert_eq(1, khits);\n\t\t\tsaw_M = true;\n\t\t\tcerr << \"Warning: -M is deprecated.  Use -D and -R to adjust \" <<\n\t\t\t        \"effort instead.\" << endl;\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_EXTEND_ITERS: {\n\t\t\tmaxIters = parse<size_t>(arg);\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_NO_EXTEND: {\n\t\t\tdoExtend = false;\n\t\t\tbreak;\n\t\t}\n\t\tcase 'R': { polstr += \";ROUNDS=\"; polstr += arg; break; }\n\t\tcase 'D': { polstr += \";DPS=\";    polstr += arg; break; }\n\t\tcase ARG_DP_MATE_STREAK_THRESH: {\n\t\t\tmaxMateStreak = parse<size_t>(arg);\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_DP_FAIL_STREAK_THRESH: {\n\t\t\tmaxDpStreak = parse<size_t>(arg);\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_EE_FAIL_STREAK_THRESH: {\n\t\t\tmaxEeStreak = parse<size_t>(arg);\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_UG_FAIL_STREAK_THRESH: {\n\t\t\tmaxUgStreak = parse<size_t>(arg);\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_DP_FAIL_THRESH: {\n\t\t\tmaxDp = parse<size_t>(arg);\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_UG_FAIL_THRESH: {\n\t\t\tmaxUg = parse<size_t>(arg);\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_SEED_BOOST_THRESH: {\n\t\t\tseedBoostThresh = parse<int>(arg);\n\t\t\tbreak;\n\t\t}\n\t\tcase 'a': {\n\t\t\tmsample = false;\n\t\t\tallHits = true;\n\t\t\tmhits = 0; // disable -M\n\t\t\tif(saw_M || saw_k) {\n\t\t\t\tcerr << \"Warning: -M, -k and -a are mutually exclusive. \"\n\t\t\t\t\t << \"-a will override\" << endl;\n\t\t\t}\n\t\t\tsaw_a = true;\n\t\t\tbreak;\n\t\t}\n\t\tcase 'k': {\n\t\t\tmsample = false;\n\t\t\tkhits = (uint32_t)parseInt(1, \"-k arg must be at least 1\", arg);\n\t\t\tmhits = 0; // disable -M\n\t\t\tif(saw_M || saw_a) {\n\t\t\t\tcerr << \"Warning: -M, -k and -a are mutually exclusive. \"\n\t\t\t\t\t << \"-k will override\" << endl;\n\t\t\t}\n\t\t\tsaw_k = true;\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_VERBOSE: gVerbose = 1; break;\n\t\tcase ARG_STARTVERBOSE: startVerbose = true; break;\n\t\tcase ARG_QUIET: gQuiet = true; break;\n\t\tcase ARG_SANITY: sanityCheck = true; break;\n\t\tcase 't': timing = true; break;\n\t\tcase ARG_METRIC_IVAL: {\n\t\t\tmetricsIval = parseInt(1, \"--metrics arg must be at least 1\", arg);\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_METRIC_FILE: metricsFile = arg; break;\n\t\tcase ARG_METRIC_STDERR: metricsStderr = true; break;\n\t\tcase ARG_METRIC_PER_READ: metricsPerRead = true; break;\n\t\tcase ARG_NO_FW: gNofw = true; break;\n\t\tcase ARG_NO_RC: gNorc = true; break;\n\t\tcase ARG_SAM_NO_QNAME_TRUNC: samTruncQname = false; break;\n\t\tcase ARG_SAM_OMIT_SEC_SEQ: samOmitSecSeqQual = true; break;\n\t\tcase ARG_SAM_NO_UNAL: samNoUnal = true; break;\n\t\tcase ARG_SAM_NOHEAD: samNoHead = true; break;\n\t\tcase ARG_SAM_NOSQ: samNoSQ = true; break;\n\t\tcase ARG_SAM_PRINT_YI: sam_print_yi = true; break;\n\t\tcase ARG_REORDER: reorder = true; break;\n\t\tcase ARG_MAPQ_EX: {\n\t\t\tsam_print_zp = true;\n\t\t\tsam_print_zu = true;\n\t\t\tsam_print_xp = true;\n\t\t\tsam_print_xss = true;\n\t\t\tsam_print_yn = true;\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_SHOW_RAND_SEED: {\n\t\t\tsam_print_zs = true;\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_SAMPLE:\n\t\t\tsampleFrac = parse<float>(arg);\n\t\t\tbreak;\n\t\tcase ARG_CP_MIN:\n\t\t\tcminlen = parse<size_t>(arg);\n\t\t\tbreak;\n\t\tcase ARG_CP_IVAL:\n\t\t\tcpow2 = parse<size_t>(arg);\n\t\t\tbreak;\n\t\tcase ARG_TRI:\n\t\t\tdoTri = true;\n\t\t\tbreak;\n\t\tcase ARG_READ_PASSTHRU: {\n\t\t\tsam_print_xr = true;\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_READ_TIMES: {\n\t\t\tsam_print_xt = true;\n\t\t\tsam_print_xd = true;\n\t\t\tsam_print_xu = true;\n\t\t\tsam_print_yl = true;\n\t\t\tsam_print_ye = true;\n\t\t\tsam_print_yu = true;\n\t\t\tsam_print_yr = true;\n\t\t\tsam_print_zb = true;\n\t\t\tsam_print_zr = true;\n\t\t\tsam_print_zf = true;\n\t\t\tsam_print_zm = true;\n\t\t\tsam_print_zi = true;\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_PARTITION: partitionSz = parse<int>(arg); break;\n\t\tcase ARG_DPAD:\n\t\t\tmaxhalf = parseInt(0, \"--dpad must be no less than 0\", arg);\n\t\t\tbreak;\n\t\tcase ARG_ORIG:\n\t\t\tif(arg == NULL || strlen(arg) == 0) {\n\t\t\t\tcerr << \"--orig arg must be followed by a string\" << endl;\n\t\t\t\tprintUsage(cerr);\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\torigString = arg;\n\t\t\tbreak;\n\t\tcase ARG_NO_DOVETAIL: gDovetailMatesOK = false; break;\n\t\tcase ARG_NO_CONTAIN:  gContainMatesOK  = false; break;\n\t\tcase ARG_NO_OVERLAP:  gOlapMatesOK     = false; break;\n\t\tcase ARG_DOVETAIL:    gDovetailMatesOK = true;  break;\n\t\tcase ARG_CONTAIN:     gContainMatesOK  = true;  break;\n\t\tcase ARG_OVERLAP:     gOlapMatesOK     = true;  break;\n\t\tcase ARG_QC_FILTER: qcFilter = true; break;\n\t\tcase ARG_NO_SCORE_PRIORITY: sortByScore = false; break;\n\t\tcase ARG_IGNORE_QUALS: ignoreQuals = true; break;\n\t\tcase ARG_MAPQ_V: mapqv = parse<int>(arg); break;\n\t\tcase 'P': { presetList.push_back(arg); break; }\n\t\tcase ARG_ALIGN_POLICY: {\n\t\t\tif(strlen(arg) > 0) {\n\t\t\t\tpolstr += \";\"; polstr += arg;\n\t\t\t}\n\t\t\tbreak;\n\t\t}\n\t\tcase 'N': { polstr += \";SEED=\"; polstr += arg; break; }\n\t\tcase 'L': {\n\t\t\tint64_t len = parse<size_t>(arg);\n\t\t\tif(len < 0) {\n\t\t\t\tcerr << \"Error: -L argument must be >= 0; was \" << arg << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\tif(len > 32) {\n\t\t\t\tcerr << \"Error: -L argument must be <= 32; was\" << arg << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\tpolstr += \";SEEDLEN=\"; polstr += arg; break;\n\t\t}\n\t\tcase 'O':\n\t\t\tmultiseedOff = parse<size_t>(arg);\n\t\t\tbreak;\n\t\tcase 'i': {\n\t\t\tEList<string> args;\n\t\t\ttokenize(arg, \",\", args);\n\t\t\tif(args.size() > 3 || args.size() == 0) {\n\t\t\t\tcerr << \"Error: expected 3 or fewer comma-separated \"\n\t\t\t\t\t << \"arguments to -i option, got \"\n\t\t\t\t\t << args.size() << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\t// Interval-settings arguments\n\t\t\tpolstr += (\";IVAL=\" + args[0]); // Function type\n\t\t\tif(args.size() > 1) {\n\t\t\t\tpolstr += (\",\" + args[1]);  // Constant term\n\t\t\t}\n\t\t\tif(args.size() > 2) {\n\t\t\t\tpolstr += (\",\" + args[2]);  // Coefficient\n\t\t\t}\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_MULTISEED_IVAL: {\n\t\t\tpolstr += \";\";\n\t\t\t// Split argument by comma\n\t\t\tEList<string> args;\n\t\t\ttokenize(arg, \",\", args);\n\t\t\tif(args.size() > 5 || args.size() == 0) {\n\t\t\t\tcerr << \"Error: expected 5 or fewer comma-separated \"\n\t\t\t\t\t << \"arguments to --multiseed option, got \"\n\t\t\t\t\t << args.size() << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\t// Seed mm and length arguments\n\t\t\tpolstr += \"SEED=\";\n\t\t\tpolstr += (args[0]); // # mismatches\n\t\t\tif(args.size() >  1) polstr += (\",\" + args[ 1]); // length\n\t\t\tif(args.size() >  2) polstr += (\";IVAL=\" + args[2]); // Func type\n\t\t\tif(args.size() >  3) polstr += (\",\" + args[ 3]); // Constant term\n\t\t\tif(args.size() >  4) polstr += (\",\" + args[ 4]); // Coefficient\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_N_CEIL: {\n\t\t\t// Split argument by comma\n\t\t\tEList<string> args;\n\t\t\ttokenize(arg, \",\", args);\n\t\t\tif(args.size() > 3) {\n\t\t\t\tcerr << \"Error: expected 3 or fewer comma-separated \"\n\t\t\t\t\t << \"arguments to --n-ceil option, got \"\n\t\t\t\t\t << args.size() << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\tif(args.size() == 0) {\n\t\t\t\tcerr << \"Error: expected at least one argument to --n-ceil option\" << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\tpolstr += \";NCEIL=\";\n\t\t\tif(args.size() == 3) {\n\t\t\t\tpolstr += (args[0] + \",\" + args[1] + \",\" + args[2]);\n\t\t\t} else {\n                if(args.size() == 1) {\n                    polstr += (\"C,\" + args[0]);\n                } else {\n\t\t\t\t\tpolstr += (args[0] + \",\" + args[1]);\n\t\t\t\t}\n\t\t\t}\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_SCORE_MA:  polstr += \";MA=\";    polstr += arg; break;\n\t\tcase ARG_SCORE_MMP: {\n\t\t\tEList<string> args;\n\t\t\ttokenize(arg, \",\", args);\n\t\t\tif(args.size() > 2 || args.size() == 0) {\n\t\t\t\tcerr << \"Error: expected 1 or 2 comma-separated \"\n\t\t\t\t\t << \"arguments to --mmp option, got \" << args.size() << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\tif(args.size() >= 1) {\n\t\t\t\tpolstr += \";MMP=Q,\";\n\t\t\t\tpolstr += args[0];\n\t\t\t\tif(args.size() >= 2) {\n\t\t\t\t\tpolstr += \",\";\n\t\t\t\t\tpolstr += args[1];\n\t\t\t\t}\n\t\t\t}\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_SCORE_NP:  polstr += \";NP=C\";   polstr += arg; break;\n\t\tcase ARG_SCORE_RDG: polstr += \";RDG=\";   polstr += arg; break;\n\t\tcase ARG_SCORE_RFG: polstr += \";RFG=\";   polstr += arg; break;\n\t\tcase ARG_SCORE_MIN: {\n\t\t\tpolstr += \";\";\n\t\t\tEList<string> args;\n\t\t\ttokenize(arg, \",\", args);\n\t\t\tif(args.size() > 3 && args.size() == 0) {\n\t\t\t\tcerr << \"Error: expected 3 or fewer comma-separated \"\n\t\t\t\t\t << \"arguments to --n-ceil option, got \"\n\t\t\t\t\t << args.size() << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\tpolstr += (\"MIN=\" + args[0]);\n\t\t\tif(args.size() > 1) {\n\t\t\t\tpolstr += (\",\" + args[1]);\n\t\t\t}\n\t\t\tif(args.size() > 2) {\n\t\t\t\tpolstr += (\",\" + args[2]);\n\t\t\t}\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_DESC: printArgDesc(cout); throw 0;\n\t\tcase 'S': outfile = arg; break;\n\t\tcase 'U': {\n\t\t\tEList<string> args;\n\t\t\ttokenize(arg, \",\", args);\n\t\t\tfor(size_t i = 0; i < args.size(); i++) {\n\t\t\t\tqueries.push_back(args[i]);\n\t\t\t}\n\t\t\tbreak;\n\t\t}\n\t\tcase ARG_VERSION: showVersion = 1; break;\n        case ARG_MIN_HITLEN: {\n            minHitLen = parseInt(15, \"--min-hitlen arg must be at least 15\", arg);\n            break;\n        }\n        case ARG_MIN_TOTALLEN: {\n        \tminTotalLen = parseInt(50, \"--min-totallen arg must be at least 50\", arg);\n        \tbreak;\n        }\n        case ARG_HOST_TAXIDS: {\n            EList<string> args;\n            tokenize(arg, \",\", args);\n            for(size_t i = 0; i < args.size(); i++) {\n                istringstream ss(args[i]);\n                uint64_t tid;\n                ss >> tid;\n                host_taxIDs.push_back(tid);\n            }\n            break;\n        }\n        case ARG_REPORT_FILE: {\n        \treportFile = arg;\n        \tbreak;\n        }\n        case ARG_NO_ABUNDANCE: {\n            abundance_analysis = false;\n            break;\n        }\n        case ARG_NO_TRAVERSE: {\n            tree_traverse = false;\n            break;\n        }\n        case ARG_CLASSIFICATION_RANK: {\n            classification_rank = arg;\n            if(classification_rank != \"strain\" &&\n               classification_rank != \"species\" &&\n               classification_rank != \"genus\" &&\n               classification_rank != \"family\" &&\n               classification_rank != \"order\" &&\n               classification_rank != \"class\" &&\n               classification_rank != \"phylum\") {\n                cerr << \"Error: \" << classification_rank << \" (--classification-rank) should be one of strain, species, genus, family, order, class, and phylum\" << endl;\n                exit(1);\n            }\n            break;\n        }\n        case ARG_EXCLUDE_TAXIDS: {\n            EList<string> args;\n            tokenize(arg, \",\", args);\n            for(size_t i = 0; i < args.size(); i++) {\n                istringstream ss(args[i]);\n                uint64_t tid;\n                ss >> tid;\n                excluded_taxIDs.push_back(tid);\n            }\n            break;\n        }\n\t    case ARG_OUT_FMT: {\n            if (strcmp(arg, \"sam\") == 0) {\n                sam_format = true;\n                parse_col_fmt(\"QNAME,FLAG,RNAME,POS,MAPQ,CIGAR,RNEXT,PNEXT,TLEN,SEQ,QUAL\",\n                              tab_fmt_cols_str, tab_fmt_cols);\n            } else if (strcmp(arg, \"default\") == 0 || strcmp(arg, \"tab\") == 0) {\n\n            } else {\n                cerr << \"Invalid output format \" << arg << \"!\" << endl;\n                exit(1);\n            }\n\t\t\tbreak;\n\t    }\n\t\tcase ARG_TAB_FMT_COLS: {\n            parse_col_fmt(arg, tab_fmt_cols_str, tab_fmt_cols);\n    \t\tbreak;\n\t\t}\n\n#ifdef USE_SRA\n        case ARG_SRA_ACC: {\n            tokenize(arg, \",\", sra_accs); format = SRA_FASTA;\n            break;\n        }\n#endif\n\tcase ARG_SEPARATOR: \n\t{\n\t\tseparator = true ;\n\t\tbreak ;\n\t} \n\t\tdefault:\n\t\t\tprintUsage(cerr);\n\t\t\tthrow 1;\n\t}\n}\n\n/**\n * Read command-line arguments\n */\nstatic void parseOptions(int argc, const char **argv) {\n\n\tint option_index = 0;\n\tint next_option;\n\tsaw_M = false;\n\tsaw_a = false;\n\tsaw_k = true;\n\tpresetList.clear();\n\tif(startVerbose) { cerr << \"Parsing options: \"; logTime(cerr, true); }\n\twhile(true) {\n\t\tnext_option = getopt_long(\n\t\t\targc, const_cast<char**>(argv),\n\t\t\tshort_options, long_options, &option_index);\n\t\tconst char * arg = optarg;\n\t\tif(next_option == EOF) {\n\t\t\tif(extra_opts_cur < extra_opts.size()) {\n\t\t\t\tnext_option = extra_opts[extra_opts_cur].first;\n\t\t\t\targ = extra_opts[extra_opts_cur].second.c_str();\n\t\t\t\textra_opts_cur++;\n\t\t\t} else {\n\t\t\t\tbreak;\n\t\t\t}\n\t\t}\n\t\tparseOption(next_option, arg);\n\t}\n\t// Now parse all the presets.  Might want to pick which presets version to\n\t// use according to other parameters.\n\tunique_ptr<Presets> presets(new PresetsV0());\n\t// Apply default preset\n\tif(!defaultPreset.empty()) {\n\t\tpolstr = applyPreset(defaultPreset, *presets.get()) + polstr;\n\t}\n\t// Apply specified presets\n\tfor(size_t i = 0; i < presetList.size(); i++) {\n\t\tpolstr += applyPreset(presetList[i], *presets.get());\n\t}\n\tfor(size_t i = 0; i < extra_opts.size(); i++) {\n\t\tnext_option = extra_opts[extra_opts_cur].first;\n\t\tconst char *arg = extra_opts[extra_opts_cur].second.c_str();\n\t\tparseOption(next_option, arg);\n\t}\n\t// Remove initial semicolons\n\twhile(!polstr.empty() && polstr[0] == ';') {\n\t\tpolstr = polstr.substr(1);\n\t}\n\tif(gVerbose) {\n\t\tcerr << \"Final policy string: '\" << polstr.c_str() << \"'\" << endl;\n\t}\n\tsize_t failStreakTmp = 0;\n\tSeedAlignmentPolicy::parseString(\n\t\tpolstr,\n\t\tlocalAlign,\n\t\tnoisyHpolymer,\n\t\tignoreQuals,\n\t\tbonusMatchType,\n\t\tbonusMatch,\n\t\tpenMmcType,\n\t\tpenMmcMax,\n\t\tpenMmcMin,\n\t\tpenNType,\n\t\tpenN,\n\t\tpenRdGapConst,\n\t\tpenRfGapConst,\n\t\tpenRdGapLinear,\n\t\tpenRfGapLinear,\n\t\tscoreMin,\n\t\tnCeil,\n\t\tpenNCatPair,\n\t\tmultiseedMms,\n\t\tmultiseedLen,\n\t\tmsIval,\n\t\tfailStreakTmp,\n\t\tnSeedRounds);\n\tif(failStreakTmp > 0) {\n\t\tmaxEeStreak = failStreakTmp;\n\t\tmaxUgStreak = failStreakTmp;\n\t\tmaxDpStreak = failStreakTmp;\n\t}\n\tif(saw_a || saw_k) {\n\t\tmsample = false;\n\t\tmhits = 0;\n\t} else {\n\t\tassert_gt(mhits, 0);\n\t\tmsample = true;\n\t}\n\tif(mates1.size() != mates2.size()) {\n\t\tcerr << \"Error: \" << mates1.size() << \" mate files/sequences were specified with -1, but \" << mates2.size() << endl\n\t\t     << \"mate files/sequences were specified with -2.  The same number of mate files/\" << endl\n\t\t     << \"sequences must be specified with -1 and -2.\" << endl;\n\t\tthrow 1;\n\t}\n\tif(qualities.size() && format != FASTA) {\n\t\tcerr << \"Error: one or more quality files were specified with -Q but -f was not\" << endl\n\t\t     << \"enabled.  -Q works only in combination with -f and -C.\" << endl;\n\t\tthrow 1;\n\t}\n\tif(qualities1.size() && format != FASTA) {\n\t\tcerr << \"Error: one or more quality files were specified with --Q1 but -f was not\" << endl\n\t\t     << \"enabled.  --Q1 works only in combination with -f and -C.\" << endl;\n\t\tthrow 1;\n\t}\n\tif(qualities2.size() && format != FASTA) {\n\t\tcerr << \"Error: one or more quality files were specified with --Q2 but -f was not\" << endl\n\t\t     << \"enabled.  --Q2 works only in combination with -f and -C.\" << endl;\n\t\tthrow 1;\n\t}\n\tif(qualities1.size() > 0 && mates1.size() != qualities1.size()) {\n\t\tcerr << \"Error: \" << mates1.size() << \" mate files/sequences were specified with -1, but \" << qualities1.size() << endl\n\t\t     << \"quality files were specified with --Q1.  The same number of mate and quality\" << endl\n\t\t     << \"files must sequences must be specified with -1 and --Q1.\" << endl;\n\t\tthrow 1;\n\t}\n\tif(qualities2.size() > 0 && mates2.size() != qualities2.size()) {\n\t\tcerr << \"Error: \" << mates2.size() << \" mate files/sequences were specified with -2, but \" << qualities2.size() << endl\n\t\t     << \"quality files were specified with --Q2.  The same number of mate and quality\" << endl\n\t\t     << \"files must sequences must be specified with -2 and --Q2.\" << endl;\n\t\tthrow 1;\n\t}\n\tif(!rgs.empty() && rgid.empty()) {\n\t\tcerr << \"Warning: --rg was specified without --rg-id also \"\n\t\t     << \"being specified.  @RG line is not printed unless --rg-id \"\n\t\t\t << \"is specified.\" << endl;\n\t}\n\t// Check for duplicate mate input files\n\tif(format != CMDLINE) {\n\t\tfor(size_t i = 0; i < mates1.size(); i++) {\n\t\t\tfor(size_t j = 0; j < mates2.size(); j++) {\n\t\t\t\tif(mates1[i] == mates2[j] && !gQuiet) {\n\t\t\t\t\tcerr << \"Warning: Same mate file \\\"\" << mates1[i].c_str() << \"\\\" appears as argument to both -1 and -2\" << endl;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\t// If both -s and -u are used, we need to adjust qUpto accordingly\n\t// since it uses rdid to know if we've reached the -u limit (and\n\t// rdids are all shifted up by skipReads characters)\n\tif(qUpto + skipReads > qUpto) {\n\t\tqUpto += skipReads;\n\t}\n\tif(useShmem && useMm && !gQuiet) {\n\t\tcerr << \"Warning: --shmem overrides --mm...\" << endl;\n\t\tuseMm = false;\n\t}\n\tif(gGapBarrier < 1) {\n\t\tcerr << \"Warning: --gbar was set less than 1 (=\" << gGapBarrier\n\t\t     << \"); setting to 1 instead\" << endl;\n\t\tgGapBarrier = 1;\n\t}\n\tif(multiseedMms >= multiseedLen) {\n\t\tassert_gt(multiseedLen, 0);\n\t\tcerr << \"Warning: seed mismatches (\" << multiseedMms\n\t\t     << \") is less than seed length (\" << multiseedLen\n\t\t\t << \"); setting mismatches to \" << (multiseedMms-1)\n\t\t\t << \" instead\" << endl;\n\t\tmultiseedMms = multiseedLen-1;\n\t}\n\tsam_print_zm = sam_print_zm && bowtie2p5;\n#ifndef NDEBUG\n\tif(!gQuiet) {\n\t\tcerr << \"Warning: Running in debug mode.  Please use debug mode only \"\n\t\t\t << \"for diagnosing errors, and not for typical use of Centrifuge.\"\n\t\t\t << endl;\n\t}\n#endif\n}\n\nstatic const char *argv0 = NULL;\n\n/// Create a PatternSourcePerThread for the current thread according\n/// to the global params and return a pointer to it\nstatic PatternSourcePerThreadFactory*\ncreatePatsrcFactory(PairedPatternSource& _patsrc, int tid) {\n\tPatternSourcePerThreadFactory *patsrcFact;\n\tpatsrcFact = new WrappedPatternSourcePerThreadFactory(_patsrc);\n\tassert(patsrcFact != NULL);\n\treturn patsrcFact;\n}\n\n#define PTHREAD_ATTRS (PTHREAD_CREATE_JOINABLE | PTHREAD_CREATE_DETACHED)\n\ntypedef TIndexOffU index_t;\ntypedef uint16_t local_index_t;\nstatic PairedPatternSource*              multiseed_patsrc;\nstatic Ebwt<index_t>*                    multiseed_ebwtFw;\nstatic Ebwt<index_t>*                    multiseed_ebwtBw;\nstatic Scoring*                          multiseed_sc;\nstatic BitPairReference*                 multiseed_refs;\nstatic AlnSink<index_t>*                 multiseed_msink;\nstatic OutFileBuf*                       multiseed_metricsOfb;\nstatic EList<string>                     multiseed_refnames;\n\n/**\n * Metrics for measuring the work done by the outer read alignment\n * loop.\n */\nstruct OuterLoopMetrics {\n\n\tOuterLoopMetrics() {\n\t    reset();\n\t}\n\n\t/**\n\t * Set all counters to 0.\n\t */\n\tvoid reset() {\n\t\treads = bases = srreads = srbases =\n\t\tfreads = fbases = ureads = ubases = 0;\n\t}\n\n\t/**\n\t * Sum the counters in m in with the conters in this object.  This\n\t * is the only safe way to update an OuterLoopMetrics that's shared\n\t * by multiple threads.\n\t */\n\tvoid merge(\n\t\tconst OuterLoopMetrics& m,\n\t\tbool getLock = false)\n\t{\n\t\tThreadSafe ts(&mutex_m, getLock);\n\t\treads += m.reads;\n\t\tbases += m.bases;\n\t\tsrreads += m.srreads;\n\t\tsrbases += m.srbases;\n\t\tfreads += m.freads;\n\t\tfbases += m.fbases;\n\t\tureads += m.ureads;\n\t\tubases += m.ubases;\n\t}\n\n\tuint64_t reads;   // total reads\n\tuint64_t bases;   // total bases\n\tuint64_t srreads; // same-read reads\n\tuint64_t srbases; // same-read bases\n\tuint64_t freads;  // filtered reads\n\tuint64_t fbases;  // filtered bases\n\tuint64_t ureads;  // unfiltered reads\n\tuint64_t ubases;  // unfiltered bases\n\tMUTEX_T mutex_m;\n};\n\n/**\n * Collection of all relevant performance metrics when aligning in\n * multiseed mode.\n */\nstruct PerfMetrics {\n\n\tPerfMetrics() : first(true) { reset(); }\n\n\t/**\n\t * Set all counters to 0.\n\t */\n\tvoid reset() {\n\t\tolm.reset();\n\t\twlm.reset();\n\t\trpm.reset();\n\t\tspm.reset();\n\t\tnbtfiltst = 0;\n\t\tnbtfiltsc = 0;\n\t\tnbtfiltdo = 0;\n\t\t\n\t\tolmu.reset();\n\t\twlmu.reset();\n\t\trpmu.reset();\n\t\tspmu.reset();\n\t\tnbtfiltst_u = 0;\n\t\tnbtfiltsc_u = 0;\n\t\tnbtfiltdo_u = 0;\n        \n        him.reset();\n\t}\n\n\t/**\n\t * Merge a set of specific metrics into this object.\n\t */\n\tvoid merge(\n\t\tconst OuterLoopMetrics *ol,\n\t\tconst WalkMetrics *wl,\n\t\tconst ReportingMetrics *rm,\n\t\tconst SpeciesMetrics *sm,\n\t\tuint64_t nbtfiltst_,\n\t\tuint64_t nbtfiltsc_,\n\t\tuint64_t nbtfiltdo_,\n        const HIMetrics *hi,\n\t\tbool getLock)\n\t{\n\n\t\tThreadSafe ts(&mutex_m, getLock);\n\t\tif(ol != NULL) {\n\t\t\tolmu.merge(*ol, false);\n\t\t}\n\t\tif(wl != NULL) {\n\t\t\twlmu.merge(*wl, false);\n\t\t}\n\t\tif(rm != NULL) {\n\t\t\trpmu.merge(*rm, false);\n\t\t}\n\t\tif (sm != NULL) {\n\t\t\tspmu.merge(*sm, false);\n\t\t}\n\t\tnbtfiltst_u += nbtfiltst_;\n\t\tnbtfiltsc_u += nbtfiltsc_;\n\t\tnbtfiltdo_u += nbtfiltdo_;\n        if(hi != NULL) {\n            him.merge(*hi, false);\n        }\n\t}\n\n\t/**\n\t * Reports a matrix of results, incl. column labels, to an OutFileBuf.\n\t * Optionally also sends results to stderr (unbuffered).  Can optionally\n\t * print a per-read record with the read name at the beginning.\n\t */\n\tvoid reportInterval(\n\t\tOutFileBuf* o,        // file to send output to\n\t\tbool metricsStderr,   // additionally output to stderr?\n\t\tbool total,           // true -> report total, otherwise incremental\n\t\tbool sync,            //  synchronize output\n\t\tconst BTString *name) // non-NULL name pointer if is per-read record\n\t{\n\t\tThreadSafe ts(&mutex_m, sync);\n\t\tostringstream stderrSs;\n\t\ttime_t curtime = time(0);\n\t\tchar buf[1024];\n\t\tif(first) {\n\t\t\tconst char *str =\n\t\t\t\t/*  1 */ \"Time\"           \"\\t\"\n\t\t\t\t/*  2 */ \"Read\"           \"\\t\"\n\t\t\t\t/*  3 */ \"Base\"           \"\\t\"\n\t\t\t\t/*  4 */ \"SameRead\"       \"\\t\"\n\t\t\t\t/*  5 */ \"SameReadBase\"   \"\\t\"\n\t\t\t\t/*  6 */ \"UnfilteredRead\" \"\\t\"\n\t\t\t\t/*  7 */ \"UnfilteredBase\" \"\\t\"\n\t\t\t\t\n\t\t\t\t/*  8 */ \"Paired\"         \"\\t\"\n\t\t\t\t/*  9 */ \"Unpaired\"       \"\\t\"\n\t\t\t\t/* 10 */ \"AlConUni\"       \"\\t\"\n\t\t\t\t/* 11 */ \"AlConRep\"       \"\\t\"\n\t\t\t\t/* 12 */ \"AlConFail\"      \"\\t\"\n\t\t\t\t/* 13 */ \"AlDis\"          \"\\t\"\n\t\t\t\t/* 14 */ \"AlConFailUni\"   \"\\t\"\n\t\t\t\t/* 15 */ \"AlConFailRep\"   \"\\t\"\n\t\t\t\t/* 16 */ \"AlConFailFail\"  \"\\t\"\n\t\t\t\t/* 17 */ \"AlConRepUni\"    \"\\t\"\n\t\t\t\t/* 18 */ \"AlConRepRep\"    \"\\t\"\n\t\t\t\t/* 19 */ \"AlConRepFail\"   \"\\t\"\n\t\t\t\t/* 20 */ \"AlUnpUni\"       \"\\t\"\n\t\t\t\t/* 21 */ \"AlUnpRep\"       \"\\t\"\n\t\t\t\t/* 22 */ \"AlUnpFail\"      \"\\t\"\n\t\t\t\t\n\t\t\t\t/* 23 */ \"SeedSearch\"     \"\\t\"\n\t\t\t\t/* 24 */ \"IntraSCacheHit\" \"\\t\"\n\t\t\t\t/* 25 */ \"InterSCacheHit\" \"\\t\"\n\t\t\t\t/* 26 */ \"OutOfMemory\"    \"\\t\"\n\t\t\t\t/* 27 */ \"AlBWOp\"         \"\\t\"\n\t\t\t\t/* 28 */ \"AlBWBranch\"     \"\\t\"\n\t\t\t\t/* 29 */ \"ResBWOp\"        \"\\t\"\n\t\t\t\t/* 30 */ \"ResBWBranch\"    \"\\t\"\n\t\t\t\t/* 31 */ \"ResResolve\"     \"\\t\"\n\t\t\t\t/* 34 */ \"ResReport\"      \"\\t\"\n\t\t\t\t/* 35 */ \"RedundantSHit\"  \"\\t\"\n\n\t\t\t\t/* 36 */ \"BestMinEdit0\"   \"\\t\"\n\t\t\t\t/* 37 */ \"BestMinEdit1\"   \"\\t\"\n\t\t\t\t/* 38 */ \"BestMinEdit2\"   \"\\t\"\n\n\t\t\t\t/* 39 */ \"ExactAttempts\"  \"\\t\"\n\t\t\t\t/* 40 */ \"ExactSucc\"      \"\\t\"\n\t\t\t\t/* 41 */ \"ExactRanges\"    \"\\t\"\n\t\t\t\t/* 42 */ \"ExactRows\"      \"\\t\"\n\t\t\t\t/* 43 */ \"ExactOOMs\"      \"\\t\"\n\n\t\t\t\t/* 44 */ \"1mmAttempts\"    \"\\t\"\n\t\t\t\t/* 45 */ \"1mmSucc\"        \"\\t\"\n\t\t\t\t/* 46 */ \"1mmRanges\"      \"\\t\"\n\t\t\t\t/* 47 */ \"1mmRows\"        \"\\t\"\n\t\t\t\t/* 48 */ \"1mmOOMs\"        \"\\t\"\n\n\t\t\t\t/* 49 */ \"UngappedSucc\"   \"\\t\"\n\t\t\t\t/* 50 */ \"UngappedFail\"   \"\\t\"\n\t\t\t\t/* 51 */ \"UngappedNoDec\"  \"\\t\"\n\n\t\t\t\t/* 52 */ \"DPExLt10Gaps\"   \"\\t\"\n\t\t\t\t/* 53 */ \"DPExLt5Gaps\"    \"\\t\"\n\t\t\t\t/* 54 */ \"DPExLt3Gaps\"    \"\\t\"\n\n\t\t\t\t/* 55 */ \"DPMateLt10Gaps\" \"\\t\"\n\t\t\t\t/* 56 */ \"DPMateLt5Gaps\"  \"\\t\"\n\t\t\t\t/* 57 */ \"DPMateLt3Gaps\"  \"\\t\"\n\n\t\t\t\t/* 58 */ \"DP16ExDps\"      \"\\t\"\n\t\t\t\t/* 59 */ \"DP16ExDpSat\"    \"\\t\"\n\t\t\t\t/* 60 */ \"DP16ExDpFail\"   \"\\t\"\n\t\t\t\t/* 61 */ \"DP16ExDpSucc\"   \"\\t\"\n\t\t\t\t/* 62 */ \"DP16ExCol\"      \"\\t\"\n\t\t\t\t/* 63 */ \"DP16ExCell\"     \"\\t\"\n\t\t\t\t/* 64 */ \"DP16ExInner\"    \"\\t\"\n\t\t\t\t/* 65 */ \"DP16ExFixup\"    \"\\t\"\n\t\t\t\t/* 66 */ \"DP16ExGathSol\"  \"\\t\"\n\t\t\t\t/* 67 */ \"DP16ExBt\"       \"\\t\"\n\t\t\t\t/* 68 */ \"DP16ExBtFail\"   \"\\t\"\n\t\t\t\t/* 69 */ \"DP16ExBtSucc\"   \"\\t\"\n\t\t\t\t/* 70 */ \"DP16ExBtCell\"   \"\\t\"\n\t\t\t\t/* 71 */ \"DP16ExCoreRej\"  \"\\t\"\n\t\t\t\t/* 72 */ \"DP16ExNRej\"     \"\\t\"\n\n\t\t\t\t/* 73 */ \"DP8ExDps\"       \"\\t\"\n\t\t\t\t/* 74 */ \"DP8ExDpSat\"     \"\\t\"\n\t\t\t\t/* 75 */ \"DP8ExDpFail\"    \"\\t\"\n\t\t\t\t/* 76 */ \"DP8ExDpSucc\"    \"\\t\"\n\t\t\t\t/* 77 */ \"DP8ExCol\"       \"\\t\"\n\t\t\t\t/* 78 */ \"DP8ExCell\"      \"\\t\"\n\t\t\t\t/* 79 */ \"DP8ExInner\"     \"\\t\"\n\t\t\t\t/* 80 */ \"DP8ExFixup\"     \"\\t\"\n\t\t\t\t/* 81 */ \"DP8ExGathSol\"   \"\\t\"\n\t\t\t\t/* 82 */ \"DP8ExBt\"        \"\\t\"\n\t\t\t\t/* 83 */ \"DP8ExBtFail\"    \"\\t\"\n\t\t\t\t/* 84 */ \"DP8ExBtSucc\"    \"\\t\"\n\t\t\t\t/* 85 */ \"DP8ExBtCell\"    \"\\t\"\n\t\t\t\t/* 86 */ \"DP8ExCoreRej\"   \"\\t\"\n\t\t\t\t/* 87 */ \"DP8ExNRej\"      \"\\t\"\n\n\t\t\t\t/* 88 */ \"DP16MateDps\"     \"\\t\"\n\t\t\t\t/* 89 */ \"DP16MateDpSat\"   \"\\t\"\n\t\t\t\t/* 90 */ \"DP16MateDpFail\"  \"\\t\"\n\t\t\t\t/* 91 */ \"DP16MateDpSucc\"  \"\\t\"\n\t\t\t\t/* 92 */ \"DP16MateCol\"     \"\\t\"\n\t\t\t\t/* 93 */ \"DP16MateCell\"    \"\\t\"\n\t\t\t\t/* 94 */ \"DP16MateInner\"   \"\\t\"\n\t\t\t\t/* 95 */ \"DP16MateFixup\"   \"\\t\"\n\t\t\t\t/* 96 */ \"DP16MateGathSol\" \"\\t\"\n\t\t\t\t/* 97 */ \"DP16MateBt\"      \"\\t\"\n\t\t\t\t/* 98 */ \"DP16MateBtFail\"  \"\\t\"\n\t\t\t\t/* 99 */ \"DP16MateBtSucc\"  \"\\t\"\n\t\t\t\t/* 100 */ \"DP16MateBtCell\"  \"\\t\"\n\t\t\t\t/* 101 */ \"DP16MateCoreRej\" \"\\t\"\n\t\t\t\t/* 102 */ \"DP16MateNRej\"    \"\\t\"\n\n\t\t\t\t/* 103 */ \"DP8MateDps\"     \"\\t\"\n\t\t\t\t/* 104 */ \"DP8MateDpSat\"   \"\\t\"\n\t\t\t\t/* 105 */ \"DP8MateDpFail\"  \"\\t\"\n\t\t\t\t/* 106 */ \"DP8MateDpSucc\"  \"\\t\"\n\t\t\t\t/* 107 */ \"DP8MateCol\"     \"\\t\"\n\t\t\t\t/* 108 */ \"DP8MateCell\"    \"\\t\"\n\t\t\t\t/* 109 */ \"DP8MateInner\"   \"\\t\"\n\t\t\t\t/* 110 */ \"DP8MateFixup\"   \"\\t\"\n\t\t\t\t/* 111 */ \"DP8MateGathSol\" \"\\t\"\n\t\t\t\t/* 112 */ \"DP8MateBt\"      \"\\t\"\n\t\t\t\t/* 113 */ \"DP8MateBtFail\"  \"\\t\"\n\t\t\t\t/* 114 */ \"DP8MateBtSucc\"  \"\\t\"\n\t\t\t\t/* 115 */ \"DP8MateBtCell\"  \"\\t\"\n\t\t\t\t/* 116 */ \"DP8MateCoreRej\" \"\\t\"\n\t\t\t\t/* 117 */ \"DP8MateNRej\"    \"\\t\"\n\n\t\t\t\t/* 118 */ \"DPBtFiltStart\"  \"\\t\"\n\t\t\t\t/* 119 */ \"DPBtFiltScore\"  \"\\t\"\n\t\t\t\t/* 120 */ \"DpBtFiltDom\"    \"\\t\"\n\n\t\t\t\t/* 121 */ \"MemPeak\"        \"\\t\"\n\t\t\t\t/* 122 */ \"UncatMemPeak\"   \"\\t\" // 0\n\t\t\t\t/* 123 */ \"EbwtMemPeak\"    \"\\t\" // EBWT_CAT\n\t\t\t\t/* 124 */ \"CacheMemPeak\"   \"\\t\" // CA_CAT\n\t\t\t\t/* 125 */ \"ResolveMemPeak\" \"\\t\" // GW_CAT\n\t\t\t\t/* 126 */ \"AlignMemPeak\"   \"\\t\" // AL_CAT\n\t\t\t\t/* 127 */ \"DPMemPeak\"      \"\\t\" // DP_CAT\n\t\t\t\t/* 128 */ \"MiscMemPeak\"    \"\\t\" // MISC_CAT\n\t\t\t\t/* 129 */ \"DebugMemPeak\"   \"\\t\" // DEBUG_CAT\n            \n                /* 130 */ \"LocalSearch\"         \"\\t\"\n                /* 131 */ \"AnchorSearch\"        \"\\t\"\n                /* 132 */ \"LocalIndexSearch\"    \"\\t\"\n                /* 133 */ \"LocalExtSearch\"      \"\\t\"\n                /* 134 */ \"LocalSearchRecur\"    \"\\t\"\n                /* 135 */ \"GlobalGenomeCoords\"  \"\\t\"\n                /* 136 */ \"LocalGenomeCoords\"   \"\\t\"\n            \n            \n\t\t\t\t\"\\n\";\n\t\t\t\n\t\t\tif(name != NULL) {\n\t\t\t\tif(o != NULL) o->writeChars(\"Name\\t\");\n\t\t\t\tif(metricsStderr) stderrSs << \"Name\\t\";\n\t\t\t}\n\t\t\t\n\t\t\tif(o != NULL) o->writeChars(str);\n\t\t\tif(metricsStderr) stderrSs << str;\n\t\t\tfirst = false;\n\t\t}\n\t\t\n\t\tif(total) mergeIncrementals();\n\t\t\n\t\t// 0. Read name, if needed\n\t\tif(name != NULL) {\n\t\t\tif(o != NULL) {\n\t\t\t\to->writeChars(name->toZBuf());\n\t\t\t\to->write('\\t');\n\t\t\t}\n\t\t\tif(metricsStderr) {\n\t\t\t\tstderrSs << (*name) << '\\t';\n\t\t\t}\n\t\t}\n\t\t\t\n\t\t// 1. Current time in secs\n\t\titoa10<time_t>(curtime, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t\n\t\tconst OuterLoopMetrics& ol = total ? olm : olmu;\n\t\t\n\t\t// 2. Reads\n\t\titoa10<uint64_t>(ol.reads, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 3. Bases\n\t\titoa10<uint64_t>(ol.bases, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 4. Same-read reads\n\t\titoa10<uint64_t>(ol.srreads, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 5. Same-read bases\n\t\titoa10<uint64_t>(ol.srbases, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 6. Unfiltered reads\n\t\titoa10<uint64_t>(ol.ureads, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 7. Unfiltered bases\n\t\titoa10<uint64_t>(ol.ubases, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\n\t\tconst ReportingMetrics& rp = total ? rpm : rpmu;\n\t\t//const SpeciesMetrics& sp = total ? spm : spmu; // TODO: do something with sp\n\n\t\t// 8. Paired reads\n\t\titoa10<uint64_t>(rp.npaired, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 9. Unpaired reads\n\t\titoa10<uint64_t>(rp.nunpaired, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 10. Pairs with unique concordant alignments\n\t\titoa10<uint64_t>(rp.nconcord_uni, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n#if 0\n\t\t// 11. Pairs with repetitive concordant alignments\n\t\titoa10<uint64_t>(rp.nconcord_rep, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 12. Pairs with 0 concordant alignments\n\t\titoa10<uint64_t>(rp.nconcord_0, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 13. Pairs with 1 discordant alignment\n\t\titoa10<uint64_t>(rp.ndiscord, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 14. Mates from unaligned pairs that align uniquely\n\t\titoa10<uint64_t>(rp.nunp_0_uni, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 15. Mates from unaligned pairs that align repetitively\n\t\titoa10<uint64_t>(rp.nunp_0_rep, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 16. Mates from unaligned pairs that fail to align\n\t\titoa10<uint64_t>(rp.nunp_0_0, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 17. Mates from repetitive pairs that align uniquely\n\t\titoa10<uint64_t>(rp.nunp_rep_uni, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 18. Mates from repetitive pairs that align repetitively\n\t\titoa10<uint64_t>(rp.nunp_rep_rep, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 19. Mates from repetitive pairs that fail to align\n\t\titoa10<uint64_t>(rp.nunp_rep_0, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 20. Unpaired reads that align uniquely\n\t\titoa10<uint64_t>(rp.nunp_uni, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 21. Unpaired reads that align repetitively\n\t\titoa10<uint64_t>(rp.nunp_rep, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 22. Unpaired reads that fail to align\n\t\titoa10<uint64_t>(rp.nunp_0, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n#endif\n        \n\t\tconst WalkMetrics& wl = total ? wlm : wlmu;\n\t\t\n\t\t// 29. Burrows-Wheeler ops in resolver\n\t\titoa10<uint64_t>(wl.bwops, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 30. Burrows-Wheeler branches in resolver\n\t\titoa10<uint64_t>(wl.branches, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 31. Burrows-Wheeler offset resolutions\n\t\titoa10<uint64_t>(wl.resolves, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 34. Offset reports\n\t\titoa10<uint64_t>(wl.reports, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t\n\t\t// 121. Overall memory peak\n\t\titoa10<size_t>(gMemTally.peak() >> 20, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 122. Uncategorized memory peak\n\t\titoa10<size_t>(gMemTally.peak(0) >> 20, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 123. Ebwt memory peak\n\t\titoa10<size_t>(gMemTally.peak(EBWT_CAT) >> 20, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 124. Cache memory peak\n\t\titoa10<size_t>(gMemTally.peak(CA_CAT) >> 20, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 125. Resolver memory peak\n\t\titoa10<size_t>(gMemTally.peak(GW_CAT) >> 20, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 126. Seed aligner memory peak\n\t\titoa10<size_t>(gMemTally.peak(AL_CAT) >> 20, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 127. Dynamic programming aligner memory peak\n\t\titoa10<size_t>(gMemTally.peak(DP_CAT) >> 20, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 128. Miscellaneous memory peak\n\t\titoa10<size_t>(gMemTally.peak(MISC_CAT) >> 20, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n\t\t// 129. Debug memory peak\n\t\titoa10<size_t>(gMemTally.peak(DEBUG_CAT) >> 20, buf);\n        if(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n        \n        // 130\n        itoa10<size_t>(him.localatts, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n        // 131\n        itoa10<size_t>(him.anchoratts, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n        // 132\n        itoa10<size_t>(him.localindexatts, buf);\n\t\tif(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n        // 133\n        itoa10<size_t>(him.localextatts, buf);\n        if(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n        // 134\n        itoa10<size_t>(him.localsearchrecur, buf);\n        if(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n        // 135\n        itoa10<size_t>(him.globalgenomecoords, buf);\n        if(metricsStderr) stderrSs << buf << '\\t';\n\t\tif(o != NULL) { o->writeChars(buf); o->write('\\t'); }\n        // 136\n        itoa10<size_t>(him.localgenomecoords, buf);\n        if(metricsStderr) stderrSs << buf;\n\t\tif(o != NULL) { o->writeChars(buf); }\n\n\t\tif(o != NULL) { o->write('\\n'); }\n\t\tif(metricsStderr) cerr << stderrSs.str().c_str() << endl;\n\t\tif(!total) mergeIncrementals();\n\t}\n\t\n\tvoid mergeIncrementals() {\n\t\tolm.merge(olmu, false);\n\t\twlm.merge(wlmu, false);\n\t\tnbtfiltst_u += nbtfiltst;\n\t\tnbtfiltsc_u += nbtfiltsc;\n\t\tnbtfiltdo_u += nbtfiltdo;\n\n\t\tolmu.reset();\n\t\twlmu.reset();\n\t\trpmu.reset();\n\t\tspmu.reset();\n\t\tnbtfiltst_u = 0;\n\t\tnbtfiltsc_u = 0;\n\t\tnbtfiltdo_u = 0;\n\t}\n\n\t// Total over the whole job\n\tOuterLoopMetrics  olm;   // overall metrics\n\tWalkMetrics       wlm;   // metrics related to walking left (i.e. resolving reference offsets)\n\tReportingMetrics  rpm;   // metrics related to reporting\n\tSpeciesMetrics    spm;   // metrics related to species\n\tuint64_t          nbtfiltst;\n\tuint64_t          nbtfiltsc;\n\tuint64_t          nbtfiltdo;\n\n\t// Just since the last update\n\tOuterLoopMetrics  olmu;  // overall metrics\n\tWalkMetrics       wlmu;  // metrics related to walking left (i.e. resolving reference offsets)\n\tReportingMetrics  rpmu;  // metrics related to reporting\n\tSpeciesMetrics    spmu;  // metrics related to species counting\n\tuint64_t          nbtfiltst_u;\n\tuint64_t          nbtfiltsc_u;\n\tuint64_t          nbtfiltdo_u;\n    \n    //\n    HIMetrics         him;\n\n\tMUTEX_T           mutex_m;  // lock for when one ob\n\tbool              first; // yet to print first line?\n\ttime_t            lastElapsed; // used in reportInterval to measure time since last call\n};\n\nstatic PerfMetrics metrics;\n\n// Cyclic rotations\n#define ROTL(n, x) (((x) << (n)) | ((x) >> (32-n)))\n#define ROTR(n, x) (((x) >> (n)) | ((x) << (32-n)))\n\nstatic inline void printMmsSkipMsg(\n\tconst PatternSourcePerThread& ps,\n\tbool paired,\n\tbool mate1,\n\tint seedmms)\n{\n\tostringstream os;\n\tif(paired) {\n\t\tos << \"Warning: skipping mate #\" << (mate1 ? '1' : '2')\n\t\t   << \" of read '\" << (mate1 ? ps.bufa().name : ps.bufb().name)\n\t\t   << \"' because length (\" << (mate1 ? ps.bufa().patFw.length() : ps.bufb().patFw.length())\n\t\t   << \") <= # seed mismatches (\" << seedmms << \")\" << endl;\n\t} else {\n\t\tos << \"Warning: skipping read '\" << (mate1 ? ps.bufa().name : ps.bufb().name)\n\t\t   << \"' because length (\" << (mate1 ? ps.bufa().patFw.length() : ps.bufb().patFw.length())\n\t\t   << \") <= # seed mismatches (\" << seedmms << \")\" << endl;\n\t}\n\tcerr << os.str().c_str();\n}\n\nstatic inline void printLenSkipMsg(\n\tconst PatternSourcePerThread& ps,\n\tbool paired,\n\tbool mate1)\n{\n\tostringstream os;\n\tif(paired) {\n\t\tos << \"Warning: skipping mate #\" << (mate1 ? '1' : '2')\n\t\t   << \" of read '\" << (mate1 ? ps.bufa().name : ps.bufb().name)\n\t\t   << \"' because it was < 2 characters long\" << endl;\n\t} else {\n\t\tos << \"Warning: skipping read '\" << (mate1 ? ps.bufa().name : ps.bufb().name)\n\t\t   << \"' because it was < 2 characters long\" << endl;\n\t}\n\tcerr << os.str().c_str();\n}\n\nstatic inline void printLocalScoreMsg(\n\tconst PatternSourcePerThread& ps,\n\tbool paired,\n\tbool mate1)\n{\n\tostringstream os;\n\tif(paired) {\n\t\tos << \"Warning: minimum score function gave negative number in \"\n\t\t   << \"--local mode for mate #\" << (mate1 ? '1' : '2')\n\t\t   << \" of read '\" << (mate1 ? ps.bufa().name : ps.bufb().name)\n\t\t   << \"; setting to 0 instead\" << endl;\n\t} else {\n\t\tos << \"Warning: minimum score function gave negative number in \"\n\t\t   << \"--local mode for read '\" << (mate1 ? ps.bufa().name : ps.bufb().name)\n\t\t   << \"; setting to 0 instead\" << endl;\n\t}\n\tcerr << os.str().c_str();\n}\n\nstatic inline void printEEScoreMsg(\n\tconst PatternSourcePerThread& ps,\n\tbool paired,\n\tbool mate1)\n{\n\tostringstream os;\n\tif(paired) {\n\t\tos << \"Warning: minimum score function gave positive number in \"\n\t\t   << \"--end-to-end mode for mate #\" << (mate1 ? '1' : '2')\n\t\t   << \" of read '\" << (mate1 ? ps.bufa().name : ps.bufb().name)\n\t\t   << \"; setting to 0 instead\" << endl;\n\t} else {\n\t\tos << \"Warning: minimum score function gave positive number in \"\n\t\t   << \"--end-to-end mode for read '\" << (mate1 ? ps.bufa().name : ps.bufb().name)\n\t\t   << \"; setting to 0 instead\" << endl;\n\t}\n\tcerr << os.str().c_str();\n}\n\n\n#define MERGE_METRICS(met, sync) { \\\n\tmsink.mergeMetrics(rpm); \\\n\tmet.merge( \\\n\t\t&olm, \\\n\t\t&wlm, \\\n\t\t&rpm, \\\n\t\t&spm, \\\n\t\tnbtfiltst, \\\n\t\tnbtfiltsc, \\\n\t\tnbtfiltdo, \\\n        &him, \\\n\t\tsync); \\\n\tolm.reset(); \\\n\twlm.reset(); \\\n\trpm.reset(); \\\n\tspm.reset(); \\\n    him.reset(); \\\n}\n\n/**\n * Called once per thread.  Sets up per-thread pointers to the shared global\n * data structures, creates per-thread structures, then enters the alignment\n * loop.  The general flow of the alignment loop is:\n *\n * - If it's been a while and we're the master thread, report some alignment\n *   metrics\n * - Get the next read/pair\n * - Check if this read/pair is identical to the previous\n *   + If identical, check whether we can skip any or all alignment stages.  If\n *     we can skip all stages, report the result immediately and move to next\n *     read/pair\n *   + If not identical, continue\n * -\n */\nstatic void multiseedSearchWorker(void *vp) {\n\n\tint tid = *((int*)vp);\n\tassert(multiseed_ebwtFw != NULL);\n\tassert(multiseedMms == 0 || multiseed_ebwtBw != NULL);\n\tPairedPatternSource&             patsrc   = *multiseed_patsrc;\n\tconst Ebwt<index_t>&             ebwtFw   = *multiseed_ebwtFw;\n\tconst Ebwt<index_t>&             ebwtBw   = *multiseed_ebwtBw;\n\tconst Scoring&                   sc       = *multiseed_sc;\n\tconst BitPairReference&          ref      = *multiseed_refs;\n\tAlnSink<index_t>&                msink    = *multiseed_msink;\n\tOutFileBuf*                      metricsOfb = multiseed_metricsOfb;\n    \n\t// Sinks: these are so that we can print tables encoding counts for\n\t// events of interest on a per-read, per-seed, per-join, or per-SW\n\t// level.  These in turn can be used to diagnose performance\n\t// problems, or generally characterize performance.\n\t\n\t//const BitPairReference& refs   = *multiseed_refs;\n\tunique_ptr<PatternSourcePerThreadFactory> patsrcFact(createPatsrcFactory(patsrc, tid));\n\tunique_ptr<PatternSourcePerThread> ps(patsrcFact->create());\n\t\n\t// Instantiate an object for holding reporting-related parameters.\n    ReportingParams rp((allHits ? std::numeric_limits<THitInt>::max() : khits),\n                       ebwtFw.compressed()); // -k\n\n\t// Make a per-thread wrapper for the global MHitSink object.\n\tAlnSinkWrap<index_t> msinkwrap(\n                                   msink,         // global sink\n                                   rp,            // reporting parameters\n                                   (size_t)tid);  // thread id\n    \n    Classifier<index_t, local_index_t> classifier(\n                                                  ebwtFw,\n                                                  multiseed_refnames,\n                                                  gMate1fw,\n                                                  gMate2fw,\n                                                  minHitLen,\n                                                  tree_traverse,\n                                                  classification_rank,\n                                                  host_taxIDs,\n                                                  excluded_taxIDs);\n\tOuterLoopMetrics olm;\n\tWalkMetrics wlm;\n\tReportingMetrics rpm;\n\tPerReadMetrics prm;\n\tSpeciesMetrics spm;\n\n\tRandomSource rnd, rndArb;\n\tuint64_t nbtfiltst = 0; // TODO: find a new home for these\n\tuint64_t nbtfiltsc = 0; // TODO: find a new home for these\n\tuint64_t nbtfiltdo = 0; // TODO: find a new home for these\n    HIMetrics him;\n    \n\tASSERT_ONLY(BTDnaString tmp);\n    \n\tint pepolFlag;\n\tif(gMate1fw && gMate2fw) {\n\t\tpepolFlag = PE_POLICY_FF;\n\t} else if(gMate1fw && !gMate2fw) {\n\t\tpepolFlag = PE_POLICY_FR;\n\t} else if(!gMate1fw && gMate2fw) {\n\t\tpepolFlag = PE_POLICY_RF;\n\t} else {\n\t\tpepolFlag = PE_POLICY_RR;\n\t}\n\tassert_geq(gMaxInsert, gMinInsert);\n\tassert_geq(gMinInsert, 0);\n\tPairedEndPolicy pepol(\n                          pepolFlag,\n                          gMaxInsert,\n                          gMinInsert,\n                          localAlign,\n                          gFlippedMatesOK,\n                          gDovetailMatesOK,\n                          gContainMatesOK,\n                          gOlapMatesOK,\n                          gExpandToFrag);\n    \n  \tPerfMetrics metricsPt; // per-thread metrics object; for read-level metrics\n\tBTString nametmp;\n\t\n\t// Used by thread with threadid == 1 to measure time elapsed\n\ttime_t iTime = time(0);\n    \n\t// Keep track of whether last search was exhaustive for mates 1 and 2\n\tbool exhaustive[2] = { false, false };\n\t// Keep track of whether mates 1/2 were filtered out last time through\n\tbool filt[2]    = { true, true };\n\t// Keep track of whether mates 1/2 were filtered out due Ns last time\n\tbool nfilt[2]   = { true, true };\n\t// Keep track of whether mates 1/2 were filtered out due to not having\n\t// enough characters to rise about the score threshold.\n\tbool scfilt[2]  = { true, true };\n\t// Keep track of whether mates 1/2 were filtered out due to not having\n\t// more characters than the number of mismatches permitted in a seed.\n\tbool lenfilt[2] = { true, true };\n\t// Keep track of whether mates 1/2 were filtered out by upstream qc\n\tbool qcfilt[2]  = { true, true };\n    \n\trndArb.init((uint32_t)time(0));\n\tint mergei = 0;\n\tint mergeival = 16;\n\twhile(true) {\n\t\tbool success = false, done = false, paired = false;\n\t\tps->nextReadPair(success, done, paired, outType != OUTPUT_SAM);\n\t\tif(!success && done) {\n\t\t\tbreak;\n\t\t} else if(!success) {\n\t\t\tcontinue;\n\t\t}\n\t\tTReadId rdid = ps->rdid();\n\t\tbool sample = true;\n\t\tif(arbitraryRandom) {\n\t\t\tps->bufa().seed = rndArb.nextU32();\n\t\t\tps->bufb().seed = rndArb.nextU32();\n\t\t}\n\t\tif(sampleFrac < 1.0f) {\n\t\t\trnd.init(ROTL(ps->bufa().seed, 2));\n\t\t\tsample = rnd.nextFloat() < sampleFrac;\n\t\t}\n\t\tif(rdid >= skipReads && rdid < qUpto && sample) {\n\t\t\t// Align this read/pair\n\t\t\tbool retry = true;\n\t\t\t//\n\t\t\t// Check if there is metrics reporting for us to do.\n\t\t\t//\n\t\t\tif(metricsIval > 0 &&\n\t\t\t   (metricsOfb != NULL || metricsStderr) &&\n\t\t\t   !metricsPerRead &&\n\t\t\t   ++mergei == mergeival)\n\t\t\t{\n\t\t\t\t// Do a periodic merge.  Update global metrics, in a\n\t\t\t\t// synchronized manner if needed.\n\t\t\t\tMERGE_METRICS(metrics, nthreads > 1);\n\t\t\t\tmergei = 0;\n\t\t\t\t// Check if a progress message should be printed\n\t\t\t\tif(tid == 0) {\n\t\t\t\t\t// Only thread 1 prints progress messages\n\t\t\t\t\ttime_t curTime = time(0);\n\t\t\t\t\tif(curTime - iTime >= metricsIval) {\n\t\t\t\t\t\tmetrics.reportInterval(metricsOfb, metricsStderr, false, true, NULL);\n\t\t\t\t\t\tiTime = curTime;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\n\t\t\tprm.reset(); // per-read metrics\n\t\t\tprm.doFmString = false;\n\t\t\tif(sam_print_xt) {\n\t\t\t\tgettimeofday(&prm.tv_beg, &prm.tz_beg);\n\t\t\t}\n\t\t\t// Try to align this read\n\t\t\twhile(retry) {\n\t\t\t\tretry = false;\n\t\t\t\tassert_eq(ps->bufa().color, false);\n\t\t\t\tolm.reads++;\n\t\t\t\tbool pair = paired;\n\t\t\t\tconst size_t rdlen1 = ps->bufa().length();\n\t\t\t\tconst size_t rdlen2 = pair ? ps->bufb().length() : 0;\n\t\t\t\tolm.bases += (rdlen1 + rdlen2);\n\t\t\t\tmsinkwrap.nextRead(\n                                   &ps->bufa(),\n                                   pair ? &ps->bufb() : NULL,\n                                   rdid,\n                                   sc.qualitiesMatter());\n\t\t\t\tassert(msinkwrap.inited());\n\t\t\t\tsize_t rdlens[2] = { rdlen1, rdlen2 };\n\t\t\t\t// Calculate the minimum valid score threshold for the read\n\t\t\t\tTAlScore minsc[2], maxpen[2];\n\t\t\t\tmaxpen[0] = maxpen[1] = 0;\n\t\t\t\tminsc[0] = minsc[1] = std::numeric_limits<TAlScore>::max();\n\t\t\t\tif(bwaSwLike) {\n\t\t\t\t\t// From BWA-SW manual: \"Given an l-long query, the\n\t\t\t\t\t// threshold for a hit to be retained is\n\t\t\t\t\t// a*max{T,c*log(l)}.\"  We try to recreate that here.\n\t\t\t\t\tfloat a = (float)sc.match(30);\n\t\t\t\t\tfloat T = bwaSwLikeT, c = bwaSwLikeC;\n\t\t\t\t\tminsc[0] = (TAlScore)max<float>(a*T, a*c*log(rdlens[0]));\n\t\t\t\t\tif(paired) {\n\t\t\t\t\t\tminsc[1] = (TAlScore)max<float>(a*T, a*c*log(rdlens[1]));\n\t\t\t\t\t}\n\t\t\t\t} else {\n\t\t\t\t\tminsc[0] = scoreMin.f<TAlScore>(rdlens[0]);\n\t\t\t\t\tif(paired) minsc[1] = scoreMin.f<TAlScore>(rdlens[1]);\n\t\t\t\t\tif(localAlign) {\n\t\t\t\t\t\tif(minsc[0] < 0) {\n\t\t\t\t\t\t\tif(!gQuiet) printLocalScoreMsg(*ps, paired, true);\n\t\t\t\t\t\t\tminsc[0] = 0;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(paired && minsc[1] < 0) {\n\t\t\t\t\t\t\tif(!gQuiet) printLocalScoreMsg(*ps, paired, false);\n\t\t\t\t\t\t\tminsc[1] = 0;\n\t\t\t\t\t\t}\n\t\t\t\t\t} else {\n\t\t\t\t\t\tif(minsc[0] > 0) {\n\t\t\t\t\t\t\tif(!gQuiet) printEEScoreMsg(*ps, paired, true);\n\t\t\t\t\t\t\tminsc[0] = 0;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(paired && minsc[1] > 0) {\n\t\t\t\t\t\t\tif(!gQuiet) printEEScoreMsg(*ps, paired, false);\n\t\t\t\t\t\t\tminsc[1] = 0;\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n                \n\t\t\t\t// N filter; does the read have too many Ns?\n\t\t\t\tsize_t readns[2] = {0, 0};\n\t\t\t\tsc.nFilterPair(\n                               &ps->bufa().patFw,\n                               pair ? &ps->bufb().patFw : NULL,\n                               readns[0],\n                               readns[1],\n                               nfilt[0],\n                               nfilt[1]);\n\t\t\t\t// Score filter; does the read enough character to rise above\n\t\t\t\t// the score threshold?\n\t\t\t\tscfilt[0] = sc.scoreFilter(minsc[0], rdlens[0]);\n\t\t\t\tscfilt[1] = sc.scoreFilter(minsc[1], rdlens[1]);\n\t\t\t\tlenfilt[0] = lenfilt[1] = true;\n\t\t\t\tif(rdlens[0] <= (size_t)multiseedMms || rdlens[0] < 2) {\n\t\t\t\t\tif(!gQuiet) printMmsSkipMsg(*ps, paired, true, multiseedMms);\n\t\t\t\t\tlenfilt[0] = false;\n\t\t\t\t}\n\t\t\t\tif((rdlens[1] <= (size_t)multiseedMms || rdlens[1] < 2) && paired) {\n\t\t\t\t\tif(!gQuiet) printMmsSkipMsg(*ps, paired, false, multiseedMms);\n\t\t\t\t\tlenfilt[1] = false;\n\t\t\t\t}\n\t\t\t\tif(rdlens[0] < 2) {\n\t\t\t\t\tif(!gQuiet) printLenSkipMsg(*ps, paired, true);\n\t\t\t\t\tlenfilt[0] = false;\n\t\t\t\t}\n\t\t\t\tif(rdlens[1] < 2 && paired) {\n\t\t\t\t\tif(!gQuiet) printLenSkipMsg(*ps, paired, false);\n\t\t\t\t\tlenfilt[1] = false;\n\t\t\t\t}\n\t\t\t\tqcfilt[0] = qcfilt[1] = true;\n\t\t\t\tif(qcFilter) {\n\t\t\t\t\tqcfilt[0] = (ps->bufa().filter != '0');\n\t\t\t\t\tqcfilt[1] = (ps->bufb().filter != '0');\n\t\t\t\t}\n\t\t\t\tfilt[0] = (nfilt[0] && scfilt[0] && lenfilt[0] && qcfilt[0]);\n\t\t\t\tfilt[1] = (nfilt[1] && scfilt[1] && lenfilt[1] && qcfilt[1]);\n\t\t\t\tprm.nFilt += (filt[0] ? 0 : 1) + (filt[1] ? 0 : 1);\n\t\t\t\tRead* rds[2] = { &ps->bufa(), &ps->bufb() };\n\t\t\t\t// For each mate...\n\t\t\t\tassert(msinkwrap.empty());\n\t\t\t\t//size_t minedfw[2] = { 0, 0 };\n\t\t\t\t//size_t minedrc[2] = { 0, 0 };\n\t\t\t\t// Calcualte nofw / no rc\n\t\t\t\tbool nofw[2] = { false, false };\n\t\t\t\tbool norc[2] = { false, false };\n\t\t\t\tnofw[0] = paired ? (gMate1fw ? gNofw : gNorc) : gNofw;\n\t\t\t\tnorc[0] = paired ? (gMate1fw ? gNorc : gNofw) : gNorc;\n\t\t\t\tnofw[1] = paired ? (gMate2fw ? gNofw : gNorc) : gNofw;\n\t\t\t\tnorc[1] = paired ? (gMate2fw ? gNorc : gNofw) : gNorc;\n\t\t\t\t// Calculate nceil\n\t\t\t\tint nceil[2] = { 0, 0 };\n\t\t\t\tnceil[0] = nCeil.f<int>((double)rdlens[0]);\n\t\t\t\tnceil[0] = min(nceil[0], (int)rdlens[0]);\n\t\t\t\tif(paired) {\n\t\t\t\t\tnceil[1] = nCeil.f<int>((double)rdlens[1]);\n\t\t\t\t\tnceil[1] = min(nceil[1], (int)rdlens[1]);\n\t\t\t\t}\n\t\t\t\texhaustive[0] = exhaustive[1] = false;\n\t\t\t\t//size_t matemap[2] = { 0, 1 };\n\t\t\t\tbool pairPostFilt = filt[0] && filt[1];\n\t\t\t\tif(pairPostFilt) {\n\t\t\t\t\trnd.init(ps->bufa().seed ^ ps->bufb().seed);\n\t\t\t\t} else {\n\t\t\t\t\trnd.init(ps->bufa().seed);\n\t\t\t\t}\n\t\t\t\t// Calculate interval length for both mates\n\t\t\t\tint interval[2] = { 0, 0 };\n\t\t\t\tfor(size_t mate = 0; mate < (pair ? 2:1); mate++) {\n\t\t\t\t\tinterval[mate] = msIval.f<int>((double)rdlens[mate]);\n\t\t\t\t\tif(filt[0] && filt[1]) {\n\t\t\t\t\t\t// Boost interval length by 20% for paired-end reads\n\t\t\t\t\t\tinterval[mate] = (int)(interval[mate] * 1.2 + 0.5);\n\t\t\t\t\t}\n\t\t\t\t\tinterval[mate] = max(interval[mate], 1);\n\t\t\t\t}\n\t\t\t\t// Calculate streak length\n\t\t\t\tsize_t streak[2]    = { maxDpStreak,   maxDpStreak };\n\t\t\t\tsize_t mtStreak[2]  = { maxMateStreak, maxMateStreak };\n\t\t\t\tsize_t mxDp[2]      = { maxDp,         maxDp       };\n\t\t\t\tsize_t mxUg[2]      = { maxUg,         maxUg       };\n\t\t\t\tsize_t mxIter[2]    = { maxIters,      maxIters    };\n\t\t\t\tif(allHits) {\n\t\t\t\t\tstreak[0]   = streak[1]   = std::numeric_limits<size_t>::max();\n\t\t\t\t\tmtStreak[0] = mtStreak[1] = std::numeric_limits<size_t>::max();\n\t\t\t\t\tmxDp[0]     = mxDp[1]     = std::numeric_limits<size_t>::max();\n\t\t\t\t\tmxUg[0]     = mxUg[1]     = std::numeric_limits<size_t>::max();\n\t\t\t\t\tmxIter[0]   = mxIter[1]   = std::numeric_limits<size_t>::max();\n\t\t\t\t} else if(khits > 1) {\n\t\t\t\t\tfor(size_t mate = 0; mate < 2; mate++) {\n\t\t\t\t\t\tstreak[mate]   += (khits-1) * maxStreakIncr;\n\t\t\t\t\t\tmtStreak[mate] += (khits-1) * maxStreakIncr;\n\t\t\t\t\t\tmxDp[mate]     += (khits-1) * maxItersIncr;\n\t\t\t\t\t\tmxUg[mate]     += (khits-1) * maxItersIncr;\n\t\t\t\t\t\tmxIter[mate]   += (khits-1) * maxItersIncr;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif(filt[0] && filt[1]) {\n\t\t\t\t\tstreak[0] = (size_t)ceil((double)streak[0] / 2.0);\n\t\t\t\t\tstreak[1] = (size_t)ceil((double)streak[1] / 2.0);\n\t\t\t\t\tassert_gt(streak[1], 0);\n\t\t\t\t}\n\t\t\t\tassert_gt(streak[0], 0);\n\t\t\t\t// Calculate # seed rounds for each mate\n\n\t\t\t\tsize_t nrounds[2] = { nSeedRounds, nSeedRounds };\n\t\t\t\tif(filt[0] && filt[1]) {\n\t\t\t\t\tnrounds[0] = (size_t)ceil((double)nrounds[0] / 2.0);\n\t\t\t\t\tnrounds[1] = (size_t)ceil((double)nrounds[1] / 2.0);\n\t\t\t\t\tassert_gt(nrounds[1], 0);\n\t\t\t\t}\n\t\t\t\tassert_gt(nrounds[0], 0);\n\t\t\t\t// Increment counters according to what got filtered\n\t\t\t\tfor(size_t mate = 0; mate < (pair ? 2:1); mate++) {\n\t\t\t\t\tif(!filt[mate]) {\n\t\t\t\t\t\t// Mate was rejected by N filter\n\t\t\t\t\t\tolm.freads++;               // reads filtered out\n\t\t\t\t\t\tolm.fbases += rdlens[mate]; // bases filtered out\n\t\t\t\t\t} else {\n\t\t\t\t\t\t//shs[mate].clear();\n\t\t\t\t\t\t//shs[mate].nextRead(mate == 0 ? ps->bufa() : ps->bufb());\n\t\t\t\t\t\t//assert(shs[mate].empty());\n\t\t\t\t\t\tolm.ureads++;               // reads passing filter\n\t\t\t\t\t\tolm.ubases += rdlens[mate]; // bases passing filter\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\t//size_t eePeEeltLimit = std::numeric_limits<size_t>::max();\n\t\t\t\t// Whether we're done with mate1 / mate2\n                bool done[2] = { !filt[0], !filt[1] };\n\t\t\t\t// size_t nelt[2] = {0, 0};\n                if(filt[0] && filt[1]) {\n                    classifier.initReads(rds, nofw, norc, minsc, maxpen);\n                } \n                else if(filt[0]) {\n                    classifier.initRead(rds[0], nofw[0], norc[0], minsc[0], maxpen[0], filt[1]);\n                }\n                else if(filt[1]) {\n                    classifier.initRead(rds[1], nofw[1], norc[1], minsc[1], maxpen[1], filt[1]);\n                }\n\t\telse\n\t\t{\n\t\t    classifier.reportUnclassified( msinkwrap ) ;\n\t\t}\n\n                if(filt[0] || filt[1]) {\n                    classifier.go(sc, ebwtFw, ebwtBw, ref, wlm, prm, him, spm, rnd, msinkwrap);\n                    size_t mate = 0;\n                    if(!done[mate]) {\n                        TAlScore perfectScore = sc.perfectScore(rdlens[mate]);\n                        if(!done[mate] && minsc[mate] == perfectScore) {\n                            done[mate] = true;\n                        }\n                    }\n                }\n\n                for(size_t i = 0; i < 2; i++) {\n                    assert_leq(prm.nExIters, mxIter[i]);\n                    assert_leq(prm.nExDps,   mxDp[i]);\n                    assert_leq(prm.nMateDps, mxDp[i]);\n                    assert_leq(prm.nExUgs,   mxUg[i]);\n                    assert_leq(prm.nMateUgs, mxUg[i]);\n                    assert_leq(prm.nDpFail,  streak[i]);\n                    assert_leq(prm.nUgFail,  streak[i]);\n                    assert_leq(prm.nEeFail,  streak[i]);\n                }\n                \n                // Commit and report paired-end/unpaired alignments\n\t\t\t\tmsinkwrap.finishRead(\n                                     NULL,\n                                     NULL,\n                                     exhaustive[0],        // exhausted seed hits for mate 1?\n                                     exhaustive[1],        // exhausted seed hits for mate 2?\n                                     nfilt[0],\n                                     nfilt[1],\n                                     scfilt[0],\n                                     scfilt[1],\n                                     lenfilt[0],\n                                     lenfilt[1],\n                                     qcfilt[0],\n                                     qcfilt[1],\n                                     sortByScore,          // prioritize by alignment score\n                                     rnd,                  // pseudo-random generator\n                                     rpm,                  // reporting metrics\n\t\t\t\t\t\t\t\t\t spm,                  // species metrics\n                                     prm,                  // per-read metrics\n                                     !seedSumm,            // suppress seed summaries?\n                                     seedSumm);            // suppress alignments?\n\t\t\t\tassert(!retry || msinkwrap.empty());\n            } // while(retry)\n\t\t} // if(rdid >= skipReads && rdid < qUpto)\n\t\telse if(rdid >= qUpto) {\n\t\t\tbreak;\n\t\t}\n\n\t\tif(metricsPerRead) {\n\t\t\tMERGE_METRICS(metricsPt, nthreads > 1);\n\t\t\tnametmp = ps->bufa().name;\n\t\t\tmetricsPt.reportInterval(\n                                     metricsOfb, metricsStderr, true, true, &nametmp);\n\t\t\tmetricsPt.reset();\n\t\t}\n\t} // while(true)\n\t\n\t// One last metrics merge\n\tMERGE_METRICS(metrics, nthreads > 1);\n    \n\treturn;\n}\n\n/**\n * Called once per alignment job.  Sets up global pointers to the\n * shared global data structures, creates per-thread structures, then\n * enters the search loop.\n */\nstatic void multiseedSearch(\n\tScoring& sc,\n\tPairedPatternSource& patsrc,  // pattern source\n\tAlnSink<index_t>& msink,      // hit sink\n\tEbwt<index_t>& ebwtFw,        // index of original text\n\tEbwt<index_t>& ebwtBw,        // index of mirror text\n    BitPairReference* refs,\n    const EList<string>& refnames,\n\tOutFileBuf *metricsOfb)\n{\n\n    multiseed_patsrc = &patsrc;\n\tmultiseed_msink  = &msink;\n\tmultiseed_ebwtFw = &ebwtFw;\n\tmultiseed_ebwtBw = &ebwtBw;\n\tmultiseed_sc     = &sc;\n\tmultiseed_metricsOfb      = metricsOfb;\n\tmultiseed_refs = refs;\n    multiseed_refnames = refnames;\n\tAutoArray<tthread::thread*> threads(nthreads);\n\tAutoArray<int> tids(nthreads);\n#if 0\n\tif(multiseedMms > 0 || do1mmUpFront) {\n\t\t// Load the other half of the index into memory\n\t\tassert(!ebwtBw.isInMemory());\n\t\tTimer _t(cerr, \"Time loading mirror index: \", timing);\n\t\tebwtBw.loadIntoMemory(\n\t\t\t0, // colorspace?\n\t\t\t// It's bidirectional search, so we need the reverse to be\n\t\t\t// constructed as the reverse of the concatenated strings.\n\t\t\t1,\n\t\t\ttrue,        // load SA samp in reverse index\n\t\t\ttrue,         // yes, need ftab in reverse index\n\t\t\ttrue,        // load rstarts in reverse index\n\t\t\t!noRefNames,  // load names?\n\t\t\tstartVerbose);\n\t}\n#endif\n\t// Start the metrics thread\n\t{\n\t\tTimer _t(cerr, \"Multiseed full-index search: \", timing);\n\n\t\tthread_rids.resize(nthreads);\n\t\tthread_rids.fill(0);\n\t\tfor(int i = 0; i < nthreads; i++) {\n\t\t\t// Thread IDs start at 1\n\t\t\ttids[i] = i+1;\n\t\t\tthreads[i] = new tthread::thread(multiseedSearchWorker, (void*)&tids[i]);\n\t\t}\n\n\t\tfor (int i = 0; i < nthreads; i++)\n\t\t\tthreads[i]->join();\n\n\t}\n\tif(!metricsPerRead && (metricsOfb != NULL || metricsStderr)) {\n\t\tmetrics.reportInterval(metricsOfb, metricsStderr, true, false, NULL);\n\t}\n}\n\nstatic string argstr;\n\nextern void initializeCntLut();\n\ntemplate<typename TStr>\nstatic void driver(\n\tconst char * type,\n\tconst string& bt2indexBase,\n\tconst string& outfile)\n{\n\tif(gVerbose || startVerbose)  {\n\t\tcerr << \"Entered driver(): \"; logTime(cerr, true);\n\t}\n\n\tinitializeCntLut();\n\n\t// Vector of the reference sequences; used for sanity-checking\n\tEList<SString<char> > names, os;\n\tEList<size_t> nameLens, seqLens;\n\t// Read reference sequences from the command-line or from a FASTA file\n\tif(!origString.empty()) {\n\t\t// Read fasta file(s)\n\t\tEList<string> origFiles;\n\t\ttokenize(origString, \",\", origFiles);\n\t\tparseFastas(origFiles, names, nameLens, os, seqLens);\n\t}\n\tPatternParams pp(\n\t\t\tformat,        // file format\n\t\t\tfileParallel,  // true -> wrap files with separate PairedPatternSources\n\t\t\tseed,          // pseudo-random seed\n\t\t\tuseSpinlock,   // use spin locks instead of pthreads\n\t\t\tsolexaQuals,   // true -> qualities are on solexa64 scale\n\t\t\tphred64Quals,  // true -> qualities are on phred64 scale\n\t\t\tintegerQuals,  // true -> qualities are space-separated numbers\n\t\t\tfuzzy,         // true -> try to parse fuzzy fastq\n\t\t\tfastaContLen,  // length of sampled reads for FastaContinuous...\n\t\t\tfastaContFreq, // frequency of sampled reads for FastaContinuous...\n\t\t\tskipReads      // skip the first 'skip' patterns\n\t\t\t);\n\tif(gVerbose || startVerbose) {\n\t\tcerr << \"Creating PatternSource: \"; logTime(cerr, true);\n\t}\n\t// Open hit output file\n\tif(gVerbose || startVerbose) {\n\t\tcerr << \"Opening hit output file: \"; logTime(cerr, true);\n\t}\n\tOutFileBuf *fout;\n\tif(!outfile.empty()) {\n\t\tfout = new OutFileBuf(outfile.c_str(), false);\n\t} else {\n\t\tfout = new OutFileBuf();\n\t}\n\t// Initialize Ebwt object and read in header\n\tif(gVerbose || startVerbose) {\n\t\tcerr << \"About to initialize fw Ebwt: \"; logTime(cerr, true);\n\t}\n\tadjIdxBase = adjustEbwtBase(argv0, bt2indexBase, gVerbose);\n\tEbwt<index_t> ebwt(\n\t\t\tadjIdxBase,\n\t\t\t0,        // index is colorspace\n\t\t\t-1,       // fw index\n\t\t\ttrue,     // index is for the forward direction\n\t\t\t/* overriding: */ offRate,\n\t\t\t0, // amount to add to index offrate or <= 0 to do nothing\n\t\t\tuseMm,    // whether to use memory-mapped files\n\t\t\tuseShmem, // whether to use shared memory\n\t\t\tmmSweep,  // sweep memory-mapped files\n\t\t\t!noRefNames, // load names?\n\t\t\ttrue,        // load SA sample?\n\t\t\ttrue,        // load ftab?\n\t\t\ttrue,        // load rstarts?\n\t\t\tgVerbose, // whether to be talkative\n\t\t\tstartVerbose, // talkative during initialization\n\t\t\tfalse /*passMemExc*/,\n\t\t\tsanityCheck);\n\tEbwt<index_t>* ebwtBw = NULL;\n#if 0\n\t// We need the mirror index if mismatches are allowed\n\tif(multiseedMms > 0 || do1mmUpFront) {\n\t\tif(gVerbose || startVerbose) {\n\t\t\tcerr << \"About to initialize rev Ebwt: \"; logTime(cerr, true);\n\t\t}\n\t\tebwtBw = new HierEbwt<index_t, local_index_t>(\n\t\t\t\tadjIdxBase + \".rev\",\n\t\t\t\t0,       // index is colorspace\n\t\t\t\t1,       // TODO: maybe not\n\t\t\t\tfalse, // index is for the reverse direction\n\t\t\t\t/* overriding: */ offRate,\n\t\t\t\t0, // amount to add to index offrate or <= 0 to do nothing\n\t\t\t\tuseMm,    // whether to use memory-mapped files\n\t\t\t\tuseShmem, // whether to use shared memory\n\t\t\t\tmmSweep,  // sweep memory-mapped files\n\t\t\t\t!noRefNames, // load names?\n\t\t\t\ttrue,        // load SA sample?\n\t\t\t\ttrue,        // load ftab?\n\t\t\t\ttrue,        // load rstarts?\n\t\t\t\tgVerbose,    // whether to be talkative\n\t\t\t\tstartVerbose, // talkative during initialization\n\t\t\t\tfalse /*passMemExc*/,\n\t\t\t\tsanityCheck);\n\t}\n#endif\n\tif(sanityCheck && !os.empty()) {\n\t\t// Sanity check number of patterns and pattern lengths in Ebwt\n\t\t// against original strings\n\t\tassert_eq(os.size(), ebwt.nPat());\n\t\tfor(size_t i = 0; i < os.size(); i++) {\n\t\t\tassert_eq(os[i].length(), ebwt.plen()[i]);\n\t\t}\n\t}\n\t// Sanity-check the restored version of the Ebwt\n\tif(sanityCheck && !os.empty()) {\n\t\tebwt.loadIntoMemory(\n\t\t\t\t0,\n\t\t\t\t-1, // fw index\n\t\t\t\ttrue, // load SA sample\n\t\t\t\ttrue, // load ftab\n\t\t\t\ttrue, // load rstarts\n\t\t\t\t!noRefNames,\n\t\t\t\tstartVerbose);\n\t\tebwt.checkOrigs(os, false, false);\n\t\tebwt.evictFromMemory();\n\t}\n\n\n\t{\n\t\t// Load the other half of the index into memory\n\t\tassert(!ebwt.isInMemory());\n\t\tTimer _t(cerr, \"Time loading forward index: \", timing);\n\t\tebwt.loadIntoMemory(\n\t\t\t0,  // colorspace?\n\t\t\t-1, // not the reverse index\n\t\t\ttrue,         // load SA samp? (yes, need forward index's SA samp)\n\t\t\ttrue,         // load ftab (in forward index)\n\t\t\ttrue,         // load rstarts (in forward index)\n\t\t\t!noRefNames,  // load names?\n\t\t\tstartVerbose);\n\t}\n\n\tOutputQueue oq(\n\t\t\t*fout,                   // out file buffer\n\t\t\treorder && nthreads > 1, // whether to reorder when there's >1 thread\n\t\t\tnthreads,                // # threads\n\t\t\tnthreads > 1,            // whether to be thread-safe\n\t\t\tskipReads);              // first read will have this rdid\n\tEList<string> refnames;\n\treadEbwtRefnames<index_t>(adjIdxBase, refnames);\n\tEList<size_t> reflens;\n\n\tAlnSink<index_t> *mssink = NULL;\n\tswitch(outType) {\n\t\tcase OUTPUT_SAM: {\n\t\t\t\t\t mssink = new AlnSinkSam<index_t>(\n\t\t\t\t\t\t\t &ebwt,\n\t\t\t\t\t\t\t oq,           // output queue\n\t\t\t\t\t\t\t refnames,     // reference names\n\t\t\t\t\t\t\t tab_fmt_cols, // columns in tab format\n\t\t\t\t\t\t\t gQuiet);      // don't print alignment summary at end\n\t\t\t\t\t if(!samNoHead) {\n\t\t\t\t\t\t BTString buf;\n\t\t\t\t\t\t // TODO: Write '@SQ\\tSN:AA\\tLN:length fields\n\t\t\t\t\t\t fout->writeString(buf);\n\t\t\t\t\t }\n\t\t\t\t\t // Write header for read-results file\n\t\t\t\t\t if (!sam_format) {\n\t\t\t\t\t\t fout->writeChars(tab_fmt_cols_str[0].c_str());\n\t\t\t\t\t\t for (size_t i = 1; i < tab_fmt_cols_str.size(); ++i) {\n\t\t\t\t\t\t\t fout->writeChars(\"\\t\");\n\t\t\t\t\t\t\t fout->writeChars(tab_fmt_cols_str[i].c_str());\n\t\t\t\t\t\t }\n\t\t\t\t\t\t fout->writeChars(\"\\n\");\n\t\t\t\t\t }\n\t\t\t\t\t break;\n\t\t\t\t }\n\t\tdefault:\n\t\t\t\t cerr << \"Invalid output type: \" << outType << endl;\n\t\t\t\t throw 1;\n\t}\n\t// Set up global constraint\n\tOutFileBuf *metricsOfb = NULL;\n\tif(!metricsFile.empty() && metricsIval > 0) {\n\t\tmetricsOfb = new OutFileBuf(metricsFile);\n\t}\n\n\n\n\tint fileCnt = mates1.size() + queries.size(); // the order should be consistent with the wrapper\n#ifdef USE_SRA\n\tfileCnt += sra_accs ;\n#endif\n\n\tfor ( int fileIdx = 0 ; fileIdx < fileCnt ; ++fileIdx )\n\t{\n\t\t// note the name is not plural here, we handle one by one.\n\t\tEList<string> query ;\n\t\tEList<string> mate1;  // mated reads (first mate)\n\t\tEList<string> mate2;  // mated reads (second mate)\n\t\tEList<string> mate12; //\n#ifdef USE_SRA\n\t\tEList<string >sra_acc ;    // SRA accession\n#endif\n\t\t\n\t\t\n\t\tif(gVerbose || startVerbose) {\n\t\t\tcerr << \"creating patternsource for \"<< fileIdx + 1 << \"-th input: \" ; logTime(cerr, true);\n\t\t}\n\t\tif ( fileIdx < mates1.size() )\t\n\t\t{\n\t\t\t// mate pair\n\t\t\tmate1.push_back( mates1[ fileIdx ] ) ; \n\t\t\tmate2.push_back( mates2[ fileIdx ] ) ;\n\t\t}\n\t\telse if ( fileIdx < mates1.size() + queries.size() )\n\t\t{\n\t\t\t// single end read\n\t\t\tquery.push_back( queries[ fileIdx - mates1.size() ] ) ;\n\t\t}\n#ifdef USE_SRA\n\t\telse\n\t\t{\n\t\t\t// only should reach here when sra is defined  \n\t\t\tsra_acc.push_back( sra_accs[ fileIdx - mates1.size() - queries.size() ] ) ;\n\t\t}\n#endif \n\n\t\tPairedPatternSource *patsrc = PairedPatternSource::setupPatternSources(\n\t\t\tquery,     // singles, from argv\n\t\t\tmate1,      // mate1's, from -1 arg\n\t\t\tmate2,      // mate2's, from -2 arg\n\t\t\tmate12,     // both mates on each line, from --12 arg\n#ifdef USE_SRA\n\t\t\tsra_acc,    // SRA accessions\n#endif\n\t\t\tqualities,   // qualities associated with singles\n\t\t\tqualities1,  // qualities associated with m1\n\t\t\tqualities2,  // qualities associated with m2\n\t\t\tpp,          // read read-in parameters\n\t\t\tnthreads,\n\t\t\tgVerbose || startVerbose); // be talkative\n\n\n\t\t{\n\t\t\tTimer _t(cerr, \"Time searching: \", timing);\n\t\t\t// Set up penalities\n\t\t\tif(bonusMatch > 0 && !localAlign) {\n\t\t\t\tcerr << \"Warning: Match bonus always = 0 in --end-to-end mode; ignoring user setting\" << endl;\n\t\t\t\tbonusMatch = 0;\n\t\t\t}\n\t\t\tScoring sc(\n\t\t\t\t\tbonusMatch,     // constant reward for match\n\t\t\t\t\tpenMmcType,     // how to penalize mismatches\n\t\t\t\t\tpenMmcMax,      // max mm pelanty\n\t\t\t\t\tpenMmcMin,      // min mm pelanty\n\t\t\t\t\tscoreMin,       // min score as function of read len\n\t\t\t\t\tnCeil,          // max # Ns as function of read len\n\t\t\t\t\tpenNType,       // how to penalize Ns in the read\n\t\t\t\t\tpenN,           // constant if N pelanty is a constant\n\t\t\t\t\tpenNCatPair,    // whether to concat mates before N filtering\n\t\t\t\t\tpenRdGapConst,  // constant coeff for read gap cost\n\t\t\t\t\tpenRfGapConst,  // constant coeff for ref gap cost\n\t\t\t\t\tpenRdGapLinear, // linear coeff for read gap cost\n\t\t\t\t\tpenRfGapLinear, // linear coeff for ref gap cost\n\t\t\t\t\tgGapBarrier);    // # rows at top/bot only entered diagonally\n\n\n\t\t\t// Set up hit sink; if sanityCheck && !os.empty() is true,\n\t\t\t// then instruct the sink to \"retain\" hits in a vector in\n\t\t\t// memory so that we can easily sanity check them later on\n\t\t\tTimer *_tRef = new Timer(cerr, \"Time loading reference: \", timing);\n\t\t\tauto_ptr<BitPairReference> refs;\n\t\t\tdelete _tRef;\n\n\n\t\t\tif(gVerbose || startVerbose) {\n\t\t\t\tcerr << \"Dispatching to search driver: \"; logTime(cerr, true);\n\t\t\t}\n\n\n\t\t\t// Do the search for all input reads\n\t\t\tassert(patsrc != NULL);\n\t\t\tassert(mssink != NULL);\n\t\t\tmultiseedSearch(\n\t\t\t\t\tsc,      // scoring scheme\n\t\t\t\t\t*patsrc, // pattern source\n\t\t\t\t\t*mssink, // hit sink\n\t\t\t\t\tebwt,    // BWT\n\t\t\t\t\t*ebwtBw, // BWT'\n\t\t\t\t\trefs.get(),\n\t\t\t\t\trefnames,\n\t\t\t\t\tmetricsOfb);\n\t\t\tif(!gQuiet && !seedSumm) {\n\t\t\t\tsize_t repThresh = mhits;\n\t\t\t\tif(repThresh == 0) {\n\t\t\t\t\trepThresh = std::numeric_limits<size_t>::max();\n\t\t\t\t}\n\t\t\t\tmssink->finish(\n\t\t\t\t\t\trepThresh,\n\t\t\t\t\t\tgReportDiscordant,\n\t\t\t\t\t\tgReportMixed,\n\t\t\t\t\t\thadoopOut);\n\t\t\t}\n\n\n\t\t\toq.flush(true) ;\n\t\t\tassert_eq(oq.numStarted(), oq.numFinished());\n\t\t\tassert_eq(oq.numStarted(), oq.numFlushed());\n\t\t\t\n\t\t\tif ( separator )\n\t\t\t{\n\t\t\t\tfout->writeChars( \"#File_End_Here\\n\" );\n\t\t\t\tfout->flush() ;\n\n\t\t\t\tstd::ostringstream oss ;\n\t\t\t\toss << \"centrifuge_report_\" << fileIdx<<\".tsv\" ;\n\t\t\t\treportFile = oss.str() ;\n\t\t\t\tif (!reportFile.empty()) {\n\t\t\t\t\t// write the species report into the corresponding file\n\t\t\t\t\tcerr << \"report file \" << reportFile << endl;\n\t\t\t\t\tofstream reportOfb;\n\t\t\t\t\treportOfb.open(reportFile.c_str());\n\t\t\t\t\tSpeciesMetrics& spm = metrics.spmu;\n\t\t\t\t\tif(abundance_analysis) {\n\t\t\t\t\t\tuint8_t rank = get_tax_rank_id(classification_rank.c_str());\n\t\t\t\t\t\tTimer timer(cerr, \"Calculating abundance: \");\n\t\t\t\t\t\tspm.calculateAbundance(ebwt, rank);\n\t\t\t\t\t}\n\t\t\t\t\tconst std::map<uint64_t, TaxonomyNode>& tree = ebwt.tree();\n\t\t\t\t\tconst std::map<uint64_t, string>& name_map = ebwt.name();\n\t\t\t\t\tconst std::map<uint64_t, uint64_t>& size_map = ebwt.size();\n\t\t\t\t\tconst map<uint64_t, double>& abundance = spm.abundance;\n\t\t\t\t\tconst map<uint64_t, double>& abundance_len = spm.abundance_len;\n\t\t\t\t\treportOfb << \"name\" << '\\t' << \"taxID\" << '\\t' << \"taxRank\" << '\\t'\n\t\t\t\t\t\t<< \"genomeSize\" << '\\t' << \"numReads\" << '\\t' << \"numUniqueReads\" << '\\t';\n\t\t\t\t\tif(false) {\n\t\t\t\t\t\treportOfb << \"summedHitLen\" << '\\t' << \"numWeightedReads\" << '\\t' << \"numUniqueKmers\" << '\\t' << \"sumScore\" << '\\t';\n\t\t\t\t\t}\n\t\t\t\t\treportOfb << \"abundance\";\n\t\t\t\t\tif(false) {\n\t\t\t\t\t\treportOfb << '\\t' << \"abundance_normalized_by_genome_size\";\n\t\t\t\t\t}\n\t\t\t\t\treportOfb << endl;\n\t\t\t\t\tfor(map<uint64_t, ReadCounts>::const_iterator it = spm.species_counts.begin(); it != spm.species_counts.end(); ++it) {\n\n\t\t\t\t\t\tuint64_t taxid = it->first;\n\t\t\t\t\t\tif(taxid == 0) continue;\n\n\t\t\t\t\t\tstd::map<uint64_t, string>::const_iterator name_itr = name_map.find(taxid);\n\t\t\t\t\t\tif(name_itr != name_map.end()) {\n\t\t\t\t\t\t\treportOfb << name_itr->second;\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\treportOfb << taxid;\n\t\t\t\t\t\t}\n\t\t\t\t\t\treportOfb << '\\t' << taxid << '\\t';\n\n\t\t\t\t\t\tuint8_t rank = 0;\n\t\t\t\t\t\tbool leaf = false;\n\t\t\t\t\t\tstd::map<uint64_t, TaxonomyNode>::const_iterator tree_itr = tree.find(taxid);\n\n\t\t\t\t\t\tif(tree_itr != tree.end()) {\n\t\t\t\t\t\t\trank = tree_itr->second.rank;\n\t\t\t\t\t\t\tleaf = tree_itr->second.leaf;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif(rank == RANK_UNKNOWN && leaf) {\n\t\t\t\t\t\t\treportOfb << \"leaf\";\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tstring rank_str = get_tax_rank_string(rank);\n\t\t\t\t\t\t\treportOfb << rank_str;\n\t\t\t\t\t\t}\n\t\t\t\t\t\treportOfb << '\\t';\n\n\t\t\t\t\t\tstd::map<uint64_t, uint64_t>::const_iterator size_itr = size_map.find(taxid);\n\t\t\t\t\t\tuint64_t genome_size = 0;\n\t\t\t\t\t\tif(size_itr != size_map.end()) {\n\t\t\t\t\t\t\tgenome_size = size_itr->second;\n\t\t\t\t\t\t}\n\n\t\t\t\t\t\treportOfb << genome_size << '\\t'\n\t\t\t\t\t\t\t<< it->second.n_reads << '\\t' << it->second.n_unique_reads << '\\t';\n\t\t\t\t\t\tif(false) {\n\t\t\t\t\t\t\treportOfb << it->second.summed_hit_len << '\\t' << it->second.weighted_reads << '\\t'\n\t\t\t\t\t\t\t\t<< spm.nDistinctKmers(taxid) << '\\t' << it->second.sum_score << '\\t';\n\t\t\t\t\t\t}\n\t\t\t\t\t\tmap<uint64_t, double>::const_iterator ab_len_itr = abundance_len.find(taxid);\n\t\t\t\t\t\tif(ab_len_itr != abundance_len.end()) {\n\t\t\t\t\t\t\treportOfb << ab_len_itr->second;\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\treportOfb << \"0.0\";\n\t\t\t\t\t\t}\n\t\t\t\t\t\tmap<uint64_t, double>::const_iterator ab_itr = abundance.find(taxid);\n\t\t\t\t\t\tif(false) {\n\t\t\t\t\t\t\tif(ab_itr != abundance.end() && ab_len_itr != abundance_len.end()) {\n\t\t\t\t\t\t\t\treportOfb << '\\t' << ab_itr->second;\n\t\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t\treportOfb << \"\\t0.0\";\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t\treportOfb << endl;\n\n\t\t\t\t\t}\n\t\t\t\t\treportOfb.close();\n\t\t\t\t}\n\t\t\t\tmetrics.reset() ;\n\t\t\t}\n\n\n\t\t\tdelete patsrc;\n\t\t}\n\t} // end for-loop of fileCnt\n\n\t// Coalesced report here if we don't want separator.\n\tif ( !separator )\n\t{\n\t\tif (!reportFile.empty()) {\n\t\t\t// write the species report into the corresponding file\n\t\t\tcerr << \"report file \" << reportFile << endl;\n\t\t\tofstream reportOfb;\n\t\t\treportOfb.open(reportFile.c_str());\n\t\t\tSpeciesMetrics& spm = metrics.spmu;\n\t\t\tif(abundance_analysis) {\n\t\t\t\tuint8_t rank = get_tax_rank_id(classification_rank.c_str());\n\t\t\t\tTimer timer(cerr, \"Calculating abundance: \");\n\t\t\t\tspm.calculateAbundance(ebwt, rank);\n\t\t\t}\n\t\t\tconst std::map<uint64_t, TaxonomyNode>& tree = ebwt.tree();\n\t\t\tconst std::map<uint64_t, string>& name_map = ebwt.name();\n\t\t\tconst std::map<uint64_t, uint64_t>& size_map = ebwt.size();\n\t\t\tconst map<uint64_t, double>& abundance = spm.abundance;\n\t\t\tconst map<uint64_t, double>& abundance_len = spm.abundance_len;\n\t\t\treportOfb << \"name\" << '\\t' << \"taxID\" << '\\t' << \"taxRank\" << '\\t'\n\t\t\t\t<< \"genomeSize\" << '\\t' << \"numReads\" << '\\t' << \"numUniqueReads\" << '\\t';\n\t\t\tif(false) {\n\t\t\t\treportOfb << \"summedHitLen\" << '\\t' << \"numWeightedReads\" << '\\t' << \"numUniqueKmers\" << '\\t' << \"sumScore\" << '\\t';\n\t\t\t}\n\t\t\treportOfb << \"abundance\";\n\t\t\tif(false) {\n\t\t\t\treportOfb << '\\t' << \"abundance_normalized_by_genome_size\";\n\t\t\t}\n\t\t\treportOfb << endl;\n\t\t\tfor(map<uint64_t, ReadCounts>::const_iterator it = spm.species_counts.begin(); it != spm.species_counts.end(); ++it) {\n\n\t\t\t\tuint64_t taxid = it->first;\n\t\t\t\tif(taxid == 0) continue;\n\n\t\t\t\tstd::map<uint64_t, string>::const_iterator name_itr = name_map.find(taxid);\n\t\t\t\tif(name_itr != name_map.end()) {\n\t\t\t\t\treportOfb << name_itr->second;\n\t\t\t\t} else {\n\t\t\t\t\treportOfb << taxid;\n\t\t\t\t}\n\t\t\t\treportOfb << '\\t' << taxid << '\\t';\n\n\t\t\t\tuint8_t rank = 0;\n\t\t\t\tbool leaf = false;\n\t\t\t\tstd::map<uint64_t, TaxonomyNode>::const_iterator tree_itr = tree.find(taxid);\n\n\t\t\t\tif(tree_itr != tree.end()) {\n\t\t\t\t\trank = tree_itr->second.rank;\n\t\t\t\t\tleaf = tree_itr->second.leaf;\n\t\t\t\t}\n\t\t\t\tif(rank == RANK_UNKNOWN && leaf) {\n\t\t\t\t\treportOfb << \"leaf\";\n\t\t\t\t} else {\n\t\t\t\t\tstring rank_str = get_tax_rank_string(rank);\n\t\t\t\t\treportOfb << rank_str;\n\t\t\t\t}\n\t\t\t\treportOfb << '\\t';\n\n\t\t\t\tstd::map<uint64_t, uint64_t>::const_iterator size_itr = size_map.find(taxid);\n\t\t\t\tuint64_t genome_size = 0;\n\t\t\t\tif(size_itr != size_map.end()) {\n\t\t\t\t\tgenome_size = size_itr->second;\n\t\t\t\t}\n\n\t\t\t\treportOfb << genome_size << '\\t'\n\t\t\t\t\t<< it->second.n_reads << '\\t' << it->second.n_unique_reads << '\\t';\n\t\t\t\tif(false) {\n\t\t\t\t\treportOfb << it->second.summed_hit_len << '\\t' << it->second.weighted_reads << '\\t'\n\t\t\t\t\t\t<< spm.nDistinctKmers(taxid) << '\\t' << it->second.sum_score << '\\t';\n\t\t\t\t}\n\t\t\t\tmap<uint64_t, double>::const_iterator ab_len_itr = abundance_len.find(taxid);\n\t\t\t\tif(ab_len_itr != abundance_len.end()) {\n\t\t\t\t\treportOfb << ab_len_itr->second;\n\t\t\t\t} else {\n\t\t\t\t\treportOfb << \"0.0\";\n\t\t\t\t}\n\t\t\t\tmap<uint64_t, double>::const_iterator ab_itr = abundance.find(taxid);\n\t\t\t\tif(false) {\n\t\t\t\t\tif(ab_itr != abundance.end() && ab_len_itr != abundance_len.end()) {\n\t\t\t\t\t\treportOfb << '\\t' << ab_itr->second;\n\t\t\t\t\t} else {\n\t\t\t\t\t\treportOfb << \"\\t0.0\";\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\treportOfb << endl;\n\n\t\t\t}\n\t\t\treportOfb.close();\n\t\t}\n\t}\n\n\t// Evict any loaded indexes from memory\n\tif(ebwt.isInMemory()) {\n\t\tebwt.evictFromMemory();\n\t}\n\tif(ebwtBw != NULL) {\n\t\tdelete ebwtBw;\n\t}\n\n\tdelete mssink;\n\tdelete metricsOfb;\n\tif(fout != NULL) {\n\t\tdelete fout;\n\t}\n}\n\n// C++ name mangling is disabled for the bowtie() function to make it\n// easier to use Bowtie as a library.\nextern \"C\" {\n\n/**\n * Main bowtie entry function.  Parses argc/argv style command-line\n * options, sets global configuration variables, and calls the driver()\n * function.\n */\nint centrifuge(int argc, const char **argv) {\n\ttry {\n\t\t// Reset all global state, including getopt state\n\t\topterr = optind = 1;\n\t\tresetOptions();\n\t\tfor(int i = 0; i < argc; i++) {\n\t\t\targstr += argv[i];\n\t\t\tif(i < argc-1) argstr += \" \";\n\t\t}\n\t\tif(startVerbose) { cerr << \"Entered main(): \"; logTime(cerr, true); }\n\n\t\tparseOptions(argc, argv);\n\n\t\targv0 = argv[0];\n\t\tif(showVersion) {\n\t\t\tcout << argv0 << \" version \" << CENTRIFUGE_VERSION << endl;\n\t\t\tif(sizeof(void*) == 4) {\n\t\t\t\tcout << \"32-bit\" << endl;\n\t\t\t} else if(sizeof(void*) == 8) {\n\t\t\t\tcout << \"64-bit\" << endl;\n\t\t\t} else {\n\t\t\t\tcout << \"Neither 32- nor 64-bit: sizeof(void*) = \" << sizeof(void*) << endl;\n\t\t\t}\n\t\t\tcout << \"Built on \" << BUILD_HOST << endl;\n\t\t\tcout << BUILD_TIME << endl;\n\t\t\tcout << \"Compiler: \" << COMPILER_VERSION << endl;\n\t\t\tcout << \"Options: \" << COMPILER_OPTIONS << endl;\n\t\t\tcout << \"Sizeof {int, long, long long, void*, size_t, off_t}: {\"\n\t\t\t\t << sizeof(int)\n\t\t\t\t << \", \" << sizeof(long) << \", \" << sizeof(long long)\n\t\t\t\t << \", \" << sizeof(void *) << \", \" << sizeof(size_t)\n\t\t\t\t << \", \" << sizeof(off_t) << \"}\" << endl;\n\t\t\treturn 0;\n\t\t}\n\t\t{\n\t\t\tTimer _t(cerr, \"Overall time: \", timing);\n\t\t\tif(startVerbose) {\n\t\t\t\tcerr << \"Parsing index and read arguments: \"; logTime(cerr, true);\n\t\t\t}\n\n\t\t\t// Get index basename (but only if it wasn't specified via --index)\n\t\t\tif(bt2index.empty()) {\n\t\t\t\tif(optind >= argc) {\n\t\t\t\t\tcerr << \"No index, query, or output file specified!\" << endl;\n\t\t\t\t\tprintUsage(cerr);\n\t\t\t\t\treturn 1;\n\t\t\t\t}\n\t\t\t\tbt2index = argv[optind++];\n\t\t\t}\n\n\t\t\t// Get query filename\n\t\t\tbool got_reads = !queries.empty() || !mates1.empty() || !mates12.empty();\n#ifdef USE_SRA\n            got_reads = got_reads || !sra_accs.empty();\n#endif\n\t\t\tif(optind >= argc) {\n\t\t\t\tif(!got_reads) {\n\t\t\t\t\tprintUsage(cerr);\n\t\t\t\t\tcerr << \"***\" << endl\n#ifdef USE_SRA\n                    << \"Error: Must specify at least one read input with -U/-1/-2/--sra-acc\" << endl;\n#else\n                    << \"Error: Must specify at least one read input with -U/-1/-2\" << endl;\n#endif\n\t\t\t\t\treturn 1;\n\t\t\t\t}\n\t\t\t} else if(!got_reads) {\n\t\t\t\t// Tokenize the list of query files\n\t\t\t\ttokenize(argv[optind++], \",\", queries);\n\t\t\t\tif(queries.empty()) {\n\t\t\t\t\tcerr << \"Tokenized query file list was empty!\" << endl;\n\t\t\t\t\tprintUsage(cerr);\n\t\t\t\t\treturn 1;\n\t\t\t\t}\n\t\t\t}\n\n\t\t\t// Get output filename\n\t\t\tif(optind < argc && outfile.empty()) {\n\t\t\t\toutfile = argv[optind++];\n\t\t\t\tcerr << \"Warning: Output file '\" << outfile.c_str()\n\t\t\t\t     << \"' was specified without -S.  This will not work in \"\n\t\t\t\t\t << \"future Centrifuge versions.  Please use -S instead.\"\n\t\t\t\t\t << endl;\n\t\t\t}\n\n\t\t\t// Extra parametesr?\n\t\t\tif(optind < argc) {\n\t\t\t\tcerr << \"Extra parameter(s) specified: \";\n\t\t\t\tfor(int i = optind; i < argc; i++) {\n\t\t\t\t\tcerr << \"\\\"\" << argv[i] << \"\\\"\";\n\t\t\t\t\tif(i < argc-1) cerr << \", \";\n\t\t\t\t}\n\t\t\t\tcerr << endl;\n\t\t\t\tif(mates1.size() > 0) {\n\t\t\t\t\tcerr << \"Note that if <mates> files are specified using -1/-2, a <singles> file cannot\" << endl\n\t\t\t\t\t\t << \"also be specified.  Please run bowtie separately for mates and singles.\" << endl;\n\t\t\t\t}\n\t\t\t\tthrow 1;\n\t\t\t}\n\n\t\t\t// Optionally summarize\n\t\t\tif(gVerbose) {\n\t\t\t\tcout << \"Input bt2 file: \\\"\" << bt2index.c_str() << \"\\\"\" << endl;\n\t\t\t\tcout << \"Query inputs (DNA, \" << file_format_names[format].c_str() << \"):\" << endl;\n\t\t\t\tfor(size_t i = 0; i < queries.size(); i++) {\n\t\t\t\t\tcout << \"  \" << queries[i].c_str() << endl;\n\t\t\t\t}\n\t\t\t\tcout << \"Quality inputs:\" << endl;\n\t\t\t\tfor(size_t i = 0; i < qualities.size(); i++) {\n\t\t\t\t\tcout << \"  \" << qualities[i].c_str() << endl;\n\t\t\t\t}\n\t\t\t\tcout << \"Output file: \\\"\" << outfile.c_str() << \"\\\"\" << endl;\n\t\t\t\tcout << \"Local endianness: \" << (currentlyBigEndian()? \"big\":\"little\") << endl;\n\t\t\t\tcout << \"Sanity checking: \" << (sanityCheck? \"enabled\":\"disabled\") << endl;\n\t\t\t#ifdef NDEBUG\n\t\t\t\tcout << \"Assertions: disabled\" << endl;\n\t\t\t#else\n\t\t\t\tcout << \"Assertions: enabled\" << endl;\n\t\t\t#endif\n\t\t\t}\n\t\t\tif(ipause) {\n\t\t\t\tcout << \"Press key to continue...\" << endl;\n\t\t\t\tgetchar();\n\t\t\t}\n\t\t\tdriver<SString<char> >(\"DNA\", bt2index, outfile);\n\t\t}\n\t\treturn 0;\n\t} catch(std::exception& e) {\n\t\tcerr << \"Error: Encountered exception: '\" << e.what() << \"'\" << endl;\n\t\tcerr << \"Command: \";\n\t\tfor(int i = 0; i < argc; i++) cerr << argv[i] << \" \";\n\t\tcerr << endl;\n\t\treturn 1;\n\t} catch(int e) {\n\t\tif(e != 0) {\n\t\t\tcerr << \"Error: Encountered internal Centrifuge exception (#\" << e << \")\" << endl;\n\t\t\tcerr << \"Command: \";\n\t\t\tfor(int i = 0; i < argc; i++) cerr << argv[i] << \" \";\n\t\t\tcerr << endl;\n\t\t}\n\t\treturn e;\n\t}\n} // bowtie()\n} // extern \"C\"\n"
  },
  {
    "path": "centrifuge.xcodeproj/project.pbxproj",
    "content": "// !$*UTF8*$!\n{\n\tarchiveVersion = 1;\n\tclasses = {\n\t};\n\tobjectVersion = 46;\n\tobjects = {\n\n/* Begin PBXBuildFile section */\n\t\tE86143B81C20833200D5C240 /* alphabet.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861438B1C20833200D5C240 /* alphabet.cpp */; };\n\t\tE86143B91C20833200D5C240 /* bt2_idx.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861438C1C20833200D5C240 /* bt2_idx.cpp */; };\n\t\tE86143BA1C20833200D5C240 /* ccnt_lut.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861438D1C20833200D5C240 /* ccnt_lut.cpp */; };\n\t\tE86143BB1C20833200D5C240 /* centrifuge_build_main.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861438E1C20833200D5C240 /* centrifuge_build_main.cpp */; };\n\t\tE86143BC1C20833200D5C240 /* centrifuge_build.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861438F1C20833200D5C240 /* centrifuge_build.cpp */; };\n\t\tE86143C41C20833200D5C240 /* ds.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143971C20833200D5C240 /* ds.cpp */; };\n\t\tE86143C71C20833200D5C240 /* limit.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861439A1C20833200D5C240 /* limit.cpp */; };\n\t\tE86143CF1C20833200D5C240 /* random_source.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A21C20833200D5C240 /* random_source.cpp */; };\n\t\tE86143D31C20833200D5C240 /* ref_read.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A61C20833200D5C240 /* ref_read.cpp */; };\n\t\tE86143D41C20833200D5C240 /* reference.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A71C20833200D5C240 /* reference.cpp */; };\n\t\tE86143D61C20833200D5C240 /* shmem.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A91C20833200D5C240 /* shmem.cpp */; };\n\t\tE86143DA1C20833200D5C240 /* tinythread.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143AD1C20833200D5C240 /* tinythread.cpp */; };\n\t\tE869A0671C209BCC007600C2 /* aligner_seed_policy.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143831C20833200D5C240 /* aligner_seed_policy.cpp */; };\n\t\tE869A0681C209BCC007600C2 /* mask.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861439C1C20833200D5C240 /* mask.cpp */; };\n\t\tE869A0691C209BCC007600C2 /* outq.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861439D1C20833200D5C240 /* outq.cpp */; };\n\t\tE869A06A1C209BCC007600C2 /* pat.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861439E1C20833200D5C240 /* pat.cpp */; };\n\t\tE869A06B1C209BCC007600C2 /* pe.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861439F1C20833200D5C240 /* pe.cpp */; };\n\t\tE869A06C1C209BCC007600C2 /* presets.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A01C20833200D5C240 /* presets.cpp */; };\n\t\tE869A06D1C209BCC007600C2 /* qual.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A11C20833200D5C240 /* qual.cpp */; };\n\t\tE869A06E1C209BCC007600C2 /* random_util.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A31C20833200D5C240 /* random_util.cpp */; };\n\t\tE869A06F1C209BCC007600C2 /* read_qseq.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A41C20833200D5C240 /* read_qseq.cpp */; };\n\t\tE869A0701C209BCC007600C2 /* ref_coord.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A51C20833200D5C240 /* ref_coord.cpp */; };\n\t\tE869A0711C209BCC007600C2 /* scoring.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A81C20833200D5C240 /* scoring.cpp */; };\n\t\tE869A0721C209BCC007600C2 /* simple_func.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143AA1C20833200D5C240 /* simple_func.cpp */; };\n\t\tE869A0731C209BE6007600C2 /* centrifuge_main.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143921C20833200D5C240 /* centrifuge_main.cpp */; };\n\t\tE869A0741C209BE6007600C2 /* centrifuge.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143941C20833200D5C240 /* centrifuge.cpp */; };\n\t\tE869A0751C20A308007600C2 /* limit.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861439A1C20833200D5C240 /* limit.cpp */; };\n\t\tE869A0761C20A425007600C2 /* alphabet.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861438B1C20833200D5C240 /* alphabet.cpp */; };\n\t\tE869A0771C20A425007600C2 /* bt2_idx.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861438C1C20833200D5C240 /* bt2_idx.cpp */; };\n\t\tE869A0781C20A425007600C2 /* ccnt_lut.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861438D1C20833200D5C240 /* ccnt_lut.cpp */; };\n\t\tE869A0791C20A425007600C2 /* ds.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143971C20833200D5C240 /* ds.cpp */; };\n\t\tE869A07A1C20A425007600C2 /* edit.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143981C20833200D5C240 /* edit.cpp */; };\n\t\tE869A07B1C20A425007600C2 /* random_source.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A21C20833200D5C240 /* random_source.cpp */; };\n\t\tE869A07C1C20A425007600C2 /* ref_read.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A61C20833200D5C240 /* ref_read.cpp */; };\n\t\tE869A07D1C20A425007600C2 /* reference.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A71C20833200D5C240 /* reference.cpp */; };\n\t\tE869A07E1C20A425007600C2 /* shmem.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A91C20833200D5C240 /* shmem.cpp */; };\n\t\tE869A07F1C20A425007600C2 /* tinythread.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143AD1C20833200D5C240 /* tinythread.cpp */; };\n\t\tE869A0801C20A50B007600C2 /* alphabet.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861438B1C20833200D5C240 /* alphabet.cpp */; };\n\t\tE869A0811C20A50B007600C2 /* bt2_idx.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861438C1C20833200D5C240 /* bt2_idx.cpp */; };\n\t\tE869A0821C20A50B007600C2 /* ccnt_lut.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E861438D1C20833200D5C240 /* ccnt_lut.cpp */; };\n\t\tE869A0831C20A50B007600C2 /* centrifuge_inspect.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143911C20833200D5C240 /* centrifuge_inspect.cpp */; };\n\t\tE869A0841C20A50B007600C2 /* ds.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143971C20833200D5C240 /* ds.cpp */; };\n\t\tE869A0851C20A50B007600C2 /* edit.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143981C20833200D5C240 /* edit.cpp */; };\n\t\tE869A0861C20A50B007600C2 /* random_source.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A21C20833200D5C240 /* random_source.cpp */; };\n\t\tE869A0871C20A50B007600C2 /* ref_read.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A61C20833200D5C240 /* ref_read.cpp */; };\n\t\tE869A0881C20A50B007600C2 /* reference.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A71C20833200D5C240 /* reference.cpp */; };\n\t\tE869A0891C20A50B007600C2 /* shmem.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143A91C20833200D5C240 /* shmem.cpp */; };\n\t\tE869A08A1C20A50B007600C2 /* tinythread.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143AD1C20833200D5C240 /* tinythread.cpp */; };\n\t\tE8AB5A231C209232009138A6 /* diff_sample.cpp in Sources */ = {isa = PBXBuildFile; fileRef = E86143951C20833200D5C240 /* diff_sample.cpp */; };\n/* End PBXBuildFile section */\n\n/* Begin PBXCopyFilesBuildPhase section */\n\t\tE8485E111C207EF000F225FA /* CopyFiles */ = {\n\t\t\tisa = PBXCopyFilesBuildPhase;\n\t\t\tbuildActionMask = 2147483647;\n\t\t\tdstPath = /usr/share/man/man1/;\n\t\t\tdstSubfolderSpec = 0;\n\t\t\tfiles = (\n\t\t\t);\n\t\t\trunOnlyForDeploymentPostprocessing = 1;\n\t\t};\n\t\tE869A0481C2095A8007600C2 /* CopyFiles */ = {\n\t\t\tisa = PBXCopyFilesBuildPhase;\n\t\t\tbuildActionMask = 2147483647;\n\t\t\tdstPath = /usr/share/man/man1/;\n\t\t\tdstSubfolderSpec = 0;\n\t\t\tfiles = (\n\t\t\t);\n\t\t\trunOnlyForDeploymentPostprocessing = 1;\n\t\t};\n\t\tE869A0531C2095B5007600C2 /* CopyFiles */ = {\n\t\t\tisa = PBXCopyFilesBuildPhase;\n\t\t\tbuildActionMask = 2147483647;\n\t\t\tdstPath = /usr/share/man/man1/;\n\t\t\tdstSubfolderSpec = 0;\n\t\t\tfiles = (\n\t\t\t);\n\t\t\trunOnlyForDeploymentPostprocessing = 1;\n\t\t};\n\t\tE869A05E1C2095CA007600C2 /* CopyFiles */ = {\n\t\t\tisa = PBXCopyFilesBuildPhase;\n\t\t\tbuildActionMask = 2147483647;\n\t\t\tdstPath = /usr/share/man/man1/;\n\t\t\tdstSubfolderSpec = 0;\n\t\t\tfiles = (\n\t\t\t);\n\t\t\trunOnlyForDeploymentPostprocessing = 1;\n\t\t};\n/* End PBXCopyFilesBuildPhase section */\n\n/* Begin PBXFileReference section */\n\t\tE8485E131C207EF000F225FA /* centrifuge-buildx */ = {isa = PBXFileReference; explicitFileType = \"compiled.mach-o.executable\"; includeInIndex = 0; path = \"centrifuge-buildx\"; sourceTree = BUILT_PRODUCTS_DIR; };\n\t\tE861433C1C20833200D5C240 /* aligner_bt.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = aligner_bt.h; sourceTree = \"<group>\"; };\n\t\tE861433D1C20833200D5C240 /* aligner_cache.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = aligner_cache.h; sourceTree = \"<group>\"; };\n\t\tE861433E1C20833200D5C240 /* aligner_metrics.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = aligner_metrics.h; sourceTree = \"<group>\"; };\n\t\tE861433F1C20833200D5C240 /* aligner_result.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = aligner_result.h; sourceTree = \"<group>\"; };\n\t\tE86143401C20833200D5C240 /* aligner_seed_policy.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = aligner_seed_policy.h; sourceTree = \"<group>\"; };\n\t\tE86143411C20833200D5C240 /* aligner_seed.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = aligner_seed.h; sourceTree = \"<group>\"; };\n\t\tE86143421C20833200D5C240 /* aligner_sw_common.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = aligner_sw_common.h; sourceTree = \"<group>\"; };\n\t\tE86143431C20833200D5C240 /* aligner_sw_nuc.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = aligner_sw_nuc.h; sourceTree = \"<group>\"; };\n\t\tE86143441C20833200D5C240 /* aligner_sw.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = aligner_sw.h; sourceTree = \"<group>\"; };\n\t\tE86143451C20833200D5C240 /* aligner_swsse.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = aligner_swsse.h; sourceTree = \"<group>\"; };\n\t\tE86143461C20833200D5C240 /* aln_sink.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = aln_sink.h; sourceTree = \"<group>\"; };\n\t\tE86143471C20833200D5C240 /* alphabet.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = alphabet.h; sourceTree = \"<group>\"; };\n\t\tE86143481C20833200D5C240 /* assert_helpers.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = assert_helpers.h; sourceTree = \"<group>\"; };\n\t\tE86143491C20833200D5C240 /* binary_sa_search.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = binary_sa_search.h; sourceTree = \"<group>\"; };\n\t\tE861434A1C20833200D5C240 /* bitpack.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = bitpack.h; sourceTree = \"<group>\"; };\n\t\tE861434B1C20833200D5C240 /* blockwise_sa.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = blockwise_sa.h; sourceTree = \"<group>\"; };\n\t\tE861434C1C20833200D5C240 /* bt2_idx.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = bt2_idx.h; sourceTree = \"<group>\"; };\n\t\tE861434D1C20833200D5C240 /* bt2_io.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = bt2_io.h; sourceTree = \"<group>\"; };\n\t\tE861434E1C20833200D5C240 /* bt2_util.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = bt2_util.h; sourceTree = \"<group>\"; };\n\t\tE861434F1C20833200D5C240 /* btypes.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = btypes.h; sourceTree = \"<group>\"; };\n\t\tE86143501C20833200D5C240 /* classifier.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = classifier.h; sourceTree = \"<group>\"; };\n\t\tE86143511C20833200D5C240 /* diff_sample.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = diff_sample.h; sourceTree = \"<group>\"; };\n\t\tE86143521C20833200D5C240 /* dp_framer.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = dp_framer.h; sourceTree = \"<group>\"; };\n\t\tE86143531C20833200D5C240 /* ds.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = ds.h; sourceTree = \"<group>\"; };\n\t\tE86143541C20833200D5C240 /* edit.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = edit.h; sourceTree = \"<group>\"; };\n\t\tE86143551C20833200D5C240 /* endian_swap.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = endian_swap.h; sourceTree = \"<group>\"; };\n\t\tE86143561C20833200D5C240 /* fast_mutex.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = fast_mutex.h; sourceTree = \"<group>\"; };\n\t\tE86143571C20833200D5C240 /* filebuf.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = filebuf.h; sourceTree = \"<group>\"; };\n\t\tE86143581C20833200D5C240 /* formats.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = formats.h; sourceTree = \"<group>\"; };\n\t\tE86143591C20833200D5C240 /* group_walk.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = group_walk.h; sourceTree = \"<group>\"; };\n\t\tE861435A1C20833200D5C240 /* hi_aligner.h */ = {isa = PBXFileReference; fileEncoding = 4; indentWidth = 4; lastKnownFileType = sourcecode.c.h; path = hi_aligner.h; sourceTree = \"<group>\"; };\n\t\tE861435B1C20833200D5C240 /* hier_idx_common.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = hier_idx_common.h; sourceTree = \"<group>\"; };\n\t\tE861435C1C20833200D5C240 /* hier_idx.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = hier_idx.h; sourceTree = \"<group>\"; };\n\t\tE861435D1C20833200D5C240 /* hyperloglogbias.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = hyperloglogbias.h; sourceTree = \"<group>\"; };\n\t\tE861435E1C20833200D5C240 /* hyperloglogplus.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = hyperloglogplus.h; sourceTree = \"<group>\"; };\n\t\tE861435F1C20833200D5C240 /* limit.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = limit.h; sourceTree = \"<group>\"; };\n\t\tE86143601C20833200D5C240 /* ls.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = ls.h; sourceTree = \"<group>\"; };\n\t\tE86143611C20833200D5C240 /* mask.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = mask.h; sourceTree = \"<group>\"; };\n\t\tE86143621C20833200D5C240 /* mem_ids.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = mem_ids.h; sourceTree = \"<group>\"; };\n\t\tE86143631C20833200D5C240 /* mm.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = mm.h; sourceTree = \"<group>\"; };\n\t\tE86143641C20833200D5C240 /* multikey_qsort.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = multikey_qsort.h; sourceTree = \"<group>\"; };\n\t\tE86143651C20833200D5C240 /* opts.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = opts.h; sourceTree = \"<group>\"; };\n\t\tE86143661C20833200D5C240 /* outq.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = outq.h; sourceTree = \"<group>\"; };\n\t\tE86143671C20833200D5C240 /* pat.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = pat.h; sourceTree = \"<group>\"; };\n\t\tE86143681C20833200D5C240 /* pe.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = pe.h; sourceTree = \"<group>\"; };\n\t\tE86143691C20833200D5C240 /* presets.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = presets.h; sourceTree = \"<group>\"; };\n\t\tE861436A1C20833200D5C240 /* processor_support.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = processor_support.h; sourceTree = \"<group>\"; };\n\t\tE861436B1C20833200D5C240 /* qual.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = qual.h; sourceTree = \"<group>\"; };\n\t\tE861436C1C20833200D5C240 /* random_source.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = random_source.h; sourceTree = \"<group>\"; };\n\t\tE861436D1C20833200D5C240 /* random_util.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = random_util.h; sourceTree = \"<group>\"; };\n\t\tE861436E1C20833200D5C240 /* read.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = read.h; sourceTree = \"<group>\"; };\n\t\tE861436F1C20833200D5C240 /* ref_coord.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = ref_coord.h; sourceTree = \"<group>\"; };\n\t\tE86143701C20833200D5C240 /* ref_read.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = ref_read.h; sourceTree = \"<group>\"; };\n\t\tE86143711C20833200D5C240 /* reference.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = reference.h; sourceTree = \"<group>\"; };\n\t\tE86143721C20833200D5C240 /* scoring.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = scoring.h; sourceTree = \"<group>\"; };\n\t\tE86143731C20833200D5C240 /* search_globals.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = search_globals.h; sourceTree = \"<group>\"; };\n\t\tE86143741C20833200D5C240 /* sequence_io.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = sequence_io.h; sourceTree = \"<group>\"; };\n\t\tE86143751C20833200D5C240 /* shmem.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = shmem.h; sourceTree = \"<group>\"; };\n\t\tE86143761C20833200D5C240 /* simple_func.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = simple_func.h; sourceTree = \"<group>\"; };\n\t\tE86143771C20833200D5C240 /* sse_util.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = sse_util.h; sourceTree = \"<group>\"; };\n\t\tE86143781C20833200D5C240 /* sstring.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = sstring.h; sourceTree = \"<group>\"; };\n\t\tE86143791C20833200D5C240 /* str_util.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = str_util.h; sourceTree = \"<group>\"; };\n\t\tE861437A1C20833200D5C240 /* threading.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = threading.h; sourceTree = \"<group>\"; };\n\t\tE861437B1C20833200D5C240 /* timer.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = timer.h; sourceTree = \"<group>\"; };\n\t\tE861437C1C20833200D5C240 /* tinythread.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = tinythread.h; sourceTree = \"<group>\"; };\n\t\tE861437D1C20833200D5C240 /* tokenize.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = tokenize.h; sourceTree = \"<group>\"; };\n\t\tE861437E1C20833200D5C240 /* util.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = util.h; sourceTree = \"<group>\"; };\n\t\tE861437F1C20833200D5C240 /* word_io.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = word_io.h; sourceTree = \"<group>\"; };\n\t\tE86143801C20833200D5C240 /* zbox.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = zbox.h; sourceTree = \"<group>\"; };\n\t\tE86143811C20833200D5C240 /* aligner_bt.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = aligner_bt.cpp; sourceTree = \"<group>\"; };\n\t\tE86143821C20833200D5C240 /* aligner_cache.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = aligner_cache.cpp; sourceTree = \"<group>\"; };\n\t\tE86143831C20833200D5C240 /* aligner_seed_policy.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = aligner_seed_policy.cpp; sourceTree = \"<group>\"; };\n\t\tE86143841C20833200D5C240 /* aligner_seed.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = aligner_seed.cpp; sourceTree = \"<group>\"; };\n\t\tE86143851C20833200D5C240 /* aligner_sw.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = aligner_sw.cpp; sourceTree = \"<group>\"; };\n\t\tE86143861C20833200D5C240 /* aligner_swsse_ee_i16.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = aligner_swsse_ee_i16.cpp; sourceTree = \"<group>\"; };\n\t\tE86143871C20833200D5C240 /* aligner_swsse_ee_u8.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = aligner_swsse_ee_u8.cpp; sourceTree = \"<group>\"; };\n\t\tE86143881C20833200D5C240 /* aligner_swsse_loc_i16.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = aligner_swsse_loc_i16.cpp; sourceTree = \"<group>\"; };\n\t\tE86143891C20833200D5C240 /* aligner_swsse_loc_u8.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = aligner_swsse_loc_u8.cpp; sourceTree = \"<group>\"; };\n\t\tE861438A1C20833200D5C240 /* aligner_swsse.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = aligner_swsse.cpp; sourceTree = \"<group>\"; };\n\t\tE861438B1C20833200D5C240 /* alphabet.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = alphabet.cpp; sourceTree = \"<group>\"; };\n\t\tE861438C1C20833200D5C240 /* bt2_idx.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = bt2_idx.cpp; sourceTree = \"<group>\"; };\n\t\tE861438D1C20833200D5C240 /* ccnt_lut.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = ccnt_lut.cpp; sourceTree = \"<group>\"; };\n\t\tE861438E1C20833200D5C240 /* centrifuge_build_main.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = centrifuge_build_main.cpp; sourceTree = \"<group>\"; };\n\t\tE861438F1C20833200D5C240 /* centrifuge_build.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = centrifuge_build.cpp; sourceTree = \"<group>\"; };\n\t\tE86143901C20833200D5C240 /* centrifuge_compress.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = centrifuge_compress.cpp; sourceTree = \"<group>\"; };\n\t\tE86143911C20833200D5C240 /* centrifuge_inspect.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = centrifuge_inspect.cpp; sourceTree = \"<group>\"; };\n\t\tE86143921C20833200D5C240 /* centrifuge_main.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = centrifuge_main.cpp; sourceTree = \"<group>\"; };\n\t\tE86143931C20833200D5C240 /* centrifuge_report.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = centrifuge_report.cpp; sourceTree = \"<group>\"; };\n\t\tE86143941C20833200D5C240 /* centrifuge.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = centrifuge.cpp; sourceTree = \"<group>\"; };\n\t\tE86143951C20833200D5C240 /* diff_sample.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = diff_sample.cpp; sourceTree = \"<group>\"; };\n\t\tE86143961C20833200D5C240 /* dp_framer.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = dp_framer.cpp; sourceTree = \"<group>\"; };\n\t\tE86143971C20833200D5C240 /* ds.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = ds.cpp; sourceTree = \"<group>\"; };\n\t\tE86143981C20833200D5C240 /* edit.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = edit.cpp; sourceTree = \"<group>\"; };\n\t\tE86143991C20833200D5C240 /* group_walk.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = group_walk.cpp; sourceTree = \"<group>\"; };\n\t\tE861439A1C20833200D5C240 /* limit.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = limit.cpp; sourceTree = \"<group>\"; };\n\t\tE861439B1C20833200D5C240 /* ls.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = ls.cpp; sourceTree = \"<group>\"; };\n\t\tE861439C1C20833200D5C240 /* mask.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = mask.cpp; sourceTree = \"<group>\"; };\n\t\tE861439D1C20833200D5C240 /* outq.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = outq.cpp; sourceTree = \"<group>\"; };\n\t\tE861439E1C20833200D5C240 /* pat.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = pat.cpp; sourceTree = \"<group>\"; };\n\t\tE861439F1C20833200D5C240 /* pe.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = pe.cpp; sourceTree = \"<group>\"; };\n\t\tE86143A01C20833200D5C240 /* presets.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = presets.cpp; sourceTree = \"<group>\"; };\n\t\tE86143A11C20833200D5C240 /* qual.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = qual.cpp; sourceTree = \"<group>\"; };\n\t\tE86143A21C20833200D5C240 /* random_source.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = random_source.cpp; sourceTree = \"<group>\"; };\n\t\tE86143A31C20833200D5C240 /* random_util.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = random_util.cpp; sourceTree = \"<group>\"; };\n\t\tE86143A41C20833200D5C240 /* read_qseq.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = read_qseq.cpp; sourceTree = \"<group>\"; };\n\t\tE86143A51C20833200D5C240 /* ref_coord.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = ref_coord.cpp; sourceTree = \"<group>\"; };\n\t\tE86143A61C20833200D5C240 /* ref_read.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = ref_read.cpp; sourceTree = \"<group>\"; };\n\t\tE86143A71C20833200D5C240 /* reference.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = reference.cpp; sourceTree = \"<group>\"; };\n\t\tE86143A81C20833200D5C240 /* scoring.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = scoring.cpp; sourceTree = \"<group>\"; };\n\t\tE86143A91C20833200D5C240 /* shmem.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = shmem.cpp; sourceTree = \"<group>\"; };\n\t\tE86143AA1C20833200D5C240 /* simple_func.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = simple_func.cpp; sourceTree = \"<group>\"; };\n\t\tE86143AB1C20833200D5C240 /* sse_util.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = sse_util.cpp; sourceTree = \"<group>\"; };\n\t\tE86143AC1C20833200D5C240 /* sstring.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = sstring.cpp; sourceTree = \"<group>\"; };\n\t\tE86143AD1C20833200D5C240 /* tinythread.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = tinythread.cpp; sourceTree = \"<group>\"; };\n\t\tE869A04A1C2095A8007600C2 /* centrifugex */ = {isa = PBXFileReference; explicitFileType = \"compiled.mach-o.executable\"; includeInIndex = 0; path = centrifugex; sourceTree = BUILT_PRODUCTS_DIR; };\n\t\tE869A0551C2095B5007600C2 /* centrifuge-inspectx */ = {isa = PBXFileReference; explicitFileType = \"compiled.mach-o.executable\"; includeInIndex = 0; path = \"centrifuge-inspectx\"; sourceTree = BUILT_PRODUCTS_DIR; };\n\t\tE869A0601C2095CA007600C2 /* centrifuge-compressx */ = {isa = PBXFileReference; explicitFileType = \"compiled.mach-o.executable\"; includeInIndex = 0; path = \"centrifuge-compressx\"; sourceTree = BUILT_PRODUCTS_DIR; };\n\t\tE8D181651D64DB4D00CC784A /* taxonomy.h */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.c.h; path = taxonomy.h; sourceTree = \"<group>\"; };\n/* End PBXFileReference section */\n\n/* Begin PBXFrameworksBuildPhase section */\n\t\tE8485E101C207EF000F225FA /* Frameworks */ = {\n\t\t\tisa = PBXFrameworksBuildPhase;\n\t\t\tbuildActionMask = 2147483647;\n\t\t\tfiles = (\n\t\t\t);\n\t\t\trunOnlyForDeploymentPostprocessing = 0;\n\t\t};\n\t\tE869A0471C2095A8007600C2 /* Frameworks */ = {\n\t\t\tisa = PBXFrameworksBuildPhase;\n\t\t\tbuildActionMask = 2147483647;\n\t\t\tfiles = (\n\t\t\t);\n\t\t\trunOnlyForDeploymentPostprocessing = 0;\n\t\t};\n\t\tE869A0521C2095B5007600C2 /* Frameworks */ = {\n\t\t\tisa = PBXFrameworksBuildPhase;\n\t\t\tbuildActionMask = 2147483647;\n\t\t\tfiles = (\n\t\t\t);\n\t\t\trunOnlyForDeploymentPostprocessing = 0;\n\t\t};\n\t\tE869A05D1C2095CA007600C2 /* Frameworks */ = {\n\t\t\tisa = PBXFrameworksBuildPhase;\n\t\t\tbuildActionMask = 2147483647;\n\t\t\tfiles = (\n\t\t\t);\n\t\t\trunOnlyForDeploymentPostprocessing = 0;\n\t\t};\n/* End PBXFrameworksBuildPhase section */\n\n/* Begin PBXGroup section */\n\t\tE8485E0A1C207EF000F225FA = {\n\t\t\tisa = PBXGroup;\n\t\t\tchildren = (\n\t\t\t\tE8485E1E1C207FCB00F225FA /* Document */,\n\t\t\t\tE8485E1D1C207FC400F225FA /* Source */,\n\t\t\t\tE8485E141C207EF000F225FA /* Products */,\n\t\t\t);\n\t\t\tsourceTree = \"<group>\";\n\t\t};\n\t\tE8485E141C207EF000F225FA /* Products */ = {\n\t\t\tisa = PBXGroup;\n\t\t\tchildren = (\n\t\t\t\tE8485E131C207EF000F225FA /* centrifuge-buildx */,\n\t\t\t\tE869A04A1C2095A8007600C2 /* centrifugex */,\n\t\t\t\tE869A0551C2095B5007600C2 /* centrifuge-inspectx */,\n\t\t\t\tE869A0601C2095CA007600C2 /* centrifuge-compressx */,\n\t\t\t);\n\t\t\tname = Products;\n\t\t\tsourceTree = \"<group>\";\n\t\t};\n\t\tE8485E1D1C207FC400F225FA /* Source */ = {\n\t\t\tisa = PBXGroup;\n\t\t\tchildren = (\n\t\t\t\tE861433C1C20833200D5C240 /* aligner_bt.h */,\n\t\t\t\tE861433D1C20833200D5C240 /* aligner_cache.h */,\n\t\t\t\tE861433E1C20833200D5C240 /* aligner_metrics.h */,\n\t\t\t\tE861433F1C20833200D5C240 /* aligner_result.h */,\n\t\t\t\tE86143401C20833200D5C240 /* aligner_seed_policy.h */,\n\t\t\t\tE86143411C20833200D5C240 /* aligner_seed.h */,\n\t\t\t\tE86143421C20833200D5C240 /* aligner_sw_common.h */,\n\t\t\t\tE86143431C20833200D5C240 /* aligner_sw_nuc.h */,\n\t\t\t\tE86143441C20833200D5C240 /* aligner_sw.h */,\n\t\t\t\tE86143451C20833200D5C240 /* aligner_swsse.h */,\n\t\t\t\tE86143461C20833200D5C240 /* aln_sink.h */,\n\t\t\t\tE86143471C20833200D5C240 /* alphabet.h */,\n\t\t\t\tE86143481C20833200D5C240 /* assert_helpers.h */,\n\t\t\t\tE86143491C20833200D5C240 /* binary_sa_search.h */,\n\t\t\t\tE861434A1C20833200D5C240 /* bitpack.h */,\n\t\t\t\tE861434B1C20833200D5C240 /* blockwise_sa.h */,\n\t\t\t\tE861434C1C20833200D5C240 /* bt2_idx.h */,\n\t\t\t\tE861434D1C20833200D5C240 /* bt2_io.h */,\n\t\t\t\tE861434E1C20833200D5C240 /* bt2_util.h */,\n\t\t\t\tE861434F1C20833200D5C240 /* btypes.h */,\n\t\t\t\tE86143501C20833200D5C240 /* classifier.h */,\n\t\t\t\tE86143511C20833200D5C240 /* diff_sample.h */,\n\t\t\t\tE86143521C20833200D5C240 /* dp_framer.h */,\n\t\t\t\tE86143531C20833200D5C240 /* ds.h */,\n\t\t\t\tE86143541C20833200D5C240 /* edit.h */,\n\t\t\t\tE86143551C20833200D5C240 /* endian_swap.h */,\n\t\t\t\tE86143561C20833200D5C240 /* fast_mutex.h */,\n\t\t\t\tE86143571C20833200D5C240 /* filebuf.h */,\n\t\t\t\tE86143581C20833200D5C240 /* formats.h */,\n\t\t\t\tE86143591C20833200D5C240 /* group_walk.h */,\n\t\t\t\tE861435A1C20833200D5C240 /* hi_aligner.h */,\n\t\t\t\tE861435B1C20833200D5C240 /* hier_idx_common.h */,\n\t\t\t\tE861435C1C20833200D5C240 /* hier_idx.h */,\n\t\t\t\tE861435D1C20833200D5C240 /* hyperloglogbias.h */,\n\t\t\t\tE861435E1C20833200D5C240 /* hyperloglogplus.h */,\n\t\t\t\tE861435F1C20833200D5C240 /* limit.h */,\n\t\t\t\tE86143601C20833200D5C240 /* ls.h */,\n\t\t\t\tE86143611C20833200D5C240 /* mask.h */,\n\t\t\t\tE86143621C20833200D5C240 /* mem_ids.h */,\n\t\t\t\tE86143631C20833200D5C240 /* mm.h */,\n\t\t\t\tE86143641C20833200D5C240 /* multikey_qsort.h */,\n\t\t\t\tE86143651C20833200D5C240 /* opts.h */,\n\t\t\t\tE86143661C20833200D5C240 /* outq.h */,\n\t\t\t\tE86143671C20833200D5C240 /* pat.h */,\n\t\t\t\tE86143681C20833200D5C240 /* pe.h */,\n\t\t\t\tE86143691C20833200D5C240 /* presets.h */,\n\t\t\t\tE861436A1C20833200D5C240 /* processor_support.h */,\n\t\t\t\tE861436B1C20833200D5C240 /* qual.h */,\n\t\t\t\tE861436C1C20833200D5C240 /* random_source.h */,\n\t\t\t\tE861436D1C20833200D5C240 /* random_util.h */,\n\t\t\t\tE861436E1C20833200D5C240 /* read.h */,\n\t\t\t\tE861436F1C20833200D5C240 /* ref_coord.h */,\n\t\t\t\tE86143701C20833200D5C240 /* ref_read.h */,\n\t\t\t\tE86143711C20833200D5C240 /* reference.h */,\n\t\t\t\tE86143721C20833200D5C240 /* scoring.h */,\n\t\t\t\tE86143731C20833200D5C240 /* search_globals.h */,\n\t\t\t\tE86143741C20833200D5C240 /* sequence_io.h */,\n\t\t\t\tE86143751C20833200D5C240 /* shmem.h */,\n\t\t\t\tE86143761C20833200D5C240 /* simple_func.h */,\n\t\t\t\tE86143771C20833200D5C240 /* sse_util.h */,\n\t\t\t\tE86143781C20833200D5C240 /* sstring.h */,\n\t\t\t\tE86143791C20833200D5C240 /* str_util.h */,\n\t\t\t\tE8D181651D64DB4D00CC784A /* taxonomy.h */,\n\t\t\t\tE861437A1C20833200D5C240 /* threading.h */,\n\t\t\t\tE861437B1C20833200D5C240 /* timer.h */,\n\t\t\t\tE861437C1C20833200D5C240 /* tinythread.h */,\n\t\t\t\tE861437D1C20833200D5C240 /* tokenize.h */,\n\t\t\t\tE861437E1C20833200D5C240 /* util.h */,\n\t\t\t\tE861437F1C20833200D5C240 /* word_io.h */,\n\t\t\t\tE86143801C20833200D5C240 /* zbox.h */,\n\t\t\t\tE86143811C20833200D5C240 /* aligner_bt.cpp */,\n\t\t\t\tE86143821C20833200D5C240 /* aligner_cache.cpp */,\n\t\t\t\tE86143831C20833200D5C240 /* aligner_seed_policy.cpp */,\n\t\t\t\tE86143841C20833200D5C240 /* aligner_seed.cpp */,\n\t\t\t\tE86143851C20833200D5C240 /* aligner_sw.cpp */,\n\t\t\t\tE86143861C20833200D5C240 /* aligner_swsse_ee_i16.cpp */,\n\t\t\t\tE86143871C20833200D5C240 /* aligner_swsse_ee_u8.cpp */,\n\t\t\t\tE86143881C20833200D5C240 /* aligner_swsse_loc_i16.cpp */,\n\t\t\t\tE86143891C20833200D5C240 /* aligner_swsse_loc_u8.cpp */,\n\t\t\t\tE861438A1C20833200D5C240 /* aligner_swsse.cpp */,\n\t\t\t\tE861438B1C20833200D5C240 /* alphabet.cpp */,\n\t\t\t\tE861438C1C20833200D5C240 /* bt2_idx.cpp */,\n\t\t\t\tE861438D1C20833200D5C240 /* ccnt_lut.cpp */,\n\t\t\t\tE861438E1C20833200D5C240 /* centrifuge_build_main.cpp */,\n\t\t\t\tE861438F1C20833200D5C240 /* centrifuge_build.cpp */,\n\t\t\t\tE86143901C20833200D5C240 /* centrifuge_compress.cpp */,\n\t\t\t\tE86143911C20833200D5C240 /* centrifuge_inspect.cpp */,\n\t\t\t\tE86143921C20833200D5C240 /* centrifuge_main.cpp */,\n\t\t\t\tE86143931C20833200D5C240 /* centrifuge_report.cpp */,\n\t\t\t\tE86143941C20833200D5C240 /* centrifuge.cpp */,\n\t\t\t\tE86143951C20833200D5C240 /* diff_sample.cpp */,\n\t\t\t\tE86143961C20833200D5C240 /* dp_framer.cpp */,\n\t\t\t\tE86143971C20833200D5C240 /* ds.cpp */,\n\t\t\t\tE86143981C20833200D5C240 /* edit.cpp */,\n\t\t\t\tE86143991C20833200D5C240 /* group_walk.cpp */,\n\t\t\t\tE861439A1C20833200D5C240 /* limit.cpp */,\n\t\t\t\tE861439B1C20833200D5C240 /* ls.cpp */,\n\t\t\t\tE861439C1C20833200D5C240 /* mask.cpp */,\n\t\t\t\tE861439D1C20833200D5C240 /* outq.cpp */,\n\t\t\t\tE861439E1C20833200D5C240 /* pat.cpp */,\n\t\t\t\tE861439F1C20833200D5C240 /* pe.cpp */,\n\t\t\t\tE86143A01C20833200D5C240 /* presets.cpp */,\n\t\t\t\tE86143A11C20833200D5C240 /* qual.cpp */,\n\t\t\t\tE86143A21C20833200D5C240 /* random_source.cpp */,\n\t\t\t\tE86143A31C20833200D5C240 /* random_util.cpp */,\n\t\t\t\tE86143A41C20833200D5C240 /* read_qseq.cpp */,\n\t\t\t\tE86143A51C20833200D5C240 /* ref_coord.cpp */,\n\t\t\t\tE86143A61C20833200D5C240 /* ref_read.cpp */,\n\t\t\t\tE86143A71C20833200D5C240 /* reference.cpp */,\n\t\t\t\tE86143A81C20833200D5C240 /* scoring.cpp */,\n\t\t\t\tE86143A91C20833200D5C240 /* shmem.cpp */,\n\t\t\t\tE86143AA1C20833200D5C240 /* simple_func.cpp */,\n\t\t\t\tE86143AB1C20833200D5C240 /* sse_util.cpp */,\n\t\t\t\tE86143AC1C20833200D5C240 /* sstring.cpp */,\n\t\t\t\tE86143AD1C20833200D5C240 /* tinythread.cpp */,\n\t\t\t);\n\t\t\tname = Source;\n\t\t\tsourceTree = \"<group>\";\n\t\t};\n\t\tE8485E1E1C207FCB00F225FA /* Document */ = {\n\t\t\tisa = PBXGroup;\n\t\t\tchildren = (\n\t\t\t);\n\t\t\tname = Document;\n\t\t\tsourceTree = \"<group>\";\n\t\t};\n/* End PBXGroup section */\n\n/* Begin PBXNativeTarget section */\n\t\tE8485E121C207EF000F225FA /* centrifuge-buildx */ = {\n\t\t\tisa = PBXNativeTarget;\n\t\t\tbuildConfigurationList = E8485E1A1C207EF000F225FA /* Build configuration list for PBXNativeTarget \"centrifuge-buildx\" */;\n\t\t\tbuildPhases = (\n\t\t\t\tE8485E0F1C207EF000F225FA /* Sources */,\n\t\t\t\tE8485E101C207EF000F225FA /* Frameworks */,\n\t\t\t\tE8485E111C207EF000F225FA /* CopyFiles */,\n\t\t\t);\n\t\t\tbuildRules = (\n\t\t\t);\n\t\t\tdependencies = (\n\t\t\t);\n\t\t\tname = \"centrifuge-buildx\";\n\t\t\tproductName = centrifuge;\n\t\t\tproductReference = E8485E131C207EF000F225FA /* centrifuge-buildx */;\n\t\t\tproductType = \"com.apple.product-type.tool\";\n\t\t};\n\t\tE869A0491C2095A8007600C2 /* centrifugex */ = {\n\t\t\tisa = PBXNativeTarget;\n\t\t\tbuildConfigurationList = E869A0501C2095A8007600C2 /* Build configuration list for PBXNativeTarget \"centrifugex\" */;\n\t\t\tbuildPhases = (\n\t\t\t\tE869A0461C2095A8007600C2 /* Sources */,\n\t\t\t\tE869A0471C2095A8007600C2 /* Frameworks */,\n\t\t\t\tE869A0481C2095A8007600C2 /* CopyFiles */,\n\t\t\t);\n\t\t\tbuildRules = (\n\t\t\t);\n\t\t\tdependencies = (\n\t\t\t);\n\t\t\tname = centrifugex;\n\t\t\tproductName = centrifugex;\n\t\t\tproductReference = E869A04A1C2095A8007600C2 /* centrifugex */;\n\t\t\tproductType = \"com.apple.product-type.tool\";\n\t\t};\n\t\tE869A0541C2095B5007600C2 /* centrifuge-inspectx */ = {\n\t\t\tisa = PBXNativeTarget;\n\t\t\tbuildConfigurationList = E869A0591C2095B5007600C2 /* Build configuration list for PBXNativeTarget \"centrifuge-inspectx\" */;\n\t\t\tbuildPhases = (\n\t\t\t\tE869A0511C2095B5007600C2 /* Sources */,\n\t\t\t\tE869A0521C2095B5007600C2 /* Frameworks */,\n\t\t\t\tE869A0531C2095B5007600C2 /* CopyFiles */,\n\t\t\t);\n\t\t\tbuildRules = (\n\t\t\t);\n\t\t\tdependencies = (\n\t\t\t);\n\t\t\tname = \"centrifuge-inspectx\";\n\t\t\tproductName = \"centrifuge-inspectx\";\n\t\t\tproductReference = E869A0551C2095B5007600C2 /* centrifuge-inspectx */;\n\t\t\tproductType = \"com.apple.product-type.tool\";\n\t\t};\n\t\tE869A05F1C2095CA007600C2 /* centrifuge-compressx */ = {\n\t\t\tisa = PBXNativeTarget;\n\t\t\tbuildConfigurationList = E869A0641C2095CA007600C2 /* Build configuration list for PBXNativeTarget \"centrifuge-compressx\" */;\n\t\t\tbuildPhases = (\n\t\t\t\tE869A05C1C2095CA007600C2 /* Sources */,\n\t\t\t\tE869A05D1C2095CA007600C2 /* Frameworks */,\n\t\t\t\tE869A05E1C2095CA007600C2 /* CopyFiles */,\n\t\t\t);\n\t\t\tbuildRules = (\n\t\t\t);\n\t\t\tdependencies = (\n\t\t\t);\n\t\t\tname = \"centrifuge-compressx\";\n\t\t\tproductName = \"centrifuge-compressx\";\n\t\t\tproductReference = E869A0601C2095CA007600C2 /* centrifuge-compressx */;\n\t\t\tproductType = \"com.apple.product-type.tool\";\n\t\t};\n/* End PBXNativeTarget section */\n\n/* Begin PBXProject section */\n\t\tE8485E0B1C207EF000F225FA /* Project object */ = {\n\t\t\tisa = PBXProject;\n\t\t\tattributes = {\n\t\t\t\tLastUpgradeCheck = 0800;\n\t\t\t\tORGANIZATIONNAME = \"Daehwan Kim\";\n\t\t\t\tTargetAttributes = {\n\t\t\t\t\tE8485E121C207EF000F225FA = {\n\t\t\t\t\t\tCreatedOnToolsVersion = 7.2;\n\t\t\t\t\t};\n\t\t\t\t\tE869A0491C2095A8007600C2 = {\n\t\t\t\t\t\tCreatedOnToolsVersion = 7.2;\n\t\t\t\t\t};\n\t\t\t\t\tE869A0541C2095B5007600C2 = {\n\t\t\t\t\t\tCreatedOnToolsVersion = 7.2;\n\t\t\t\t\t};\n\t\t\t\t\tE869A05F1C2095CA007600C2 = {\n\t\t\t\t\t\tCreatedOnToolsVersion = 7.2;\n\t\t\t\t\t};\n\t\t\t\t};\n\t\t\t};\n\t\t\tbuildConfigurationList = E8485E0E1C207EF000F225FA /* Build configuration list for PBXProject \"centrifuge\" */;\n\t\t\tcompatibilityVersion = \"Xcode 3.2\";\n\t\t\tdevelopmentRegion = English;\n\t\t\thasScannedForEncodings = 0;\n\t\t\tknownRegions = (\n\t\t\t\ten,\n\t\t\t);\n\t\t\tmainGroup = E8485E0A1C207EF000F225FA;\n\t\t\tproductRefGroup = E8485E141C207EF000F225FA /* Products */;\n\t\t\tprojectDirPath = \"\";\n\t\t\tprojectRoot = \"\";\n\t\t\ttargets = (\n\t\t\t\tE8485E121C207EF000F225FA /* centrifuge-buildx */,\n\t\t\t\tE869A0491C2095A8007600C2 /* centrifugex */,\n\t\t\t\tE869A0541C2095B5007600C2 /* centrifuge-inspectx */,\n\t\t\t\tE869A05F1C2095CA007600C2 /* centrifuge-compressx */,\n\t\t\t);\n\t\t};\n/* End PBXProject section */\n\n/* Begin PBXSourcesBuildPhase section */\n\t\tE8485E0F1C207EF000F225FA /* Sources */ = {\n\t\t\tisa = PBXSourcesBuildPhase;\n\t\t\tbuildActionMask = 2147483647;\n\t\t\tfiles = (\n\t\t\t\tE8AB5A231C209232009138A6 /* diff_sample.cpp in Sources */,\n\t\t\t\tE86143BA1C20833200D5C240 /* ccnt_lut.cpp in Sources */,\n\t\t\t\tE86143D41C20833200D5C240 /* reference.cpp in Sources */,\n\t\t\t\tE86143CF1C20833200D5C240 /* random_source.cpp in Sources */,\n\t\t\t\tE86143DA1C20833200D5C240 /* tinythread.cpp in Sources */,\n\t\t\t\tE86143BB1C20833200D5C240 /* centrifuge_build_main.cpp in Sources */,\n\t\t\t\tE86143C71C20833200D5C240 /* limit.cpp in Sources */,\n\t\t\t\tE86143B91C20833200D5C240 /* bt2_idx.cpp in Sources */,\n\t\t\t\tE86143B81C20833200D5C240 /* alphabet.cpp in Sources */,\n\t\t\t\tE86143D31C20833200D5C240 /* ref_read.cpp in Sources */,\n\t\t\t\tE86143C41C20833200D5C240 /* ds.cpp in Sources */,\n\t\t\t\tE86143BC1C20833200D5C240 /* centrifuge_build.cpp in Sources */,\n\t\t\t\tE86143D61C20833200D5C240 /* shmem.cpp in Sources */,\n\t\t\t);\n\t\t\trunOnlyForDeploymentPostprocessing = 0;\n\t\t};\n\t\tE869A0461C2095A8007600C2 /* Sources */ = {\n\t\t\tisa = PBXSourcesBuildPhase;\n\t\t\tbuildActionMask = 2147483647;\n\t\t\tfiles = (\n\t\t\t\tE869A0761C20A425007600C2 /* alphabet.cpp in Sources */,\n\t\t\t\tE869A0771C20A425007600C2 /* bt2_idx.cpp in Sources */,\n\t\t\t\tE869A0781C20A425007600C2 /* ccnt_lut.cpp in Sources */,\n\t\t\t\tE869A0791C20A425007600C2 /* ds.cpp in Sources */,\n\t\t\t\tE869A07A1C20A425007600C2 /* edit.cpp in Sources */,\n\t\t\t\tE869A07B1C20A425007600C2 /* random_source.cpp in Sources */,\n\t\t\t\tE869A07C1C20A425007600C2 /* ref_read.cpp in Sources */,\n\t\t\t\tE869A07D1C20A425007600C2 /* reference.cpp in Sources */,\n\t\t\t\tE869A07E1C20A425007600C2 /* shmem.cpp in Sources */,\n\t\t\t\tE869A07F1C20A425007600C2 /* tinythread.cpp in Sources */,\n\t\t\t\tE869A0751C20A308007600C2 /* limit.cpp in Sources */,\n\t\t\t\tE869A0731C209BE6007600C2 /* centrifuge_main.cpp in Sources */,\n\t\t\t\tE869A0741C209BE6007600C2 /* centrifuge.cpp in Sources */,\n\t\t\t\tE869A0671C209BCC007600C2 /* aligner_seed_policy.cpp in Sources */,\n\t\t\t\tE869A0681C209BCC007600C2 /* mask.cpp in Sources */,\n\t\t\t\tE869A0691C209BCC007600C2 /* outq.cpp in Sources */,\n\t\t\t\tE869A06A1C209BCC007600C2 /* pat.cpp in Sources */,\n\t\t\t\tE869A06B1C209BCC007600C2 /* pe.cpp in Sources */,\n\t\t\t\tE869A06C1C209BCC007600C2 /* presets.cpp in Sources */,\n\t\t\t\tE869A06D1C209BCC007600C2 /* qual.cpp in Sources */,\n\t\t\t\tE869A06E1C209BCC007600C2 /* random_util.cpp in Sources */,\n\t\t\t\tE869A06F1C209BCC007600C2 /* read_qseq.cpp in Sources */,\n\t\t\t\tE869A0701C209BCC007600C2 /* ref_coord.cpp in Sources */,\n\t\t\t\tE869A0711C209BCC007600C2 /* scoring.cpp in Sources */,\n\t\t\t\tE869A0721C209BCC007600C2 /* simple_func.cpp in Sources */,\n\t\t\t);\n\t\t\trunOnlyForDeploymentPostprocessing = 0;\n\t\t};\n\t\tE869A0511C2095B5007600C2 /* Sources */ = {\n\t\t\tisa = PBXSourcesBuildPhase;\n\t\t\tbuildActionMask = 2147483647;\n\t\t\tfiles = (\n\t\t\t\tE869A0801C20A50B007600C2 /* alphabet.cpp in Sources */,\n\t\t\t\tE869A0811C20A50B007600C2 /* bt2_idx.cpp in Sources */,\n\t\t\t\tE869A0821C20A50B007600C2 /* ccnt_lut.cpp in Sources */,\n\t\t\t\tE869A0831C20A50B007600C2 /* centrifuge_inspect.cpp in Sources */,\n\t\t\t\tE869A0841C20A50B007600C2 /* ds.cpp in Sources */,\n\t\t\t\tE869A0851C20A50B007600C2 /* edit.cpp in Sources */,\n\t\t\t\tE869A0861C20A50B007600C2 /* random_source.cpp in Sources */,\n\t\t\t\tE869A0871C20A50B007600C2 /* ref_read.cpp in Sources */,\n\t\t\t\tE869A0881C20A50B007600C2 /* reference.cpp in Sources */,\n\t\t\t\tE869A0891C20A50B007600C2 /* shmem.cpp in Sources */,\n\t\t\t\tE869A08A1C20A50B007600C2 /* tinythread.cpp in Sources */,\n\t\t\t);\n\t\t\trunOnlyForDeploymentPostprocessing = 0;\n\t\t};\n\t\tE869A05C1C2095CA007600C2 /* Sources */ = {\n\t\t\tisa = PBXSourcesBuildPhase;\n\t\t\tbuildActionMask = 2147483647;\n\t\t\tfiles = (\n\t\t\t);\n\t\t\trunOnlyForDeploymentPostprocessing = 0;\n\t\t};\n/* End PBXSourcesBuildPhase section */\n\n/* Begin XCBuildConfiguration section */\n\t\tE8485E181C207EF000F225FA /* Debug */ = {\n\t\t\tisa = XCBuildConfiguration;\n\t\t\tbuildSettings = {\n\t\t\t\tALWAYS_SEARCH_USER_PATHS = NO;\n\t\t\t\tCLANG_CXX_LANGUAGE_STANDARD = \"gnu++0x\";\n\t\t\t\tCLANG_CXX_LIBRARY = \"libc++\";\n\t\t\t\tCLANG_ENABLE_MODULES = YES;\n\t\t\t\tCLANG_ENABLE_OBJC_ARC = YES;\n\t\t\t\tCLANG_WARN_BOOL_CONVERSION = YES;\n\t\t\t\tCLANG_WARN_CONSTANT_CONVERSION = YES;\n\t\t\t\tCLANG_WARN_DIRECT_OBJC_ISA_USAGE = YES_ERROR;\n\t\t\t\tCLANG_WARN_EMPTY_BODY = YES;\n\t\t\t\tCLANG_WARN_ENUM_CONVERSION = YES;\n\t\t\t\tCLANG_WARN_INFINITE_RECURSION = YES;\n\t\t\t\tCLANG_WARN_INT_CONVERSION = YES;\n\t\t\t\tCLANG_WARN_OBJC_ROOT_CLASS = YES_ERROR;\n\t\t\t\tCLANG_WARN_SUSPICIOUS_MOVE = YES;\n\t\t\t\tCLANG_WARN_UNREACHABLE_CODE = YES;\n\t\t\t\tCLANG_WARN__DUPLICATE_METHOD_MATCH = YES;\n\t\t\t\tCODE_SIGN_IDENTITY = \"-\";\n\t\t\t\tCOPY_PHASE_STRIP = NO;\n\t\t\t\tDEBUG_INFORMATION_FORMAT = dwarf;\n\t\t\t\tENABLE_STRICT_OBJC_MSGSEND = YES;\n\t\t\t\tENABLE_TESTABILITY = YES;\n\t\t\t\tGCC_C_LANGUAGE_STANDARD = gnu99;\n\t\t\t\tGCC_DYNAMIC_NO_PIC = NO;\n\t\t\t\tGCC_NO_COMMON_BLOCKS = YES;\n\t\t\t\tGCC_OPTIMIZATION_LEVEL = 0;\n\t\t\t\tGCC_PREPROCESSOR_DEFINITIONS = (\n\t\t\t\t\t\"$(inherited)\",\n\t\t\t\t\t\"DEBUG=1\",\n\t\t\t\t\tBOWTIE_MM,\n\t\t\t\t\t\"MACOS=1\",\n\t\t\t\t\tPOPCNT_CAPABILITY,\n\t\t\t\t\tBOWTIE2,\n\t\t\t\t\tBOWTIE_64BIT_INDEX,\n\t\t\t\t\tCENTRIFUGE,\n\t\t\t\t\t\"CENTRIFUGE_VERSION=\\\"\\\\\\\"`cat VERSION`\\\\\\\"\\\"\",\n\t\t\t\t\t\"BUILD_HOST=\\\"\\\\\\\"`hostname`\\\\\\\"\\\"\",\n\t\t\t\t\t\"BUILD_TIME=\\\"\\\\\\\"`date`\\\\\\\"\\\"\",\n\t\t\t\t\t\"COMPILER_VERSION=\\\"\\\\\\\"`$(CXX) -v 2>&1 | tail -1`\\\\\\\"\\\"\",\n\t\t\t\t\t\"COMPILER_OPTIONS=\\\"\\\\\\\"test\\\\\\\"\\\"\",\n\t\t\t\t);\n\t\t\t\tGCC_WARN_64_TO_32_BIT_CONVERSION = NO;\n\t\t\t\tGCC_WARN_ABOUT_RETURN_TYPE = YES_ERROR;\n\t\t\t\tGCC_WARN_UNDECLARED_SELECTOR = YES;\n\t\t\t\tGCC_WARN_UNINITIALIZED_AUTOS = YES_AGGRESSIVE;\n\t\t\t\tGCC_WARN_UNUSED_FUNCTION = YES;\n\t\t\t\tGCC_WARN_UNUSED_VARIABLE = YES;\n\t\t\t\tMACOSX_DEPLOYMENT_TARGET = 10.11;\n\t\t\t\tMTL_ENABLE_DEBUG_INFO = YES;\n\t\t\t\tONLY_ACTIVE_ARCH = YES;\n\t\t\t\tSDKROOT = macosx;\n\t\t\t};\n\t\t\tname = Debug;\n\t\t};\n\t\tE8485E191C207EF000F225FA /* Release */ = {\n\t\t\tisa = XCBuildConfiguration;\n\t\t\tbuildSettings = {\n\t\t\t\tALWAYS_SEARCH_USER_PATHS = NO;\n\t\t\t\tCLANG_CXX_LANGUAGE_STANDARD = \"gnu++0x\";\n\t\t\t\tCLANG_CXX_LIBRARY = \"libc++\";\n\t\t\t\tCLANG_ENABLE_MODULES = YES;\n\t\t\t\tCLANG_ENABLE_OBJC_ARC = YES;\n\t\t\t\tCLANG_WARN_BOOL_CONVERSION = YES;\n\t\t\t\tCLANG_WARN_CONSTANT_CONVERSION = YES;\n\t\t\t\tCLANG_WARN_DIRECT_OBJC_ISA_USAGE = YES_ERROR;\n\t\t\t\tCLANG_WARN_EMPTY_BODY = YES;\n\t\t\t\tCLANG_WARN_ENUM_CONVERSION = YES;\n\t\t\t\tCLANG_WARN_INFINITE_RECURSION = YES;\n\t\t\t\tCLANG_WARN_INT_CONVERSION = YES;\n\t\t\t\tCLANG_WARN_OBJC_ROOT_CLASS = YES_ERROR;\n\t\t\t\tCLANG_WARN_SUSPICIOUS_MOVE = YES;\n\t\t\t\tCLANG_WARN_UNREACHABLE_CODE = YES;\n\t\t\t\tCLANG_WARN__DUPLICATE_METHOD_MATCH = YES;\n\t\t\t\tCODE_SIGN_IDENTITY = \"-\";\n\t\t\t\tCOPY_PHASE_STRIP = NO;\n\t\t\t\tDEBUG_INFORMATION_FORMAT = \"dwarf-with-dsym\";\n\t\t\t\tENABLE_NS_ASSERTIONS = NO;\n\t\t\t\tENABLE_STRICT_OBJC_MSGSEND = YES;\n\t\t\t\tGCC_C_LANGUAGE_STANDARD = gnu99;\n\t\t\t\tGCC_NO_COMMON_BLOCKS = YES;\n\t\t\t\tGCC_PREPROCESSOR_DEFINITIONS = (\n\t\t\t\t\t\"$(inherited)\",\n\t\t\t\t\t\"DEBUG=0\",\n\t\t\t\t\tBOWTIE_MM,\n\t\t\t\t\t\"MACOS=1\",\n\t\t\t\t\tPOPCNT_CAPABILITY,\n\t\t\t\t\tBOWTIE2,\n\t\t\t\t\tBOWTIE_64BIT_INDEX,\n\t\t\t\t\tCENTRIFUGE,\n\t\t\t\t\t\"CENTRIFUGE_VERSION=\\\"\\\\\\\"`cat VERSION`\\\\\\\"\\\"\",\n\t\t\t\t\t\"BUILD_HOST=\\\"\\\\\\\"`hostname`\\\\\\\"\\\"\",\n\t\t\t\t\t\"BUILD_TIME=\\\"\\\\\\\"`date`\\\\\\\"\\\"\",\n\t\t\t\t\t\"COMPILER_VERSION=\\\"\\\\\\\"`$(CXX) -v 2>&1 | tail -1`\\\\\\\"\\\"\",\n\t\t\t\t\t\"COMPILER_OPTIONS=\\\"\\\\\\\"test\\\\\\\"\\\"\",\n\t\t\t\t);\n\t\t\t\tGCC_WARN_64_TO_32_BIT_CONVERSION = NO;\n\t\t\t\tGCC_WARN_ABOUT_RETURN_TYPE = YES_ERROR;\n\t\t\t\tGCC_WARN_UNDECLARED_SELECTOR = YES;\n\t\t\t\tGCC_WARN_UNINITIALIZED_AUTOS = YES_AGGRESSIVE;\n\t\t\t\tGCC_WARN_UNUSED_FUNCTION = YES;\n\t\t\t\tGCC_WARN_UNUSED_VARIABLE = YES;\n\t\t\t\tMACOSX_DEPLOYMENT_TARGET = 10.11;\n\t\t\t\tMTL_ENABLE_DEBUG_INFO = NO;\n\t\t\t\tSDKROOT = macosx;\n\t\t\t};\n\t\t\tname = Release;\n\t\t};\n\t\tE8485E1B1C207EF000F225FA /* Debug */ = {\n\t\t\tisa = XCBuildConfiguration;\n\t\t\tbuildSettings = {\n\t\t\t\tGCC_PREPROCESSOR_DEFINITIONS = \"$(inherited)\";\n\t\t\t\tHEADER_SEARCH_PATHS = /usr/local/include;\n\t\t\t\tLIBRARY_SEARCH_PATHS = /usr/local/lib;\n\t\t\t\tPRODUCT_NAME = \"$(TARGET_NAME)\";\n\t\t\t};\n\t\t\tname = Debug;\n\t\t};\n\t\tE8485E1C1C207EF000F225FA /* Release */ = {\n\t\t\tisa = XCBuildConfiguration;\n\t\t\tbuildSettings = {\n\t\t\t\tHEADER_SEARCH_PATHS = /usr/local/include;\n\t\t\t\tLIBRARY_SEARCH_PATHS = /usr/local/lib;\n\t\t\t\tPRODUCT_NAME = \"$(TARGET_NAME)\";\n\t\t\t};\n\t\t\tname = Release;\n\t\t};\n\t\tE869A04E1C2095A8007600C2 /* Debug */ = {\n\t\t\tisa = XCBuildConfiguration;\n\t\t\tbuildSettings = {\n\t\t\t\tGCC_PREPROCESSOR_DEFINITIONS = (\n\t\t\t\t\t\"DEBUG=1\",\n\t\t\t\t\t\"$(inherited)\",\n\t\t\t\t);\n\t\t\t\tPRODUCT_NAME = \"$(TARGET_NAME)\";\n\t\t\t};\n\t\t\tname = Debug;\n\t\t};\n\t\tE869A04F1C2095A8007600C2 /* Release */ = {\n\t\t\tisa = XCBuildConfiguration;\n\t\t\tbuildSettings = {\n\t\t\t\tPRODUCT_NAME = \"$(TARGET_NAME)\";\n\t\t\t};\n\t\t\tname = Release;\n\t\t};\n\t\tE869A05A1C2095B5007600C2 /* Debug */ = {\n\t\t\tisa = XCBuildConfiguration;\n\t\t\tbuildSettings = {\n\t\t\t\tGCC_PREPROCESSOR_DEFINITIONS = (\n\t\t\t\t\t\"DEBUG=1\",\n\t\t\t\t\t\"$(inherited)\",\n\t\t\t\t);\n\t\t\t\tPRODUCT_NAME = \"$(TARGET_NAME)\";\n\t\t\t};\n\t\t\tname = Debug;\n\t\t};\n\t\tE869A05B1C2095B5007600C2 /* Release */ = {\n\t\t\tisa = XCBuildConfiguration;\n\t\t\tbuildSettings = {\n\t\t\t\tPRODUCT_NAME = \"$(TARGET_NAME)\";\n\t\t\t};\n\t\t\tname = Release;\n\t\t};\n\t\tE869A0651C2095CA007600C2 /* Debug */ = {\n\t\t\tisa = XCBuildConfiguration;\n\t\t\tbuildSettings = {\n\t\t\t\tGCC_PREPROCESSOR_DEFINITIONS = (\n\t\t\t\t\t\"DEBUG=1\",\n\t\t\t\t\t\"$(inherited)\",\n\t\t\t\t);\n\t\t\t\tPRODUCT_NAME = \"$(TARGET_NAME)\";\n\t\t\t};\n\t\t\tname = Debug;\n\t\t};\n\t\tE869A0661C2095CA007600C2 /* Release */ = {\n\t\t\tisa = XCBuildConfiguration;\n\t\t\tbuildSettings = {\n\t\t\t\tPRODUCT_NAME = \"$(TARGET_NAME)\";\n\t\t\t};\n\t\t\tname = Release;\n\t\t};\n/* End XCBuildConfiguration section */\n\n/* Begin XCConfigurationList section */\n\t\tE8485E0E1C207EF000F225FA /* Build configuration list for PBXProject \"centrifuge\" */ = {\n\t\t\tisa = XCConfigurationList;\n\t\t\tbuildConfigurations = (\n\t\t\t\tE8485E181C207EF000F225FA /* Debug */,\n\t\t\t\tE8485E191C207EF000F225FA /* Release */,\n\t\t\t);\n\t\t\tdefaultConfigurationIsVisible = 0;\n\t\t\tdefaultConfigurationName = Release;\n\t\t};\n\t\tE8485E1A1C207EF000F225FA /* Build configuration list for PBXNativeTarget \"centrifuge-buildx\" */ = {\n\t\t\tisa = XCConfigurationList;\n\t\t\tbuildConfigurations = (\n\t\t\t\tE8485E1B1C207EF000F225FA /* Debug */,\n\t\t\t\tE8485E1C1C207EF000F225FA /* Release */,\n\t\t\t);\n\t\t\tdefaultConfigurationIsVisible = 0;\n\t\t\tdefaultConfigurationName = Release;\n\t\t};\n\t\tE869A0501C2095A8007600C2 /* Build configuration list for PBXNativeTarget \"centrifugex\" */ = {\n\t\t\tisa = XCConfigurationList;\n\t\t\tbuildConfigurations = (\n\t\t\t\tE869A04E1C2095A8007600C2 /* Debug */,\n\t\t\t\tE869A04F1C2095A8007600C2 /* Release */,\n\t\t\t);\n\t\t\tdefaultConfigurationIsVisible = 0;\n\t\t\tdefaultConfigurationName = Release;\n\t\t};\n\t\tE869A0591C2095B5007600C2 /* Build configuration list for PBXNativeTarget \"centrifuge-inspectx\" */ = {\n\t\t\tisa = XCConfigurationList;\n\t\t\tbuildConfigurations = (\n\t\t\t\tE869A05A1C2095B5007600C2 /* Debug */,\n\t\t\t\tE869A05B1C2095B5007600C2 /* Release */,\n\t\t\t);\n\t\t\tdefaultConfigurationIsVisible = 0;\n\t\t\tdefaultConfigurationName = Release;\n\t\t};\n\t\tE869A0641C2095CA007600C2 /* Build configuration list for PBXNativeTarget \"centrifuge-compressx\" */ = {\n\t\t\tisa = XCConfigurationList;\n\t\t\tbuildConfigurations = (\n\t\t\t\tE869A0651C2095CA007600C2 /* Debug */,\n\t\t\t\tE869A0661C2095CA007600C2 /* Release */,\n\t\t\t);\n\t\t\tdefaultConfigurationIsVisible = 0;\n\t\t\tdefaultConfigurationName = Release;\n\t\t};\n/* End XCConfigurationList section */\n\t};\n\trootObject = E8485E0B1C207EF000F225FA /* Project object */;\n}\n"
  },
  {
    "path": "centrifuge_build.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <iostream>\n#include <fstream>\n#include <string>\n#include <cassert>\n#include <getopt.h>\n#include \"assert_helpers.h\"\n#include \"endian_swap.h\"\n#include \"bt2_idx.h\"\n#include \"bt2_io.h\"\n#include \"bt2_util.h\"\n#include \"formats.h\"\n#include \"sequence_io.h\"\n#include \"tokenize.h\"\n#include \"timer.h\"\n#include \"ref_read.h\"\n#include \"filebuf.h\"\n#include \"reference.h\"\n#include \"ds.h\"\n\n/**\n * \\file Driver for the bowtie-build indexing tool.\n */\n\n// Build parameters\nint verbose;\nstatic int sanityCheck;\nstatic int format;\nstatic TIndexOffU bmax;\nstatic TIndexOffU bmaxMultSqrt;\nstatic uint32_t bmaxDivN;\nstatic int dcv;\nstatic int noDc;\nstatic int entireSA;\nstatic int seed;\nstatic int showVersion;\nstatic int nthreads;      // number of pthreads operating concurrently\n//   Ebwt parameters\nstatic int32_t lineRate;\nstatic int32_t linesPerSide;\nstatic int32_t offRate;\nstatic int32_t ftabChars;\nstatic int32_t localOffRate;\nstatic int32_t localFtabChars;\nstatic string conversion_table_fname; // conversion table file name\nstatic string taxonomy_fname; // taxonomy tree file name\nstatic string name_table_fname; // name table file name\nstatic string size_table_fname; // contig size table file name\nstatic int  bigEndian;\nstatic bool nsToAs;\nstatic bool doSaFile;  // make a file with just the suffix array in it\nstatic bool doBwtFile; // make a file with just the BWT string in it\nstatic bool autoMem;\nstatic bool packed;\nstatic bool writeRef;\nstatic bool justRef;\nstatic bool reverseEach;\nstatic string wrapper;\nstatic int kmer_count;\n\nstatic void resetOptions() {\n\tverbose        = true;  // be talkative (default)\n\tsanityCheck    = 0;     // do slow sanity checks\n\tformat         = FASTA; // input sequence format\n\tbmax           = OFF_MASK; // max blockwise SA bucket size\n\tbmaxMultSqrt   = OFF_MASK; // same, as multplier of sqrt(n)\n\tbmaxDivN       = 4;          // same, as divisor of n\n\tdcv            = 1024;  // bwise SA difference-cover sample sz\n\tnoDc           = 0;     // disable difference-cover sample\n\tentireSA       = 0;     // 1 = disable blockwise SA\n\tseed           = 0;     // srandom seed\n\tshowVersion    = 0;     // just print version and quit?\n    nthreads       = 1;\n\t//   Ebwt parameters\n\tlineRate       = Ebwt<TIndexOffU>::default_lineRate;\n\tlinesPerSide   = 1;  // 1 64-byte line on a side\n\toffRate        = 4;  // sample 1 out of 16 SA elts\n\tftabChars      = 10; // 10 chars in initial lookup table\n    localOffRate   = 3;\n    localFtabChars = 6;\n    conversion_table_fname    = \"\";\n    taxonomy_fname = \"\";\n    name_table_fname = \"\";\n    size_table_fname = \"\";\n\tbigEndian      = 0;  // little endian\n\tnsToAs         = false; // convert reference Ns to As prior to indexing\n    doSaFile       = false; // make a file with just the suffix array in it\n    doBwtFile      = false; // make a file with just the BWT string in it\n\tautoMem        = true;  // automatically adjust memory usage parameters\n\tpacked         = false; //\n\twriteRef       = true;  // write compact reference to .3.bt2/.4.bt2\n\tjustRef        = false; // *just* write compact reference, don't index\n\treverseEach    = false;\n    wrapper.clear();\n    kmer_count     = 0; // k : k-mer to be counted\n}\n\n// Argument constants for getopts\nenum {\n\tARG_BMAX = 256,\n\tARG_BMAX_MULT,\n\tARG_BMAX_DIV,\n\tARG_DCV,\n\tARG_SEED,\n\tARG_CUTOFF,\n\tARG_PMAP,\n\tARG_NTOA,\n\tARG_USAGE,\n\tARG_REVERSE_EACH,\n    ARG_SA,\n\tARG_WRAPPER,\n    ARG_THREADS,\n    ARG_LOCAL_OFFRATE,\n    ARG_LOCAL_FTABCHARS,\n    ARG_CONVERSION_TABLE,\n    ARG_TAXONOMY_TREE,\n    ARG_NAME_TABLE,\n    ARG_SIZE_TABLE,\n    ARG_KMER_COUNT,\n};\n\n/**\n * Print a detailed usage message to the provided output stream.\n */\nstatic void printUsage(ostream& out) {\n\tout << \"Centrifuge version \" << string(CENTRIFUGE_VERSION).c_str() << \" by Daehwan Kim (infphilo@gmail.com, http://www.ccb.jhu.edu/people/infphilo)\" << endl;\n\tstring tool_name = \"centrifuge-build-bin\";\n\tif(wrapper == \"basic-0\") {\n\t\ttool_name = \"centrifuge-build\";\n\t}\n    \n\tout << \"Usage: centrifuge-build [options]* --conversion-table <table file> --taxonomy-tree <taxonomy tree file> <reference_in> <cf_index_base>\" << endl\n\t    << \"    reference_in            comma-separated list of files with ref sequences\" << endl\n\t    << \"    centrifuge_index_base          write \" << gEbwt_ext << \" data to files with this dir/basename\" << endl\n        << \"Options:\" << endl\n        << \"    -c                      reference sequences given on cmd line (as\" << endl\n        << \"                            <reference_in>)\" << endl;\n    if(wrapper == \"basic-0\") {\n        //out << \"    --large-index           force generated index to be 'large', even if ref\" << endl\n\t//\t<< \"                            has fewer than 4 billion nucleotides\" << endl;\n\t}\n    out << \"    -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting\" << endl\n\t    << \"    --bmax <int>            max bucket sz for blockwise suffix-array builder\" << endl\n\t    << \"    --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)\" << endl\n\t    << \"    --dcv <int>             diff-cover period for blockwise (default: 1024)\" << endl\n\t    << \"    --nodc                  disable diff-cover (algorithm becomes quadratic)\" << endl\n\t    << \"    -r/--noref              don't build .3/.4.bt2 (packed reference) portion\" << endl\n\t    << \"    -3/--justref            just build .3/.4.bt2 (packed reference) portion\" << endl\n\t    << \"    -o/--offrate <int>      SA is sampled every 2^offRate BWT chars (default: 5)\" << endl\n\t    << \"    -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)\" << endl\n        << \"    --conversion-table <file name>  a table that converts any id to a taxonomy id\" << endl\n        << \"    --taxonomy-tree    <file name>  taxonomy tree\" << endl\n        << \"    --name-table       <file name>  names corresponding to taxonomic IDs\" << endl\n        << \"    --size-table       <file name>  table of contig (or genome) sizes\" << endl\n\t    << \"    --seed <int>            seed for random number generator\" << endl\n\t    << \"    -q/--quiet              verbose output (for debugging)\" << endl\n        << \"    -p/--threads <int>      number of alignment threads to launch (1)\" << endl\n        << \"    --kmer-count <int>      k size for counting the number of distinct k-mer\" << endl\n\t    << \"    -h/--help               print detailed description of tool and its options\" << endl\n\t    << \"    --usage                 print this usage message\" << endl\n\t    << \"    --version               print version information and quit\" << endl\n\t    ;\n    \n    if(wrapper.empty()) {\n\t\tcerr << endl\n        << \"*** Warning ***\" << endl\n        << \"'\" << tool_name << \"' was run directly.  It is recommended \"\n        << \"that you run the wrapper script 'bowtie2-build' instead.\"\n        << endl << endl;\n\t}\n}\n\nstatic const char *short_options = \"qrap:h?nscfl:i:o:t:h:3C\";\n\nstatic struct option long_options[] = {\n\t{(char*)\"quiet\",          no_argument,       0,            'q'},\n\t{(char*)\"sanity\",         no_argument,       0,            's'},\n\t{(char*)\"packed\",         no_argument,       0,            'p'},\n\t{(char*)\"little\",         no_argument,       &bigEndian,   0},\n\t{(char*)\"big\",            no_argument,       &bigEndian,   1},\n\t{(char*)\"bmax\",           required_argument, 0,            ARG_BMAX},\n\t{(char*)\"bmaxmultsqrt\",   required_argument, 0,            ARG_BMAX_MULT},\n\t{(char*)\"bmaxdivn\",       required_argument, 0,            ARG_BMAX_DIV},\n\t{(char*)\"dcv\",            required_argument, 0,            ARG_DCV},\n\t{(char*)\"nodc\",           no_argument,       &noDc,        1},\n\t{(char*)\"seed\",           required_argument, 0,            ARG_SEED},\n\t{(char*)\"entiresa\",       no_argument,       &entireSA,    1},\n\t{(char*)\"version\",        no_argument,       &showVersion, 1},\n\t{(char*)\"noauto\",         no_argument,       0,            'a'},\n\t{(char*)\"noblocks\",       required_argument, 0,            'n'},\n    {(char*)\"threads\",        required_argument, 0,            ARG_THREADS},\n\t{(char*)\"linerate\",       required_argument, 0,            'l'},\n\t{(char*)\"linesperside\",   required_argument, 0,            'i'},\n\t{(char*)\"offrate\",        required_argument, 0,            'o'},\n\t{(char*)\"ftabchars\",      required_argument, 0,            't'},\n    {(char*)\"localoffrate\",   required_argument, 0,            ARG_LOCAL_OFFRATE},\n\t{(char*)\"localftabchars\", required_argument, 0,            ARG_LOCAL_FTABCHARS},\n    {(char*)\"conversion-table\", required_argument, 0,          ARG_CONVERSION_TABLE},\n    {(char*)\"taxonomy-tree\",    required_argument, 0,          ARG_TAXONOMY_TREE},\n    {(char*)\"name-table\",       required_argument, 0,          ARG_NAME_TABLE},\n    {(char*)\"size-table\",       required_argument, 0,          ARG_SIZE_TABLE},\n\t{(char*)\"help\",           no_argument,       0,            'h'},\n\t{(char*)\"ntoa\",           no_argument,       0,            ARG_NTOA},\n\t{(char*)\"justref\",        no_argument,       0,            '3'},\n\t{(char*)\"noref\",          no_argument,       0,            'r'},\n\t{(char*)\"kmer-count\",     required_argument, 0,            ARG_KMER_COUNT},\n    {(char*)\"sa\",             no_argument,       0,            ARG_SA},\n\t{(char*)\"reverse-each\",   no_argument,       0,            ARG_REVERSE_EACH},\n\t{(char*)\"usage\",          no_argument,       0,            ARG_USAGE},\n    {(char*)\"wrapper\",        required_argument, 0,            ARG_WRAPPER},\n\t{(char*)0, 0, 0, 0} // terminator\n};\n\n/**\n * Parse an int out of optarg and enforce that it be at least 'lower';\n * if it is less than 'lower', then output the given error message and\n * exit with an error and a usage message.\n */\ntemplate<typename T>\nstatic T parseNumber(T lower, const char *errmsg) {\n\tchar *endPtr= NULL;\n\tT t = (T)strtoll(optarg, &endPtr, 10);\n\tif (endPtr != NULL) {\n\t\tif (t < lower) {\n\t\t\tcerr << errmsg << endl;\n\t\t\tprintUsage(cerr);\n\t\t\tthrow 1;\n\t\t}\n\t\treturn t;\n\t}\n\tcerr << errmsg << endl;\n\tprintUsage(cerr);\n\tthrow 1;\n\treturn -1;\n}\n\n/**\n * Read command-line arguments\n */\nstatic void parseOptions(int argc, const char **argv) {\n\tint option_index = 0;\n\tint next_option;\n\tdo {\n\t\tnext_option = getopt_long(\n\t\t\targc, const_cast<char**>(argv),\n\t\t\tshort_options, long_options, &option_index);\n\t\tswitch (next_option) {\n            case ARG_WRAPPER:\n\t\t\t\twrapper = optarg;\n\t\t\t\tbreak;\n\t\t\tcase 'f': format = FASTA; break;\n\t\t\tcase 'c': format = CMDLINE; break;\n            case ARG_THREADS:\n\t\t\tcase 'p': nthreads = parseNumber<int>(1, \"-p arg must be at least 1\");\n                break;\n\t\t\tcase 'C':\n\t\t\t\tcerr << \"Error: -C specified but Bowtie 2 does not support colorspace input.\" << endl;\n\t\t\t\tthrow 1;\n\t\t\t\tbreak;\n\t\t\tcase 'l':\n\t\t\t\tlineRate = parseNumber<int>(3, \"-l/--lineRate arg must be at least 3\");\n\t\t\t\tbreak;\n\t\t\tcase 'i':\n\t\t\t\tlinesPerSide = parseNumber<int>(1, \"-i/--linesPerSide arg must be at least 1\");\n\t\t\t\tbreak;\n\t\t\tcase 'o':\n\t\t\t\toffRate = parseNumber<int>(0, \"-o/--offRate arg must be at least 0\");\n\t\t\t\tbreak;\n            case ARG_LOCAL_OFFRATE:\n                localOffRate = parseNumber<int>(0, \"-o/--localoffrate arg must be at least 0\");\n                break;\n\t\t\tcase '3':\n\t\t\t\tjustRef = true;\n\t\t\t\tbreak;\n\t\t\tcase 't':\n\t\t\t\tftabChars = parseNumber<int>(1, \"-t/--ftabChars arg must be at least 1\");\n\t\t\t\tbreak;\n            case ARG_LOCAL_FTABCHARS:\n\t\t\t\tlocalFtabChars = parseNumber<int>(1, \"-t/--localftabchars arg must be at least 1\");\n\t\t\t\tbreak;\n\t\t\tcase 'n':\n\t\t\t\t// all f-s is used to mean \"not set\", so put 'e' on end\n\t\t\t\tbmax = 0xfffffffe;\n\t\t\t\tbreak;\n\t\t\tcase 'h':\n\t\t\tcase ARG_USAGE:\n\t\t\t\tprintUsage(cout);\n\t\t\t\tthrow 0;\n\t\t\t\tbreak;\n\t\t\tcase ARG_BMAX:\n\t\t\t\tbmax = parseNumber<TIndexOffU>(1, \"--bmax arg must be at least 1\");\n\t\t\t\tbmaxMultSqrt = OFF_MASK; // don't use multSqrt\n\t\t\t\tbmaxDivN = 0xffffffff;     // don't use multSqrt\n\t\t\t\tbreak;\n\t\t\tcase ARG_BMAX_MULT:\n\t\t\t\tbmaxMultSqrt = parseNumber<TIndexOffU>(1, \"--bmaxmultsqrt arg must be at least 1\");\n\t\t\t\tbmax = OFF_MASK;     // don't use bmax\n\t\t\t\tbmaxDivN = 0xffffffff; // don't use multSqrt\n\t\t\t\tbreak;\n\t\t\tcase ARG_BMAX_DIV:\n\t\t\t\tbmaxDivN = parseNumber<uint32_t>(1, \"--bmaxdivn arg must be at least 1\");\n\t\t\t\tbmax = OFF_MASK;         // don't use bmax\n\t\t\t\tbmaxMultSqrt = OFF_MASK; // don't use multSqrt\n\t\t\t\tbreak;\n\t\t\tcase ARG_DCV:\n\t\t\t\tdcv = parseNumber<int>(3, \"--dcv arg must be at least 3\");\n\t\t\t\tbreak;\n\t\t\tcase ARG_SEED:\n\t\t\t\tseed = parseNumber<int>(0, \"--seed arg must be at least 0\");\n\t\t\t\tbreak;\n\t\t\tcase ARG_REVERSE_EACH:\n\t\t\t\treverseEach = true;\n\t\t\t\tbreak;\n            case ARG_SA:\n                doSaFile = true;\n                break;\n\t\t\tcase ARG_NTOA: nsToAs = true; break;\n            case ARG_CONVERSION_TABLE:\n                conversion_table_fname = optarg;\n                break;\n            case ARG_TAXONOMY_TREE:\n                taxonomy_fname = optarg;\n                break;\n            case ARG_NAME_TABLE:\n                name_table_fname = optarg;\n                break;\n            case ARG_SIZE_TABLE:\n                size_table_fname = optarg;\n                break;\n            case ARG_KMER_COUNT:\n                kmer_count = parseNumber<int>(1, \"--kmer-count arg must be at least 1\");\n                break;\n\t\t\tcase 'a': autoMem = false; break;\n\t\t\tcase 'q': verbose = false; break;\n\t\t\tcase 's': sanityCheck = true; break;\n\t\t\tcase 'r': writeRef = false; break;\n\n\t\t\tcase -1: /* Done with options. */\n\t\t\t\tbreak;\n\t\t\tcase 0:\n\t\t\t\tif (long_options[option_index].flag != 0)\n\t\t\t\t\tbreak;\n\t\t\tdefault:\n\t\t\t\tprintUsage(cerr);\n\t\t\t\tthrow 1;\n\t\t}\n\t} while(next_option != -1);\n\tif(bmax < 40) {\n\t\tcerr << \"Warning: specified bmax is very small (\" << bmax << \").  This can lead to\" << endl\n\t\t     << \"extremely slow performance and memory exhaustion.  Perhaps you meant to specify\" << endl\n\t\t     << \"a small --bmaxdivn?\" << endl;\n\t}\n}\n\nEList<string> filesWritten;\n\n/**\n * Delete all the index files that we tried to create.  For when we had to\n * abort the index-building process due to an error.\n */\nstatic void deleteIdxFiles(\n                           const string& outfile,\n                           bool doRef,\n                           bool justRef)\n{\n\t\n\tfor(size_t i = 0; i < filesWritten.size(); i++) {\n\t\tcerr << \"Deleting \\\"\" << filesWritten[i].c_str()\n\t\t     << \"\\\" file written during aborted indexing attempt.\" << endl;\n\t\tremove(filesWritten[i].c_str());\n\t}\n}\n\nextern void initializeCntLut();\n\n/**\n * Drive the index construction process and optionally sanity-check the\n * result.\n */\ntemplate<typename TStr>\nstatic void driver(\n                   const string& infile,\n                   EList<string>& infiles,\n                   const string& conversion_table_fname,\n                   const string& taxonomy_fname,\n                   const string& name_table_fname,\n                   const string& size_table_fname,\n                   const string& outfile,\n                   bool packed,\n                   int reverse)\n{\n    initializeCntLut();\n\tEList<FileBuf*> is(MISC_CAT);\n\tbool bisulfite = false;\n\tRefReadInParams refparams(false, reverse, nsToAs, bisulfite);\n\tassert_gt(infiles.size(), 0);\n\tif(format == CMDLINE) {\n\t\t// Adapt sequence strings to stringstreams open for input\n\t\tstringstream *ss = new stringstream();\n\t\tfor(size_t i = 0; i < infiles.size(); i++) {\n\t\t\t(*ss) << \">\" << i << endl << infiles[i].c_str() << endl;\n\t\t}\n\t\tFileBuf *fb = new FileBuf(ss);\n\t\tassert(fb != NULL);\n\t\tassert(!fb->eof());\n\t\tassert(fb->get() == '>');\n\t\tASSERT_ONLY(fb->reset());\n\t\tassert(!fb->eof());\n\t\tis.push_back(fb);\n\t} else {\n\t\t// Adapt sequence files to ifstreams\n\t\tfor(size_t i = 0; i < infiles.size(); i++) {\n\t\t\tFILE *f = fopen(infiles[i].c_str(), \"r\");\n\t\t\tif (f == NULL) {\n\t\t\t\tcerr << \"Error: could not open \"<< infiles[i].c_str() << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\tFileBuf *fb = new FileBuf(f);\n\t\t\tassert(fb != NULL);\n\t\t\tif(fb->peek() == -1 || fb->eof()) {\n\t\t\t\tcerr << \"Warning: Empty fasta file: '\" << infile.c_str() << \"'\" << endl;\n\t\t\t\tcontinue;\n\t\t\t}\n\t\t\tassert(!fb->eof());\n\t\t\tassert(fb->get() == '>');\n\t\t\tASSERT_ONLY(fb->reset());\n\t\t\tassert(!fb->eof());\n\t\t\tis.push_back(fb);\n\t\t}\n\t}\n\tif(is.empty()) {\n\t\tcerr << \"Warning: All fasta inputs were empty\" << endl;\n\t\tthrow 1;\n\t}\n\t// Vector for the ordered list of \"records\" comprising the input\n\t// sequences.  A record represents a stretch of unambiguous\n\t// characters in one of the input sequences.\n\tEList<RefRecord> szs(MISC_CAT);\n\tstd::pair<size_t, size_t> sztot;\n\t{\n\t\tif(verbose) cout << \"Reading reference sizes\" << endl;\n\t\tTimer _t(cout, \"  Time reading reference sizes: \", verbose);\n        sztot = BitPairReference::szsFromFasta(is, string(), bigEndian, refparams, szs, sanityCheck);\n\t}\n\tif(justRef) return;\n\tassert_gt(sztot.first, 0);\n\tassert_gt(sztot.second, 0);\n\tassert_gt(szs.size(), 0);\n\t// Construct index from input strings and parameters\n\tfilesWritten.push_back(outfile + \".1.\" + gEbwt_ext);\n\tfilesWritten.push_back(outfile + \".2.\" + gEbwt_ext);\n    filesWritten.push_back(outfile + \".3.\" + gEbwt_ext);\n\tTStr s;\n\tEbwt<TIndexOffU> ebwt(\n                          s,\n                          packed,\n                          0,\n                          1,  // TODO: maybe not?\n                          lineRate,\n                          offRate,      // suffix-array sampling rate\n                          ftabChars,    // number of chars in initial arrow-pair calc\n                          outfile,      // basename for .?.ebwt files\n                          reverse == 0, // fw\n                          !entireSA,    // useBlockwise\n                          bmax,         // block size for blockwise SA builder\n                          bmaxMultSqrt, // block size as multiplier of sqrt(len)\n                          bmaxDivN,     // block size as divisor of len\n                          noDc? 0 : dcv,// difference-cover period\n                          nthreads,\n                          is,           // list of input streams\n                          szs,          // list of reference sizes\n                          (TIndexOffU)sztot.first,  // total size of all unambiguous ref chars\n                          conversion_table_fname,\n                          taxonomy_fname,\n                          name_table_fname,\n                          size_table_fname,\n                          refparams,    // reference read-in parameters\n                          seed,         // pseudo-random number generator seed\n                          -1,           // override offRate\n                          doSaFile,     // make a file with just the suffix array in it\n                          doBwtFile,    // make a file with just the BWT string in it\n                          kmer_count,   // Count the number of distinct k-mers if non-zero\n                          verbose,      // be talkative\n                          autoMem,      // pass exceptions up to the toplevel so that we can adjust memory settings automatically\n                          sanityCheck); // verify results and internal consistency\n\t// Note that the Ebwt is *not* resident in memory at this time.  To\n\t// load it into memory, call ebwt.loadIntoMemory()\n\tif(verbose) {\n\t\t// Print Ebwt's vital stats\n\t\tebwt.eh().print(cout);\n\t}\n\tif(sanityCheck) {\n\t\t// Try restoring the original string (if there were\n\t\t// multiple texts, what we'll get back is the joined,\n\t\t// padded string, not a list)\n\t\tebwt.loadIntoMemory(\n                            0,\n                            reverse ? (refparams.reverse == REF_READ_REVERSE) : 0,\n                            true,  // load SA sample?\n                            true,  // load ftab?\n                            true,  // load rstarts?\n                            false,\n                            false);\n\t\tSString<char> s2;\n\t\tebwt.restore(s2);\n\t\tebwt.evictFromMemory();\n\t\t{\n\t\t\tSString<char> joinedss = Ebwt<TIndexOffU>::join<SString<char> >(\n                                                                            is,          // list of input streams\n                                                                            szs,         // list of reference sizes\n                                                                            (TIndexOffU)sztot.first, // total size of all unambiguous ref chars\n                                                                            refparams,   // reference read-in parameters\n                                                                            seed);       // pseudo-random number generator seed\n\t\t\tif(refparams.reverse == REF_READ_REVERSE) {\n\t\t\t\tjoinedss.reverse();\n\t\t\t}\n\t\t\tassert_eq(joinedss.length(), s2.length());\n\t\t\tassert(sstr_eq(joinedss, s2));\n\t\t}\n\t\tif(verbose) {\n\t\t\tif(s2.length() < 1000) {\n\t\t\t\tcout << \"Passed restore check: \" << s2.toZBuf() << endl;\n\t\t\t} else {\n\t\t\t\tcout << \"Passed restore check: (\" << s2.length() << \" chars)\" << endl;\n\t\t\t}\n\t\t}\n\t}\n}\n\nstatic const char *argv0 = NULL;\n\nextern \"C\" {\n/**\n * main function.  Parses command-line arguments.\n */\nint centrifuge_build(int argc, const char **argv) {\n\tstring outfile;\n\ttry {\n\t\t// Reset all global state, including getopt state\n\t\topterr = optind = 1;\n\t\tresetOptions();\n\n\t\tstring infile;\n\t\tEList<string> infiles(MISC_CAT);\n\n\t\tparseOptions(argc, argv);\n\t\targv0 = argv[0];\n\t\tif(showVersion) {\n\t\t\tcout << argv0 << \" version \" << string(CENTRIFUGE_VERSION).c_str() << endl;\n\t\t\tif(sizeof(void*) == 4) {\n\t\t\t\tcout << \"32-bit\" << endl;\n\t\t\t} else if(sizeof(void*) == 8) {\n\t\t\t\tcout << \"64-bit\" << endl;\n\t\t\t} else {\n\t\t\t\tcout << \"Neither 32- nor 64-bit: sizeof(void*) = \" << sizeof(void*) << endl;\n\t\t\t}\n\t\t\tcout << \"Built on \" << BUILD_HOST << endl;\n\t\t\tcout << BUILD_TIME << endl;\n\t\t\tcout << \"Compiler: \" << COMPILER_VERSION << endl;\n\t\t\tcout << \"Options: \" << COMPILER_OPTIONS << endl;\n\t\t\tcout << \"Sizeof {int, long, long long, void*, size_t, off_t}: {\"\n\t\t\t\t << sizeof(int)\n\t\t\t\t << \", \" << sizeof(long) << \", \" << sizeof(long long)\n\t\t\t\t << \", \" << sizeof(void *) << \", \" << sizeof(size_t)\n\t\t\t\t << \", \" << sizeof(off_t) << \"}\" << endl;\n\t\t\treturn 0;\n\t\t}\n\n\t\t// Get input filename\n\t\tif(optind >= argc) {\n\t\t\tcerr << \"No input sequence or sequence file specified!\" << endl;\n\t\t\tprintUsage(cerr);\n\t\t\treturn 1;\n\t\t}\n\t\tinfile = argv[optind++];\n\n\t\t// Get output filename\n\t\tif(optind >= argc) {\n\t\t\tcerr << \"No output file specified!\" << endl;\n\t\t\tprintUsage(cerr);\n\t\t\treturn 1;\n\t\t}\n\t\toutfile = argv[optind++];\n\n\t\ttokenize(infile, \",\", infiles);\n\t\tif(infiles.size() < 1) {\n\t\t\tcerr << \"Tokenized input file list was empty!\" << endl;\n\t\t\tprintUsage(cerr);\n\t\t\treturn 1;\n\t\t}\n        \n        if(conversion_table_fname == \"\") {\n            cerr << \"Please specify --conversion-table!\" << endl;\n            printUsage(cerr);\n            return 1;\n        }\n        \n        if(taxonomy_fname == \"\") {\n            cerr << \"Please specify --taxonomy-tree!\" << endl;\n            printUsage(cerr);\n            return 1;\n        }\n        \n        if(name_table_fname == \"\") {\n            cerr << \"Please specify --name-table!\" << endl;\n            printUsage(cerr);\n            return 1;\n        }\n\n\t\t// Optionally summarize\n\t\tif(verbose) {\n\t\t\tcout << \"Settings:\" << endl\n\t\t\t\t << \"  Output files: \\\"\" << outfile.c_str() << \".*.\" << gEbwt_ext << \"\\\"\" << endl\n\t\t\t\t << \"  Line rate: \" << lineRate << \" (line is \" << (1<<lineRate) << \" bytes)\" << endl\n\t\t\t\t << \"  Lines per side: \" << linesPerSide << \" (side is \" << ((1<<lineRate)*linesPerSide) << \" bytes)\" << endl\n\t\t\t\t << \"  Offset rate: \" << offRate << \" (one in \" << (1<<offRate) << \")\" << endl\n\t\t\t\t << \"  FTable chars: \" << ftabChars << endl\n\t\t\t\t << \"  Strings: \" << (packed? \"packed\" : \"unpacked\") << endl\n                 << \"  Local offset rate: \" << localOffRate << \" (one in \" << (1<<localOffRate) << \")\" << endl\n                 << \"  Local fTable chars: \" << localFtabChars << endl\n\t\t\t\t ;\n\t\t\tif(bmax == OFF_MASK) {\n\t\t\t\tcout << \"  Max bucket size: default\" << endl;\n\t\t\t} else {\n\t\t\t\tcout << \"  Max bucket size: \" << bmax << endl;\n\t\t\t}\n\t\t\tif(bmaxMultSqrt == OFF_MASK) {\n\t\t\t\tcout << \"  Max bucket size, sqrt multiplier: default\" << endl;\n\t\t\t} else {\n\t\t\t\tcout << \"  Max bucket size, sqrt multiplier: \" << bmaxMultSqrt << endl;\n\t\t\t}\n\t\t\tif(bmaxDivN == 0xffffffff) {\n\t\t\t\tcout << \"  Max bucket size, len divisor: default\" << endl;\n\t\t\t} else {\n\t\t\t\tcout << \"  Max bucket size, len divisor: \" << bmaxDivN << endl;\n\t\t\t}\n\t\t\tcout << \"  Difference-cover sample period: \" << dcv << endl;\n\t\t\tcout << \"  Endianness: \" << (bigEndian? \"big\":\"little\") << endl\n\t\t\t\t << \"  Actual local endianness: \" << (currentlyBigEndian()? \"big\":\"little\") << endl\n\t\t\t\t << \"  Sanity checking: \" << (sanityCheck? \"enabled\":\"disabled\") << endl;\n\t#ifdef NDEBUG\n\t\t\tcout << \"  Assertions: disabled\" << endl;\n\t#else\n\t\t\tcout << \"  Assertions: enabled\" << endl;\n\t#endif\n\t\t\tcout << \"  Random seed: \" << seed << endl;\n\t\t\tcout << \"  Sizeofs: void*:\" << sizeof(void*) << \", int:\" << sizeof(int) << \", long:\" << sizeof(long) << \", size_t:\" << sizeof(size_t) << endl;\n\t\t\tcout << \"Input files DNA, \" << file_format_names[format].c_str() << \":\" << endl;\n\t\t\tfor(size_t i = 0; i < infiles.size(); i++) {\n\t\t\t\tcout << \"  \" << infiles[i].c_str() << endl;\n\t\t\t}\n\t\t}\n\t\t// Seed random number generator\n\t\tsrand(seed);\n\t\t{\n\t\t\tTimer timer(cout, \"Total time for call to driver() for forward index: \", verbose);\n\t\t\tif(!packed) {\n\t\t\t\ttry {\n\t\t\t\t\tdriver<SString<char> >(\n                                           infile,\n                                           infiles,\n                                           conversion_table_fname,\n                                           taxonomy_fname,\n                                           size_table_fname,\n                                           name_table_fname,\n                                           outfile,\n                                           false,\n                                           REF_READ_FORWARD);\n\t\t\t\t} catch(bad_alloc& e) {\n\t\t\t\t\tif(autoMem) {\n\t\t\t\t\t\tcerr << \"Switching to a packed string representation.\" << endl;\n\t\t\t\t\t\tpacked = true;\n\t\t\t\t\t} else {\n\t\t\t\t\t\tthrow e;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(packed) {\n\t\t\t\tdriver<S2bDnaString>(\n                                     infile,\n                                     infiles,\n                                     conversion_table_fname,\n                                     taxonomy_fname,\n                                     name_table_fname,\n                                     size_table_fname,\n                                     outfile,\n                                     true,\n                                     REF_READ_FORWARD);\n\t\t\t}\n\t\t}\n#if 0\n\t\tint reverseType = reverseEach ? REF_READ_REVERSE_EACH : REF_READ_REVERSE;\n\t\tsrand(seed);\n\t\tTimer timer(cout, \"Total time for backward call to driver() for mirror index: \", verbose);\n\t\tif(!packed) {\n\t\t\ttry {\n\t\t\t\tdriver<SString<char> >(infile, infiles, outfile + \".rev\", false, reverseType);\n\t\t\t} catch(bad_alloc& e) {\n\t\t\t\tif(autoMem) {\n\t\t\t\t\tcerr << \"Switching to a packed string representation.\" << endl;\n\t\t\t\t\tpacked = true;\n\t\t\t\t} else {\n\t\t\t\t\tthrow e;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tif(packed) {\n\t\t\tdriver<S2bDnaString>(infile, infiles, outfile + \".rev\", true, reverseType);\n\t\t}\n#endif\n\t\treturn 0;\n\t} catch(std::exception& e) {\n\t\tcerr << \"Error: Encountered exception: '\" << e.what() << \"'\" << endl;\n\t\tcerr << \"Command: \";\n\t\tfor(int i = 0; i < argc; i++) cerr << argv[i] << \" \";\n\t\tcerr << endl;\n\t\tdeleteIdxFiles(outfile, writeRef || justRef, justRef);\n\t\treturn 1;\n\t} catch(int e) {\n\t\tif(e != 0) {\n\t\t\tcerr << \"Error: Encountered internal Centrifuge exception (#\" << e << \")\" << endl;\n\t\t\tcerr << \"Command: \";\n\t\t\tfor(int i = 0; i < argc; i++) cerr << argv[i] << \" \";\n\t\t\tcerr << endl;\n\t\t}\n\t\tdeleteIdxFiles(outfile, writeRef || justRef, justRef);\n\t\treturn e;\n\t}\n}\n}\n"
  },
  {
    "path": "centrifuge_build_main.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <iostream>\n#include <fstream>\n#include <string.h>\n#include <stdlib.h>\n#include \"tokenize.h\"\n#include \"ds.h\"\n#include \"mem_ids.h\"\n\nusing namespace std;\n\nextern \"C\" {\n\tint centrifuge_build(int argc, const char **argv);\n}\n\n/**\n * bowtie-build main function.  It is placed in a separate source file\n * to make it slightly easier to compile as a library.\n *\n * If the user specifies -A <file> as the first two arguments, main\n * will interpret that file as having one set of command-line arguments\n * per line, and will dispatch each batch of arguments one at a time to\n * bowtie-build.\n */\nint main(int argc, const char **argv) {\n\tif(argc > 2 && strcmp(argv[1], \"-A\") == 0) {\n\t\tconst char *file = argv[2];\n\t\tifstream in;\n\t\tin.open(file);\n\t\tchar buf[4096];\n\t\tint lastret = -1;\n\t\twhile(in.getline(buf, 4095)) {\n\t\t\tEList<string> args(MISC_CAT);\n\t\t\targs.push_back(string(argv[0]));\n\t\t\ttokenize(buf, \" \\t\", args);\n\t\t\tconst char **myargs = (const char**)malloc(sizeof(char*)*args.size());\n\t\t\tfor(size_t i = 0; i < args.size(); i++) {\n\t\t\t\tmyargs[i] = args[i].c_str();\n\t\t\t}\n\t\t\tif(args.size() == 1) continue;\n\t\t\tlastret = centrifuge_build((int)args.size(), myargs);\n\t\t\tfree(myargs);\n\t\t}\n\t\tif(lastret == -1) {\n\t\t\tcerr << \"Warning: No arg strings parsed from \" << file << endl;\n\t\t\treturn 0;\n\t\t}\n\t\treturn lastret;\n\t} else {\n\t\treturn centrifuge_build(argc, argv);\n\t}\n}\n"
  },
  {
    "path": "centrifuge_compress.cpp",
    "content": "/*\n * Copyright 2015, Daehwan Kim <infphilo@gmail.com>\n *\n * This file is part of Centrifuge.\n *\n * Centrifuge is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Centrifuge is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Centrifuge.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <iostream>\n#include <fstream>\n#include <string>\n#include <cassert>\n#include <getopt.h>\n#include \"assert_helpers.h\"\n#include \"endian_swap.h\"\n#include \"bt2_idx.h\"\n#include \"bt2_io.h\"\n#include \"bt2_util.h\"\n#include \"formats.h\"\n#include \"sequence_io.h\"\n#include \"tokenize.h\"\n#include \"timer.h\"\n#include \"ref_read.h\"\n#include \"filebuf.h\"\n#include \"reference.h\"\n#include \"ds.h\"\n#include \"aligner_sw.h\"\n\n/**\n * \\file Driver for the bowtie-build indexing tool.\n */\n\n// Build parameters\nint verbose;\nstatic int sanityCheck;\nstatic int format;\nstatic TIndexOffU bmax;\nstatic TIndexOffU bmaxMultSqrt;\nstatic uint32_t bmaxDivN;\nstatic int dcv;\nstatic int noDc;\nstatic int entireSA;\nstatic int seed;\nstatic int showVersion;\n//   Ebwt parameters\nstatic int32_t lineRate;\nstatic int32_t linesPerSide;\nstatic int32_t offRate;\nstatic int32_t ftabChars;\nstatic int32_t localOffRate;\nstatic int32_t localFtabChars;\nstatic int  bigEndian;\nstatic bool nsToAs;\nstatic bool doSaFile;  // make a file with just the suffix array in it\nstatic bool doBwtFile; // make a file with just the BWT string in it\nstatic bool autoMem;\nstatic bool packed;\nstatic bool writeRef;\nstatic bool justRef;\nstatic bool reverseEach;\nstatic string wrapper;\nstatic int across;\nstatic size_t minSimLen;  // minimum similar length\nstatic bool printN;\n\nstatic void resetOptions() {\n\tverbose        = true;  // be talkative (default)\n\tsanityCheck    = 0;     // do slow sanity checks\n\tformat         = FASTA; // input sequence format\n\tbmax           = OFF_MASK; // max blockwise SA bucket size\n\tbmaxMultSqrt   = OFF_MASK; // same, as multplier of sqrt(n)\n\tbmaxDivN       = 4;          // same, as divisor of n\n\tdcv            = 1024;  // bwise SA difference-cover sample sz\n\tnoDc           = 0;     // disable difference-cover sample\n\tentireSA       = 0;     // 1 = disable blockwise SA\n\tseed           = 0;     // srandom seed\n\tshowVersion    = 0;     // just print version and quit?\n\t//   Ebwt parameters\n\tlineRate       = Ebwt<TIndexOffU>::default_lineRate;\n\tlinesPerSide   = 1;  // 1 64-byte line on a side\n\toffRate        = 4;  // sample 1 out of 16 SA elts\n\tftabChars      = 10; // 10 chars in initial lookup table\n    localOffRate   = 3;\n    localFtabChars = 6;\n\tbigEndian      = 0;  // little endian\n\tnsToAs         = false; // convert reference Ns to As prior to indexing\n    doSaFile       = false; // make a file with just the suffix array in it\n    doBwtFile      = false; // make a file with just the BWT string in it\n\tautoMem        = true;  // automatically adjust memory usage parameters\n\tpacked         = false; //\n\twriteRef       = true;  // write compact reference to .3.bt2/.4.bt2\n\tjustRef        = false; // *just* write compact reference, don't index\n\treverseEach    = false;\n    across         = 60; // number of characters across in FASTA output\n    minSimLen      = 100;\n    printN         = false;\n    wrapper.clear();\n}\n\n// Argument constants for getopts\nenum {\n\tARG_BMAX = 256,\n\tARG_BMAX_MULT,\n\tARG_BMAX_DIV,\n\tARG_DCV,\n\tARG_SEED,\n\tARG_CUTOFF,\n\tARG_PMAP,\n\tARG_NTOA,\n\tARG_USAGE,\n\tARG_REVERSE_EACH,\n    ARG_SA,\n\tARG_WRAPPER,\n    ARG_LOCAL_OFFRATE,\n    ARG_LOCAL_FTABCHARS,\n    ARG_MIN_SIMLEN,\n    ARG_PRINTN,\n};\n\n/**\n * Print a detailed usage message to the provided output stream.\n */\nstatic void printUsage(ostream& out) {\n\tout << \"Centrifuge version \" << string(CENTRIFUGE_VERSION).c_str() << \" by Daehwan Kim (infphilo@gmail.com, http://www.ccb.jhu.edu/people/infphilo)\" << endl;\n    \n#ifdef BOWTIE_64BIT_INDEX\n\tstring tool_name = \"hisat-build-l\";\n#else\n\tstring tool_name = \"hisat-build-s\";\n#endif\n\tif(wrapper == \"basic-0\") {\n\t\ttool_name = \"hisat-build\";\n\t}\n    \n\tout << \"Usage: hisat2-build [options]* <reference_in> <bt2_index_base>\" << endl\n\t    << \"    reference_in            comma-separated list of files with ref sequences\" << endl\n\t    << \"    hisat_index_base          write \" << gEbwt_ext << \" data to files with this dir/basename\" << endl\n        << \"Options:\" << endl\n        << \"    -c                      reference sequences given on cmd line (as\" << endl\n        << \"                            <reference_in>)\" << endl;\n    if(wrapper == \"basic-0\") {\n        out << \"    --large-index           force generated index to be 'large', even if ref\" << endl\n\t\t<< \"                            has fewer than 4 billion nucleotides\" << endl;\n\t}\n    out << \"    -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting\" << endl\n\t    << \"    -p/--packed             use packed strings internally; slower, uses less mem\" << endl\n\t    << \"    --bmax <int>            max bucket sz for blockwise suffix-array builder\" << endl\n\t    << \"    --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)\" << endl\n\t    << \"    --dcv <int>             diff-cover period for blockwise (default: 1024)\" << endl\n\t    << \"    --nodc                  disable diff-cover (algorithm becomes quadratic)\" << endl\n\t    << \"    -r/--noref              don't build .3/.4.bt2 (packed reference) portion\" << endl\n\t    << \"    -3/--justref            just build .3/.4.bt2 (packed reference) portion\" << endl\n\t    << \"    -o/--offrate <int>      SA is sampled every 2^offRate BWT chars (default: 5)\" << endl\n\t    << \"    -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)\" << endl\n        << \"    --localoffrate <int>    SA (local) is sampled every 2^offRate BWT chars (default: 3)\" << endl\n        << \"    --localftabchars <int>  # of chars consumed in initial lookup in a local index (default: 6)\" << endl\n\t    << \"    --seed <int>            seed for random number generator\" << endl\n\t    << \"    -q/--quiet              verbose output (for debugging)\" << endl\n        << \"    --printN                print original sequence with mask\" << endl\n\t    << \"    -h/--help               print detailed description of tool and its options\" << endl\n\t    << \"    --usage                 print this usage message\" << endl\n\t    << \"    --version               print version information and quit\" << endl\n\t    ;\n    \n    if(wrapper.empty()) {\n\t\tcerr << endl\n        << \"*** Warning ***\" << endl\n        << \"'\" << tool_name << \"' was run directly.  It is recommended \"\n        << \"that you run the wrapper script 'bowtie2-build' instead.\"\n        << endl << endl;\n\t}\n}\n\nstatic const char *short_options = \"qraph?nscfl:i:o:t:h:3C\";\n\nstatic struct option long_options[] = {\n\t{(char*)\"quiet\",          no_argument,       0,            'q'},\n\t{(char*)\"sanity\",         no_argument,       0,            's'},\n\t{(char*)\"packed\",         no_argument,       0,            'p'},\n\t{(char*)\"little\",         no_argument,       &bigEndian,   0},\n\t{(char*)\"big\",            no_argument,       &bigEndian,   1},\n\t{(char*)\"bmax\",           required_argument, 0,            ARG_BMAX},\n\t{(char*)\"bmaxmultsqrt\",   required_argument, 0,            ARG_BMAX_MULT},\n\t{(char*)\"bmaxdivn\",       required_argument, 0,            ARG_BMAX_DIV},\n\t{(char*)\"dcv\",            required_argument, 0,            ARG_DCV},\n\t{(char*)\"nodc\",           no_argument,       &noDc,        1},\n\t{(char*)\"seed\",           required_argument, 0,            ARG_SEED},\n\t{(char*)\"entiresa\",       no_argument,       &entireSA,    1},\n\t{(char*)\"version\",        no_argument,       &showVersion, 1},\n\t{(char*)\"noauto\",         no_argument,       0,            'a'},\n\t{(char*)\"noblocks\",       required_argument, 0,            'n'},\n\t{(char*)\"linerate\",       required_argument, 0,            'l'},\n\t{(char*)\"linesperside\",   required_argument, 0,            'i'},\n\t{(char*)\"offrate\",        required_argument, 0,            'o'},\n\t{(char*)\"ftabchars\",      required_argument, 0,            't'},\n    {(char*)\"localoffrate\",   required_argument, 0,            ARG_LOCAL_OFFRATE},\n\t{(char*)\"localftabchars\", required_argument, 0,            ARG_LOCAL_FTABCHARS},\n\t{(char*)\"help\",           no_argument,       0,            'h'},\n\t{(char*)\"ntoa\",           no_argument,       0,            ARG_NTOA},\n\t{(char*)\"justref\",        no_argument,       0,            '3'},\n\t{(char*)\"noref\",          no_argument,       0,            'r'},\n\t{(char*)\"color\",          no_argument,       0,            'C'},\n    {(char*)\"sa\",             no_argument,       0,            ARG_SA},\n\t{(char*)\"reverse-each\",   no_argument,       0,            ARG_REVERSE_EACH},\n    {(char*)\"min-simlen\",     required_argument, 0,            ARG_MIN_SIMLEN},\n    {(char*)\"printN\",         no_argument,       0,            ARG_PRINTN},\n\t{(char*)\"usage\",          no_argument,       0,            ARG_USAGE},\n    {(char*)\"wrapper\",        required_argument, 0,            ARG_WRAPPER},\n\t{(char*)0, 0, 0, 0} // terminator\n};\n\n/**\n * Parse an int out of optarg and enforce that it be at least 'lower';\n * if it is less than 'lower', then output the given error message and\n * exit with an error and a usage message.\n */\ntemplate<typename T>\nstatic int parseNumber(T lower, const char *errmsg) {\n\tchar *endPtr= NULL;\n\tT t = (T)strtoll(optarg, &endPtr, 10);\n\tif (endPtr != NULL) {\n\t\tif (t < lower) {\n\t\t\tcerr << errmsg << endl;\n\t\t\tprintUsage(cerr);\n\t\t\tthrow 1;\n\t\t}\n\t\treturn t;\n\t}\n\tcerr << errmsg << endl;\n\tprintUsage(cerr);\n\tthrow 1;\n\treturn -1;\n}\n\n/**\n * Read command-line arguments\n */\nstatic void parseOptions(int argc, const char **argv) {\n\tint option_index = 0;\n\tint next_option;\n\tdo {\n\t\tnext_option = getopt_long(\n\t\t\targc, const_cast<char**>(argv),\n\t\t\tshort_options, long_options, &option_index);\n\t\tswitch (next_option) {\n            case ARG_WRAPPER:\n\t\t\t\twrapper = optarg;\n\t\t\t\tbreak;\n\t\t\tcase 'f': format = FASTA; break;\n\t\t\tcase 'c': format = CMDLINE; break;\n\t\t\tcase 'p': packed = true; break;\n\t\t\tcase 'C':\n\t\t\t\tcerr << \"Error: -C specified but Bowtie 2 does not support colorspace input.\" << endl;\n\t\t\t\tthrow 1;\n\t\t\t\tbreak;\n\t\t\tcase 'l':\n\t\t\t\tlineRate = parseNumber<int>(3, \"-l/--lineRate arg must be at least 3\");\n\t\t\t\tbreak;\n\t\t\tcase 'i':\n\t\t\t\tlinesPerSide = parseNumber<int>(1, \"-i/--linesPerSide arg must be at least 1\");\n\t\t\t\tbreak;\n\t\t\tcase 'o':\n\t\t\t\toffRate = parseNumber<int>(0, \"-o/--offRate arg must be at least 0\");\n\t\t\t\tbreak;\n            case ARG_LOCAL_OFFRATE:\n                localOffRate = parseNumber<int>(0, \"-o/--localoffrate arg must be at least 0\");\n                break;\n\t\t\tcase '3':\n\t\t\t\tjustRef = true;\n\t\t\t\tbreak;\n\t\t\tcase 't':\n\t\t\t\tftabChars = parseNumber<int>(1, \"-t/--ftabChars arg must be at least 1\");\n\t\t\t\tbreak;\n            case ARG_LOCAL_FTABCHARS:\n\t\t\t\tlocalFtabChars = parseNumber<int>(1, \"-t/--localftabchars arg must be at least 1\");\n\t\t\t\tbreak;\n\t\t\tcase 'n':\n\t\t\t\t// all f-s is used to mean \"not set\", so put 'e' on end\n\t\t\t\tbmax = 0xfffffffe;\n\t\t\t\tbreak;\n\t\t\tcase 'h':\n\t\t\tcase ARG_USAGE:\n\t\t\t\tprintUsage(cout);\n\t\t\t\tthrow 0;\n\t\t\t\tbreak;\n\t\t\tcase ARG_BMAX:\n\t\t\t\tbmax = parseNumber<TIndexOffU>(1, \"--bmax arg must be at least 1\");\n\t\t\t\tbmaxMultSqrt = OFF_MASK; // don't use multSqrt\n\t\t\t\tbmaxDivN = 0xffffffff;     // don't use multSqrt\n\t\t\t\tbreak;\n\t\t\tcase ARG_BMAX_MULT:\n\t\t\t\tbmaxMultSqrt = parseNumber<TIndexOffU>(1, \"--bmaxmultsqrt arg must be at least 1\");\n\t\t\t\tbmax = OFF_MASK;     // don't use bmax\n\t\t\t\tbmaxDivN = 0xffffffff; // don't use multSqrt\n\t\t\t\tbreak;\n\t\t\tcase ARG_BMAX_DIV:\n\t\t\t\tbmaxDivN = parseNumber<uint32_t>(1, \"--bmaxdivn arg must be at least 1\");\n\t\t\t\tbmax = OFF_MASK;         // don't use bmax\n\t\t\t\tbmaxMultSqrt = OFF_MASK; // don't use multSqrt\n\t\t\t\tbreak;\n\t\t\tcase ARG_DCV:\n\t\t\t\tdcv = parseNumber<int>(3, \"--dcv arg must be at least 3\");\n\t\t\t\tbreak;\n\t\t\tcase ARG_SEED:\n\t\t\t\tseed = parseNumber<int>(0, \"--seed arg must be at least 0\");\n\t\t\t\tbreak;\n\t\t\tcase ARG_REVERSE_EACH:\n\t\t\t\treverseEach = true;\n\t\t\t\tbreak;\n            case ARG_SA:\n                doSaFile = true;\n                break;\n\t\t\tcase ARG_NTOA: nsToAs = true; break;\n            case ARG_MIN_SIMLEN:\n                minSimLen = parseNumber<size_t>(2, \"--min-simlen arg must be at least 2\");\n                break;\n            case ARG_PRINTN: printN = true; break;\n\t\t\tcase 'a': autoMem = false; break;\n\t\t\tcase 'q': verbose = false; break;\n\t\t\tcase 's': sanityCheck = true; break;\n\t\t\tcase 'r': writeRef = false; break;\n\n\t\t\tcase -1: /* Done with options. */\n\t\t\t\tbreak;\n\t\t\tcase 0:\n\t\t\t\tif (long_options[option_index].flag != 0)\n\t\t\t\t\tbreak;\n\t\t\tdefault:\n\t\t\t\tprintUsage(cerr);\n\t\t\t\tthrow 1;\n\t\t}\n\t} while(next_option != -1);\n\tif(bmax < 40) {\n\t\tcerr << \"Warning: specified bmax is very small (\" << bmax << \").  This can lead to\" << endl\n\t\t     << \"extremely slow performance and memory exhaustion.  Perhaps you meant to specify\" << endl\n\t\t     << \"a small --bmaxdivn?\" << endl;\n\t}\n}\n\nEList<string> filesWritten;\n\nstatic void print_fasta_record(\n                               ostream& fout,\n                               const string& defline,\n                               const SString<char>& seq,\n                               size_t len)\n{\n    fout << \">\";\n    fout << defline.c_str() << endl;\n    \n    if(across > 0) {\n        size_t i = 0;\n        while (i + across < len)\n        {\n            for(size_t j = 0; j < (unsigned)across; j++) {\n                int base = seq.get(i + j);\n                assert_lt(base, 4);\n                fout << \"ACGTN\"[base];\n            }\n            fout << endl;\n            i += across;\n        }\n        if (i < len) {\n            for(size_t j = i; j < len; j++) {\n                int base = seq.get(j);\n                assert_lt(base, 4);\n                fout << \"ACGTN\"[base];\n            }\n            fout << endl;\n        }\n    } else {\n        for(size_t j = 0; j < len; j++) {\n            int base = seq.get(j);\n            assert_lt(base, 4);\n            fout << \"ACGTN\"[base];\n        }\n        fout << endl;\n    }\n}\n\nstruct RegionSimilar {\n    bool fw;\n    size_t pos;\n    uint32_t fw_length; // this includes the base at 'pos'\n    uint32_t bw_length;\n    uint32_t mismatches;\n    uint32_t gaps;\n    \n    void reset() {\n        fw = false;\n        pos = 0;\n        fw_length = bw_length = 0;\n        mismatches = gaps = 0;\n    }\n    \n    bool operator< (const RegionSimilar& o) const {\n        return pos < o.pos;\n    }\n};\n\nstruct Region {\n    size_t pos;\n    uint32_t fw_length;\n    uint32_t bw_length; // used for merging\n    bool   low_complexity;\n    \n    uint32_t match_begin;\n    uint32_t match_end;\n    \n    uint32_t match_size() {\n        assert_leq(match_begin, match_end);\n        return match_end - match_begin;\n    }\n    \n    void reset() {\n        pos = fw_length = bw_length = 0, low_complexity = false;\n        match_begin = match_end = 0;\n    }\n    \n    bool operator< (const Region& o) const {\n        return pos < o.pos;\n    }\n};\n\nstruct RegionToMerge {\n    bool processed;\n    EList<pair<uint32_t, uint32_t> > list;\n    \n    void reset() {\n        processed = false;\n        list.clear();\n    }\n};\n\n/**\n * Drive the index construction process and optionally sanity-check the\n * result.\n */\nstatic void driver(\n\tconst string& fafile,\n\tconst string& safile,\n\tbool packed,\n\tint reverse)\n{\n    EList<FileBuf*> is(MISC_CAT);\n\tbool bisulfite = false;\n\tRefReadInParams refparams(false, reverse, nsToAs, bisulfite);\n    FILE *f = fopen(fafile.c_str(), \"r\");\n    if (f == NULL) {\n        cerr << \"Error: could not open \"<<fafile.c_str() << endl;\n        throw 1;\n    }\n    FileBuf *fb = new FileBuf(f);\n    assert(fb != NULL);\n    if(fb->peek() == -1 || fb->eof()) {\n        cerr << \"Warning: Empty fasta file: '\" << fafile.c_str() << \"'\" << endl;\n        throw 1;\n    }\n    assert(!fb->eof());\n    assert(fb->get() == '>');\n    ASSERT_ONLY(fb->reset());\n    assert(!fb->eof());\n    is.push_back(fb);\n    if(is.empty()) {\n\t\tcerr << \"Warning: All fasta inputs were empty\" << endl;\n\t\tthrow 1;\n\t}\n\t// Vector for the ordered list of \"records\" comprising the input\n\t// sequences.  A record represents a stretch of unambiguous\n\t// characters in one of the input sequences.\n\tEList<RefRecord> szs(MISC_CAT);\n\tBitPairReference::szsFromFasta(is, string(), bigEndian, refparams, szs, sanityCheck);\n\tassert_gt(szs.size(), 0);\n    \n    EList<string> refnames;\n    \n    SString<char> s;\n    assert_eq(szs.size(), 1);\n    size_t jlen = szs[0].len;\n    try {\n        Timer _t(cerr, \"  (1/5) Time reading reference sequence: \", verbose);\n        \n        s.resize(jlen);\n        RefReadInParams rpcp = refparams;\n        // For each filebuf\n        assert_eq(is.size(), 1);\n        FileBuf *fb = is[0];\n        assert(!fb->eof());\n        // For each *fragment* (not necessary an entire sequence) we\n        // can pull out of istream l[i]...\n        if(!fb->eof()) {\n            // Push a new name onto our vector\n            refnames.push_back(\"\");\n            TIndexOffU distoff = 0;\n            fastaRefReadAppend(*fb, true, s, distoff, rpcp, &refnames.back());\n        }\n        fb->reset();\n        assert(!fb->eof());\n        // Joined reference sequence now in 's'\n    } catch(bad_alloc& e) {\n        // If we throw an allocation exception in the try block,\n        // that means that the joined version of the reference\n        // string itself is too larger to fit in memory.  The only\n        // alternatives are to tell the user to give us more memory\n        // or to try again with a packed representation of the\n        // reference (if we haven't tried that already).\n        cerr << \"Could not allocate space for a joined string of \" << jlen << \" elements.\" << endl;\n        // There's no point passing this exception on.  The fact\n        // that we couldn't allocate the joined string means that\n        // --bmax is irrelevant - the user should re-run with\n        // ebwt-build-packed\n        if(packed) {\n            cerr << \"Please try running bowtie-build on a computer with more memory.\" << endl;\n        } else {\n            cerr << \"Please try running bowtie-build in packed mode (-p/--packed) or in automatic\" << endl\n            << \"mode (-a/--auto), or try again on a computer with more memory.\" << endl;\n        }\n        if(sizeof(void*) == 4) {\n            cerr << \"If this computer has more than 4 GB of memory, try using a 64-bit executable;\" << endl\n            << \"this executable is 32-bit.\" << endl;\n        }\n        throw 1;\n    }\n    // Succesfully obtained joined reference string\n    assert_eq(s.length(), jlen);\n    size_t sense_seq_len = s.length();\n    size_t both_seq_len = sense_seq_len * 2;\n    assert_geq(sense_seq_len, 2);\n    \n    SwAligner sw;\n\n    SimpleFunc scoreMin; scoreMin.init(SIMPLE_FUNC_LINEAR, DEFAULT_MIN_CONST, DEFAULT_MIN_LINEAR);\n    SimpleFunc nCeil; nCeil.init(SIMPLE_FUNC_LINEAR, 0.0f, std::numeric_limits<double>::max(), 2.0f, 0.1f);\n    const int gGapBarrier = 4;\n    Scoring sc(\n               DEFAULT_MATCH_BONUS,          // constant reward for match\n               DEFAULT_MATCH_BONUS_TYPE,     // how to penalize mismatches\n               DEFAULT_MM_PENALTY_MAX,       // max mm pelanty\n               DEFAULT_MM_PENALTY_MIN,       // min mm pelanty\n               scoreMin,                     // min score as function of read len\n               nCeil,                        // max # Ns as function of read len\n               DEFAULT_N_PENALTY_TYPE,       // how to penalize Ns in the read\n               DEFAULT_N_PENALTY,            // constant if N pelanty is a constant\n               DEFAULT_N_CAT_PAIR,           // whether to concat mates before N filtering\n               DEFAULT_READ_GAP_CONST,       // constant coeff for read gap cost\n               DEFAULT_REF_GAP_CONST,        // constant coeff for ref gap cost\n               DEFAULT_READ_GAP_LINEAR,      // linear coeff for read gap cost\n               DEFAULT_REF_GAP_LINEAR,       // linear coeff for ref gap cost\n               gGapBarrier);                 // # rows at top/bot only entered diagonally\n    \n    size_t tmp_sense_seq_len = sense_seq_len;\n    size_t min_kmer = 0;\n    while(tmp_sense_seq_len > 0) {\n        tmp_sense_seq_len >>= 2;\n        min_kmer++;\n    }\n    //\n    min_kmer += 4;\n    const size_t min_seed_length = min_kmer * 2;\n    \n    //\n    min_kmer += 10;\n    \n    EList<Region> regions;\n    EList<RegionSimilar> regions_similar;\n    {\n        EList<size_t> sa;\n        EList<uint16_t> prefix_lengths;\n        prefix_lengths.resizeExact(sense_seq_len);\n        prefix_lengths.fillZero();\n        {\n            Timer _t(cerr, \"  (2/5) Time finding seeds using suffix array and reference sequence: \", verbose);\n            \n            ifstream in(safile.c_str(), ios::binary);\n            const size_t sa_size = readIndex<uint64_t>(in, false);\n#if 0\n            for(size_t i = 0; i < sa_size; i++) {\n                size_t pos = (size_t)readIndex<uint64_t>(in, false);\n                if(pos == sa_size) continue;\n                size_t opos = both_seq_len - pos - 1;\n                size_t cpos = min(pos, opos);\n                bool fw = pos == cpos;\n                cout << i << \" \";\n                for(size_t j = 0; j < 20; j++) {\n                    int base = 0;\n                    if(fw) {\n                        if(pos + j >= sense_seq_len) break;\n                        base = s[pos+j];\n                    } else {\n                        if(cpos < j) break;\n                        base = 3 - s[cpos-j];\n                    }\n                    cout << \"ACGT\"[base];\n                }\n                cout << \" \" << pos << endl;\n            }\n            exit(1);\n#endif\n            \n            assert_eq(both_seq_len + 1, sa_size);\n            size_t sa_begin = 0, sa_end = 0;\n            \n            // Compress sequences by removing redundant sub-sequences\n            size_t last_i1 = 0;\n            for(size_t i1 = 0; i1 < sa_size - 1; i1++) {\n                // daehwan - for debugging purposes\n                if((i1 + 1) % 1000000 == 0) {\n                    cerr << \"\\t\\t\" << (i1 + 1) / 1000000 << \" million\" << endl;\n                }\n                if(i1 >= sa_end) {\n                    assert_leq(sa_begin, sa_end);\n                    assert_eq(i1, sa_end);\n                    size_t sa_pos = (size_t)readIndex<uint64_t>(in, false);\n                    sa.push_back(sa_pos);\n                    sa_end++;\n                    assert_eq(sa_end - sa_begin, sa.size());\n                }\n                \n                assert_geq(i1, sa_begin); assert_lt(i1, sa_end);\n                size_t pos1 = sa[i1-sa_begin];\n                \n                if(pos1 == both_seq_len) continue;\n                if(pos1 + min_seed_length >= sense_seq_len) continue;\n                \n                // Compare with the following sequences\n                bool expanded = false;\n                size_t i2 = last_i1 + 1;\n                for(; i2 < sa_size; i2++) {\n                    if(i1 == i2) continue;\n                    if(i2 >= sa_end) {\n                        assert_leq(sa_begin, sa_end);\n                        assert_eq(i2, sa_end);\n                        size_t sa_pos = (size_t)readIndex<uint64_t>(in, false);\n                        sa.push_back(sa_pos);\n                        sa_end++;\n                        assert_eq(sa_end - sa_begin, sa.size());\n                    }\n                    \n                    assert_geq(i2, sa_begin); assert_lt(i2, sa_end);\n                    size_t pos2 = sa[i2-sa_begin];\n                    if(pos2 == both_seq_len) continue;\n                    // opos2 is relative pos of pos2 on the other strand\n                    size_t opos2 = both_seq_len - pos2 - 1;\n                    // cpos2 is canonical pos on the sense strand\n                    size_t cpos2 = min(pos2, opos2);\n                    bool fw = pos2 == cpos2;\n                    if(fw) {\n                        if(pos2 + min_kmer > sense_seq_len) continue;\n                    } else {\n                        if(pos2 + min_kmer > both_seq_len) continue;\n                    }\n                    \n                    size_t j1 = 0; // includes the base at 'pos1'\n                    while(pos1 + j1 < sense_seq_len && pos2 + j1 < (fw ? sense_seq_len : both_seq_len)) {\n                        if(!fw) {\n                            if(pos1 < cpos2 && pos1 + (j1 * 2) >= cpos2) break;\n                        }\n                        int base1 = s[pos1 + j1];\n                        int base2;\n                        if(fw) {\n                            base2 = s[pos2 + j1];\n                        } else {\n                            assert_geq(cpos2, j1);\n                            base2 = 3 - s[cpos2 - j1];\n                        }\n                        if(base1 > 3 || base2 > 3) break;\n                        if(base1 != base2) break;\n                        j1++;\n                    }\n                    if(j1 < min_kmer) {\n                        if(i2 > i1) break;\n                        else continue;\n                    }\n                    \n                    size_t j2 = 0; // doesn't include the base at 'pos1'\n                    while(j2 <= pos1 && (fw ? 0 : sense_seq_len) + j2 <= pos2) {\n                        if(!fw) {\n                            if(cpos2 < pos1 && cpos2 + (j2 * 2) >= pos1) break;\n                        }\n                        int base1 = s[pos1 - j2];\n                        int base2;\n                        if(fw) {\n                            base2 = s[pos2 - j2];\n                        } else {\n                            assert_lt(cpos2 + j2, s.length());\n                            base2 = 3 - s[cpos2 + j2];\n                        }\n                        if(base1 > 3 || base2 > 3) break;\n                        if(base1 != base2) break;\n                        j2++;\n                    }\n                    if(j2 > 0) j2--;\n                    \n                    size_t j = j1 + j2;\n                    \n                    // Do not proceed if two sequences are not similar\n                    if(j < min_seed_length) continue;\n                    \n                    assert_leq(pos1 + j1, prefix_lengths.size());\n                    if(!expanded && j1 <= prefix_lengths[pos1]) continue;\n                    \n                    if(!expanded) {\n                        regions.expand();\n                        regions.back().reset();\n                        regions.back().pos = pos1;\n                        regions.back().fw_length = j1;\n                        regions.back().match_begin = regions.back().match_end = regions_similar.size();\n                        for(size_t k = 0; k < j1; k++) {\n                            size_t tmp_length = j1 - k;\n                            if(prefix_lengths[pos1 + k] < tmp_length) {\n                                prefix_lengths[pos1 + k] = tmp_length;\n                            }\n                        }\n                        expanded = true;\n                    }\n                    \n                    if(regions.back().fw_length < j1) {\n                        regions.back().fw_length = j1;\n                    }\n                    \n                    regions_similar.expand();\n                    regions_similar.back().reset();\n                    regions_similar.back().fw = fw;\n                    regions_similar.back().pos = cpos2;\n                    if(fw) {\n                        regions_similar.back().fw_length = j1;\n                        regions_similar.back().bw_length = j2;\n                    } else {\n                        regions_similar.back().fw_length = j1 > 0 ? j2 + 1 : 0;\n                        regions_similar.back().bw_length = j1 > 0 ? j1 - 1 : 0;\n                    }\n                    \n                    regions.back().match_end = regions_similar.size();\n                    if(regions.back().match_size() >= 20) break;\n                }\n                \n                last_i1 = i1;\n                assert_geq(last_i1, sa_begin);\n                if(last_i1 >= sa_begin + 1024) {\n                    assert_lt(last_i1, sa_end);\n                    sa.erase(0, last_i1 - sa_begin);\n                    sa_begin = last_i1;                    \n                    assert_eq(sa_end - sa_begin, sa.size());\n                }\n                \n                if(expanded) {\n                    assert_gt(regions.size(), 0);\n                    assert_lt(regions.back().match_begin, regions.back().match_end);\n                    assert_eq(regions.back().match_end, regions_similar.size());\n                    if(regions.back().match_size() > 1) {\n                        regions_similar.sortPortion(regions.back().match_begin, regions.back().match_size());\n                        size_t cur_pos = regions.back().match_begin + 1;\n                        for(size_t i = regions.back().match_begin + 1; i < regions.back().match_end; i++) {\n                            assert_gt(cur_pos, 0);\n                            const RegionSimilar& last_region = regions_similar[cur_pos-1];\n                            const RegionSimilar& new_region = regions_similar[i];\n                            if(last_region.fw == new_region.fw) {\n                                if(last_region.fw) {\n                                    if(last_region.pos + last_region.fw_length >= new_region.pos) {\n                                        continue;\n                                    }\n                                } else {\n                                    if(last_region.pos + last_region.fw_length >= new_region.pos) {\n                                        regions_similar[cur_pos-1] = new_region;\n                                        continue;\n                                    }\n                                }\n                            }\n                            if(cur_pos != i) {\n                                assert_lt(cur_pos, regions_similar.size());\n                                regions_similar[cur_pos] = new_region;\n                            }\n                            cur_pos++;\n                        }\n                        if(cur_pos < regions.back().match_end) {\n                            regions.back().low_complexity = true;\n                        }\n                        regions_similar.resize(cur_pos);\n                        regions.back().match_end = regions_similar.size();\n                    }\n                }\n                \n                // daehwan - for debugging purposes\n#if 1\n                assert_lt(i1, i2);\n                if(i1 + 8 < i2) {\n                    i1 = i1 + (i2 - i1) / 2;\n                }\n#endif\n            }\n        }\n        \n        {\n            Timer _t(cerr, \"  (3/5) Time sorting seeds and then removing redundant seeds: \", verbose);\n            regions.sort();\n            \n            if(regions.size() > 1) {\n                size_t cur_pos = 1;\n                for(size_t i = 1; i < regions.size(); i++) {\n                    assert_gt(cur_pos, 0);\n                    const Region& last_region = regions[cur_pos-1];\n                    const Region& new_region = regions[i];\n                    if(last_region.low_complexity && last_region.pos + last_region.fw_length > new_region.pos) continue;\n                    if(last_region.pos + last_region.fw_length >= new_region.pos + new_region.fw_length) continue;\n                    if(cur_pos != i) {\n                        assert_lt(cur_pos, regions.size());\n                        regions[cur_pos] = new_region;\n                    }\n                    cur_pos++;\n                }\n                regions.resizeExact(cur_pos);\n            }\n        }\n    }\n\n    // Print regions\n#if 0\n    cout << \"no. of regions: \" << regions.size() << endl << endl;\n    for(size_t i = 0; i < regions.size(); i++) {\n        const Region& region = regions[i];\n        cout << \"At \" << region.pos << \"\\t\" << region.fw_length << \" bps\" << endl;\n        for(size_t j = 0; j < region.hits.size(); j++) {\n            const RegionSimilar& region2 = region.hits[j];\n            cout << \"\\t\" << (region2.fw ? \"+\" : \"-\") << \"\\tat \" << region2.pos\n            << \"\\t-\" << region2.bw_length\n            << \"\\t+\" << region2.fw_length << endl;\n        }\n        cout << endl << endl;\n    }\n#endif\n    \n    const size_t min_sim_length = minSimLen;\n    \n    {\n        Timer _t(cerr, \"  (4/5) Time merging seeds and masking sequence: \", verbose);\n        \n        EList<uint8_t> mask;\n        mask.resizeExact(sense_seq_len);\n        mask.fillZero();\n        \n        EList<RegionToMerge> merge_list;\n        for(size_t i = 0; i < regions.size(); i++) {\n            const Region& region = regions[i];\n            if(i == 0) {\n                for(size_t j = region.match_begin; j < region.match_end; j++) {\n                    merge_list.expand();\n                    merge_list.back().reset();\n                    merge_list.back().list.expand();\n                    merge_list.back().list.back().first = i;\n                    merge_list.back().list.back().second = j;\n                }\n            } else {\n                assert_gt(i, 0);\n                for(size_t j = region.match_begin; j < region.match_end; j++) {\n                    assert_lt(j, regions_similar.size());\n                    const RegionSimilar& cmp_region = regions_similar[j];\n                    bool added = false;\n                    for(size_t k = 0; k < merge_list.size(); k++) {\n                        RegionToMerge& merge = merge_list[k];\n                        uint32_t region_id1 = merge.list.back().first;\n                        if(region_id1 >= i) break;\n                        uint32_t region_id2 = merge.list.back().second;\n                        assert_lt(region_id1, regions.size());\n                        const Region& prev_region = regions[region_id1];\n                        assert_lt(region_id2, regions_similar.size());\n                        \n                        assert_lt(prev_region.pos, region.pos);\n                        size_t gap = region.pos - prev_region.pos;\n                        \n                        const RegionSimilar& prev_cmp_region = regions_similar[region_id2];\n                        if(prev_cmp_region.fw != cmp_region.fw) continue;\n                        if(prev_cmp_region.pos + cmp_region.bw_length == cmp_region.pos + prev_cmp_region.bw_length &&\n                           prev_cmp_region.pos + prev_cmp_region.fw_length == cmp_region.pos + cmp_region.fw_length)\n                            continue;\n                        \n                        if(!cmp_region.fw) {\n                            if(cmp_region.pos >= prev_region.pos && cmp_region.pos < region.pos) continue;\n                        }\n                        \n                        size_t cmp_gap = 0;\n                        if(cmp_region.fw) {\n                            if(prev_cmp_region.pos >= cmp_region.pos) continue;\n                            cmp_gap = cmp_region.pos - prev_cmp_region.pos;\n                        } else {\n                            if(prev_cmp_region.pos <= cmp_region.pos) continue;\n                            cmp_gap = prev_cmp_region.pos - cmp_region.pos;\n                        }\n                        if(cmp_gap + 10 < gap || gap + 10 < cmp_gap) continue;\n                        \n                        if(prev_region.fw_length + 200 < gap) continue;\n                        if(cmp_region.fw) {\n                            if(prev_cmp_region.fw_length + 200 < cmp_gap) continue;\n                        } else {\n                            if(cmp_region.fw_length + 200 < cmp_gap) continue;\n                        }\n                        \n                        added = true;\n                        merge.list.expand();\n                        merge.list.back().first = i;\n                        merge.list.back().second = j;\n                    }\n                    \n                    if(!added) {\n                        added = true;\n                        merge_list.expand();\n                        merge_list.back().reset();\n                        merge_list.back().list.expand();\n                        merge_list.back().list.back().first = i;\n                        merge_list.back().list.back().second = j;\n                    }\n                }\n            }\n            \n            for(size_t j = 0; j < merge_list.size(); j++) {\n                RegionToMerge& merge = merge_list[j];\n                uint32_t region_id1 = merge.list.back().first;\n                if(i + 1 < regions.size()) {\n                    if(region_id1 == i) continue;\n                    assert_lt(region_id1, regions.size());\n                    const Region& prev_region = regions[region_id1];\n                    if(prev_region.pos + 200 > region.pos) continue;\n                }\n                merge_list[j].processed = true;\n                \n#if 0\n                bool skip_merge = true;\n                for(size_t k = 0; k < merge.list.size(); k++) {\n                    uint32_t region_id1 = merge.list[k].first;\n                    uint32_t region_id2 = merge.list[k].second;\n                    assert_lt(region_id1, regions.size());\n                    const Region& region = regions[region_id1];\n                    assert_lt(region_id2, region.hits.size());\n                    const RegionSimilar& sim_region = region.hits[region_id2];\n                    assert_lt(region.pos, mask.size()); assert_lt(sim_region.pos, mask.size());\n                    if(mask[region.pos] == 0 || mask[sim_region.pos] == 0) {\n                        skip_merge = false;\n                        break;\n                    }\n                }\n                if(skip_merge) continue;\n#endif\n                \n#if 0\n                bool output_merge = merge.list.size() > 1;\n                if(!output_merge) {\n                    assert_gt(merge.list.size(), 0);\n                    uint32_t region_id2 = merge.list[0].second;\n                    assert_lt(region_id2, regions_similar.size());\n                    const RegionSimilar& sim_region = regions_similar[region_id2];\n                    if(sim_region.bw_length + sim_region.fw_length >= min_sim_length) {\n                        output_merge = true;\n                    }\n                }\n                if(output_merge) {\n                    cout << endl << \":\" << endl;\n                    for(size_t k = 0; k < merge.list.size(); k++) {\n                        uint32_t region_id1 = merge.list[k].first;\n                        uint32_t region_id2 = merge.list[k].second;\n                        assert_lt(region_id1, regions.size());\n                        const Region& region = regions[region_id1];\n                        assert_lt(region_id2, regions_similar.size());\n                        const RegionSimilar& sim_region = regions_similar[region_id2];\n                        \n                        cout << \"\\t\";\n                        cout << k << \") at \" << region.pos << \"\\t\" << region.fw_length << \" bps\\t\"\n                        << (sim_region.fw ? \"+\" : \"-\") << \"\\tat \" << sim_region.pos << \"\\t-\" << sim_region.bw_length << \"\\t+\" << sim_region.fw_length << endl;\n                    }\n                    cout << endl << endl;\n                }\n#endif\n                \n                assert_gt(merge.list.size(), 0);\n                for(size_t k = 0; k < merge.list.size(); k++) {\n                    uint32_t region_id1 = merge.list[k].first;\n                    assert_lt(region_id1, regions.size());\n                    Region& region1 = regions[region_id1];\n                    \n                    uint32_t cmp_region_id1 = merge.list[k].second;\n                    assert_lt(cmp_region_id1, regions_similar.size());\n                    const RegionSimilar& cmp_region1 = regions_similar[cmp_region_id1];\n                    \n                    if(cmp_region1.fw) {\n                        region1.fw_length = cmp_region1.fw_length;\n                        region1.bw_length = cmp_region1.bw_length;\n                    } else {\n                        region1.fw_length = cmp_region1.fw_length > 0 ? cmp_region1.bw_length + 1 : 0;\n                        region1.bw_length = cmp_region1.fw_length > 0 ? cmp_region1.fw_length - 1 : 0;\n                    }\n                }\n                \n                for(size_t k = 0; k < merge.list.size(); k++) {\n                    uint32_t region_id1 = merge.list[k].first;\n                    assert_lt(region_id1, regions.size());\n                    const Region& region1 = regions[region_id1];\n\n                    uint32_t cmp_region_id1 = merge.list[k].second;\n                    assert_lt(cmp_region_id1, regions_similar.size());\n                    const RegionSimilar& cmp_region1 = regions_similar[cmp_region_id1];\n\n                    const bool fw = cmp_region1.fw;\n                    bool combined = false;\n                    if(k + 1 < merge.list.size()) {\n                        uint32_t region_id2 = merge.list[k+1].first;\n                        assert_lt(region_id1, region_id2);\n                        assert_lt(region_id2, regions.size());\n                        Region& region2 = regions[region_id2];\n                        \n                        uint32_t cmp_region_id2 = merge.list[k+1].second;\n                        assert_lt(cmp_region_id2, regions_similar.size());\n                        RegionSimilar& cmp_region2 = regions_similar[cmp_region_id2];\n                        \n                        assert_eq(cmp_region1.fw, cmp_region2.fw);\n                        size_t query_len, left = region1.pos, right = region2.pos, cmp_left, cmp_right;\n                        if(fw) {\n                            assert_lt(cmp_region1.pos, cmp_region2.pos);\n                            query_len = cmp_region2.pos - cmp_region1.pos + cmp_region2.fw_length + cmp_region1.bw_length;\n                            cmp_left = cmp_region1.pos, cmp_right = cmp_region2.pos;\n                            \n                            assert_gt(cmp_region1.fw_length, 0);\n                            left = left + cmp_region1.fw_length - 1;\n                            cmp_left = cmp_left + cmp_region1.fw_length - 1;\n                            \n                            assert_geq(right, cmp_region2.bw_length);\n                            right = right - cmp_region2.bw_length;\n                            assert_geq(cmp_right, cmp_region2.bw_length);\n                            cmp_right = cmp_right - cmp_region2.bw_length;\n                            \n                        } else {\n                            assert_lt(cmp_region2.pos, cmp_region1.pos);\n                            query_len = cmp_region1.pos - cmp_region2.pos + cmp_region1.fw_length + cmp_region2.bw_length;\n                            cmp_left = cmp_region2.pos, cmp_right = cmp_region1.pos;\n                            \n                            left = left + cmp_region1.bw_length;\n                            assert_gt(cmp_region2.fw_length, 0);\n                            cmp_left = cmp_left + cmp_region2.fw_length - 1;\n                            \n                            assert_geq(right + 1, cmp_region2.fw_length);\n                            right = right + 1 - cmp_region2.fw_length;\n                            assert_geq(cmp_right, cmp_region1.bw_length);\n                            cmp_right = cmp_right - cmp_region1.bw_length;\n                        }\n                        \n#if 0\n                        cout << \"query length: \" << query_len << endl;\n                        cout << \"left-right: \" << left << \"\\t\" << right << endl;\n                        cout << \"cmp left-right: \" << cmp_left << \"\\t\" << cmp_right << endl;\n#endif\n                        \n                        size_t max_diffs = (query_len + 9) / 10;\n                        if(max_diffs > cmp_region1.mismatches + cmp_region1.gaps) {\n                            max_diffs -= (cmp_region1.mismatches + cmp_region1.gaps);\n                        } else {\n                            max_diffs = 0;\n                        }\n                        \n                        bool do_swalign = max_diffs > 0;\n                        if(left >= right && cmp_left >= cmp_right) {\n                            combined = true;\n                        } else if(left >= right) {\n                            assert_lt(cmp_left, cmp_right);\n                            size_t gap = cmp_right - cmp_left + 1 + left - right;\n                            if(gap <= max_diffs) {\n                                combined = true;\n                                cmp_region2.gaps += gap;\n                            } else {\n                                do_swalign = false;\n                            }\n                        } else if(cmp_left >= cmp_right) {\n                            assert_lt(left, right);\n                            size_t gap = right - left + 1 + cmp_left - cmp_right;\n                            if(gap <= max_diffs) {\n                                combined = true;\n                                cmp_region2.gaps += gap;\n                            } else {\n                                do_swalign = false;\n                            }\n                        }\n                        /*else if(left + max_diffs >= right && cmp_left + max_diffs >= cmp_right) {\n                            combined = true;\n                        }*/\n                        \n                        if(!combined && do_swalign) {\n                            BTString seq;\n                            BTDnaString cmp_seq;\n                            BTString cmp_qual;\n                            \n                            assert_lt(region1.pos, region2.pos);\n                            for(size_t pos = left; pos <= right; pos++) {\n                                assert_lt(pos, s.length());\n                                seq.append(1 << s[pos]);\n                            }\n                            \n                            for(size_t pos = cmp_left; pos <= cmp_right; pos++) {\n                                assert_lt(pos, s.length());\n                                cmp_seq.append(s[pos]);\n                            }\n                            cmp_qual.resize(cmp_seq.length());\n                            cmp_qual.fill('I');\n                            if(!fw) {\n                                cmp_seq.reverseComp();\n                                cmp_qual.reverse();\n                            }\n                            \n                            sw.initRead(cmp_seq, cmp_seq, cmp_qual, cmp_qual, 0, cmp_seq.length(), sc);\n                            \n                            DPRect rect;\n                            rect.refl = rect.refl_pretrim = rect.corel = 0;\n                            rect.refr = rect.refr_pretrim = rect.corer = seq.length();\n                            rect.triml = rect.trimr = 0;\n                            rect.maxgap = 10;\n                            \n                            TAlScore minsc = -max_diffs * 6;\n                            if(minsc < 0) {\n                                sw.initRef(\n                                           true, // fw\n                                           0, // refidx\n                                           rect,\n                                           const_cast<char *>(seq.toZBuf()),\n                                           0,\n                                           seq.length(),\n                                           seq.length(),\n                                           sc,\n                                           minsc,\n                                           true, // enable8\n                                           2000, // cminlen\n                                           4, // cpow2\n                                           false, // doTri\n                                           true); // extend);\n                                \n                                // Perform dynamic programing\n                                RandomSource rnd(seed);\n                                TAlScore bestCell = std::numeric_limits<TAlScore>::min();\n                                if(seq.length() <= 200) {\n                                    combined = sw.align(rnd, bestCell);\n                                }\n#if 0\n                                if(combined) {\n                                    BTDnaString seqstr;\n                                    for(size_t bi = 0; bi < seq.length(); bi++) {\n                                        seqstr.append(firsts5[(int)seq[bi]]);\n                                    }\n                                    cout << seqstr << endl;\n                                    cout << cmp_seq << endl;\n                                    \n                                    SwResult res;\n                                    res.reset();\n                                    sw.nextAlignment(res, minsc, rnd);\n                                    res.alres.ned().reverse();\n                                    cout << \"Succeeded (\" << bestCell << \"): \"; Edit::print(cout, res.alres.ned()); cout << endl;\n                                }\n#endif\n                            }\n                        }\n                        \n                        if(combined) {\n                            assert_lt(region1.pos, region2.pos);\n                            region2.bw_length = region2.pos - region1.pos + region1.bw_length;\n                            if(fw) {\n                                assert_lt(cmp_region1.pos, cmp_region2.pos);\n                                cmp_region2.bw_length = cmp_region2.pos - cmp_region1.pos + cmp_region1.bw_length;\n                            } else {\n                                assert_lt(cmp_region2.pos, cmp_region1.pos);\n                                cmp_region2.fw_length = cmp_region1.pos - cmp_region2.pos + cmp_region1.fw_length;\n                            }\n                        }\n                    }\n                    \n                    // Mask sequence\n                    if(!combined || k + 1 == merge.list.size()) {\n                        if(cmp_region1.bw_length + cmp_region1.fw_length >= min_sim_length) {\n                            size_t mask_begin = 0, mask_end = 0;\n                            if(region1.pos < cmp_region1.pos) {\n                                assert_leq(cmp_region1.bw_length, cmp_region1.pos);\n                                mask_begin = cmp_region1.pos - cmp_region1.bw_length;\n                                assert_leq(cmp_region1.pos + cmp_region1.fw_length, s.length());\n                                mask_end = cmp_region1.pos + cmp_region1.fw_length;\n                            } else {\n                                assert_gt(region1.pos, cmp_region1.pos);\n                                assert_leq(region1.bw_length, region1.pos);\n                                mask_begin = region1.pos - region1.bw_length;\n                                assert_leq(region1.pos + region1.fw_length, s.length());\n                                mask_end = region1.pos + region1.fw_length;\n                            }\n                            for(size_t mask_pos = mask_begin; mask_pos < mask_end; mask_pos ++) {\n                                assert_lt(mask_pos, mask.size());\n                                mask[mask_pos] = 1;\n                            }\n                        }\n                    }\n                }\n                \n                merge.list.resizeExact(0);\n            }\n            \n            size_t cur_pos = 0;\n            for(size_t j = 0; j < merge_list.size(); j++) {\n                if(merge_list[j].processed) continue;\n                if(j != cur_pos) {\n                    merge_list[cur_pos] = merge_list[j];\n                }\n                cur_pos++;\n            }\n             merge_list.resize(cur_pos);\n        }\n        \n        assert_eq(merge_list.size(), 0);\n        assert_eq(mask.size(), sense_seq_len);\n        for(size_t i = 0; i < mask.size(); i++){\n            if(mask[i] != 0) {\n                s.set(4, i);\n            }\n        }\n    }\n    \n    // Output compressed sequence\n    const size_t min_seq_len = 31;\n    size_t cur_pos = 0;\n    {\n        Timer _t(cerr, \"  (5/5) Time outputing compressed sequence: \", verbose);\n\n        if(printN) {\n            print_fasta_record(cout, refnames[0], s, s.length());\n        }\n        size_t cur_seq_len = 0;\n        for(size_t i = 0; i < sense_seq_len; i++) {\n            int base = s[i];\n            assert_leq(base, 4);\n            if(base < 4) {\n                s.set(base, cur_pos);\n                cur_pos++;\n                cur_seq_len++;\n            } else {\n                if(cur_seq_len < min_seq_len) {\n                    assert_leq(cur_seq_len, i);\n                    assert_leq(cur_seq_len, cur_pos);\n                    for(size_t j = i - cur_seq_len; j < i; j++) {\n                        assert_lt(s[j], 4);\n                        s.set(4, j);\n                    }\n                    cur_pos -= cur_seq_len;\n                }\n                cur_seq_len = 0;\n            }\n        }\n        if(!printN) {\n            print_fasta_record(cout, refnames[0], s, cur_pos);\n        }\n    }\n    \n    cerr << endl;\n    cerr << \"Compressed: \" << sense_seq_len << \" to \" << cur_pos\n         << \" bps (\" << (sense_seq_len - cur_pos) * 100.0 / sense_seq_len << \"%)\" << endl;\n}\n\nstatic const char *argv0 = NULL;\n\n/**\n * main function.  Parses command-line arguments.\n */\nint centrifuge_compress(int argc, const char **argv) {\n\tstring outfile;\n\ttry {\n\t\t// Reset all global state, including getopt state\n\t\topterr = optind = 1;\n\t\tresetOptions();\n\n\t\tstring fafile;\n        string safile;\n\t\t\n\t\tparseOptions(argc, argv);\n\t\targv0 = argv[0];\n\t\tif(showVersion) {\n\t\t\tcout << argv0 << \" version \" << string(CENTRIFUGE_VERSION).c_str() << endl;\n\t\t\tif(sizeof(void*) == 4) {\n\t\t\t\tcout << \"32-bit\" << endl;\n\t\t\t} else if(sizeof(void*) == 8) {\n\t\t\t\tcout << \"64-bit\" << endl;\n\t\t\t} else {\n\t\t\t\tcout << \"Neither 32- nor 64-bit: sizeof(void*) = \" << sizeof(void*) << endl;\n\t\t\t}\n\t\t\tcout << \"Built on \" << BUILD_HOST << endl;\n\t\t\tcout << BUILD_TIME << endl;\n\t\t\tcout << \"Compiler: \" << COMPILER_VERSION << endl;\n\t\t\tcout << \"Options: \" << COMPILER_OPTIONS << endl;\n\t\t\tcout << \"Sizeof {int, long, long long, void*, size_t, off_t}: {\"\n\t\t\t\t << sizeof(int)\n\t\t\t\t << \", \" << sizeof(long) << \", \" << sizeof(long long)\n\t\t\t\t << \", \" << sizeof(void *) << \", \" << sizeof(size_t)\n\t\t\t\t << \", \" << sizeof(off_t) << \"}\" << endl;\n\t\t\treturn 0;\n\t\t}\n\n\t\t// Get input filename\n\t\tif(optind >= argc) {\n\t\t\tcerr << \"No input sequence or sequence file specified!\" << endl;\n\t\t\tprintUsage(cerr);\n\t\t\treturn 1;\n\t\t}\n\t\tfafile = argv[optind++];\n\n\t\t// Get output filename\n\t\tif(optind >= argc) {\n\t\t\tcerr << \"No output file specified!\" << endl;\n\t\t\tprintUsage(cerr);\n\t\t\treturn 1;\n\t\t}\n\t\tsafile = argv[optind++];\n\n\t\t// Optionally summarize\n\t\tif(verbose) {\n#if 0\n\t\t\tcout << \"Settings:\" << endl\n\t\t\t\t << \"  Output files: \\\"\" << outfile.c_str() << \".*.\" << gEbwt_ext << \"\\\"\" << endl\n\t\t\t\t << \"  Line rate: \" << lineRate << \" (line is \" << (1<<lineRate) << \" bytes)\" << endl\n\t\t\t\t << \"  Lines per side: \" << linesPerSide << \" (side is \" << ((1<<lineRate)*linesPerSide) << \" bytes)\" << endl\n\t\t\t\t << \"  Offset rate: \" << offRate << \" (one in \" << (1<<offRate) << \")\" << endl\n\t\t\t\t << \"  FTable chars: \" << ftabChars << endl\n\t\t\t\t << \"  Strings: \" << (packed? \"packed\" : \"unpacked\") << endl\n                 << \"  Local offset rate: \" << localOffRate << \" (one in \" << (1<<localOffRate) << \")\" << endl\n                 << \"  Local fTable chars: \" << localFtabChars << endl\n\t\t\t\t ;\n\t\t\tif(bmax == OFF_MASK) {\n\t\t\t\tcout << \"  Max bucket size: default\" << endl;\n\t\t\t} else {\n\t\t\t\tcout << \"  Max bucket size: \" << bmax << endl;\n\t\t\t}\n\t\t\tif(bmaxMultSqrt == OFF_MASK) {\n\t\t\t\tcout << \"  Max bucket size, sqrt multiplier: default\" << endl;\n\t\t\t} else {\n\t\t\t\tcout << \"  Max bucket size, sqrt multiplier: \" << bmaxMultSqrt << endl;\n\t\t\t}\n\t\t\tif(bmaxDivN == 0xffffffff) {\n\t\t\t\tcout << \"  Max bucket size, len divisor: default\" << endl;\n\t\t\t} else {\n\t\t\t\tcout << \"  Max bucket size, len divisor: \" << bmaxDivN << endl;\n\t\t\t}\n\t\t\tcout << \"  Difference-cover sample period: \" << dcv << endl;\n\t\t\tcout << \"  Endianness: \" << (bigEndian? \"big\":\"little\") << endl\n\t\t\t\t << \"  Actual local endianness: \" << (currentlyBigEndian()? \"big\":\"little\") << endl\n\t\t\t\t << \"  Sanity checking: \" << (sanityCheck? \"enabled\":\"disabled\") << endl;\n\t#ifdef NDEBUG\n\t\t\tcout << \"  Assertions: disabled\" << endl;\n\t#else\n\t\t\tcout << \"  Assertions: enabled\" << endl;\n\t#endif\n\t\t\tcout << \"  Random seed: \" << seed << endl;\n\t\t\tcout << \"  Sizeofs: void*:\" << sizeof(void*) << \", int:\" << sizeof(int) << \", long:\" << sizeof(long) << \", size_t:\" << sizeof(size_t) << endl;\n\t\t\tcout << \"Input files DNA, \" << file_format_names[format].c_str() << \":\" << endl;\n\t\t\tfor(size_t i = 0; i < infiles.size(); i++) {\n\t\t\t\tcout << \"  \" << infiles[i].c_str() << endl;\n\t\t\t}\n#endif\n        }\n\t\t// Seed random number generator\n\t\tsrand(seed);\n\t\t{\n\t\t\ttry {\n                driver(fafile, safile, false, REF_READ_FORWARD);\n            } catch(bad_alloc& e) {\n                if(autoMem) {\n                    cerr << \"Switching to a packed string representation.\" << endl;\n                    packed = true;\n                } else {\n                    throw e;\n                }\n            }\n\t\t}\n\t\treturn 0;\n\t} catch(std::exception& e) {\n\t\tcerr << \"Error: Encountered exception: '\" << e.what() << \"'\" << endl;\n\t\tcerr << \"Command: \";\n\t\tfor(int i = 0; i < argc; i++) cerr << argv[i] << \" \";\n\t\tcerr << endl;\n\t\treturn 1;\n\t} catch(int e) {\n\t\tif(e != 0) {\n\t\t\tcerr << \"Error: Encountered internal Centrifuge exception (#\" << e << \")\" << endl;\n\t\t\tcerr << \"Command: \";\n\t\t\tfor(int i = 0; i < argc; i++) cerr << argv[i] << \" \";\n\t\t\tcerr << endl;\n\t\t}\n\t\treturn e;\n\t}\n}\n\n/**\n * bowtie-build main function.  It is placed in a separate source file\n * to make it slightly easier to compile as a library.\n *\n * If the user specifies -A <file> as the first two arguments, main\n * will interpret that file as having one set of command-line arguments\n * per line, and will dispatch each batch of arguments one at a time to\n * bowtie-build.\n */\nint main(int argc, const char **argv) {\n    if(argc > 2 && strcmp(argv[1], \"-A\") == 0) {\n        const char *file = argv[2];\n        ifstream in;\n        in.open(file);\n        char buf[4096];\n        int lastret = -1;\n        while(in.getline(buf, 4095)) {\n            EList<string> args(MISC_CAT);\n            args.push_back(string(argv[0]));\n            tokenize(buf, \" \\t\", args);\n            const char **myargs = (const char**)malloc(sizeof(char*)*args.size());\n            for(size_t i = 0; i < args.size(); i++) {\n                myargs[i] = args[i].c_str();\n            }\n            if(args.size() == 1) continue;\n            lastret = centrifuge_compress((int)args.size(), myargs);\n            free(myargs);\n        }\n        if(lastret == -1) {\n            cerr << \"Warning: No arg strings parsed from \" << file << endl;\n            return 0;\n        }\n        return lastret;\n    } else {\n        return centrifuge_compress(argc, argv);\n    }\n}\n"
  },
  {
    "path": "centrifuge_inspect.cpp",
    "content": "/*\n * Copyright 2016\n *\n * This file is part of Centrifuge and based on code from Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <string>\n#include <iostream>\n#include <getopt.h>\n#include <stdexcept>\n#include <set>\n\n#include \"assert_helpers.h\"\n#include \"endian_swap.h\"\n#include \"hier_idx.h\"\n#include \"reference.h\"\n#include \"ds.h\"\n#include \"hyperloglogplus.h\"\n\nusing namespace std;\n\nstatic bool showVersion = false; // just print version and quit?\nint verbose             = 0;  // be talkative\nstatic int names_only   = 0;  // just print the sequence names in the index\nstatic int summarize_only = 0; // just print summary of index and quit\nstatic int across       = 60; // number of characters across in FASTA output\nstatic bool refFromEbwt = false; // true -> when printing reference, decode it from Ebwt instead of reading it from BitPairReference\nstatic string wrapper;\nstatic const char *short_options = \"vhnsea:\";\nstatic int conversion_table = 0;\nstatic int taxonomy_tree = 0;\nstatic int name_table = 0;\nstatic int size_table = 0;\nstatic int count_kmers = 0;\n\n//#define TEST_KMER_COUNTING\n\nenum {\n\tARG_VERSION = 256,\n    ARG_WRAPPER,\n\tARG_USAGE,\n    ARG_CONVERSION_TABLE,\n    ARG_TAXONOMY_TREE,\n    ARG_NAME_TABLE,\n    ARG_SIZE_TABLE,\n\tARG_COUNT_KMERS\n};\n\nstatic struct option long_options[] = {\n\t{(char*)\"verbose\",  no_argument,        0, 'v'},\n\t{(char*)\"version\",  no_argument,        0, ARG_VERSION},\n\t{(char*)\"usage\",    no_argument,        0, ARG_USAGE},\n\t{(char*)\"names\",    no_argument,        0, 'n'},\n\t{(char*)\"summary\",  no_argument,        0, 's'},\n\t{(char*)\"help\",     no_argument,        0, 'h'},\n\t{(char*)\"across\",   required_argument,  0, 'a'},\n\t{(char*)\"ebwt-ref\", no_argument,        0, 'e'},\n    {(char*)\"wrapper\",  required_argument,  0, ARG_WRAPPER},\n    {(char*)\"conversion-table\", no_argument,  0, ARG_CONVERSION_TABLE},\n    {(char*)\"taxonomy-tree\",    no_argument,  0, ARG_TAXONOMY_TREE},\n    {(char*)\"name-table\",       no_argument,  0, ARG_NAME_TABLE},\n    {(char*)\"size-table\",       no_argument,  0, ARG_SIZE_TABLE},\n\t{(char*)\"estimate-n-kmers\",      no_argument,  0, ARG_COUNT_KMERS},\n\t{(char*)0, 0, 0, 0} // terminator\n};\n\n/**\n * Print a summary usage message to the provided output stream.\n */\nstatic void printUsage(ostream& out) {\n\tout << \"Centrifuge version \" << string(CENTRIFUGE_VERSION).c_str() << \" by Daehwan Kim (infphilo@gmail.com, http://www.ccb.jhu.edu/people/infphilo)\" << endl;\n\tout\n\t<< \"Usage: centrifuge-inspect [options]* <cf_base>\" << endl\n\t<< \"  <cf_base>         cf filename minus trailing .1.\" << gEbwt_ext << \"/.2.\" << gEbwt_ext << \"/.3.\" << gEbwt_ext << endl\n\t<< endl\n\t<< \"  By default, prints FASTA records of the indexed nucleotide sequences to\" << endl\n\t<< \"  standard out.  With -n, just prints names.  With -s, just prints a summary of\" << endl\n\t<< \"  the index parameters and sequences.  With -e, preserves colors if applicable.\" << endl\n\t<< endl\n\t<< \"Options:\" << endl;\n    if(wrapper == \"basic-0\") {\n\t\t//out << \"  --large-index      force inspection of the 'large' index, even if a\" << endl\n\t        //<< \"                     'small' one is present.\" << endl;\n\t}\n\tout << \"  -a/--across <int>  Number of characters across in FASTA output (default: 60)\" << endl\n\t<< \"  -n/--names         Print reference sequence names only\" << endl\n\t<< \"  -s/--summary       Print summary incl. ref names, lengths, index properties\" << endl\n\t<< \"  -e/--bt2-ref       Reconstruct reference from .\" << gEbwt_ext << \" (slow, preserves colors)\" << endl\n    << \"  --conversion-table Print conversion table\" << endl\n    << \"  --taxonomy-tree    Print taxonomy tree\" << endl\n    << \"  --name-table       Print names corresponding to taxonomic IDs\" << endl\n    << \"  --size-table       Print the lengths of the sequences belonging to the same taxonomic ID\" << endl\n\t<< \"  -v/--verbose       Verbose output (for debugging)\" << endl\n\t<< \"  -h/--help          print detailed description of tool and its options\" << endl\n\t<< \"  --help             print this usage message\" << endl\n\t;\n    if(wrapper.empty()) {\n\t\tcerr << endl\n        << \"*** Warning ***\" << endl\n        << \"'centrifuge-inspect-bin' was run directly.  It is recommended \"\n        << \"to use the wrapper script instead.\"\n        << endl << endl;\n\t}\n}\n\n/**\n * Parse an int out of optarg and enforce that it be at least 'lower';\n * if it is less than 'lower', than output the given error message and\n * exit with an error and a usage message.\n */\nstatic int parseInt(int lower, const char *errmsg) {\n\tlong l;\n\tchar *endPtr= NULL;\n\tl = strtol(optarg, &endPtr, 10);\n\tif (endPtr != NULL) {\n\t\tif (l < lower) {\n\t\t\tcerr << errmsg << endl;\n\t\t\tprintUsage(cerr);\n\t\t\tthrow 1;\n\t\t}\n\t\treturn (int32_t)l;\n\t}\n\tcerr << errmsg << endl;\n\tprintUsage(cerr);\n\tthrow 1;\n\treturn -1;\n}\n\n/**\n * Read command-line arguments\n */\nstatic void parseOptions(int argc, char **argv) {\n\tint option_index = 0;\n\tint next_option;\n\tdo {\n\t\tnext_option = getopt_long(argc, argv, short_options, long_options, &option_index);\n\t\tswitch (next_option) {\n            case ARG_WRAPPER:\n\t\t\t\twrapper = optarg;\n\t\t\t\tbreak;\n\t\t\tcase ARG_USAGE:\n\t\t\tcase 'h':\n\t\t\t\tprintUsage(cout);\n\t\t\t\tthrow 0;\n\t\t\t\tbreak;\n\t\t\tcase 'v': verbose = true; break;\n\t\t\tcase ARG_VERSION: showVersion = true; break;\n            case ARG_CONVERSION_TABLE:\n                conversion_table = true;\n                break;\n            case ARG_TAXONOMY_TREE:\n                taxonomy_tree = true;\n                break;\n            case ARG_NAME_TABLE:\n                name_table = true;\n                break;\n            case ARG_SIZE_TABLE:\n                size_table = true;\n                break;\n\t\t\tcase ARG_COUNT_KMERS:\n\t\t\t\tcount_kmers = true;\n\t\t\t\tbreak;\n\t\t\tcase 'e': refFromEbwt = true; break;\n\t\t\tcase 'n': names_only = true; break;\n\t\t\tcase 's': summarize_only = true; break;\n\t\t\tcase 'a': across = parseInt(-1, \"-a/--across arg must be at least 1\"); break;\n\t\t\tcase -1: break; /* Done with options. */\n\t\t\tcase 0:\n\t\t\t\tif (long_options[option_index].flag != 0)\n\t\t\t\t\tbreak;\n\t\t\tdefault:\n\t\t\t\tprintUsage(cerr);\n\t\t\t\tthrow 1;\n\t\t}\n\t} while(next_option != -1);\n}\n\nstatic void print_fasta_record(\n\tostream& fout,\n\tconst string& defline,\n\tconst string& seq)\n{\n\tfout << \">\";\n\tfout << defline.c_str() << endl;\n\n\tif(across > 0) {\n\t\tsize_t i = 0;\n\t\twhile (i + across < seq.length())\n\t\t{\n\t\t\tfout << seq.substr(i, across).c_str() << endl;\n\t\t\ti += across;\n\t\t}\n\t\tif (i < seq.length())\n\t\t\tfout << seq.substr(i).c_str() << endl;\n\t} else {\n\t\tfout << seq.c_str() << endl;\n\t}\n}\n\n/**\n * Counts the number of unique k-mers in the reference sequence\n * that's reconstructed from the index\n */\ntemplate<typename index_t, typename TStr>\nstatic uint64_t count_idx_kmers ( Ebwt<index_t>& ebwt)\n{\n\tTStr cat_ref;\n\tebwt.restore(cat_ref);\n\tcerr << \"Index loaded\" << endl;\n#ifdef TEST_KMER_COUNTING\n\tstd::set<uint64_t> my_set;\n#endif\n\n\tHyperLogLogPlusMinus<uint64_t> kmer_counter(16);\n\tuint64_t word = 0;\n\tuint64_t curr_length = 0;\n\tuint8_t k = 32;\n\n\tTIndexOffU curr_ref = OFF_MASK;\n\tTIndexOffU last_text_off = 0;\n\tsize_t orig_len = cat_ref.length();\n\tTIndexOffU tlen = OFF_MASK;\n\tbool first = true;\n\n\tfor(size_t i = 0; i < orig_len; i++) {\n\t\tTIndexOffU tidx = OFF_MASK;\n\t\tTIndexOffU textoff = OFF_MASK;\n\t\ttlen = OFF_MASK;\n\t\tbool straddled = false;\n\t\tebwt.joinedToTextOff(1 /* qlen */, (TIndexOffU)i, tidx, textoff, tlen, true, straddled);\n\n\t\tif (tidx != OFF_MASK && textoff < tlen) {\n\t\t\tif (curr_ref != tidx) {\n\t\t\t\t// End of the sequence - reset word and counter\n\t\t\t\tcurr_ref = tidx;\n\t\t\t\tword = 0; curr_length = 0;\n\t\t\t\tlast_text_off = 0;\n\t\t\t\tfirst = true;\n\t\t\t}\n\n\t\t\tTIndexOffU textoff_adj = textoff;\n\t\t\tif(first && textoff > 0) textoff_adj++;\n\t\t\tif (textoff_adj - last_text_off > 1) {\n\t\t\t\t// there's an N - reset word and counter\n\t\t\t\tword = 0; curr_length = 0;\n\t\t\t}\n\t\t\t// add another char.\n            int bp = (int)cat_ref[i];\n\n            // shift the first two bits off the word\n            word = word << 2;\n            // put the base-pair code from pos at that position\n            word |= bp;\n\t\t\t++curr_length;\n\t\t\t//cerr << \"[\" << i << \"; \" << curr_length << \"; \" << word << \":\" << kmer_counter.cardinality()  << \"]\" << endl;\n\t\t\tif (curr_length >= k) {\n\t\t\t\tkmer_counter.add(word);\n#ifdef TEST_KMER_COUNTING\n\t\t\t\tmy_set.insert(word);\n\t\t\t\tcerr << \" \" << kmer_counter.cardinality()  << \" vs \" << my_set.size() << endl;\n#endif\n\t\t\t}\n\n\t\t\tlast_text_off = textoff;\n\t\t\tfirst = false;\n\n\t\t}\n\t}\n\tif (curr_length >= k) {\n\t\tkmer_counter.add(word);\n#ifdef TEST_KMER_COUNTING\n\t\tmy_set.insert(word);\n#endif\n\t}\n\n#ifdef TEST_KMER_COUNTING\n\tcerr << \"Exact count: \" << my_set.size() << endl;\n#endif\n\n\treturn kmer_counter.cardinality();\n}\n\n/**\n * Given output stream, BitPairReference, reference index, name and\n * length, print the whole nucleotide reference with the appropriate\n * number of columns.\n */\nstatic void print_ref_sequence(\n\tostream& fout,\n\tBitPairReference& ref,\n\tconst string& name,\n\tsize_t refi,\n\tsize_t len)\n{\n\tbool newlines = across > 0;\n\tint myacross = across > 0 ? across : 60;\n\tsize_t incr = myacross * 1000;\n\tuint32_t *buf = new uint32_t[(incr + 128)/4];\n\tfout << \">\" << name.c_str() << \"\\n\";\n\tASSERT_ONLY(SStringExpandable<uint32_t> destU32);\n\tfor(size_t i = 0; i < len; i += incr) {\n\t\tsize_t amt = min(incr, len-i);\n\t\tassert_leq(amt, incr);\n\t\tint off = ref.getStretch(buf, refi, i, amt ASSERT_ONLY(, destU32));\n\t\tuint8_t *cb = ((uint8_t*)buf) + off;\n\t\tfor(size_t j = 0; j < amt; j++) {\n\t\t\tif(newlines && j > 0 && (j % myacross) == 0) fout << \"\\n\";\n\t\t\tassert_range(0, 4, (int)cb[j]);\n\t\t\tfout << \"ACGTN\"[(int)cb[j]];\n\t\t}\n\t\tfout << \"\\n\";\n\t}\n\tdelete [] buf;\n}\n\n/**\n * Create a BitPairReference encapsulating the reference portion of the\n * index at the given basename.  Iterate through the reference\n * sequences, sending each one to print_ref_sequence to print.\n */\nstatic void print_ref_sequences(\n\tostream& fout,\n\tbool color,\n\tconst EList<string>& refnames,\n\tconst TIndexOffU* plen,\n\tconst string& adjustedEbwtFileBase)\n{\n\tBitPairReference ref(\n\t\tadjustedEbwtFileBase, // input basename\n\t\tcolor,                // true -> expect colorspace reference\n\t\tfalse,                // sanity-check reference\n\t\tNULL,                 // infiles\n\t\tNULL,                 // originals\n\t\tfalse,                // infiles are sequences\n\t\tfalse,                // memory-map\n\t\tfalse,                // use shared memory\n\t\tfalse,                // sweep mm-mapped ref\n\t\tverbose,              // be talkative\n\t\tverbose);             // be talkative at startup\n\tassert_eq(ref.numRefs(), refnames.size());\n\tfor(size_t i = 0; i < ref.numRefs(); i++) {\n\t\tprint_ref_sequence(\n\t\t\tfout,\n\t\t\tref,\n\t\t\trefnames[i],\n\t\t\ti,\n\t\t\tplen[i] + (color ? 1 : 0));\n\t}\n}\n\n/**\n * Given an index, reconstruct the reference by LF mapping through the\n * entire thing.\n */\ntemplate<typename index_t, typename TStr>\nstatic void print_index_sequences(ostream& fout, Ebwt<index_t>& ebwt)\n{\n\tEList<string>* refnames = &(ebwt.refnames());\n\n\tTStr cat_ref;\n\tebwt.restore(cat_ref);\n\n\tHyperLogLogPlusMinus<uint64_t> kmer_counter;\n\tTIndexOffU curr_ref = OFF_MASK;\n\tstring curr_ref_seq = \"\";\n\tTIndexOffU curr_ref_len = OFF_MASK;\n\tTIndexOffU last_text_off = 0;\n\tsize_t orig_len = cat_ref.length();\n\tTIndexOffU tlen = OFF_MASK;\n\tbool first = true;\n\tfor(size_t i = 0; i < orig_len; i++) {\n\t\tTIndexOffU tidx = OFF_MASK;\n\t\tTIndexOffU textoff = OFF_MASK;\n\t\ttlen = OFF_MASK;\n\t\tbool straddled = false;\n\t\tebwt.joinedToTextOff(1 /* qlen */, (TIndexOffU)i, tidx, textoff, tlen, true, straddled);\n\n\t\tif (tidx != OFF_MASK && textoff < tlen)\n\t\t{\n\t\t\tif (curr_ref != tidx)\n\t\t\t{\n\t\t\t\tif (curr_ref != OFF_MASK)\n\t\t\t\t{\n\t\t\t\t\t// Add trailing gaps, if any exist\n\t\t\t\t\tif(curr_ref_seq.length() < curr_ref_len) {\n\t\t\t\t\t\tcurr_ref_seq += string(curr_ref_len - curr_ref_seq.length(), 'N');\n\t\t\t\t\t}\n\t\t\t\t\tprint_fasta_record(fout, (*refnames)[curr_ref], curr_ref_seq);\n\t\t\t\t}\n\t\t\t\tcurr_ref = tidx;\n\t\t\t\tcurr_ref_seq = \"\";\n\t\t\t\tcurr_ref_len = tlen;\n\t\t\t\tlast_text_off = 0;\n\t\t\t\tfirst = true;\n\t\t\t}\n\n\t\t\tTIndexOffU textoff_adj = textoff;\n\t\t\tif(first && textoff > 0) textoff_adj++;\n\t\t\tif (textoff_adj - last_text_off > 1)\n\t\t\t\tcurr_ref_seq += string(textoff_adj - last_text_off - 1, 'N');\n\n            curr_ref_seq.push_back(\"ACGT\"[int(cat_ref[i])]);\t\t\t\n\t\t\tlast_text_off = textoff;\n\t\t\tfirst = false;\n\t\t}\n\t}\n\tif (curr_ref < refnames->size())\n\t{\n\t\t// Add trailing gaps, if any exist\n\t\tif(curr_ref_seq.length() < curr_ref_len) {\n\t\t\tcurr_ref_seq += string(curr_ref_len - curr_ref_seq.length(), 'N');\n\t\t}\n\t\tprint_fasta_record(fout, (*refnames)[curr_ref], curr_ref_seq);\n\t}\n\n}\n\nstatic char *argv0 = NULL;\n\ntemplate <typename index_t>\nstatic void print_index_sequence_names(const string& fname, ostream& fout)\n{\n\tEList<string> p_refnames;\n\treadEbwtRefnames<index_t>(fname, p_refnames);\n\tfor(size_t i = 0; i < p_refnames.size(); i++) {\n\t\tcout << p_refnames[i].c_str() << endl;\n\t}\n}\n\n/**\n * Print a short summary of what's in the index and its flags.\n */\ntemplate <typename index_t>\nstatic void print_index_summary(\n\tconst string& fname,\n\tostream& fout)\n{\n\tint32_t flags = Ebwt<index_t>::readFlags(fname);\n\tbool color = readEbwtColor(fname);\n\tEbwt<index_t> ebwt(\n\t\t\t\t\t   fname,\n\t\t\t\t\t   color,                // index is colorspace\n\t\t\t\t\t   -1,                   // don't require entire reverse\n\t\t\t\t\t   true,                 // index is for the forward direction\n\t\t\t\t\t   -1,                   // offrate (-1 = index default)\n\t\t\t\t\t   0,                    // offrate-plus (0 = index default)\n\t\t\t\t\t   false,                // use memory-mapped IO\n\t\t\t\t\t   false,                // use shared memory\n\t\t\t\t\t   false,                // sweep memory-mapped memory\n\t\t\t\t\t   true,                 // load names?\n\t\t\t\t\t   false,                // load SA sample?\n\t\t\t\t\t   false,                // load ftab?\n\t\t\t\t\t   false,                // load rstarts?\n\t\t\t\t\t   verbose,              // be talkative?\n\t\t\t\t\t   verbose,              // be talkative at startup?\n\t\t\t\t\t   false,                // pass up memory exceptions?\n\t\t\t\t\t   false);               // sanity check?\n\tEList<string> p_refnames;\n\treadEbwtRefnames<index_t>(fname, p_refnames);\n\tcout << \"Flags\" << '\\t' << (-flags) << endl;\n\tcout << \"SA-Sample\" << \"\\t1 in \" << (1 << ebwt.eh().offRate()) << endl;\n\tcout << \"FTab-Chars\" << '\\t' << ebwt.eh().ftabChars() << endl;\n\tassert_eq(ebwt.nPat(), p_refnames.size());\n\tfor(size_t i = 0; i < p_refnames.size(); i++) {\n\t\tcout << \"Sequence-\" << (i+1)\n\t\t     << '\\t' << p_refnames[i].c_str()\n\t\t     << '\\t' << (ebwt.plen()[i] + (color ? 1 : 0))\n\t\t     << endl;\n\t}\n}\n\nextern void initializeCntLut();\n\nstatic void driver(\n\tconst string& ebwtFileBase,\n\tconst string& query)\n{\n    initializeCntLut();\n    \n\t// Adjust\n\tstring adjustedEbwtFileBase = adjustEbwtBase(argv0, ebwtFileBase, verbose);\n\n\tif(names_only) {\n\t\tprint_index_sequence_names<TIndexOffU>(adjustedEbwtFileBase, cout);\n\t} else if(summarize_only) {\n\t\tprint_index_summary<TIndexOffU>(adjustedEbwtFileBase, cout);\n    } else {\n        // Initialize Ebwt object\n\t\tbool color = readEbwtColor(adjustedEbwtFileBase);\n\t\tHierEbwt<TIndexOffU, uint16_t> ebwt(\n                                            adjustedEbwtFileBase,\n                                            color,                // index is colorspace\n                                            -1,                   // don't care about entire-reverse\n                                            true,                 // index is for the forward direction\n                                            -1,                   // offrate (-1 = index default)\n                                            0,                    // offrate-plus (0 = index default)\n                                            false,                // use memory-mapped IO\n                                            false,                // use shared memory\n                                            false,                // sweep memory-mapped memory\n                                            true,                 // load names?\n                                            true,                 // load SA sample?\n                                            true,                 // load ftab?\n                                            true,                 // load rstarts?\n                                            false,                // be talkative?\n                                            false,                // be talkative at startup?\n                                            false,                // pass up memory exceptions?\n                                            false);               // sanity check?        \n        \n        if(conversion_table) {\n            const EList<pair<string, uint64_t> >& uid_to_tid = ebwt.uid_to_tid();\n            for(size_t i = 0; i < uid_to_tid.size(); i++) {\n                uint64_t tid = uid_to_tid[i].second;\n                cout << uid_to_tid[i].first << \"\\t\"\n                     << (tid & 0xffffffff);\n                tid >>= 32;\n                if(tid > 0) {\n                    cout << \".\" << tid;\n                }\n                cout << endl;\n            }\n        } else if(taxonomy_tree) {\n            const map<uint64_t, TaxonomyNode>& tree = ebwt.tree();\n            for(map<uint64_t, TaxonomyNode>::const_iterator itr = tree.begin(); itr != tree.end(); itr++) {\n                string rank = get_tax_rank_string(itr->second.rank);\n                cout << itr->first << \"\\t|\\t\" << itr->second.parent_tid << \"\\t|\\t\" << rank << endl;\n            }\n        } else if(name_table) {\n            const std::map<uint64_t, string>& name_map = ebwt.name();\n            for(std::map<uint64_t, string>::const_iterator itr = name_map.begin(); itr != name_map.end(); itr++) {\n                uint64_t tid = itr->first;\n                cout << (tid & 0xffffffff);\n                tid >>= 32;\n                if(tid > 0) {\n                    cout << \".\" << tid;\n                }\n                cout << \"\\t\" << itr->second << endl;\n            }\n        } else if(size_table) {\n            const std::map<uint64_t, uint64_t>& size_map = ebwt.size();\n            for(std::map<uint64_t, uint64_t>::const_iterator itr = size_map.begin(); itr != size_map.end(); itr++) {\n                uint64_t tid = itr->first;\n                uint64_t size = itr->second;\n                cout << (tid & 0xffffffff);\n                tid >>= 32;\n                if(tid > 0) {\n                    cout << \".\" << tid;\n                }\n                cout << \"\\t\" << size << endl;\n            }\n        } else if (count_kmers) {\n        \tebwt.loadIntoMemory(\n        \t                                -1,     // color\n        \t                                -1,     // need entire reverse\n        \t                                true,   // load SA sample\n        \t                                true,   // load ftab\n        \t                                true,   // load rstarts\n        \t                                true,   // load names\n        \t                                verbose);  // verbose\n        \tuint64_t n_kmers = count_idx_kmers<TIndexOffU, SString<char> >(ebwt);\n        \tcout << \"Approximate number of kmers in the reference sequence: \" << n_kmers << endl;\n\n        } else {\n            ebwt.loadIntoMemory(\n                                -1,     // color\n                                -1,     // need entire reverse\n                                true,   // load SA sample\n                                true,   // load ftab\n                                true,   // load rstarts\n                                true,   // load names\n                                verbose);  // verbose\n            \n            // Load whole index into memory\n            if(refFromEbwt || true) {\n                print_index_sequences<TIndexOffU, SString<char> >(cout, ebwt);\n            } else {\n                EList<string> refnames;\n                readEbwtRefnames<TIndexOffU>(adjustedEbwtFileBase, refnames);\n                print_ref_sequences(\n                                    cout,\n                                    readEbwtColor(ebwtFileBase),\n                                    refnames,\n                                    ebwt.plen(),\n                                    adjustedEbwtFileBase);\n            }\n        }\n\t\t// Evict any loaded indexes from memory\n\t\tif(ebwt.isInMemory()) {\n\t\t\tebwt.evictFromMemory();\n\t\t}\n\t}\n}\n\n/**\n * main function.  Parses command-line arguments.\n */\nint main(int argc, char **argv) {\n\ttry {\n\t\tstring ebwtFile;  // read serialized Ebwt from this file\n\t\tstring query;   // read query string(s) from this file\n\t\tEList<string> queries;\n\t\tstring outfile; // write query results to this file\n\t\targv0 = argv[0];\n\t\tparseOptions(argc, argv);\n\t\tif(showVersion) {\n\t\t\tcout << argv0 << \" version \" << CENTRIFUGE_VERSION << endl;\n\t\t\tif(sizeof(void*) == 4) {\n\t\t\t\tcout << \"32-bit\" << endl;\n\t\t\t} else if(sizeof(void*) == 8) {\n\t\t\t\tcout << \"64-bit\" << endl;\n\t\t\t} else {\n\t\t\t\tcout << \"Neither 32- nor 64-bit: sizeof(void*) = \" << sizeof(void*) << endl;\n\t\t\t}\n\t\t\tcout << \"Built on \" << BUILD_HOST << endl;\n\t\t\tcout << BUILD_TIME << endl;\n\t\t\tcout << \"Compiler: \" << COMPILER_VERSION << endl;\n\t\t\tcout << \"Options: \" << COMPILER_OPTIONS << endl;\n\t\t\tcout << \"Sizeof {int, long, long long, void*, size_t, off_t}: {\"\n\t\t\t\t << sizeof(int)\n\t\t\t\t << \", \" << sizeof(long) << \", \" << sizeof(long long)\n\t\t\t\t << \", \" << sizeof(void *) << \", \" << sizeof(size_t)\n\t\t\t\t << \", \" << sizeof(off_t) << \"}\" << endl;\n\t\t\treturn 0;\n\t\t}\n\n\t\t// Get input filename\n\t\tif(optind >= argc) {\n\t\t\tcerr << \"No index name given!\" << endl;\n\t\t\tprintUsage(cerr);\n\t\t\treturn 1;\n\t\t}\n\t\tebwtFile = argv[optind++];\n\n\t\t// Optionally summarize\n\t\tif(verbose) {\n\t\t\tcout << \"Input ebwt file: \\\"\" << ebwtFile.c_str() << \"\\\"\" << endl;\n\t\t\tcout << \"Output file: \\\"\" << outfile.c_str() << \"\\\"\" << endl;\n\t\t\tcout << \"Local endianness: \" << (currentlyBigEndian()? \"big\":\"little\") << endl;\n#ifdef NDEBUG\n\t\t\tcout << \"Assertions: disabled\" << endl;\n#else\n\t\t\tcout << \"Assertions: enabled\" << endl;\n#endif\n\t\t}\n\t\tdriver(ebwtFile, query);\n\t\treturn 0;\n\t} catch(std::exception& e) {\n\t\tcerr << \"Error: Encountered exception: '\" << e.what() << \"'\" << endl;\n\t\tcerr << \"Command: \";\n\t\tfor(int i = 0; i < argc; i++) cerr << argv[i] << \" \";\n\t\tcerr << endl;\n\t\treturn 1;\n\t} catch(int e) {\n\t\tif(e != 0) {\n\t\t\tcerr << \"Error: Encountered internal Centrifuge exception (#\" << e << \")\" << endl;\n\t\t\tcerr << \"Command: \";\n\t\t\tfor(int i = 0; i < argc; i++) cerr << argv[i] << \" \";\n\t\t\tcerr << endl;\n\t\t}\n\t\treturn e;\n\t}\n}\n"
  },
  {
    "path": "centrifuge_main.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <iostream>\n#include <fstream>\n#include <string.h>\n#include <stdlib.h>\n#include \"tokenize.h\"\n#include \"ds.h\"\n\nusing namespace std;\n\nextern \"C\" {\n\tint centrifuge(int argc, const char **argv);\n}\n\n/**\n * Bowtie main function.  It is placed in a separate source file to\n * make it slightly easier to compile Bowtie as a library.\n *\n * If the user specifies -A <file> as the first two arguments, main\n * will interpret that file as having one set of command-line arguments\n * per line, and will dispatch each batch of arguments one at a time to\n * bowtie.\n */\nint main(int argc, const char **argv) {\n\tif(argc > 2 && strcmp(argv[1], \"-A\") == 0) {\n\t\tconst char *file = argv[2];\n\t\tifstream in;\n\t\tin.open(file);\n\t\tchar buf[4096];\n\t\tint lastret = -1;\n\t\twhile(in.getline(buf, 4095)) {\n\t\t\tEList<string> args;\n\t\t\targs.push_back(string(argv[0]));\n\t\t\ttokenize(buf, \" \\t\", args);\n\t\t\tconst char **myargs = (const char**)malloc(sizeof(char*)*args.size());\n\t\t\tfor(size_t i = 0; i < args.size(); i++) {\n\t\t\t\tmyargs[i] = args[i].c_str();\n\t\t\t}\n\t\t\tif(args.size() == 1) continue;\n\t\t\tlastret = centrifuge((int)args.size(), myargs);\n\t\t\tfree(myargs);\n\t\t}\n\t\tif(lastret == -1) {\n\t\t\tcerr << \"Warning: No arg strings parsed from \" << file << endl;\n\t\t\treturn 0;\n\t\t}\n\t\treturn lastret;\n\t} else {\n\t\treturn centrifuge(argc, argv);\n\t}\n}\n"
  },
  {
    "path": "centrifuge_report.cpp",
    "content": "/*\n * centrifuge-build.cpp\n *\n *  Created on: Apr 8, 2015\n *      Author: fbreitwieser\n */\n\n#include<iostream>\n#include<fstream>\n#include<sstream>\n#include<map>\n#include<vector>\n#include \"assert_helpers.h\"\n#include \"sstring.h\"\n#include \"ds.h\"      // EList\n#include \"bt2_idx.h\" // Ebwt\n#include \"bt2_io.h\"\n#include \"util.h\"\n\nusing namespace std;\ntypedef TIndexOffU index_t;\n\nstatic bool startVerbose = true; // be talkative at startup\nint gVerbose = 1; // be talkative always\nstatic const char *argv0 = NULL;\nstatic string adjIdxBase;\nstatic bool useShmem\t\t\t\t= false; // use shared memory to hold the index\nstatic bool useMm\t\t\t\t\t= false; // use memory-mapped files to hold the index\nstatic bool mmSweep\t\t\t\t\t= false; // sweep through memory-mapped files immediately after mapping\nstatic int offRate\t\t\t\t\t= -1;    // keep default offRate\nstatic bool noRefNames\t\t\t\t= false; // true -> print reference indexes; not names\nstatic int sanityCheck\t\t\t\t= 0;  // enable expensive sanity checks\n/**\n * Print a summary usage message to the provided output stream.\n */\nstatic void printUsage(ostream& out) {\n\tout << \"Centrifuge version \" << string(CENTRIFUGE_VERSION).c_str() << \" by Daehwan Kim (infphilo@gmail.com, www.ccb.jhu.edu/people/infphilo)\" << endl;\n\tstring tool_name = \"centrifuge-class\";\n\n\tout << \"Usage: \" << endl\n\t    << \"  \" << tool_name.c_str() << \" <bt2-idx> <centrifuge-out>\" << endl\n\t    << endl\n\t\t<<     \"  <bt2-idx>  Index filename prefix (minus trailing .X.\" << gEbwt_ext << \").\" << endl\n\t\t<<     \"  <centrifuge-out>  Centrifuge result file.\" << endl;\n}\n\ntemplate <typename T>\nclass Pair2ndComparator{\npublic:\n     bool operator()(const pair<T,T> &left, const pair<T,T> &right){\n    \t return left.second < right.second;\n     }\n};\n\n\ntemplate<typename TStr>\nstatic void driver(\n\tconst char * type,\n\tconst string& bt2indexBase,\n\tconst string& cf_out)\n{\n\tif(gVerbose || startVerbose)  {\n\t\tcerr << \"Entered driver(): \"; logTime(cerr, true);\n\t}\n\n    //initializeCntLut();  // FB: test commenting\n\n\t// Vector of the reference sequences; used for sanity-checking\n\tEList<SString<char> > names, os;\n\tEList<size_t> nameLens, seqLens;\n\n\t// Initialize Ebwt object and read in header\n\tif(gVerbose || startVerbose) {\n\t\tcerr << \"About to initialize fw Ebwt: \"; logTime(cerr, true);\n\t}\n\tadjIdxBase = adjustEbwtBase(argv0, bt2indexBase, gVerbose);\n\tEbwt<index_t> ebwt(\n\t\tadjIdxBase,\n\t    0,        // index is colorspace\n\t\t-1,       // fw index\n\t    true,     // index is for the forward direction\n\t    /* overriding: */ offRate,\n\t\t0, // amount to add to index offrate or <= 0 to do nothing\n\t    useMm,    // whether to use memory-mapped files\n\t    useShmem, // whether to use shared memory\n\t    mmSweep,  // sweep memory-mapped files\n\t    !noRefNames, // load names?\n\t\ttrue,        // load SA sample?\n\t\ttrue,        // load ftab?\n\t\ttrue,        // load rstarts?\n\t    gVerbose, // whether to be talkative\n\t    startVerbose, // talkative during initialization\n\t    false /*passMemExc*/,\n\t    sanityCheck);\n\t//Ebwt<index_t>* ebwtBw = NULL;\n\n\n\tEList<size_t> reflens;\n\tEList<string> refnames;\n\treadEbwtRefnames<index_t>(adjIdxBase, refnames);\n\tmap<uint32_t,pair<string,uint64_t> > speciesID_to_name_len;\n\tfor(size_t i = 0; i < ebwt.nPat(); i++) {\n\t\t// cerr << \"Push back to reflens: \"<<  refnames[i] << \" is so long: \" << ebwt.plen()[i] << endl;\n\t\treflens.push_back(ebwt.plen()[i]);\n\n\t\t// extract numeric id from refName\n\t\tconst string& refName = refnames[i];\n\t\tuint64_t id = extractIDFromRefName(refName);\n\t\tuint32_t speciesID = (uint32_t)(id >> 32);\n\n\t\t// extract name from refName\n\t\tconst string& name_part = refName.substr(refName.find_first_of(' '));\n\n\t\t//uint32_t genusID = (uint32_t)(id & 0xffffffff);\n\t\tspeciesID_to_name_len[speciesID] = pair<string,uint64_t>(name_part,ebwt.plen()[i]);\n\n\t}\n//\tEList<string> refnames;\n//\treadEbwtRefnames<index_t>(adjIdxBase, refnames);\n\n\t// Read Centrifuge output file\n\tifstream infile(cf_out.c_str());\n\n\tstring line;\n\tmap<uint32_t,uint32_t> species_to_score;\n\n\twhile (getline(infile,line)) {\n\t\tstring rd_name;\n\t\tuint32_t genusID;\n\t\tuint32_t speciesID;\n\t\tuint32_t score;\n\t\tuint32_t secbest_score;\n\n\t\tistringstream iss(line);\n\t\tiss >> rd_name >> genusID >> speciesID >> score >> secbest_score;\n\t\t// cerr << rd_name << \" -> \" << genusID << \" -> \" << speciesID << \" -> \" << score << \" -> \" << secbest_score << \"\\n\";\n\t\tspecies_to_score[speciesID] += score;\n\t}\n\n\t// Sort the species by their score\n\tvector<pair<uint32_t,uint32_t> > species_to_score_v(species_to_score.begin(), species_to_score.end());\n\n\tsort(species_to_score_v.begin(),species_to_score_v.end(),Pair2ndComparator<uint32_t>());\n\n\tcout << \"Name\\tTaxonID\\tLength\\tSummed Score\\tNormalized Score\\n\";\n\t// Output the summed species scores\n\tfor (vector<pair<uint32_t,uint32_t> >::iterator species_score = species_to_score_v.begin();\n\t\t\tspecies_score != species_to_score_v.end();\n\t\t\t++species_score) {\n\t\tuint32_t speciesID = species_score->first;\n\t\tpair<string,uint64_t> name_len = speciesID_to_name_len[speciesID];\n\t\tuint64_t slength = name_len.second;\n\t\tuint64_t sumscore = species_score->second;\n\n\t\tcout << name_len.first << \"\\t\" <<\n\t\t\t\tspeciesID << \"\\t\" <<\n\t\t\t\tslength << \"\\t\" <<\n\t\t\t\tsumscore << \"\\t\" <<\n\t\t\t\t(float)sumscore/slength << \"\\n\";\n\t}\n\n\n\n}\n\n//int centrifuge_report(int argc, const char **argv) {\nint main(int argc, const char **argv) {\n\n\tif (argc < 3) {\n\t\tcerr << \"Number of arguments is \" << argc << endl;\n\t\tprintUsage(cerr);\n\t\texit(1);\n\t}\n\n\targv0 = argv[0];\n\tconst string bt2index = argv[1];\n\tconst string cf_out = argv[2];\n\t//static string outfile;        // write SAM output to this file\n\n\tcout << \"Input bt2 file: \\\"\" << bt2index.c_str() << \"\\\"\" << endl;\n\tcout << \"Centrifuge results file: \\\"\" << cf_out.c_str() << \"\\\"\" << endl;\n\n\tdriver<SString<char> >(\"DNA\", bt2index, cf_out);\n\treturn 0;\n}\n\n"
  },
  {
    "path": "classifier.h",
    "content": "/*\n * Copyright 2014, Daehwan Kim <infphilo@gmail.com>\n *\n * This file is part of HISAT.\n *\n * HISAT is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * HISAT is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with HISAT.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef CLASSIFIER_H_\n#define CLASSIFIER_H_\n\n//#define LI_DEBUG\n\n#include <algorithm>\n#include <vector>\n#include \"hi_aligner.h\"\n#include \"util.h\"\n\ntemplate<typename index_t>\nstruct HitCount {\n    uint64_t uniqueID;\n    uint64_t taxID;\n    uint32_t count;\n    uint32_t score;\n    uint32_t scores[2][2];      // scores[rdi][fwi]\n    double   summedHitLen;\n    double   summedHitLens[2][2]; // summedHitLens[rdi][fwi]\n    uint32_t timeStamp;\n    EList<pair<uint32_t,uint32_t> > readPositions;\n    bool     leaf;\n    uint32_t num_leaves;\n    \n    uint8_t _rank;  // there are compilation error w/ g++ v7.2.0 on OSX when naming the member 'rank' instead of '_rank' \n    EList<uint64_t> path;\n    \n    void reset() {\n        uniqueID = taxID = count = score = timeStamp = 0;\n        scores[0][0] = scores[0][1] = scores[1][0] = scores[1][1] = 0;\n        summedHitLen = 0.0;\n        summedHitLens[0][0] = summedHitLens[0][1] = summedHitLens[1][0] = summedHitLens[1][1] = 0.0;\n        readPositions.clear();\n        _rank = 0;\n        path.clear();\n        leaf = true;\n        num_leaves = 1;\n    }\n    \n    HitCount& operator=(const HitCount& o) {\n        if(this == &o)\n            return *this;\n        \n        uniqueID = o.uniqueID;\n        taxID = o.taxID;\n        count = o.count;\n        score = o.score;\n        scores[0][0] = o.scores[0][0];\n        scores[0][1] = o.scores[0][1];\n        scores[1][0] = o.scores[1][0];\n        scores[1][1] = o.scores[1][1];\n        summedHitLen = o.summedHitLen;\n        summedHitLens[0][0] = o.summedHitLens[0][0];\n        summedHitLens[0][1] = o.summedHitLens[0][1];\n        summedHitLens[1][0] = o.summedHitLens[1][0];\n        summedHitLens[1][1] = o.summedHitLens[1][1];\n        timeStamp = o.timeStamp;\n        readPositions = o.readPositions;\n        leaf = o.leaf;\n        num_leaves = o.num_leaves;\n        _rank = o._rank;\n        path = o.path;\n        \n        return *this;\n    }\n\n    void finalize(\n                  bool paired,\n                  bool mate1fw,\n                  bool mate2fw) {\n        if(paired) {\n#if 1\n            score = max(scores[0][0], scores[0][1]) + max(scores[1][0], scores[1][1]);\n            summedHitLen = max(summedHitLens[0][0], summedHitLens[0][1]) + max(summedHitLens[1][0], summedHitLens[1][1]);\n#else\n            uint32_t score1 = 0, score2 = 0;\n            double summedHitLen1 = 0.0, summedHitLen2 = 0.0;\n            if(mate1fw == mate2fw) {\n                score1 = scores[0][0] + scores[1][0];\n                score2 = scores[0][1] + scores[1][1];\n                summedHitLen1 = summedHitLens[0][0] + summedHitLens[1][0];\n                summedHitLen2 = summedHitLens[0][1] + summedHitLens[1][1];\n            } else {\n                score1 = scores[0][0] + scores[1][1];\n                score2 = scores[0][1] + scores[1][0];\n                summedHitLen1 = summedHitLens[0][0] + summedHitLens[1][1];\n                summedHitLen2 = summedHitLens[0][1] + summedHitLens[1][0];\n            }\n            if(score1 >= score2) {\n                score = score1;\n                summedHitLen = summedHitLen1;\n            } else {\n                score = score2;\n                summedHitLen = summedHitLen2;\n            }\n#endif\n        } else {\n            score = max(scores[0][0], scores[0][1]);\n            summedHitLen = max(summedHitLens[0][0], summedHitLens[0][1]);\n        }\n    }\n};\n\n/**\n * With a hierarchical indexing, SplicedAligner provides several alignment strategies\n * , which enable effective alignment of RNA-seq reads\n */\ntemplate <typename index_t, typename local_index_t>\nclass Classifier : public HI_Aligner<index_t, local_index_t> {\n    \npublic:\n    \n    /**\n     * Initialize with index.\n     */\n    Classifier(const Ebwt<index_t>& ebwt,\n               const EList<string>& refnames,\n               bool mate1fw,\n               bool mate2fw,\n               index_t minHitLen,\n               bool tree_traverse,\n               const string& classification_rank,\n               const EList<uint64_t>& hostGenomes,\n               const EList<uint64_t>& excluded_taxIDs) :\n    HI_Aligner<index_t, local_index_t>(\n                                       ebwt,\n                                       0,    // don't make use of splice sites found by earlier reads\n                                       true), // no spliced alignment\n    _refnames(refnames),\n    _minHitLen(minHitLen),\n    _mate1fw(mate1fw),\n    _mate2fw(mate2fw),\n    _tree_traverse(tree_traverse)\n    {\n        _classification_rank = get_tax_rank_id(classification_rank.c_str());\n        _classification_rank = TaxonomyPathTable::rank_to_pathID(_classification_rank);\n        \n        const map<uint64_t, TaxonomyNode>& tree = ebwt.tree();\n        _host_taxIDs.clear();\n        if(hostGenomes.size() > 0) {\n            for(map<uint64_t, TaxonomyNode>::const_iterator itr = tree.begin(); itr != tree.end(); itr++) {\n                uint64_t tmp_taxID = itr->first;\n                while(true) {\n                    bool found = false;\n                    for(size_t t = 0; t < hostGenomes.size(); t++) {\n                        if(tmp_taxID == hostGenomes[t]) {\n                            _host_taxIDs.insert(itr->first);\n                            found = true;\n                            break;\n                        }\n                    }\n                    if(found) break;\n                    map<uint64_t, TaxonomyNode>::const_iterator itr2 = tree.find(tmp_taxID);\n                    if(itr2 == tree.end()) break;\n                    const TaxonomyNode& node = itr2->second;\n                    if(tmp_taxID == node.parent_tid) break;\n                    tmp_taxID = node.parent_tid;\n                }\n            }\n        }\n        \n        _excluded_taxIDs.clear();\n        if(excluded_taxIDs.size() > 0) {\n            for(map<uint64_t, TaxonomyNode>::const_iterator itr = tree.begin(); itr != tree.end(); itr++) {\n                uint64_t tmp_taxID = itr->first;\n                while(true) {\n                    bool found = false;\n                    for(size_t t = 0; t < excluded_taxIDs.size(); t++) {\n                        if(tmp_taxID == excluded_taxIDs[t]) {\n                            _excluded_taxIDs.insert(itr->first);\n                            found = true;\n                            break;\n                        }\n                    }\n                    if(found) break;\n                    map<uint64_t, TaxonomyNode>::const_iterator itr2 = tree.find(tmp_taxID);\n                    if(itr2 == tree.end()) break;\n                    if(tmp_taxID == itr2->second.parent_tid) break;\n                    tmp_taxID = itr2->second.parent_tid;\n                }\n            }\n        }\n    }\n    \n    ~Classifier() {\n    }\n\n    /**\n     * Aligns a read or a pair\n     * This funcion is called per read or pair\n     */\n    virtual\n    int go(\n           const Scoring&           sc,\n           const Ebwt<index_t>&     ebwtFw,\n           const Ebwt<index_t>&     ebwtBw,\n           const BitPairReference&  ref,\n           WalkMetrics&             wlm,\n           PerReadMetrics&          prm,\n           HIMetrics&               him,\n\t\t   SpeciesMetrics&          spm,\n           RandomSource&            rnd,\n           AlnSinkWrap<index_t>&    sink)\n    {\n        _hitMap.clear();\n        \n        const index_t increment = (2 * _minHitLen <= 33) ? 10 : (2 * _minHitLen - 33);\n        const ReportingParams& rp = sink.reportingParams();\n        index_t maxGenomeHitSize = rp.khits;\n\t\tbool isFw = false;\n        \n        //\n        uint32_t ts = 0; // time stamp\n        // for each mate. only called once for unpaired data\n        for(int rdi = 0; rdi < (this->_paired ? 2 : 1); rdi++) {\n            assert(this->_rds[rdi] != NULL);\n            \n            // search for partial hits on the forward and reverse strand (saved in this->_hits[rdi])\n            searchForwardAndReverse(rdi, ebwtFw, sc, rnd, rp, increment);\n            \n            // get forward or reverse hits for this read from this->_hits[rdi]\n            //  the strand is chosen based on higher average hit length in either direction\n            pair<int, int> fwp = getForwardOrReverseHit(rdi);\n            for(int fwi = fwp.first; fwi < fwp.second; fwi++) {\n                ReadBWTHit<index_t>& hit = this->_hits[rdi][fwi];\n                assert(hit.done());\n                isFw = hit._fw;  // TODO: Sync between mates!\n                \n                // choose candidate partial alignments for further alignment\n                index_t offsetSize = hit.offsetSize();\n                this->_genomeHits.clear();\n                \n                // sort partial hits by size (number of genome positions), ascending, and then length, descending\n                for(size_t hi = 0; hi < offsetSize; hi++) {\n                    const BWTHit<index_t> partialHit = hit.getPartialHit(hi);\n#ifdef LI_DEBUG\n                    cout << partialHit.len() << \" \" << partialHit.size() << endl;\n#endif\n                    if(partialHit.len() >= _minHitLen && partialHit.size() > maxGenomeHitSize) {\n                        maxGenomeHitSize = partialHit.size();\n                    }\n                }\n                \n                if(maxGenomeHitSize > (index_t)rp.khits) {\n                    maxGenomeHitSize += rp.khits;\n                }\n                \n                hit._partialHits.sort(compareBWTHits());\n                size_t usedPortion = 0;\n                size_t genomeHitCnt = 0;\n                for(size_t hi = 0; hi < offsetSize; hi++, ts++) {\n                    const BWTHit<index_t>& partialHit = hit.getPartialHit(hi);\n                    size_t partialHitLen = partialHit.len();\n                    if(partialHitLen <= _minHitLen) continue;                    \n                    if(partialHit.size() == 0) continue;\n                    \n                    // only keep this partial hit if it is equal to or bigger than minHitLen (default: 22 bp)\n                    // TODO: consider not requiring minHitLen when we have already hits to the same genome\n                    bool considerOnlyIfPreviouslyObserved = partialHitLen < _minHitLen;\n                    \n                    // get all coordinates of the hit\n                    EList<Coord>& coords = getCoords(\n                                                     hit,\n                                                     hi,\n                                                     ebwtFw,\n                                                     ref,\n                                                     rnd,\n                                                     maxGenomeHitSize,\n                                                     wlm,\n                                                     prm,\n                                                     him);\n                    if(coords.empty())\n                        continue;\n                    \n                    usedPortion += partialHitLen;\n                    assert_gt(coords.size(), 0);\n                    \n                    // the maximum number of hits per read is maxGenomeHitSize (change with parameter -k)\n                    size_t nHitsToConsider = coords.size();\n                    if((THitInt)coords.size() > rp.ihits) {\n                        continue;\n                    }\n\n                    // find the genome id for all coordinates, and count the number of genomes\n                    EList<pair<uint64_t, uint64_t> > coord_ids;\n                    for(index_t k = 0; k < nHitsToConsider; k++, genomeHitCnt++) {\n                        const Coord& coord = coords[k];\n                        assert_lt(coord.ref(), _refnames.size()); // gives a warning - coord.ref() is signed integer. why?\n                        \n                        // extract numeric id from refName\n                        const EList<pair<string, uint64_t> >& uid_to_tid = ebwtFw.uid_to_tid();\n                        assert_lt(coord.ref(), uid_to_tid.size());\n                        uint64_t taxID = uid_to_tid[coord.ref()].second;\n                        bool found = false;\n                        for(index_t k2 = 0; k2 < coord_ids.size(); k2++) {\n                            // count the genome if it is not in coord_ids, yet\n                            if(coord_ids[k2].first == (uint64_t)coord.ref()) {\n                                found = true;\n                                break;\n                            }\n                        }\n                        if(found) continue;\n                        // add to coord_ids\n                        coord_ids.expand();\n                        coord_ids.back().first = coord.ref();\n                        coord_ids.back().second = taxID;\n                    }\n                    \n                    ASSERT_ONLY(size_t n_genomes = coord_ids.size());\n                    // scoring function: calculate the weight of this partial hit\n                    assert_gt(partialHitLen, 15);\n                    assert_gt(n_genomes, 0);\n                    uint32_t partialHitScore = (uint32_t)((partialHitLen - 15) * (partialHitLen - 15)) ; // / n_genomes;\n                    double weightedHitLen = double(partialHitLen) ; // / double(n_genomes) ;\n                    \n                    // go through all coordinates reported for partial hit\n                    for(index_t k = 0; k < coord_ids.size(); ++k) {\n                        uint64_t uniqueID = coord_ids[k].first;\n                        uint64_t taxID = coord_ids[k].second;\n                        if(_excluded_taxIDs.find(taxID) != _excluded_taxIDs.end())\n                            continue ;\n                        // add hit to genus map and get new index in the map\n                        size_t idx = addHitToHitMap(\n                                                    ebwtFw,\n                                                    _hitMap,\n                                                    rdi,\n                                                    fwi,\n                                                    uniqueID,\n                                                    taxID,\n                                                    ts,\n                                                    partialHitScore,\n                                                    weightedHitLen,\n                                                    considerOnlyIfPreviouslyObserved,\n                                                    partialHit._bwoff,\n                                                    partialHit.len());\n                        \n                        //if considerOnlyIfPreviouslyObserved and it was not found, genus Idx size is equal to the genus Map size\n                        if(idx >= _hitMap.size()) {\n                            continue;\n                        }\n                        \n#ifdef FLORIAN_DEBUG\n                        std::cerr << speciesID << ';';\n#endif\n                    }\n                    \n                    if(genomeHitCnt >= maxGenomeHitSize)\n                        break;\n                    \n#ifdef FLORIAN_DEBUG\n                    std::cerr << \"  partialHits-done\";\n#endif\n                } // partialHits\n            } // fwi\n            \n#ifdef FLORIAN_DEBUG\n            std::cerr << \"  rdi-done\" << endl;\n#endif\n        } // rdi\n        \n        for(size_t i = 0; i < _hitMap.size(); i++) {\n            _hitMap[i].finalize(this->_paired, this->_mate1fw, this->_mate2fw);\n        }\n\n        // See if some of the assignments corresponde to host taxIDs\n        int64_t best_score = 0;\n        bool only_host_taxIDs = false;\n        for(size_t gi = 0; gi < _hitMap.size(); gi++) {\n            if(_hitMap[gi].score > best_score) {\n                best_score = _hitMap[gi].score;\n                only_host_taxIDs = (_host_taxIDs.find(_hitMap[gi].taxID) != _host_taxIDs.end());\n            } else if(_hitMap[gi].score == best_score) {\n                only_host_taxIDs |= (_host_taxIDs.find(_hitMap[gi].taxID) != _host_taxIDs.end());\n            }\n        }\n \n        \n        // If the number of hits is more than -k,\n        //   traverse up the taxonomy tree to reduce the number\n        if (!only_host_taxIDs && _hitMap.size() > (size_t)rp.khits) {\n            // Count the number of the best hits\n            uint32_t best_score = _hitMap[0].score;\n            for(size_t i = 1; i < _hitMap.size(); i++) {\n                if(best_score < _hitMap[i].score) {\n                    best_score = _hitMap[i].score;\n                }\n            }\n            \n            // Remove secondary hits\n            for(int i = 0; i < (int)_hitMap.size(); i++) {\n                if(_hitMap[i].score < best_score) {\n                    if(i + 1 < (int)_hitMap.size()) {\n                        _hitMap[i] = _hitMap.back();\n                    }\n                    _hitMap.pop_back();\n                    i--;\n                }\n            }\n            \n            if(!_tree_traverse) {\n                if(_hitMap.size() > (size_t)rp.khits)\n\t\t{\n\t\t    reportUnclassified( sink ) ;\n                    return 0;\n\t\t}\n            }\n            \n            uint8_t rank = 0;\n            while(_hitMap.size() > (size_t)rp.khits) {\n                _hitTaxCount.clear();\n                for(size_t i = 0; i < _hitMap.size(); i++) {\n                    while(_hitMap[i]._rank < rank) {\n                        if(_hitMap[i]._rank + 1 >= _hitMap[i].path.size()) {\n                            _hitMap[i]._rank = std::numeric_limits<uint8_t>::max();\n                            break;\n                        }\n                        _hitMap[i]._rank += 1;\n                        _hitMap[i].taxID = _hitMap[i].path[_hitMap[i]._rank];\n                        _hitMap[i].leaf = false;\n                    }\n                    if(_hitMap[i]._rank > rank) continue;\n                    \n                    uint64_t parent_taxID = (rank + 1 >= _hitMap[i].path.size() ? 1 : _hitMap[i].path[rank + 1]);\n                    // Traverse up the tree more until we get non-zero taxID.\n                    if(parent_taxID == 0) continue;\n                    \n                    size_t j = 0;\n                    for(; j < _hitTaxCount.size(); j++) {\n                        if(_hitTaxCount[j].second == parent_taxID) {\n                            _hitTaxCount[j].first += 1;\n                            break;\n                        }\n                    }\n                    if(j == _hitTaxCount.size()) {\n                        _hitTaxCount.expand();\n                        _hitTaxCount.back().first = 1;\n                        _hitTaxCount.back().second = parent_taxID;\n                    }\n                }\n                if(_hitTaxCount.size() <= 0) {\n                    if(rank < _hitMap[0].path.size()) {\n                        rank++;\n                        continue;\n                    } else {\n                        break;\n                    }\n                }\n                _hitTaxCount.sort();\n                size_t j = _hitTaxCount.size();\n                while(j-- > 0) {\n                    uint64_t parent_taxID = _hitTaxCount[j].second;\n                    int64_t max_score = 0;\n                    for(size_t i = 0; i < _hitMap.size(); i++) {\n                        assert_geq(_hitMap[i]._rank, rank);\n                        if(_hitMap[i]._rank != rank) continue;\n                        uint64_t cur_parent_taxID = (rank + 1 >= _hitMap[i].path.size() ? 1 : _hitMap[i].path[rank + 1]);\n                        if(parent_taxID == cur_parent_taxID) {\n                            _hitMap[i].uniqueID = std::numeric_limits<uint64_t>::max();\n                            _hitMap[i]._rank = rank + 1;\n                            _hitMap[i].taxID = parent_taxID;\n                            _hitMap[i].leaf = false;\n                        }\n                        if(parent_taxID == _hitMap[i].taxID) {\n                            if(_hitMap[i].score > max_score) {\n                                max_score = _hitMap[i].score;\n                            }\n                        }\n                    }\n                    \n                    bool first = true;\n                    size_t rep_i = _hitMap.size();\n                    for(size_t i = 0; i < _hitMap.size(); i++) {\n                        if(parent_taxID == _hitMap[i].taxID) {\n                            if(!first) {\n                                assert_lt(rep_i, _hitMap.size());\n                                _hitMap[rep_i].num_leaves += _hitMap[i].num_leaves;\n                                if(i + 1 < _hitMap.size()) {\n                                    _hitMap[i] = _hitMap.back();\n                                }\n                                _hitMap.pop_back();\n                                i--;\n                            } else {\n                                first = false;\n                                rep_i = i;\n                            }\n                        }\n                    }\n                    \n                    if(_hitMap.size() <= (size_t)rp.khits)\n                        break;\n                }\n                ++rank;\n                if(rank > _hitMap[0].path.size())\n                    break;\n            }\n        }\n        if(!only_host_taxIDs && _hitMap.size() > (size_t)rp.khits)\n\t{\n\t    reportUnclassified( sink ) ;\n            return 0;\n\t}\n       \n#if 0\n       \t// boost up the score if the assignment is unique\n        if(_hitMap.size() == 1) {\n            HitCount& hitCount = _hitMap[0];\n            hitCount.score = (hitCount.summedHitLen - 15) * (hitCount.summedHitLen - 15);\n        }\n#endif\n        \n        index_t rdlen = this->_rds[0]->length();\n        int64_t max_score = (rdlen > 15 ? (rdlen - 15) * (rdlen - 15) : 0);\n        if(this->_paired) {\n            rdlen = this->_rds[1]->length();\n            max_score += (rdlen > 15 ? (rdlen - 15) * (rdlen - 15) : 0);\n        }\n        \n      \tbool reported = false ; \n        for(size_t gi = 0; gi < _hitMap.size(); gi++) {\n            assert_gt(_hitMap[gi].score, 0);\n            HitCount<index_t>& hitCount = _hitMap[gi];\n            if(only_host_taxIDs) {\n                if(_host_taxIDs.find(_hitMap[gi].taxID) == _host_taxIDs.end())\n                    continue;\n            }\n            const EList<pair<string, uint64_t> >& uid_to_tid = ebwtFw.uid_to_tid();\n            const std::map<uint64_t, TaxonomyNode>& tree = ebwtFw.tree();\n            uint8_t taxRank = RANK_UNKNOWN;\n            std::map<uint64_t, TaxonomyNode>::const_iterator itr = tree.find(hitCount.taxID);\n            if(itr != tree.end()) {\n                taxRank = itr->second.rank;\n            }\n            // report\n            AlnRes rs;\n            rs.init(\n                    hitCount.score,\n                    max_score,\n                    hitCount.uniqueID < uid_to_tid.size() ? uid_to_tid[hitCount.uniqueID].first : get_tax_rank_string(taxRank),\n                    hitCount.taxID,\n                    taxRank,\n                    hitCount.summedHitLen,\n                    hitCount.readPositions,\n                    isFw);\n            sink.report(0, &rs);\n\t    reported = true ;\n        }\n\n\tif ( reported == false ) \n\t\treportUnclassified( sink ) ;\n        \n\treturn 0;\n    }\n    \n    bool getGenomeIdx(\n                      const Ebwt<index_t>&       ebwt,\n                      const BitPairReference&    ref,\n                      RandomSource&              rnd,\n                      index_t                    top,\n                      index_t                    bot,\n                      bool                       fw,\n                      index_t                    maxelt,\n                      index_t                    rdoff,\n                      index_t                    rdlen,\n                      EList<Coord>&              coords,\n                      WalkMetrics&               met,\n                      PerReadMetrics&            prm,\n                      HIMetrics&                 him,\n                      bool                       rejectStraddle,\n                      bool&                      straddled)\n    {\n        straddled = false;\n        assert_gt(bot, top);\n        index_t nelt = bot - top;\n        nelt = min<index_t>(nelt, maxelt);\n        coords.clear();\n        him.globalgenomecoords += (bot - top);\n        this->_offs.resize(nelt);\n        this->_offs.fill(std::numeric_limits<index_t>::max());\n        this->_sas.init(top, rdlen, EListSlice<index_t, 16>(this->_offs, 0, nelt));\n        this->_gws.init(ebwt, ref, this->_sas, rnd, met);\n        for(index_t off = 0; off < nelt; off++) {\n            WalkResult<index_t> wr;\n            this->_gws.advanceElement(\n                                off,\n                                ebwt,         // forward Bowtie index for walking left\n                                ref,          // bitpair-encoded reference\n                                this->_sas,   // SA range with offsets\n                                this->_gwstate,     // GroupWalk state; scratch space\n                                wr,           // put the result here\n                                met,          // metrics\n                                prm);         // per-read metrics\n            // Coordinate of the seed hit w/r/t the pasted reference string\n            coords.expand();\n            coords.back().init(wr.toff, 0, fw);\n        }\n        \n        return true;\n    }\n\n    void reportUnclassified( AlnSinkWrap<index_t>& sink )\n    {\n\t    AlnRes rs ;\n\t    EList<pair<uint32_t,uint32_t> > dummy ;\n\t    dummy.push_back( make_pair( 0, 0 ) ) ;\n\t    rs.init( 0, 0, string( \"unclassified\" ), 0, 0, 0, dummy, true ) ;\n\t    sink.report( 0, &rs ) ;\n    }\n\nprivate:\n    EList<string>                _refnames;\n    EList<HitCount<index_t> >    _hitMap;\n    index_t                      _minHitLen;\n    EList<uint16_t>              _tempTies;\n    bool                         _mate1fw;\n    bool                         _mate2fw;\n    \n    bool                         _tree_traverse;\n    uint8_t                      _classification_rank;\n    set<uint64_t>                _host_taxIDs; // favor these genomes\n    set<uint64_t>                _excluded_taxIDs;\n    \n    // Temporary variables\n    ReadBWTHit<index_t>          _tempHit;\n    EList<pair<uint32_t, uint64_t> > _hitTaxCount;  // pair of count and taxID\n    EList<uint64_t>              _tempPath;\n    \n    void searchForwardAndReverse(\n                                 index_t rdi,\n                                 const Ebwt<index_t>& ebwtFw,\n                                 const Scoring& sc,\n                                 RandomSource& rnd,\n                                 const ReportingParams& rp,\n                                 const index_t increment)\n    {\n        const Read& rd = *(this->_rds[rdi]);\n\n        bool done[2] = {false, false};\n#ifdef LI_DEBUG\n        size_t cur[2] = {0, 0} ;\n#endif\n        \n        index_t rdlen = rd.length();\n        //const size_t maxDiff = (rdlen / 2 > 2 * _minHitLen) ? rdlen / 2 : (2 * _minHitLen);\n        size_t sum[2] = {0, 0} ;\n        \n        // search for partial hits on the forward and reverse strand\n        while(!done[0] || !done[1]) {\n            for(index_t fwi = 0; fwi < 2; fwi++) {\n                if(done[fwi])\n                    continue;\n                \n                size_t mineFw = 0, mineRc = 0;\n                bool fw = (fwi == 0);\n                ReadBWTHit<index_t>& hit = this->_hits[rdi][fwi];\n                this->partialSearch(\n                                    ebwtFw,\n                                    rd,\n                                    sc,\n                                    fw,\n                                    0,\n                                    mineFw,\n                                    mineRc,\n                                    hit,\n                                    rnd);\n                \n                BWTHit<index_t>& lastHit = hit.getPartialHit(hit.offsetSize() - 1);\n                if(hit.done()) {\n                    done[fwi] = true;\n#ifdef LI_DEBUG\n                    cur[fwi] = rdlen;\n#endif\n                    if(lastHit.len() >= _minHitLen) {\n                        sum[fwi] += lastHit.len();\n                        if(0) //lastHit.len() < 31 && rdlen > 31 && lastHit.size() == 1 )\n                        {\n                            ReadBWTHit<index_t> testHit ;\n                            testHit.init( fw, rdlen ) ;\n                            testHit.setOffset(hit.cur() - 1 - 31 + 1);\n                            this->partialSearch(ebwtFw,\n                                                rd,\n                                                sc,\n                                                fw,\n                                                0,\n                                                mineFw,\n                                                mineRc,\n                                                testHit,\n                                                rnd);\n                            index_t tmpLen = testHit.getPartialHit( testHit.offsetSize() - 1 ).len();\n#ifdef LI_DEBUG\n                            cout << \"(adjust: \" << tmpLen << \")\";\n#endif\n                            if(tmpLen >= 31) {\n                                lastHit._len = tmpLen;\n                            }\n                        }\n                    }\n                    \n                    continue;\n                }\n                \n#ifdef LI_DEBUG\n                cur[fwi] = hit.cur();\n                cout << fwi << \":\" << lastHit.len() << \" \" << cur[fwi] << \" \";\n#endif\n                if(lastHit.len() >= _minHitLen)\n                    sum[fwi] += lastHit.len();\n                \n                if(lastHit.len() > increment) {\n                    if(lastHit.len() < _minHitLen) {\n                        // daehwan - for debugging purposes\n#if 1\n                        hit.setOffset(hit.cur() + 1);\n#else\n                        hit.setOffset(hit.cur() - increment);\n#endif\n                    } else {\n                        hit.setOffset(hit.cur() + 1);\n                        if(0) //lastHit.len() < 31 && hit.cur() >= 31 && lastHit.size() == 1 )\n                        {\n                            ReadBWTHit<index_t> testHit;\n                            testHit.init(fw, rdlen);\n                            testHit.setOffset(hit.cur() - 1 - 31); // why not hit.cur() - 1 - 31 + 1? because we \"+1\" before the if!\n                            \n                            this->partialSearch(ebwtFw,\n                                                rd,\n                                                sc,\n                                                fw,\n                                                0,\n                                                mineFw,\n                                                mineRc,\n                                                testHit,\n                                                rnd);\n                            index_t tmpLen = testHit.getPartialHit(testHit.offsetSize() - 1 ).len();\n#ifdef LI_DEBUG\n                            cout << \"(adjust: \" << tmpLen << \")\";\n#endif\n                            if(tmpLen >= 31) {\n                                lastHit._len = tmpLen;\n                            }\n                        }\n                    }\n                }\n                if(hit.cur() + _minHitLen >= rdlen) {\n                    hit.done(true);\n                    done[fwi] = true;\n                    continue;\n                }\n\n                if(lastHit.len() <= 3) {\n                    // This happens most likely due to the Ns in the read\n                    --fwi ; // Repeat this strand again.\n                }\n            }\n#ifdef LI_DEBUG\n            cout << endl;\n#endif\n\n            // No early termination\n#if 0\n            if(sum[0] > sum[1] + (rdlen - cur[1] + 1)) {\n                this->_hits[rdi][1].done(true);\n                done[1] = true;\n            } else if(sum[1] > sum[0] + (rdlen - cur[0] + 1)) {\n                this->_hits[rdi][0].done(true);\n                done[0] = true;\n            }\n#endif\n        }\n        \n        // Extend partial hits\n        if(sum[0] >= _minHitLen && sum[1] >= _minHitLen) {\n            ReadBWTHit<index_t>& hits = this->_hits[rdi][0];\n            ReadBWTHit<index_t>& rchits = this->_hits[rdi][1];\n            for(size_t i = 0; i < hits.offsetSize(); i++) {\n                BWTHit<index_t>& hit = hits.getPartialHit(i);\n                index_t len = hit.len();\n                //if(len < _minHitLen) continue;\n                index_t l = hit._bwoff;\n                index_t r = hit._bwoff + len;\n                for(size_t j = 0; j < rchits.offsetSize(); j++) {\n                    BWTHit<index_t>& rchit = rchits.getPartialHit(j);\n                    index_t rclen = rchit.len();\n                    if(len < _minHitLen && rclen < _minHitLen) continue;\n                    index_t rc_l = rdlen - rchit._bwoff - rchit._len;\n                    index_t rc_r = rc_l + rclen;\n                    if(r <= rc_l) continue;\n                    if(rc_r <= l) continue;\n                    if(l == rc_l && r == rc_r) continue;\n                    if(l < rc_l && r > rc_r) continue;\n                    if(l > rc_l && r < rc_r) continue;\n                    if(l > rc_l) {\n                        _tempHit.init(true /* fw */, rdlen);\n                        _tempHit.setOffset(rc_l);\n                        size_t mineFw = 0, mineRc = 0;\n                        this->partialSearch(ebwtFw,\n                                            rd,\n                                            sc,\n                                            true, // fw\n                                            0,\n                                            mineFw,\n                                            mineRc,\n                                            _tempHit,\n                                            rnd);\n                        BWTHit<index_t>& tmphit = _tempHit.getPartialHit(0);\n                        if(tmphit.len() == len + l - rc_l) {\n                            hit = tmphit;\n                        }\n                    }\n                    if(r > rc_r) {\n                        _tempHit.init(false /* fw */, rdlen);\n                        _tempHit.setOffset(rdlen - r);\n                        size_t mineFw = 0, mineRc = 0;\n                        this->partialSearch(ebwtFw,\n                                            rd,\n                                            sc,\n                                            false, // fw\n                                            0,\n                                            mineFw,\n                                            mineRc,\n                                            _tempHit,\n                                            rnd);\n                        BWTHit<index_t>& tmphit = _tempHit.getPartialHit(0);\n                        if(tmphit.len() == rclen + r - rc_r) {\n                            rchit = tmphit;\n                        }\n                    }\n                }\n            }\n            \n            // Remove partial hits that are mapped more than user-specified number\n            for(size_t i = 0; i < hits.offsetSize(); i++) {\n                BWTHit<index_t>& hit = hits.getPartialHit(i);\n                index_t len = hit.len();\n                index_t l = hit._bwoff;\n                index_t r = hit._bwoff + len;\n                for(size_t j = 0; j < rchits.offsetSize(); j++) {\n                    BWTHit<index_t>& rchit = rchits.getPartialHit(j);\n                    index_t rclen = rchit.len();\n                    index_t rc_l = rdlen - rchit._bwoff - rchit._len;\n                    index_t rc_r = rc_l + rclen;\n                    if(rc_l < l) break;\n                    if(len != rclen) continue;\n                    if(l == rc_l &&\n                       r == rc_r &&\n                       hit.size() + rchit.size() > rp.ihits) {\n                        hit.reset();\n                        rchit.reset();\n                        break;\n                    }\n                }\n            }\n        }\n        \n        // Trim partial hits\n        for(int fwi = 0; fwi < 2; fwi++) {\n            ReadBWTHit<index_t>& hits = this->_hits[rdi][fwi];\n            if(hits.offsetSize() < 2) continue;\n            for(size_t i = 0; i < hits.offsetSize() - 1; i++) {\n                BWTHit<index_t>& hit = hits.getPartialHit(i);\n                for(size_t j = i + 1; j < hits.offsetSize(); j++) {\n                    BWTHit<index_t>& hit2 = hits.getPartialHit(j);\n                    if(hit._bwoff >= hit2._bwoff) {\n                        hit._len = 0;\n                        break;\n                    }\n                    if(hit._bwoff + hit._len <= hit2._bwoff) break;\n                    if(hit._len >= hit2._len) {\n                        index_t hit2_end = hit2._bwoff + hit2._len;\n                        hit2._bwoff = hit._bwoff + hit._len;\n                        hit2._len = hit2_end - hit2._bwoff;\n                    } else {\n                        hit._len = hit2._bwoff - hit._bwoff;\n                    }\n                }\n            }\n        }\n    }\n    \n    pair<int, int> getForwardOrReverseHit(index_t rdi) {\n        index_t avgHitLength[2] = {0, 0};\n        index_t hitSize[2] = {0, 0} ;\n        index_t maxHitLength[2] = {0, 0} ;\n        for(index_t fwi = 0; fwi < 2; fwi++) {\n            ReadBWTHit<index_t>& hit = this->_hits[rdi][fwi];\n            index_t numHits = 0;\n            index_t totalHitLength = 0;\n#ifdef LI_DEBUG\n            cout << fwi << \": \";\n#endif\n            for(size_t i = 0; i < hit.offsetSize(); i++) {\n                index_t len = hit.getPartialHit(i).len();\n#ifdef LI_DEBUG\n                cout << len << \" \";\n#endif\n                \n                if(len < _minHitLen) continue;\n                totalHitLength += (len - 15) * (len - 15);\n                hitSize[fwi] += hit.getPartialHit(i).size();\n                if(len > maxHitLength[fwi])\n                    maxHitLength[fwi] = len;\n                numHits++;\n            }\n#ifdef LI_DEBUG\n            cout << endl;\n#endif\n            if(numHits > 0) {\n                avgHitLength[fwi] = totalHitLength ; /// numHits;\n            }\n        }\n        \n        // choose read direction with a higher average hit length\n        //cout<<\"strand choosing: \"<<avgHitLength[0]<<\" \"<<avgHitLength[1]<<endl ;\n        index_t fwi;//= (avgHitLength[0] > avgHitLength[1])? 0 : 1;\n        if(avgHitLength[0] != avgHitLength[1])\n            fwi = (avgHitLength[0] > avgHitLength[1]) ? 0 : 1;\n        else if(maxHitLength[0] != maxHitLength[1])\n            fwi = (maxHitLength[0] > maxHitLength[1])? 0 : 1;\n        else\n            return pair<int, int>(0, 2);\n        \n        return pair<int, int>((int)fwi, (int)fwi + 1);\n    }\n    \n    EList<Coord>& getCoords(\n                            ReadBWTHit<index_t>& hit,\n                            size_t hi,\n                            const Ebwt<index_t>& ebwtFw,\n                            const BitPairReference& ref,\n                            RandomSource& rnd,\n                            const index_t maxGenomeHitSize,\n                            WalkMetrics& wlm,\n                            PerReadMetrics& prm,\n                            HIMetrics& him)\n    {\n        BWTHit<index_t>& partialHit = hit.getPartialHit(hi);\n\tassert(!partialHit.hasGenomeCoords());\n        bool straddled = false;\n        this->getGenomeIdx(\n                           ebwtFw,     // FB: Why is it called ...FW here?\n                           ref,\n                           rnd,\n                           partialHit._top,\n                           partialHit._bot,\n                           hit._fw == 0, // FIXME: fwi and hit._fw are defined differently\n                           maxGenomeHitSize - this->_genomeHits.size(),\n                           hit._len - partialHit._bwoff - partialHit._len,\n                           partialHit._len,\n                           partialHit._coords,\n                           wlm,       // why is it called wlm here?\n                           prm,\n                           him,\n                           false, // reject straddled\n                           straddled);\n#ifdef FLORIAN_DEBUG\n        std::cerr <<  partialHit.len() << ':';\n#endif\n        // get all coordinates of the hit\n        return partialHit._coords;\n    }\n\n\n    // append a hit to genus map or update entry\n    size_t addHitToHitMap(\n                          const Ebwt<index_t>& ebwt,\n                          EList<HitCount<index_t> >& hitMap,\n                          int rdi,\n                          int fwi,\n                          uint64_t uniqueID,\n                          uint64_t taxID,\n                          size_t hi,\n                          uint32_t partialHitScore,\n                          double weightedHitLen,\n                          bool considerOnlyIfPreviouslyObserved,\n                          size_t offset,\n                          size_t length)\n    {\n\t    size_t idx = 0;\n#ifdef LI_DEBUG\n\t    cout << \"Add \" << taxID << \" \" << partialHitScore << \" \" << weightedHitLen << endl;\n#endif\n\t    const TaxonomyPathTable& pathTable = ebwt.paths();\n\t    pathTable.getPath(taxID, _tempPath);\n\t    uint8_t rank = _classification_rank;\n\t    if(rank > 0) {\n\t\t    for(; rank < _tempPath.size(); rank++) {\n\t\t\t    if(_tempPath[rank] != 0) {\n\t\t\t\t    taxID = _tempPath[rank];\n\t\t\t\t    break;\n\t\t\t    }\n\t\t    }\n\t    }\n\n\t    for(; idx < hitMap.size(); ++idx) {\n\t\t    bool same = false;\n\t\t    if(rank == 0) {\n\t\t\t    same = (uniqueID == hitMap[idx].uniqueID);\n\t\t    } else {\n\t\t\t    same = (taxID == hitMap[idx].taxID);\n\t\t    }\n\t\t    if(same) {\n\t\t\t    if(hitMap[idx].timeStamp != hi) {\n\t\t\t\t    hitMap[idx].count += 1;\n\t\t\t\t    hitMap[idx].scores[rdi][fwi] += partialHitScore;\n\t\t\t\t    hitMap[idx].summedHitLens[rdi][fwi] += weightedHitLen;\n\t\t\t\t    hitMap[idx].timeStamp = (uint32_t)hi;\n\t\t\t\t    hitMap[idx].readPositions.push_back(make_pair(offset, length));\n\t\t\t    }\n\t\t\t    break;\n\t\t    }\n\t    }\n\n\t    if(idx >= hitMap.size() && !considerOnlyIfPreviouslyObserved) {\n\t\t    hitMap.expand();\n\t\t    HitCount<index_t>& hitCount = hitMap.back();\n\t\t    hitCount.reset();\n\t\t    hitCount.uniqueID = uniqueID;\n\t\t    hitCount.count = 1;\n\t\t    hitCount.scores[rdi][fwi] = partialHitScore;\n\t\t    hitCount.summedHitLens[rdi][fwi] = weightedHitLen;\n\t\t    hitCount.timeStamp = (uint32_t)hi;\n\t\t    hitCount.readPositions.clear();\n\t\t    hitCount.readPositions.push_back(make_pair(offset, length));\n\t\t    hitCount.path = _tempPath;\n\t\t    hitCount._rank = rank;\n\t\t    hitCount.taxID = taxID;\n\t    }\n\n\t    //if considerOnlyIfPreviouslyObserved and it was not found, genus Idx size is equal to the genus Map size\n\t    //assert_lt(genusIdx, genusMap.size());\n\t    return idx;\n    }\n\n\n\n\n    // compare BWTHits by size, ascending, first, then by length, descending\n    //   TODO: move this operator into BWTHits if that is the standard way we would like to sort\n    //   TODO: this ordering does not necessarily give the best results\n    struct compareBWTHits {\n        bool operator()(const BWTHit<index_t>& a, const BWTHit<index_t>& b) const {\n            if(a.len() >= 22 || b.len() >= 22) {\n                if(a.len() >= 22 && b.len() >= 22) {\n                    // sort ascending by size\n                    if (a.size() < b.size()) return true;\n                    if (a.size() > b.size()) return false;\n                }\n                \n                // sort descending by length\n                if (b.len() < a.len()) return true;\n                if (b.len() > a.len()) return false;\n            }\n            \n            // sort by the weighted len\n            if(b.len() * a.size() < a.len() * b.size()) return true;\n            if(b.len() * a.size() > a.len() * b.size()) return false;\n            \n            // sort ascending by size\n            if(a.size() < b.size()) return true;\n            if(a.size() > b.size()) return false;\n            \n            // sort descending by length\n            if(b.len() < a.len()) return true;\n            if(b.len() > a.len()) return false;\n            \n            return false;\n        }\n    };\n};\n\n\n#endif /*CLASSIFIER_H_*/\n"
  },
  {
    "path": "diff_sample.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"diff_sample.h\"\n\nstruct sampleEntry clDCs[16];\nbool clDCs_calced = false; /// have clDCs been calculated?\n\n/**\n * Entries 4-57 are transcribed from page 6 of Luk and Wong's paper\n * \"Two New Quorum Based Algorithms for Distributed Mutual Exclusion\",\n * which is also used and cited in the Burkhardt and Karkkainen's\n * papers on difference covers for sorting.  These samples are optimal\n * according to Luk and Wong.\n *\n * All other entries are generated via the exhaustive algorithm in\n * calcExhaustiveDC().\n *\n * The 0 is stored at the end of the sample as an end-of-list marker,\n * but 0 is also an element of each.\n *\n * Note that every difference cover has a 0 and a 1.  Intuitively,\n * any optimal difference cover sample can be oriented (i.e. rotated)\n * such that it includes 0 and 1 as elements.\n *\n * All samples in this list have been verified to be complete covers.\n *\n * A value of 0xffffffff in the first column indicates that there is no\n * sample for that value of v.  We do not keep samples for values of v\n * less than 3, since they are trivial (and the caller probably didn't\n * mean to ask for it).\n */\nuint32_t dc0to64[65][10] = {\n\t{0xffffffff},                     // 0\n\t{0xffffffff},                     // 1\n\t{0xffffffff},                     // 2\n\t{1, 0},                           // 3\n\t{1, 2, 0},                        // 4\n\t{1, 2, 0},                        // 5\n\t{1, 3, 0},                        // 6\n\t{1, 3, 0},                        // 7\n\t{1, 2, 4, 0},                     // 8\n\t{1, 2, 4, 0},                     // 9\n\t{1, 2, 5, 0},                     // 10\n\t{1, 2, 5, 0},                     // 11\n\t{1, 3, 7, 0},                     // 12\n\t{1, 3, 9, 0},                     // 13\n\t{1, 2, 3, 7, 0},                  // 14\n\t{1, 2, 3, 7, 0},                  // 15\n\t{1, 2, 5, 8, 0},                  // 16\n\t{1, 2, 4, 12, 0},                 // 17\n\t{1, 2, 5, 11, 0},                 // 18\n\t{1, 2, 6, 9, 0},                  // 19\n\t{1, 2, 3, 6, 10, 0},              // 20\n\t{1, 4, 14, 16, 0},                // 21\n\t{1, 2, 3, 7, 11, 0},              // 22\n\t{1, 2, 3, 7, 11, 0},              // 23\n\t{1, 2, 3, 7, 15, 0},              // 24\n\t{1, 2, 3, 8, 12, 0},              // 25\n\t{1, 2, 5, 9, 15, 0},              // 26\n\t{1, 2, 5, 13, 22, 0},             // 27\n\t{1, 4, 15, 20, 22, 0},            // 28\n\t{1, 2, 3, 4, 9, 14, 0},           // 29\n\t{1, 2, 3, 4, 9, 19, 0},           // 30\n\t{1, 3, 8, 12, 18, 0},             // 31\n\t{1, 2, 3, 7, 11, 19, 0},          // 32\n\t{1, 2, 3, 6, 16, 27, 0},          // 33\n\t{1, 2, 3, 7, 12, 20, 0},          // 34\n\t{1, 2, 3, 8, 12, 21, 0},          // 35\n\t{1, 2, 5, 12, 14, 20, 0},         // 36\n\t{1, 2, 4, 10, 15, 22, 0},         // 37\n\t{1, 2, 3, 4, 8, 14, 23, 0},       // 38\n\t{1, 2, 4, 13, 18, 33, 0},         // 39\n\t{1, 2, 3, 4, 9, 14, 24, 0},       // 40\n\t{1, 2, 3, 4, 9, 15, 25, 0},       // 41\n\t{1, 2, 3, 4, 9, 15, 25, 0},       // 42\n\t{1, 2, 3, 4, 10, 15, 26, 0},      // 43\n\t{1, 2, 3, 6, 16, 27, 38, 0},      // 44\n\t{1, 2, 3, 5, 12, 18, 26, 0},      // 45\n\t{1, 2, 3, 6, 18, 25, 38, 0},      // 46\n\t{1, 2, 3, 5, 16, 22, 40, 0},      // 47\n\t{1, 2, 5, 9, 20, 26, 36, 0},      // 48\n\t{1, 2, 5, 24, 33, 36, 44, 0},     // 49\n\t{1, 3, 8, 17, 28, 32, 38, 0},     // 50\n\t{1, 2, 5, 11, 18, 30, 38, 0},     // 51\n\t{1, 2, 3, 4, 6, 14, 21, 30, 0},   // 52\n\t{1, 2, 3, 4, 7, 21, 29, 44, 0},   // 53\n\t{1, 2, 3, 4, 9, 15, 21, 31, 0},   // 54\n\t{1, 2, 3, 4, 6, 19, 26, 47, 0},   // 55\n\t{1, 2, 3, 4, 11, 16, 33, 39, 0},  // 56\n\t{1, 3, 13, 32, 36, 43, 52, 0},    // 57\n\n\t// Generated by calcExhaustiveDC()\n\t{1, 2, 3, 7, 21, 33, 37, 50, 0},  // 58\n\t{1, 2, 3, 6, 13, 21, 35, 44, 0},  // 59\n\t{1, 2, 4, 9, 15, 25, 30, 42, 0},  // 60\n\t{1, 2, 3, 7, 15, 25, 36, 45, 0},  // 61\n\t{1, 2, 4, 10, 32, 39, 46, 51, 0}, // 62\n\t{1, 2, 6, 8, 20, 38, 41, 54, 0},  // 63\n\t{1, 2, 5, 14, 16, 34, 42, 59, 0}  // 64\n};\n"
  },
  {
    "path": "diff_sample.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef DIFF_SAMPLE_H_\n#define DIFF_SAMPLE_H_\n\n#include <stdint.h>\n#include <string.h>\n#include \"assert_helpers.h\"\n#include \"multikey_qsort.h\"\n#include \"timer.h\"\n#include \"ds.h\"\n#include \"mem_ids.h\"\n#include \"ls.h\"\n#include \"btypes.h\"\n\nusing namespace std;\n\n#ifndef VMSG_NL\n#define VMSG_NL(...) \\\nif(this->verbose()) { \\\n\tstringstream tmp; \\\n\ttmp << __VA_ARGS__ << endl; \\\n\tthis->verbose(tmp.str()); \\\n}\n#endif\n\n#ifndef VMSG\n#define VMSG(...) \\\nif(this->verbose()) { \\\n\tstringstream tmp; \\\n\ttmp << __VA_ARGS__; \\\n\tthis->verbose(tmp.str()); \\\n}\n#endif\n\n/**\n * Routines for calculating, sanity-checking, and dispensing difference\n * cover samples to clients.\n */\n\n/**\n *\n */\nstruct sampleEntry {\n\tuint32_t maxV;\n\tuint32_t numSamples;\n\tuint32_t samples[128];\n};\n\n/// Array of Colbourn and Ling calculated difference covers up to\n/// r = 16 (maxV = 5953)\nextern struct sampleEntry clDCs[16];\nextern bool clDCs_calced; /// have clDCs been calculated?\n\n/**\n * Check that the given difference cover 'ds' actually covers all\n * differences for a periodicity of v.\n */\ntemplate<typename T>\nstatic bool dcRepOk(T v, EList<T>& ds) {\n\t// diffs[] records all the differences observed\n\tAutoArray<bool> covered(v, EBWT_CAT);\n\tfor(T i = 1; i < v; i++) {\n\t\tcovered[i] = false;\n\t}\n\tfor(T di = T(); di < ds.size(); di++) {\n\t\tfor(T dj = di+1; dj < ds.size(); dj++) {\n\t\t\tassert_lt(ds[di], ds[dj]);\n\t\t\tT d1 = (ds[dj] - ds[di]);\n\t\t\tT d2 = (ds[di] + v - ds[dj]);\n\t\t\tassert_lt(d1, v);\n\t\t\tassert_lt(d2, v);\n\t\t\tcovered[d1] = true;\n\t\t\tcovered[d2] = true;\n\t\t}\n\t}\n\tbool ok = true;\n\tfor(T i = 1; i < v; i++) {\n\t\tif(covered[i] == false) {\n\t\t\tok = false;\n\t\t\tbreak;\n\t\t}\n\t}\n\treturn ok;\n}\n\n/**\n * Return true iff each element of ts (with length 'limit') is greater\n * than the last.\n */\ntemplate<typename T>\nstatic bool increasing(T* ts, size_t limit) {\n\tfor(size_t i = 0; i < limit-1; i++) {\n\t\tif(ts[i+1] <= ts[i]) return false;\n\t}\n\treturn true;\n}\n\n/**\n * Return true iff the given difference cover covers difference 'diff'\n * mod 'v'.\n */\ntemplate<typename T>\nstatic inline bool hasDifference(T *ds, T d, T v, T diff) {\n\t// diffs[] records all the differences observed\n\tfor(T di = T(); di < d; di++) {\n\t\tfor(T dj = di+1; dj < d; dj++) {\n\t\t\tassert_lt(ds[di], ds[dj]);\n\t\t\tT d1 = (ds[dj] - ds[di]);\n\t\t\tT d2 = (ds[di] + v - ds[dj]);\n\t\t\tassert_lt(d1, v);\n\t\t\tassert_lt(d2, v);\n\t\t\tif(d1 == diff || d2 == diff) return true;\n\t\t}\n\t}\n\treturn false;\n}\n\n/**\n * Exhaustively calculate optimal difference cover samples for v = 4,\n * 8, 16, 32, 64, 128, 256 and store results in p2DCs[]\n */\ntemplate<typename T>\nvoid calcExhaustiveDC(T i, bool verbose = false, bool sanityCheck = false) {\n\tT v = i;\n\tAutoArray<bool> diffs(v, EBWT_CAT);\n\t// v is the target period\n\tT ld = (T)ceil(sqrt(v));\n\t// ud is the upper bound on |D|\n\tT ud = v / 2;\n\t// for all possible |D|s\n\tbool ok = true;\n\tT *ds = NULL;\n\tT d;\n\tfor(d = ld; d <= ud+1; d++) {\n\t\t// for all possible |D| samples\n\t\tAutoArray<T> ds(d, EBWT_CAT);\n\t\tfor(T j = 0; j < d; j++) {\n\t\t\tds[j] = j;\n\t\t}\n\t\tassert(increasing(ds, d));\n\t\twhile(true) {\n\t\t\t// reset diffs[]\n\t\t\tfor(T t = 1; t < v; t++) {\n\t\t\t\tdiffs[t] = false;\n\t\t\t}\n\t\t\tT diffCnt = 0;\n\t\t\t// diffs[] records all the differences observed\n\t\t\tfor(T di = 0; di < d; di++) {\n\t\t\t\tfor(T dj = di+1; dj < d; dj++) {\n\t\t\t\t\tassert_lt(ds[di], ds[dj]);\n\t\t\t\t\tT d1 = (ds[dj] - ds[di]);\n\t\t\t\t\tT d2 = (ds[di] + v - ds[dj]);\n\t\t\t\t\tassert_lt(d1, v);\n\t\t\t\t\tassert_lt(d2, v);\n\t\t\t\t\tassert_gt(d1, 0);\n\t\t\t\t\tassert_gt(d2, 0);\n\t\t\t\t\tif(!diffs[d1]) \n\t\t\t\t\t\tdiffCnt++; \n\t\t\t\t\tdiffs[d1] = true;\n\t\t\t\t\tif(!diffs[d2]) \n\t\t\t\t\t\tdiffCnt++; \n\t\t\t\t\tdiffs[d2] = true;\n\t\t\t\t}\n\t\t\t}\n\t\t\t// Do we observe all possible differences (except 0)\n\t\t\tok = diffCnt == v-1;\n\t\t\tif(ok) {\n\t\t\t\t// Yes, all differences are covered\n\t\t\t\tbreak;\n\t\t\t} else {\n\t\t\t\t// Advance ds\n\t\t\t\t// (Following is commented out because it turns out\n\t\t\t\t// it's slow)\n\t\t\t\t// Find a missing difference\n\t\t\t\t//uint32_t missing = 0xffffffff;\n\t\t\t\t//for(uint32_t t = 1; t < v; t++) {\n\t\t\t\t//\tif(diffs[t] == false) {\n\t\t\t\t//\t\tmissing = diffs[t];\n\t\t\t\t//\t\tbreak;\n\t\t\t\t//\t}\n\t\t\t\t//}\n\t\t\t\t//assert_neq(missing, 0xffffffff);\n\t\t\t\tassert(increasing(ds, d));\n\t\t\t\tbool advanced = false;\n\t\t\t\tbool keepGoing = false;\n\t\t\t\tdo {\n\t\t\t\t\tkeepGoing = false;\n\t\t\t\t\tfor(T bd = d-1; bd > 1; bd--) {\n\t\t\t\t\t\tT dif = (d-1)-bd;\n\t\t\t\t\t\tif(ds[bd] < v-1-dif) {\n\t\t\t\t\t\t\tds[bd]++;\n\t\t\t\t\t\t\tassert_neq(0, ds[bd]);\n\t\t\t\t\t\t\t// Reset subsequent ones\n\t\t\t\t\t\t\tfor(T bdi = bd+1; bdi < d; bdi++) {\n\t\t\t\t\t\t\t\tassert_eq(0, ds[bdi]);\n\t\t\t\t\t\t\t\tds[bdi] = ds[bdi-1]+1;\n\t\t\t\t\t\t\t\tassert_gt(ds[bdi], ds[bdi-1]);\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tassert(increasing(ds, d));\n\t\t\t\t\t\t\t// (Following is commented out because\n\t\t\t\t\t\t\t// it turns out it's slow)\n\t\t\t\t\t\t\t// See if the new DC has the missing value\n\t\t\t\t\t\t\t//if(!hasDifference(ds, d, v, missing)) {\n\t\t\t\t\t\t\t//\tkeepGoing = true;\n\t\t\t\t\t\t\t//\tbreak;\n\t\t\t\t\t\t\t//}\n\t\t\t\t\t\t\tadvanced = true;\n\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tds[bd] = 0;\n\t\t\t\t\t\t\t// keep going\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t} while(keepGoing);\n\t\t\t\t// No solution for this |D|\n\t\t\t\tif(!advanced) break;\n\t\t\t\tassert(increasing(ds, d));\n\t\t\t}\n\t\t} // next sample assignment\n\t\tif(ok) {\n\t\t\tbreak;\n\t\t}\n\t} // next |D|\n\tassert(ok);\n\tcout << \"Did exhaustive v=\" << v << \" |D|=\" << d << endl;\n\tcout << \"  \";\n\tfor(T i = 0; i < d; i++) {\n\t\tcout << ds[i];\n\t\tif(i < d-1) cout << \",\";\n\t}\n\tcout << endl;\n}\n\n/**\n * Routune for calculating the elements of clDCs up to r = 16 using the\n * technique of Colbourn and Ling.\n *\n * See http://citeseer.ist.psu.edu/211575.html\n */\ntemplate <typename T>\nvoid calcColbournAndLingDCs(bool verbose = false, bool sanityCheck = false) {\n\tfor(T r = 0; r < 16; r++) {\n\t\tT maxv = 24*r*r + 36*r + 13; // Corollary 2.3\n\t\tT numsamp = 6*r + 4;\n\t\tclDCs[r].maxV = maxv;\n\t\tclDCs[r].numSamples = numsamp;\n\t\tmemset(clDCs[r].samples, 0, 4 * 128);\n\t\tT i;\n\t\t// clDCs[r].samples[0] = 0;\n\t\t// Fill in the 1^r part of the B series\n\t\tfor(i = 1; i < r+1; i++) {\n\t\t\tclDCs[r].samples[i] = clDCs[r].samples[i-1] + 1;\n\t\t}\n\t\t// Fill in the (r + 1)^1 part\n\t\tclDCs[r].samples[r+1] = clDCs[r].samples[r] + r + 1;\n\t\t// Fill in the (2r + 1)^r part\n\t\tfor(i = r+2; i < r+2+r; i++) {\n\t\t\tclDCs[r].samples[i] = clDCs[r].samples[i-1] + 2*r + 1;\n\t\t}\n\t\t// Fill in the (4r + 3)^(2r + 1) part\n\t\tfor(i = r+2+r; i < r+2+r+2*r+1; i++) {\n\t\t\tclDCs[r].samples[i] = clDCs[r].samples[i-1] + 4*r + 3;\n\t\t}\n\t\t// Fill in the (2r + 2)^(r + 1) part\n\t\tfor(i = r+2+r+2*r+1; i < r+2+r+2*r+1+r+1; i++) {\n\t\t\tclDCs[r].samples[i] = clDCs[r].samples[i-1] + 2*r + 2;\n\t\t}\n\t\t// Fill in the last 1^r part\n\t\tfor(i = r+2+r+2*r+1+r+1; i < r+2+r+2*r+1+r+1+r; i++) {\n\t\t\tclDCs[r].samples[i] = clDCs[r].samples[i-1] + 1;\n\t\t}\n\t\tassert_eq(i, numsamp);\n\t\tassert_lt(i, 128);\n\t\tif(sanityCheck) {\n\t\t\t// diffs[] records all the differences observed\n\t\t\tAutoArray<bool> diffs(maxv, EBWT_CAT);\n\t\t\tfor(T i = 0; i < numsamp; i++) {\n\t\t\t\tfor(T j = i+1; j < numsamp; j++) {\n\t\t\t\t\tT d1 = (clDCs[r].samples[j] - clDCs[r].samples[i]);\n\t\t\t\t\tT d2 = (clDCs[r].samples[i] + maxv - clDCs[r].samples[j]);\n\t\t\t\t\tassert_lt(d1, maxv);\n\t\t\t\t\tassert_lt(d2, maxv);\n\t\t\t\t\tdiffs[d1] = true;\n\t\t\t\t\tdiffs[d2] = true;\n\t\t\t\t}\n\t\t\t}\n\t\t\t// Should have observed all possible differences (except 0)\n\t\t\tfor(T i = 1; i < maxv; i++) {\n\t\t\t\tif(diffs[i] == false) cout << r << \", \" << i << endl;\n\t\t\t\tassert(diffs[i] == true);\n\t\t\t}\n\t\t}\n\t}\n\tclDCs_calced = true;\n}\n\n/**\n * A precalculated list of difference covers.\n */\nextern uint32_t dc0to64[65][10];\n\n/**\n * Get a difference cover for the requested periodicity v.\n */\ntemplate <typename T>\nstatic EList<T> getDiffCover(\n\tT v,\n\tbool verbose = false,\n\tbool sanityCheck = false)\n{\n\tassert_gt(v, 2);\n\tEList<T> ret;\n\tret.clear();\n\t// Can we look it up in our hardcoded array?\n\tif(v <= 64 && dc0to64[v][0] == 0xffffffff) {\n\t\tif(verbose) cout << \"v in hardcoded area, but hardcoded entry was all-fs\" << endl;\n\t\treturn ret;\n\t} else if(v <= 64) {\n\t\tret.push_back(0);\n\t\tfor(size_t i = 0; i < 10; i++) {\n\t\t\tif(dc0to64[v][i] == 0) break;\n\t\t\tret.push_back(dc0to64[v][i]);\n\t\t}\n\t\tif(sanityCheck) assert(dcRepOk(v, ret));\n\t\treturn ret;\n\t}\n\n\t// Can we look it up in our calcColbournAndLingDCs array?\n\tif(!clDCs_calced) {\n\t\tcalcColbournAndLingDCs<uint32_t>(verbose, sanityCheck);\n\t\tassert(clDCs_calced);\n\t}\n\tfor(size_t i = 0; i < 16; i++) {\n\t\tif(v <= clDCs[i].maxV) {\n\t\t\tfor(size_t j = 0; j < clDCs[i].numSamples; j++) {\n\t\t\t\tT s = clDCs[i].samples[j];\n\t\t\t\tif(s >= v) {\n\t\t\t\t\ts %= v;\n\t\t\t\t\tfor(size_t k = 0; k < ret.size(); k++) {\n\t\t\t\t\t\tif(s == ret[k]) break;\n\t\t\t\t\t\tif(s < ret[k]) {\n\t\t\t\t\t\t\tret.insert(s, k);\n\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t} else {\n\t\t\t\t\tret.push_back(s % v);\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(sanityCheck) assert(dcRepOk(v, ret));\n\t\t\treturn ret;\n\t\t}\n\t}\n\tcerr << \"Error: Could not find a difference cover sample for v=\" << v << endl;\n\tthrow 1;\n}\n\n/**\n * Calculate and return a delta map based on the given difference cover\n * and periodicity v.\n */\ntemplate <typename T>\nstatic EList<T> getDeltaMap(T v, const EList<T>& dc) {\n\t// Declare anchor-map-related items\n\tEList<T> amap;\n\tsize_t amapEnts = 1;\n\tamap.resizeExact((size_t)v);\n\tamap.fill(0xffffffff);\n\tamap[0] = 0;\n\t// Print out difference cover (and optionally calculate\n\t// anchor map)\n\tfor(size_t i = 0; i < dc.size(); i++) {\n\t\tfor(size_t j = i+1; j < dc.size(); j++) {\n\t\t\tassert_gt(dc[j], dc[i]);\n\t\t\tT diffLeft  = dc[j] - dc[i];\n\t\t\tT diffRight = dc[i] + v - dc[j];\n\t\t\tassert_lt(diffLeft, v);\n\t\t\tassert_lt(diffRight, v);\n\t\t\tif(amap[diffLeft] == 0xffffffff) {\n\t\t\t\tamap[diffLeft] = dc[i];\n\t\t\t\tamapEnts++;\n\t\t\t}\n\t\t\tif(amap[diffRight] == 0xffffffff) {\n\t\t\t\tamap[diffRight] = dc[j];\n\t\t\t\tamapEnts++;\n\t\t\t}\n\t\t}\n\t}\n\treturn amap;\n}\n\n/**\n * Return population count (count of all bits set to 1) of i.\n */\ntemplate<typename T>\nstatic unsigned int popCount(T i) {\n\tunsigned int cnt = 0;\n\tfor(size_t j = 0; j < sizeof(T)*8; j++) {\n\t\tif(i & 1) cnt++;\n\t\ti >>= 1;\n\t}\n\treturn cnt;\n}\n\n/**\n * Calculate log-base-2 of i\n */\ntemplate<typename T>\nstatic unsigned int myLog2(T i) {\n\tassert_eq(1, popCount(i)); // must be power of 2\n\tfor(size_t j = 0; j < sizeof(T)*8; j++) {\n\t\tif(i & 1) return (int)j;\n\t\ti >>= 1;\n\t}\n\tassert(false);\n\treturn 0xffffffff;\n}\n\n/**\n *\n */\ntemplate<typename TStr>\nclass DifferenceCoverSample {\npublic:\n\n\tDifferenceCoverSample(const TStr& __text,\n\t                      uint32_t __v,\n\t                      bool __verbose = false,\n\t                      bool __sanity = false,\n\t                      ostream& __logger = cout) :\n\t\t_text(__text),\n\t\t_v(__v),\n\t\t_verbose(__verbose),\n\t\t_sanity(__sanity),\n\t\t_ds(getDiffCover(_v, _verbose, _sanity)),\n\t\t_dmap(getDeltaMap(_v, _ds)),\n\t\t_d((uint32_t)_ds.size()),\n\t\t_doffs(),\n\t\t_isaPrime(),\n\t\t_dInv(),\n\t\t_log2v(myLog2(_v)),\n\t\t_vmask(OFF_MASK << _log2v),\n\t\t_logger(__logger)\n\t{\n\t\tassert_gt(_d, 0);\n\t\tassert_eq(1, popCount(_v)); // must be power of 2\n\t\t// Build map from d's to idx's\n\t\t_dInv.resizeExact((size_t)v());\n\t\t_dInv.fill(0xffffffff);\n\t\tuint32_t lim = (uint32_t)_ds.size();\n\t\tfor(uint32_t i = 0; i < lim; i++) {\n\t\t\t_dInv[_ds[i]] = i;\n\t\t}\n\t}\n\t\n\t/**\n\t * Allocate an amount of memory that simulates the peak memory\n\t * usage of the DifferenceCoverSample with the given text and v.\n\t * Throws bad_alloc if it's not going to fit in memory.  Returns\n\t * the approximate number of bytes the Cover takes at all times.\n\t */\n\tstatic size_t simulateAllocs(const TStr& text, uint32_t v) {\n\t\tEList<uint32_t> ds(getDiffCover(v, false /*verbose*/, false /*sanity*/));\n\t\tsize_t len = text.length();\n\t\tsize_t sPrimeSz = (len / v) * ds.size();\n\t\t// sPrime, sPrimeOrder, _isaPrime all exist in memory at\n\t\t// once and that's the peak\n\t\tAutoArray<TIndexOffU> aa(sPrimeSz * 3 + (1024 * 1024 /*out of caution*/), EBWT_CAT);\n\t\treturn sPrimeSz * 4; // sPrime array\n\t}\n\n\tuint32_t v() const                   { return _v; }\n\tuint32_t log2v() const               { return _log2v; }\n\tuint32_t vmask() const               { return _vmask; }\n\tuint32_t modv(TIndexOffU i) const    { return (uint32_t)(i & ~_vmask); }\n\tTIndexOffU divv(TIndexOffU i) const  { return i >> _log2v; }\n\tuint32_t d() const                   { return _d; }\n\tbool verbose() const                 { return _verbose; }\n\tbool sanityCheck() const             { return _sanity; }\n\tconst TStr& text() const             { return _text; }\n\tconst EList<uint32_t>& ds() const    { return _ds; }\n\tconst EList<uint32_t>& dmap() const  { return _dmap; }\n\tostream& log() const                 { return _logger; }\n\n\tvoid     build(int nthreads);\n\tuint32_t tieBreakOff(TIndexOffU i, TIndexOffU j) const;\n\tint64_t  breakTie(TIndexOffU i, TIndexOffU j) const;\n\tbool     isCovered(TIndexOffU i) const;\n\tTIndexOffU rank(TIndexOffU i) const;\n\n\t/**\n\t * Print out the suffix array such that every sample offset has its\n\t * rank filled in and every non-sample offset is shown as '-'.\n\t */\n\tvoid print(ostream& out) {\n\t\tfor(size_t i = 0; i < _text.length(); i++) {\n\t\t\tif(isCovered(i)) {\n\t\t\t\tout << rank(i);\n\t\t\t} else {\n\t\t\t\tout << \"-\";\n\t\t\t}\n\t\t\tif(i < _text.length()-1) {\n\t\t\t\tout << \",\";\n\t\t\t}\n\t\t}\n\t\tout << endl;\n\t}\n\nprivate:\n\n\tvoid doBuiltSanityCheck() const;\n\tvoid buildSPrime(EList<TIndexOffU>& sPrime, size_t padding);\n\n\tbool built() const {\n\t\treturn _isaPrime.size() > 0;\n\t}\n\n\tvoid verbose(const string& s) const {\n\t\tif(this->verbose()) {\n\t\t\tthis->log() << s.c_str();\n\t\t\tthis->log().flush();\n\t\t}\n\t}\n\n\tconst TStr&      _text;     // text to sample\n\tuint32_t         _v;        // periodicity of sample\n\tbool             _verbose;  //\n\tbool             _sanity;   //\n\tEList<uint32_t>  _ds;       // samples: idx -> d\n\tEList<uint32_t>  _dmap;     // delta map\n\tuint32_t         _d;        // |D| - size of sample\n\tEList<TIndexOffU>  _doffs;    // offsets into sPrime/isaPrime for each d idx\n\tEList<TIndexOffU>  _isaPrime; // ISA' array\n\tEList<uint32_t>  _dInv;     // Map from d -> idx\n\tuint32_t         _log2v;\n\tTIndexOffU         _vmask;\n\tostream&         _logger;\n};\n\n/**\n * Sanity-check the difference cover by first inverting _isaPrime then\n * checking that each successive suffix really is less than the next.\n */\ntemplate <typename TStr>\nvoid DifferenceCoverSample<TStr>::doBuiltSanityCheck() const {\n\tuint32_t v = this->v();\n\tassert(built());\n\tVMSG_NL(\"  Doing sanity check\");\n\tTIndexOffU added = 0;\n\tEList<TIndexOffU> sorted;\n\tsorted.resizeExact(_isaPrime.size());\n\tsorted.fill(OFF_MASK);\n\tfor(size_t di = 0; di < this->d(); di++) {\n\t\tuint32_t d = _ds[di];\n\t\tsize_t i = 0;\n\t\tfor(size_t doi = _doffs[di]; doi < _doffs[di+1]; doi++, i++) {\n\t\t\tassert_eq(OFF_MASK, sorted[_isaPrime[doi]]);\n\t\t\t// Maps the offset of the suffix to its rank\n\t\t\tsorted[_isaPrime[doi]] = (TIndexOffU)(v*i + d);\n\t\t\tadded++;\n\t\t}\n\t}\n\tassert_eq(added, _isaPrime.size());\n#ifndef NDEBUG\n\tfor(size_t i = 0; i < sorted.size()-1; i++) {\n\t\tassert(sstr_suf_lt(this->text(), sorted[i], this->text(), sorted[i+1], false));\n\t}\n#endif\n}\n\n/**\n * Build the s' array by sampling suffixes (suffix offsets, actually)\n * from t according to the difference-cover sample and pack them into\n * an array of machine words in the order dictated by the \"mu\" mapping\n * described in Burkhardt.\n *\n * Also builds _doffs map.\n */\ntemplate <typename TStr>\nvoid DifferenceCoverSample<TStr>::buildSPrime(\n\tEList<TIndexOffU>& sPrime,\n\tsize_t padding)\n{\n\tconst TStr& t = this->text();\n\tconst EList<uint32_t>& ds = this->ds();\n\tTIndexOffU tlen = (TIndexOffU)t.length();\n\tuint32_t v = this->v();\n\tuint32_t d = this->d();\n\tassert_gt(v, 2);\n\tassert_lt(d, v);\n\t// Record where each d section should begin in sPrime\n\tTIndexOffU tlenDivV = this->divv(tlen);\n\tuint32_t tlenModV = this->modv(tlen);\n\tTIndexOffU sPrimeSz = 0;\n\tassert(_doffs.empty());\n\t_doffs.resizeExact((size_t)d+1);\n\tfor(uint32_t di = 0; di < d; di++) {\n\t\t// mu mapping\n\t\tTIndexOffU sz = tlenDivV + ((ds[di] <= tlenModV) ? 1 : 0);\n\t\tassert_geq(sz, 0);\n\t\t_doffs[di] = sPrimeSz;\n\t\tsPrimeSz += sz;\n\t}\n\t_doffs[d] = sPrimeSz;\n#ifndef NDEBUG\n\tif(tlenDivV > 0) {\n\t\tfor(size_t i = 0; i < d; i++) {\n\t\t\tassert_gt(_doffs[i+1], _doffs[i]);\n\t\t\tTIndexOffU diff = _doffs[i+1] - _doffs[i];\n\t\t\tassert(diff == tlenDivV || diff == tlenDivV+1);\n\t\t}\n\t}\n#endif\n\tassert_eq(_doffs.size(), d+1);\n\t// Size sPrime appropriately\n\tsPrime.resizeExact((size_t)sPrimeSz + padding);\n\tsPrime.fill(OFF_MASK);\n\t// Slot suffixes from text into sPrime according to the mu\n\t// mapping; where the mapping would leave a blank, insert a 0\n\tTIndexOffU added = 0;\n\tTIndexOffU i = 0;\n\tfor(uint64_t ti = 0; ti <= tlen; ti += v) {\n\t\tfor(uint32_t di = 0; di < d; di++) {\n\t\t\tTIndexOffU tti = ti + ds[di];\n\t\t\tif(tti > tlen) break;\n\t\t\tTIndexOffU spi = _doffs[di] + i;\n\t\t\tassert_lt(spi, _doffs[di+1]);\n\t\t\tassert_leq(tti, tlen);\n\t\t\tassert_lt(spi, sPrimeSz);\n\t\t\tassert_eq(OFF_MASK, sPrime[spi]);\n\t\t\tsPrime[spi] = tti; added++;\n\t\t}\n\t\ti++;\n\t}\n\tassert_eq(added, sPrimeSz);\n}\n\n/**\n * Return true iff suffixes with offsets suf1 and suf2 out of host\n * string 'host' are identical up to depth 'v'.\n */\ntemplate <typename TStr>\nstatic inline bool suffixSameUpTo(\n\tconst TStr& host,\n\tTIndexOffU suf1,\n\tTIndexOffU suf2,\n\tTIndexOffU v)\n{\n\tfor(TIndexOffU i = 0; i < v; i++) {\n\t\tbool endSuf1 = suf1+i >= host.length();\n\t\tbool endSuf2 = suf2+i >= host.length();\n\t\tif((endSuf1 && !endSuf2) || (!endSuf1 && endSuf2)) return false;\n\t\tif(endSuf1 && endSuf2) return true;\n\t\tif(host[suf1+i] != host[suf2+i]) return false;\n\t}\n\treturn true;\n}\n\ntemplate<typename TStr>\nstruct VSortingParam {\n    DifferenceCoverSample<TStr>* dcs;\n    TIndexOffU*                  sPrimeArr;\n    size_t                       sPrimeSz;\n    TIndexOffU*                  sPrimeOrderArr;\n    size_t                       depth;\n    const EList<size_t>*         boundaries;\n    size_t*                      cur;\n    MUTEX_T*                     mutex;\n};\n\ntemplate<typename TStr>\nstatic void VSorting_worker(void *vp)\n{\n    VSortingParam<TStr>* param = (VSortingParam<TStr>*)vp;\n    DifferenceCoverSample<TStr>* dcs = param->dcs;\n    const TStr& host = dcs->text();\n    const size_t hlen = host.length();\n    uint32_t v = dcs->v();\n    while(true) {\n        size_t cur = 0;\n        {\n            ThreadSafe ts(param->mutex, true);\n            cur = *(param->cur);\n            (*param->cur)++;\n        }\n        if(cur >= param->boundaries->size()) return;\n        size_t begin = (cur == 0 ? 0 : (*param->boundaries)[cur-1]);\n        size_t end = (*param->boundaries)[cur];\n        assert_leq(begin, end);\n        if(end - begin <= 1) continue;\n        mkeyQSortSuf2(\n                      host,\n                      hlen,\n                      param->sPrimeArr,\n                      param->sPrimeSz,\n                      param->sPrimeOrderArr,\n                      4,\n                      begin,\n                      end,\n                      param->depth,\n                      v);\n    }\n}\n\n/**\n * Calculates a ranking of all suffixes in the sample and stores them,\n * packed according to the mu mapping, in _isaPrime.\n */\ntemplate <typename TStr>\nvoid DifferenceCoverSample<TStr>::build(int nthreads) {\n\t// Local names for relevant types\n\tVMSG_NL(\"Building DifferenceCoverSample\");\n\t// Local names for relevant data\n\tconst TStr& t = this->text();\n\tuint32_t v = this->v();\n\tassert_gt(v, 2);\n\t// Build s'\n\tEList<TIndexOffU> sPrime;\n\t// Need to allocate 2 extra elements at the end of the sPrime and _isaPrime\n\t// arrays.  One element that's less than all others, and another that acts\n\t// as needed padding for the Larsson-Sadakane sorting code.\n\tsize_t padding = 1;\n\tVMSG_NL(\"  Building sPrime\");\n\tbuildSPrime(sPrime, padding);\n\tsize_t sPrimeSz = sPrime.size() - padding;\n\tassert_gt(sPrime.size(), padding);\n\tassert_leq(sPrime.size(), t.length() + padding + 1);\n\tTIndexOffU nextRank = 0;\n\t{\n\t\tVMSG_NL(\"  Building sPrimeOrder\");\n\t\tEList<TIndexOffU> sPrimeOrder;\n\t\tsPrimeOrder.resizeExact(sPrimeSz);\n\t\tfor(TIndexOffU i = 0; i < sPrimeSz; i++) {\n\t\t\tsPrimeOrder[i] = i;\n\t\t}\n\t\t// sPrime now holds suffix-offsets for DC samples.\n\t\t{\n\t\t\tTimer timer(cout, \"  V-Sorting samples time: \", this->verbose());\n\t\t\tVMSG_NL(\"  V-Sorting samples\");\n\t\t\t// Extract backing-store array from sPrime and sPrimeOrder;\n\t\t\t// the mkeyQSortSuf2 routine works on the array for maximum\n\t\t\t// efficiency\n\t\t\tTIndexOffU *sPrimeArr = (TIndexOffU*)sPrime.ptr();\n\t\t\tassert_eq(sPrimeArr[0], sPrime[0]);\n\t\t\tassert_eq(sPrimeArr[sPrimeSz-1], sPrime[sPrimeSz-1]);\n\t\t\tTIndexOffU *sPrimeOrderArr = (TIndexOffU*)sPrimeOrder.ptr();\n\t\t\tassert_eq(sPrimeOrderArr[0], sPrimeOrder[0]);\n\t\t\tassert_eq(sPrimeOrderArr[sPrimeSz-1], sPrimeOrder[sPrimeSz-1]);\n            // Sort sample suffixes up to the vth character using a\n\t\t\t// multikey quicksort.  Sort time is proportional to the\n\t\t\t// number of samples times v.  It isn't quadratic.\n\t\t\t// sPrimeOrder is passed in as a swapping partner for\n\t\t\t// sPrimeArr, i.e., every time the multikey qsort swaps\n\t\t\t// elements in sPrime, it swaps the same elements in\n\t\t\t// sPrimeOrder too.  This allows us to easily reconstruct\n\t\t\t// what the sort did.\n            if(nthreads == 1) {\n                mkeyQSortSuf2(t, sPrimeArr, sPrimeSz, sPrimeOrderArr, 4,\n                              this->verbose(), this->sanityCheck(), v);\n            } else {\n                int query_depth = 0;\n                int tmp_nthreads = nthreads;\n                while(tmp_nthreads > 0) {\n                    query_depth++;\n                    tmp_nthreads >>= 1;\n                }\n                EList<size_t> boundaries; // bucket boundaries for parallelization\n                TIndexOffU *sOrig = NULL;\n                if(this->sanityCheck()) {\n                    sOrig = new TIndexOffU[sPrimeSz];\n                    memcpy(sOrig, sPrimeArr, OFF_SIZE * sPrimeSz);\n                }\n                mkeyQSortSuf2(t, sPrimeArr, sPrimeSz, sPrimeOrderArr, 4,\n                              this->verbose(), false, query_depth, &boundaries);\n                if(boundaries.size() > 0) {\n                    AutoArray<tthread::thread*> threads(nthreads);\n                    EList<VSortingParam<TStr> > tparams;\n                    size_t cur = 0;\n                    MUTEX_T mutex;\n                    for(int tid = 0; tid < nthreads; tid++) {\n                        // Calculate bucket sizes by doing a binary search for each\n                        // suffix and noting where it lands\n                        tparams.expand();\n                        tparams.back().dcs = this;\n                        tparams.back().sPrimeArr = sPrimeArr;\n                        tparams.back().sPrimeSz = sPrimeSz;\n                        tparams.back().sPrimeOrderArr = sPrimeOrderArr;\n                        tparams.back().depth = query_depth;\n                        tparams.back().boundaries = &boundaries;\n                        tparams.back().cur = &cur;\n                        tparams.back().mutex = &mutex;\n                        threads[tid] = new tthread::thread(VSorting_worker<TStr>, (void*)&tparams.back());\n                    }\n                    for (int tid = 0; tid < nthreads; tid++) {\n                        threads[tid]->join();\n                    }\n                }\n                if(this->sanityCheck()) {\n                    sanityCheckOrderedSufs(t, t.length(), sPrimeArr, sPrimeSz, v);\n                    for(size_t i = 0; i < sPrimeSz; i++) {\n                        assert_eq(sPrimeArr[i], sOrig[sPrimeOrderArr[i]]);\n                    }\n                    delete[] sOrig;\n                }\n            }\n\t\t\t// Make sure sPrime and sPrimeOrder are consistent with\n\t\t\t// their respective backing-store arrays\n\t\t\tassert_eq(sPrimeArr[0], sPrime[0]);\n\t\t\tassert_eq(sPrimeArr[sPrimeSz-1], sPrime[sPrimeSz-1]);\n\t\t\tassert_eq(sPrimeOrderArr[0], sPrimeOrder[0]);\n\t\t\tassert_eq(sPrimeOrderArr[sPrimeSz-1], sPrimeOrder[sPrimeSz-1]);\n\t\t}\n\t\t// Now assign the ranking implied by the sorted sPrime/sPrimeOrder\n\t\t// arrays back into sPrime.\n\t\tVMSG_NL(\"  Allocating rank array\");\n\t\t_isaPrime.resizeExact(sPrime.size());\n\t\tASSERT_ONLY(_isaPrime.fill(OFF_MASK));\n\t\tassert_gt(_isaPrime.size(), 0);\n\t\t{\n\t\t\tTimer timer(cout, \"  Ranking v-sort output time: \", this->verbose());\n\t\t\tVMSG_NL(\"  Ranking v-sort output\");\n\t\t\tfor(size_t i = 0; i < sPrimeSz-1; i++) {\n\t\t\t\t// Place the appropriate ranking\n\t\t\t\t_isaPrime[sPrimeOrder[i]] = nextRank;\n\t\t\t\t// If sPrime[i] and sPrime[i+1] are identical up to v, then we\n\t\t\t\t// should give the next suffix the same rank\n\t\t\t\tif(!suffixSameUpTo(t, sPrime[i], sPrime[i+1], v)) nextRank++;\n\t\t\t}\n\t\t\t_isaPrime[sPrimeOrder[sPrimeSz-1]] = nextRank; // finish off\n#ifndef NDEBUG\n\t\t\tfor(size_t i = 0; i < sPrimeSz; i++) {\n\t\t\t\tassert_neq(OFF_MASK, _isaPrime[i]);\n\t\t\t\tassert_lt(_isaPrime[i], sPrimeSz);\n\t\t\t}\n#endif\n\t\t}\n\t\t// sPrimeOrder is destroyed\n\t\t// All the information we need is now in _isaPrime\n\t}\n\t_isaPrime[_isaPrime.size()-1] = (TIndexOffU)sPrimeSz;\n\tsPrime[sPrime.size()-1] = (TIndexOffU)sPrimeSz;\n\t// _isaPrime[_isaPrime.size()-1] and sPrime[sPrime.size()-1] are just\n\t// spacer for the Larsson-Sadakane routine to use\n\t{\n\t\tTimer timer(cout, \"  Invoking Larsson-Sadakane on ranks time: \", this->verbose());\n\t\tVMSG_NL(\"  Invoking Larsson-Sadakane on ranks\");\n\t\tif(sPrime.size() >= LS_SIZE) {\n\t\t\tcerr << \"Error; sPrime array has so many elements that it can't be converted to a signed array without overflow.\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tLarssonSadakane<TIndexOff> ls;\n\t\tls.suffixsort(\n\t\t\t(TIndexOff*)_isaPrime.ptr(),\n\t\t\t(TIndexOff*)sPrime.ptr(),\n\t\t\t(TIndexOff)sPrimeSz,\n\t\t\t(TIndexOff)sPrime.size(),\n\t\t\t0);\n\t}\n\t// chop off final character of _isaPrime\n\t_isaPrime.resizeExact(sPrimeSz);\n\tfor(size_t i = 0; i < _isaPrime.size(); i++) {\n\t\t_isaPrime[i]--;\n\t}\n#ifndef NDEBUG\n\tfor(size_t i = 0; i < sPrimeSz-1; i++) {\n\t\tassert_lt(_isaPrime[i], sPrimeSz);\n\t\tassert(i == 0 || _isaPrime[i] != _isaPrime[i-1]);\n\t}\n#endif\n\tVMSG_NL(\"  Sanity-checking and returning\");\n\tif(this->sanityCheck()) doBuiltSanityCheck();\n}\n\n/**\n * Return true iff index i within the text is covered by the difference\n * cover sample.  Allow i to be off the end of the text; simplifies\n * logic elsewhere.\n */\ntemplate <typename TStr>\nbool DifferenceCoverSample<TStr>::isCovered(TIndexOffU i) const {\n\tassert(built());\n\tuint32_t modi = this->modv(i);\n\tassert_lt(modi, _dInv.size());\n\treturn _dInv[modi] != 0xffffffff;\n}\n\n/**\n * Given a text offset that's covered, return its lexicographical rank\n * among the sample suffixes.\n */\ntemplate <typename TStr>\nTIndexOffU DifferenceCoverSample<TStr>::rank(TIndexOffU i) const {\n\tassert(built());\n\tassert_lt(i, this->text().length());\n\tuint32_t imodv = this->modv(i);\n\tassert_neq(0xffffffff, _dInv[imodv]); // must be in the sample\n\tTIndexOffU ioff = this->divv(i);\n\tassert_lt(ioff, _doffs[_dInv[imodv]+1] - _doffs[_dInv[imodv]]);\n\tTIndexOffU isaIIdx = _doffs[_dInv[imodv]] + ioff;\n\tassert_lt(isaIIdx, _isaPrime.size());\n\tTIndexOffU isaPrimeI = _isaPrime[isaIIdx];\n\tassert_leq(isaPrimeI, _isaPrime.size());\n\treturn isaPrimeI;\n}\n\n/**\n * Return: < 0 if suffix i is lexicographically less than suffix j; > 0\n * if suffix j is lexicographically greater.\n */\ntemplate <typename TStr>\nint64_t DifferenceCoverSample<TStr>::breakTie(TIndexOffU i, TIndexOffU j) const {\n\tassert(built());\n\tassert_neq(i, j);\n\tassert_lt(i, this->text().length());\n\tassert_lt(j, this->text().length());\n\tuint32_t imodv = this->modv(i);\n\tuint32_t jmodv = this->modv(j);\n\tassert_neq(0xffffffff, _dInv[imodv]); // must be in the sample\n\tassert_neq(0xffffffff, _dInv[jmodv]); // must be in the sample\n\tuint32_t dimodv = _dInv[imodv];\n\tuint32_t djmodv = _dInv[jmodv];\n\tTIndexOffU ioff = this->divv(i);\n\tTIndexOffU joff = this->divv(j);\n\tassert_lt(dimodv+1, _doffs.size());\n\tassert_lt(djmodv+1, _doffs.size());\n\t// assert_lt: expected (32024) < (0)\n\tassert_lt(ioff, _doffs[dimodv+1] - _doffs[dimodv]);\n\tassert_lt(joff, _doffs[djmodv+1] - _doffs[djmodv]);\n\tTIndexOffU isaIIdx = _doffs[dimodv] + ioff;\n\tTIndexOffU isaJIdx = _doffs[djmodv] + joff;\n\tassert_lt(isaIIdx, _isaPrime.size());\n\tassert_lt(isaJIdx, _isaPrime.size());\n\tassert_neq(isaIIdx, isaJIdx); // ranks must be unique\n\tTIndexOffU isaPrimeI = _isaPrime[isaIIdx];\n\tTIndexOffU isaPrimeJ = _isaPrime[isaJIdx];\n\tassert_neq(isaPrimeI, isaPrimeJ); // ranks must be unique\n\tassert_leq(isaPrimeI, _isaPrime.size());\n\tassert_leq(isaPrimeJ, _isaPrime.size());\n\treturn (int64_t)isaPrimeI - (int64_t)isaPrimeJ;\n}\n\n/**\n * Given i, j, return the number of additional characters that need to\n * be compared before the difference cover can break the tie.\n */\ntemplate <typename TStr>\nuint32_t DifferenceCoverSample<TStr>::tieBreakOff(TIndexOffU i, TIndexOffU j) const {\n\tconst TStr& t = this->text();\n\tconst EList<uint32_t>& dmap = this->dmap();\n\tassert(built());\n\t// It's actually convenient to allow this, but we're permitted to\n\t// return nonsense in that case\n\tif(t[i] != t[j]) return 0xffffffff;\n\t//assert_eq(t[i], t[j]); // if they're unequal, there's no tie to break\n\tuint32_t v = this->v();\n\tassert_neq(i, j);\n\tassert_lt(i, t.length());\n\tassert_lt(j, t.length());\n\tuint32_t imod = this->modv(i);\n\tuint32_t jmod = this->modv(j);\n\tuint32_t diffLeft = (jmod >= imod)? (jmod - imod) : (jmod + v - imod);\n\tuint32_t diffRight = (imod >= jmod)? (imod - jmod) : (imod + v - jmod);\n\tassert_lt(diffLeft, dmap.size());\n\tassert_lt(diffRight, dmap.size());\n\tuint32_t destLeft = dmap[diffLeft];   // offset where i needs to be\n\tuint32_t destRight = dmap[diffRight]; // offset where i needs to be\n\tassert(isCovered(destLeft));\n\tassert(isCovered(destLeft+diffLeft));\n\tassert(isCovered(destRight));\n\tassert(isCovered(destRight+diffRight));\n\tassert_lt(destLeft, v);\n\tassert_lt(destRight, v);\n\tuint32_t deltaLeft = (destLeft >= imod)? (destLeft - imod) : (destLeft + v - imod);\n\tif(deltaLeft == v) deltaLeft = 0;\n\tuint32_t deltaRight = (destRight >= jmod)? (destRight - jmod) : (destRight + v - jmod);\n\tif(deltaRight == v) deltaRight = 0;\n\tassert_lt(deltaLeft, v);\n\tassert_lt(deltaRight, v);\n\tassert(isCovered(i+deltaLeft));\n\tassert(isCovered(j+deltaLeft));\n\tassert(isCovered(i+deltaRight));\n\tassert(isCovered(j+deltaRight));\n\treturn min(deltaLeft, deltaRight);\n}\n\n#endif /*DIFF_SAMPLE_H_*/\n"
  },
  {
    "path": "doc/README",
    "content": "To populate this directory, change to the bowtie2 directory and type\n'make doc'.  You must have pandoc installed:\n\n  http://johnmacfarlane.net/pandoc/\n"
  },
  {
    "path": "doc/add.css",
    "content": ".pageStyle #leftside { \n  color: #666;\n}\n\n.pageStyle #leftside a { \n  color: #0066B3;\n  text-decoration: none;\n}\n\n.pageStyle #leftside h1 {\n  background: none;\n  margin: 0 0 10px;\n  padding: 10px 0;\n  font: bold 1.9em Arial,Verdana,sans-serif;\n}\n\n.pageStyle #leftside h2 { \n  background: none;\n  margin: 0 0 10px;\n  padding: 10px 0;\n  font: bold 1.2em Arial,Verdana,sans-serif;\n}  \n.pageStyle #leftside h3 { \n  background: none;\n  margin: 0 0 10px 5px;\n  padding: 10px 0;\n  font: 1.2em Arial,Verdana,sans-serif;\n}\n\n.pageStyle #leftside table {\n  margin: 15px 0 0;\n}\n\n.pageStyle #leftside td {\n vertical-align: top;\n}\n\n.pageStyle #leftside p { color:#444; }\n\n\n.pageStyle #leftside td p {\n margin-left:15px;\n}\n\n.pageStyle #leftside h4 {\n    margin: 0px 15px 10px 10px;\n    padding: 10px 0px;\n    font: 1.1em Arial,Verdana,sans-serif;\n    background: none;\n}\n\n.pageStyle #leftside ul { margin:0; padding-left:0; list-style-type: circle; }\n.pageStyle #leftside #TOC ul { margin:0; padding-left:0; list-style-type: none; }\n.pageStyle #leftside li { color:#444; margin-left:14px; }\n.pageStyle #leftside #TOC li { margin-left:0; }\n.pageStyle #leftside #TOC li li { margin-left:14px; }\n.pageStyle #leftside p { padding: 0; margin:0 0 10px; }\n"
  },
  {
    "path": "doc/faq.shtml",
    "content": "<!--#set var=\"Title\" value=\"Centrifuge\" -->\n<!--#set var=\"NoCrumbs\" value=\"1\" -->\n<!--#set var=\"SubTitle\" value=\"Classifier for metagenomic sequences\"-->\n<!--#set var=\"ExtraCSS\" value=\"/software/hisat/add.css\"-->\n<!--#include virtual=\"/iheader_r.shtml\"-->\n<div id=\"mainContent\">\n  <div id=\"main\">\n\t<div id=\"rightside\">\n\t<!--#include virtual=\"sidebar.inc.shtml\"-->\n\t</div> <!-- End of \"rightside\" -->\n    <div id=\"leftside\">\n  \t<h1>Frequently Asked Questions</h1><br>\n      <div id=\"toc\">\n  \t    <ul>\n\t    <!--\n\t      <li><a href=\"#edit_dist\"><br>\n\t\t</a></li>\n  \t      </ul><br>\n\t\t\n<h2 id=\"edit_dist\"><br>\n\t\t  </h2>\n\t\t  <p><br>\n\t\t  \n  </p>\n-->\n\n      </div>\n   </div>\n</div>\n</div>\n\n<!--#include virtual=\"footer.inc.html\"-->\n\n<!-- Google analytics code -->\n<script type=\"text/javascript\">\nvar gaJsHost = ((\"https:\" == document.location.protocol) ? \"https://ssl.\" : \"http://www.\");\ndocument.write(unescape(\"%3Cscript src='\" + gaJsHost + \"google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E\"));\n</script>\n<script type=\"text/javascript\">\nvar pageTracker = _gat._getTracker(\"UA-6101038-1\");\npageTracker._trackPageview();\n</script>\n\n</body>\n</html>\n"
  },
  {
    "path": "doc/footer.inc.html",
    "content": "<div id=\"footer\">\n  <table cellspacing=\"15\" width=\"100%\"><tbody><tr><td>\n   This research was supported in part by NIH grants R01-LM06845 and R01-GM083873 and NSF grant CCF-0347992.\n   </td><td align=\"right\">\n   Administrator: <a href=\"mailto:infphilo@gmail.com\">Daehwan Kim</a>. Design by <a href=\"http://www.free-css-templates.com\" title=\"Design by David Herreman\">David Herreman</a>\n   </td></tr></tbody></table>\n</div>\n"
  },
  {
    "path": "doc/index.shtml",
    "content": "<!--#set var=\"Title\" value=\"Centrifuge\" -->\n<!--#set var=\"NoCrumbs\" value=\"1\" -->\n<!--#set var=\"SubTitle\" value=\"Classifier for metagenomic sequences\"-->\n<!--#set var=\"ExtraCSS\" value=\"/software/centrifuge/add.css\"-->\n<!--#include virtual=\"/iheader_r.shtml\"-->\n<div id=\"mainContent\">\n  <div id=\"subheader\">\n    <table width=\"100%\"><tbody><tr>\n\t  <td>\n\t    <strong>Centrifuge</strong> is a very rapid and memory-efficient system for the classification of DNA sequences from microbial samples, with better sensitivity than and comparable accuracy to other leading systems. The system uses a novel indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (e.g., 4.3 GB for ~4,100 bacterial genomes) yet provides very fast classification speed, allowing it to process a typical DNA sequencing run within an hour. Together these advances enable timely and accurate analysis of large metagenomics data sets on conventional desktop computers.\n\t    <br>\n\t  </td>\n\t  <td valign=\"middle\" align=\"right\">\n\t    <a href=\"http://opensource.org\"><img alt=\"Open Source Software\" src=\"images/osi-certified.gif\" border=\"0\"></a>\n\t</td></tr>\n    </tbody></table>\n  </div>\n  <div id=\"main\">\n    <div id=\"rightside\">\n\t<!--#include virtual=\"sidebar.inc.shtml\"-->\n\t</div> <!-- End of \"rightside\" -->\n\t<div id=\"leftside\">\n          <h2>Have a look at <a href=\"Pavian\">https://github.com/fbreitwieser/pavian</a> for visual analysis of results generated with Centrifuge!</h2>\n\n\t  <h2>Centrifuge 1.0.3-beta release 12/06/2016</h2>\n          <ul>\n              <li>Fixed Perl hash bangs (thanks to Andreas Sjödin / @druvus).</li>\n              <li>Updated nt database building to work with new accession code scheme.</li>\n              <li>Added option <tt>--tab-fmt-cols</tt> to specify output format columns.</li>\n              <li>A minor fix for traversing the hitmap up the taxonomy tree.</li>\n          </ul>\n          <br/>\n\n          <h2><a href=\"http://genome.cshlp.org/content/early/2016/11/16/gr.210641.116.abstract\">Centrifuge</a> paper published at Genome Research 11/16/2016</h2>  \n\n\t  <h2>Centrifuge 1.0.2-beta release 5/25/2016</h2>\n          <ul>\n\t    <li>Fixed a runtime error during abundance analysis.</li>\n\t    <li>Changed a default report file name from centrifuge_report.csv to centrifuge_report.tsv. </li>\n          </ul>\n          <br/>\n\n\t  <h2>Centrifuge preprint is available <a href=\"http://biorxiv.org/content/early/2016/05/25/054965\">here</a> at bioRxiv 5/24/2016</h2>\n\t  \n\t  <h2>Centrifuge 1.0.1-beta release 3/8/2016</h2>\n          <ul>\n\t    <li>\n\t      Centrifuge is now able to work directly with SRA data: both downloaded on demand over internet and prefetched to local disks.\n\t      <ul>\n              <li>\n                For example, you can run Centrifuge with SRA data (SRR353653) as follows. <br/>\n                <i>centrifuge -x /path/to/index --sra-acc SRR353653</i>\n              </li>\n              <li> This eliminates the need to download SRA reads manually and to convert them into fasta/fastq format without affecting the run time. </li>\n            </ul>\n\t    </li>\n\t    <li>\n\t      We provide a Centrifuge index (<i>nt</i> index) for NCBI nucleotide non-redundant sequences collected from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes, totaling ~109 billion bps. Centrifuge is a very good alternative to Megablast (or Blast) for searching through this huge database.\n\t    </li>\n\t    <li>\n\t      Fixed Centrifuge's scripts related to sequence downloading and index building.\n\t    </li>\n          </ul>\n          <br/>\n\t  <h2>Centrifuge 1.0.0-beta release 2/19/2016 - first release</h2>\n          <ul>\n\t    <li>\n\t      The first release of Centrifuge features a dramatically reduced database size,  higher classification accuracy and sensitivity, and comparably rapid classification speed.\n\t    </li>\n\t    <li>\n\t      Please refer to the manual for details on how to run Centrifuge and interpret Centrifuge’s classification results.\n\t    </li>\n\t    <li>\n\t      We provide several standard indexes designed to meet the needs of most users (see the side panel - Indexes)\n\t      <ul>\n\t\t<li> For compressed indexes, we first combined bacterial genomes belonging to the same species and removed redundant sequences, and built indexes using the combined sequences.\n\t\tAs a result, those compressed indexes are much smaller than uncompressed indexes.  Centrifuge classifies reads at the species level when using the compressed indexes and at the strain level (or the genome level) when using the uncompressed indexes. </li>\n\t      </ul>\n\t    </li>\n          </ul>\n          <br/>\n\t  <h2>The Centrifuge source code is available in a <a href=\"https://github.com/infphilo/centrifuge\">public GitHub repository</a> (7/14/2015).</h2>\n\t</div>\n  </div>\n</div>\n\n<!--#include virtual=\"footer.inc.html\"-->\n\n<!-- Google analytics code -->\n<script type=\"text/javascript\">\nvar gaJsHost = ((\"https:\" == document.location.protocol) ? \"https://ssl.\" : \"http://www.\");\ndocument.write(unescape(\"%3Cscript src='\" + gaJsHost + \"google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E\"));\n</script>\n<script type=\"text/javascript\">\nvar pageTracker = _gat._getTracker(\"UA-6101038-1\");\npageTracker._trackPageview();\n</script>\n<!-- End google analytics code -->\n</body>\n</html>\n"
  },
  {
    "path": "doc/manual.html",
    "content": "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head>\n  <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />\n  <meta http-equiv=\"Content-Style-Type\" content=\"text/css\" />\n  <meta name=\"generator\" content=\"pandoc\" />\n  <title>HISAT Manual - </title>\n  <style type=\"text/css\">code{white-space: pre;}</style>\n  <link rel=\"stylesheet\" href=\"style.css\" type=\"text/css\" />\n</head>\n<body>\n<h1>Table of Contents</h1>\n<div id=\"TOC\">\n<ul>\n<li><a href=\"#introduction\">Introduction</a><ul>\n<li><a href=\"#what-is-hisat\">What is HISAT?</a></li>\n</ul></li>\n<li><a href=\"#obtaining-bowtie-2\">Obtaining Bowtie 2</a><ul>\n<li><a href=\"#building-from-source\">Building from source</a></li>\n<li><a href=\"#adding-to-path\">Adding to PATH</a></li>\n<li><a href=\"#reporting\">Reporting</a><ul>\n<li><a href=\"#distinct-alignments-map-a-read-to-different-places\">Distinct alignments map a read to different places</a></li>\n<li><a href=\"#default-mode-search-for-multiple-alignments-report-the-best-one\">Default mode: search for multiple alignments, report the best one</a></li>\n<li><a href=\"#k-mode-search-for-one-or-more-alignments-report-each\">-k mode: search for one or more alignments, report each</a></li>\n</ul></li>\n<li><a href=\"#alignment-summmary\">Alignment summmary</a></li>\n<li><a href=\"#wrapper\">Wrapper</a></li>\n<li><a href=\"#performance-tuning\">Performance tuning</a></li>\n<li><a href=\"#command-line\">Command Line</a><ul>\n<li><a href=\"#setting-function-options\">Setting function options</a></li>\n<li><a href=\"#usage\">Usage</a></li>\n<li><a href=\"#main-arguments\">Main arguments</a></li>\n<li><a href=\"#options\">Options</a></li>\n</ul></li>\n<li><a href=\"#sam-output\">SAM output</a></li>\n</ul></li>\n<li><a href=\"#the-bowtie2-build-indexer\">The <code>bowtie2-build</code> indexer</a><ul>\n<li><a href=\"#command-line-1\">Command Line</a><ul>\n<li><a href=\"#main-arguments-1\">Main arguments</a></li>\n<li><a href=\"#options-1\">Options</a></li>\n</ul></li>\n</ul></li>\n<li><a href=\"#the-bowtie2-inspect-index-inspector\">The <code>bowtie2-inspect</code> index inspector</a><ul>\n<li><a href=\"#command-line-2\">Command Line</a><ul>\n<li><a href=\"#main-arguments-2\">Main arguments</a></li>\n<li><a href=\"#options-2\">Options</a></li>\n</ul></li>\n</ul></li>\n<li><a href=\"#getting-started-with-bowtie-2-lambda-phage-example\">Getting started with Bowtie 2: Lambda phage example</a><ul>\n<li><a href=\"#indexing-a-reference-genome\">Indexing a reference genome</a></li>\n<li><a href=\"#aligning-example-reads\">Aligning example reads</a></li>\n<li><a href=\"#paired-end-example\">Paired-end example</a></li>\n<li><a href=\"#local-alignment-example\">Local alignment example</a></li>\n<li><a href=\"#using-samtoolsbcftools-downstream\">Using SAMtools/BCFtools downstream</a></li>\n</ul></li>\n</ul>\n</div>\n<!--\n ! This manual is written in \"markdown\" format and thus contains some\n ! distracting formatting clutter.  See 'MANUAL' for an easier-to-read version\n ! of this text document, or see the HTML manual online.\n ! -->\n\n<h1 id=\"introduction\">Introduction</h1>\n<h2 id=\"what-is-hisat\">What is HISAT?</h2>\n<p><a href=\"http://www.ccb.jhu.edu/software/hisat\">HISAT</a> is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an <a href=\"http://portal.acm.org/citation.cfm?id=796543\">FM Index</a> (based on the <a href=\"http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\">Burrows-Wheeler Transform</a> or <a href=\"http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\">BWT</a>) to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 gigabytes of RAM. Bowtie 2 supports gapped, local, and paired-end alignment modes. Multiple processors can be used simultaneously to achieve greater alignment speed. Bowtie 2 outputs alignments in <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM</a> format, enabling interoperation with a large number of other tools (e.g. <a href=\"http://samtools.sourceforge.net/\">SAMtools</a>, <a href=\"http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit\">GATK</a>) that use SAM. Bowtie 2 is distributed under the <a href=\"http://www.gnu.org/licenses/gpl-3.0.html\">GPLv3 license</a>, and it runs on the command line under Windows, Mac OS X and Linux.</p>\n<p><a href=\"http://bowtie-bio.sf.net/bowtie2\">Bowtie 2</a> is often the first step in pipelines for comparative genomics, including for variation calling, ChIP-seq, RNA-seq, BS-seq. <a href=\"http://bowtie-bio.sf.net/bowtie2\">Bowtie 2</a> and <a href=\"http://bowtie-bio.sf.net\">Bowtie</a> (also called &quot;<a href=\"http://bowtie-bio.sf.net\">Bowtie 1</a>&quot; here) are also tightly integrated into some tools, including <a href=\"http://tophat.cbcb.umd.edu/\">TopHat</a>: a fast splice junction mapper for RNA-seq reads, <a href=\"http://cufflinks.cbcb.umd.edu/\">Cufflinks</a>: a tool for transcriptome assembly and isoform quantitiation from RNA-seq reads, <a href=\"http://bowtie-bio.sf.net/crossbow\">Crossbow</a>: a cloud-enabled software tool for analyzing reseuqncing data, and <a href=\"http://bowtie-bio.sf.net/myrna\">Myrna</a>: a cloud-enabled software tool for aligning RNA-seq reads and measuring differential gene expression.</p>\n<h1 id=\"obtaining-bowtie-2\">Obtaining Bowtie 2</h1>\n<p>Download Bowtie 2 sources and binaries from the <a href=\"https://sourceforge.net/projects/bowtie-bio/files/bowtie2/\">Download</a> section of the Sourceforge site. Binaries are available for Intel architectures (<code>i386</code> and <code>x86_64</code>) running Linux, and Mac OS X. A 32-bit version is available for Windows. If you plan to compile Bowtie 2 yourself, make sure to get the source package, i.e., the filename that ends in &quot;-source.zip&quot;.</p>\n<h2 id=\"building-from-source\">Building from source</h2>\n<p>Building Bowtie 2 from source requires a GNU-like environment with GCC, GNU Make and other basics. It should be possible to build Bowtie 2 on most vanilla Linux installations or on a Mac installation with <a href=\"http://developer.apple.com/xcode/\">Xcode</a> installed. Bowtie 2 can also be built on Windows using <a href=\"http://www.cygwin.com/\">Cygwin</a> or <a href=\"http://www.mingw.org/\">MinGW</a> (MinGW recommended). For a MinGW build the choice of what compiler is to be used is important since this will determine if a 32 or 64 bit code can be successfully compiled using it. If there is a need to generate both 32 and 64 bit on the same machine then a multilib MinGW has to be properly installed. <a href=\"http://www.mingw.org/wiki/msys\">MSYS</a>, the <a href=\"http://cygwin.com/packages/mingw-zlib/\">zlib</a> library, and depending on architecture <a href=\"http://sourceware.org/pthreads-win32/\">pthreads</a> library are also required. We are recommending a 64 bit build since it has some clear advantages in real life research problems. In order to simplify the MinGW setup it might be worth investigating popular MinGW personal builds since these are coming already prepared with most of the toolchains needed.</p>\n<p>First, download the source package from the <a href=\"https://sourceforge.net/projects/bowtie-bio/files/bowtie2/\">sourceforge site</a>. Make sure you're getting the source package; the file downloaded should end in <code>-source.zip</code>. Unzip the file, change to the unzipped directory, and build the Bowtie 2 tools by running GNU <code>make</code> (usually with the command <code>make</code>, but sometimes with <code>gmake</code>) with no arguments. If building with MinGW, run <code>make</code> from the MSYS environment.</p>\n<p>Bowtie 2 is using the multithreading software model in order to speed up execution times on SMP architectures where this is possible. On POSIX platforms (like linux, Mac OS, etc) it needs the pthread library. Although it is possible to use pthread library on non-POSIX platform like Windows, due to performance reasons bowtie 2 will try to use Windows native multithreading if possible.</p>\n<h2 id=\"adding-to-path\">Adding to PATH</h2>\n<p>By adding your new Bowtie 2 directory to your <a href=\"http://en.wikipedia.org/wiki/PATH_(variable)\">PATH environment variable</a>, you ensure that whenever you run <code>bowtie2</code>, <code>bowtie2-build</code> or <code>bowtie2-inspect</code> from the command line, you will get the version you just installed without having to specify the entire path. This is recommended for most users. To do this, follow your operating system's instructions for adding the directory to your <a href=\"http://en.wikipedia.org/wiki/PATH_(variable)\">PATH</a>.</p>\n<p>If you would like to install Bowtie 2 by copying the Bowtie 2 executable files to an existing directory in your <a href=\"http://en.wikipedia.org/wiki/PATH_(variable)\">PATH</a>, make sure that you copy all the executables, including <code>bowtie2</code>, <code>bowtie2-align</code>, <code>bowtie2-build</code> and <code>bowtie2-inspect</code>.</p>\n<h2 id=\"reporting\">Reporting</h2>\n<p>The reporting mode governs how many alignments Bowtie 2 looks for, and how to report them. Bowtie 2 has three distinct reporting modes. The default reporting mode is similar to the default reporting mode of many other read alignment tools, including <a href=\"http://bio-bwa.sourceforge.net/\">BWA</a>. It is also similar to Bowtie 1's <code>-M</code> alignment mode.</p>\n<p>In general, when we say that a read has an alignment, we mean that it has a <a href=\"#valid-alignments-meet-or-exceed-the-minimum-score-threshold\">valid alignment</a>. When we say that a read has multiple alignments, we mean that it has multiple alignments that are valid and distinct from one another.</p>\n<h3 id=\"distinct-alignments-map-a-read-to-different-places\">Distinct alignments map a read to different places</h3>\n<p>Two alignments for the same individual read are &quot;distinct&quot; if they map the same read to different places. Specifically, we say that two alignments are distinct if there are no alignment positions where a particular read offset is aligned opposite a particular reference offset in both alignments with the same orientation. E.g. if the first alignment is in the forward orientation and aligns the read character at read offset 10 to the reference character at chromosome 3, offset 3,445,245, and the second alignment is also in the forward orientation and also aligns the read character at read offset 10 to the reference character at chromosome 3, offset 3,445,245, they are not distinct alignments.</p>\n<p>Two alignments for the same pair are distinct if either the mate 1s in the two paired-end alignments are distinct or the mate 2s in the two alignments are distinct or both.</p>\n<h3 id=\"default-mode-search-for-multiple-alignments-report-the-best-one\">Default mode: search for multiple alignments, report the best one</h3>\n<p>By default, Bowtie 2 searches for distinct, valid alignments for each read. When it finds a valid alignment, it generally will continue to look for alignments that are nearly as good or better. It will eventually stop looking, either because it exceeded a limit placed on search effort (see [<code>-D</code>] and <a href=\"#bowtie2-options-r\"><code>-R</code></a>) or because it already knows all it needs to know to report an alignment. Information from the best alignments are used to estimate mapping quality (the <code>MAPQ</code> <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM</a> field) and to set SAM optional fields, such as <a href=\"#bowtie2-build-opt-fields-as\"><code>AS:i</code></a> and <a href=\"#bowtie2-build-opt-fields-xs\"><code>XS:i</code></a>. Bowtie 2 does not gaurantee that the alignment reported is the best possible in terms of alignment score.</p>\n<p>See also: [<code>-D</code>], which puts an upper limit on the number of dynamic programming problems (i.e. seed extensions) that can &quot;fail&quot; in a row before Bowtie 2 stops searching. Increasing [<code>-D</code>] makes Bowtie 2 slower, but increases the likelihood that it will report the correct alignment for a read that aligns many places.</p>\n<p>See also: <a href=\"#bowtie2-options-r\"><code>-R</code></a>, which sets the maximum number of times Bowtie 2 will &quot;re-seed&quot; when attempting to align a read with repetitive seeds. Increasing <a href=\"#bowtie2-options-r\"><code>-R</code></a> makes Bowtie 2 slower, but increases the likelihood that it will report the correct alignment for a read that aligns many places.</p>\n<h3 id=\"k-mode-search-for-one-or-more-alignments-report-each\">-k mode: search for one or more alignments, report each</h3>\n<p>In <a href=\"#bowtie2-options-k\"><code>-k</code></a> mode, Bowtie 2 searches for up to N distinct, valid alignments for each read, where N equals the integer specified with the <code>-k</code> parameter. That is, if <code>-k 2</code> is specified, Bowtie 2 will search for at most 2 distinct alignments. It reports all alignments found, in descending order by alignment score. The alignment score for a paired-end alignment equals the sum of the alignment scores of the individual mates. Each reported read or pair alignment beyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAGS field. See the <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM specification</a> for details.</p>\n<p>Bowtie 2 does not &quot;find&quot; alignments in any specific order, so for reads that have more than N distinct, valid alignments, Bowtie 2 does not gaurantee that the N alignments reported are the best possible in terms of alignment score. Still, this mode can be effective and fast in situations where the user cares more about whether a read aligns (or aligns a certain number of times) than where exactly it originated.</p>\n<h2 id=\"alignment-summmary\">Alignment summmary</h2>\n<p>When Bowtie 2 finishes running, it prints messages summarizing what happened. These messages are printed to the &quot;standard error&quot; (&quot;stderr&quot;) filehandle. For datasets consisting of unpaired reads, the summary might look like this:</p>\n<pre><code>20000 reads; of these:\n  20000 (100.00%) were unpaired; of these:\n    1247 (6.24%) aligned 0 times\n    18739 (93.69%) aligned exactly 1 time\n    14 (0.07%) aligned &gt;1 times\n93.77% overall alignment rate</code></pre>\n<p>For datasets consisting of pairs, the summary might look like this:</p>\n<pre><code>10000 reads; of these:\n  10000 (100.00%) were paired; of these:\n    650 (6.50%) aligned concordantly 0 times\n    8823 (88.23%) aligned concordantly exactly 1 time\n    527 (5.27%) aligned concordantly &gt;1 times\n    ----\n    650 pairs aligned concordantly 0 times; of these:\n      34 (5.23%) aligned discordantly 1 time\n    ----\n    616 pairs aligned 0 times concordantly or discordantly; of these:\n      1232 mates make up the pairs; of these:\n        660 (53.57%) aligned 0 times\n        571 (46.35%) aligned exactly 1 time\n        1 (0.08%) aligned &gt;1 times\n96.70% overall alignment rate</code></pre>\n<p>The indentation indicates how subtotals relate to totals.</p>\n<h2 id=\"wrapper\">Wrapper</h2>\n<p>The <code>bowtie2</code> executable is actually a Perl wrapper script that calls the compiled <code>bowtie2-align</code> binary. It is recommended that you always run the <code>bowtie2</code> wrapper and not run <code>bowtie2-align</code> directly.</p>\n<h2 id=\"performance-tuning\">Performance tuning</h2>\n<ol style=\"list-style-type: decimal\">\n<li><p>Use 64-bit version if possible</p>\n<p>The 64-bit version of Bowtie 2 is faster than the 32-bit version, owing to its use of 64-bit arithmetic. If possible, download the 64-bit binaries for Bowtie 2 and run on a 64-bit computer. If you are building Bowtie 2 from sources, you may need to pass the <code>-m64</code> option to <code>g++</code> to compile the 64-bit version; you can do this by including <code>BITS=64</code> in the arguments to the <code>make</code> command; e.g.: <code>make BITS=64 bowtie2</code>. To determine whether your version of bowtie is 64-bit or 32-bit, run <code>bowtie2 --version</code>.</p></li>\n<li><p>If your computer has multiple processors/cores, use <code>-p</code></p>\n<p>The <a href=\"#bowtie2-options-p\"><code>-p</code></a> option causes Bowtie 2 to launch a specified number of parallel search threads. Each thread runs on a different processor/core and all threads find alignments in parallel, increasing alignment throughput by approximately a multiple of the number of threads (though in practice, speedup is somewhat worse than linear).</p></li>\n</ol>\n<h2 id=\"command-line\">Command Line</h2>\n<h3 id=\"setting-function-options\">Setting function options</h3>\n<p>Some Bowtie 2 options specify a function rather than an individual number or setting. In these cases the user specifies three parameters: (a) a function type <code>F</code>, (b) a constant term <code>B</code>, and (c) a coefficient <code>A</code>. The available function types are constant (<code>C</code>), linear (<code>L</code>), square-root (<code>S</code>), and natural log (<code>G</code>). The parameters are specified as <code>F,B,A</code> - that is, the function type, the constant term, and the coefficient are separated by commas with no whitespace. The constant term and coefficient may be negative and/or floating-point numbers.</p>\n<p>For example, if the function specification is <code>L,-0.4,-0.6</code>, then the function defined is:</p>\n<pre><code>f(x) = -0.4 + -0.6 * x</code></pre>\n<p>If the function specification is <code>G,1,5.4</code>, then the function defined is:</p>\n<pre><code>f(x) = 1.0 + 5.4 * ln(x)</code></pre>\n<p>See the documentation for the option in question to learn what the parameter <code>x</code> is for. For example, in the case if the <a href=\"#bowtie2-options-score-min\"><code>--score-min</code></a> option, the function <code>f(x)</code> sets the minimum alignment score necessary for an alignment to be considered valid, and <code>x</code> is the read length.</p>\n<h3 id=\"usage\">Usage</h3>\n<pre><code>bowtie2 [options]* -x &lt;bt2-idx&gt; {-1 &lt;m1&gt; -2 &lt;m2&gt; | -U &lt;r&gt;} -S [&lt;hit&gt;]</code></pre>\n<h3 id=\"main-arguments\">Main arguments</h3>\n<table><tr><td>\n\n<pre><code>-x &lt;bt2-idx&gt;</code></pre>\n</td><td>\n\n<p>The basename of the index for the reference genome. The basename is the name of any of the index files up to but not including the final <code>.1.bt2</code> / <code>.rev.1.bt2</code> / etc. <code>bowtie2</code> looks for the specified index first in the current directory, then in the directory specified in the <code>BOWTIE2_INDEXES</code> environment variable.</p>\n</td></tr><tr><td>\n\n<pre><code>-1 &lt;m1&gt;</code></pre>\n</td><td>\n\n<p>Comma-separated list of files containing mate 1s (filename usually includes <code>_1</code>), e.g. <code>-1 flyA_1.fq,flyB_1.fq</code>. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in <code>&lt;m2&gt;</code>. Reads may be a mix of different lengths. If <code>-</code> is specified, <code>bowtie2</code> will read the mate 1s from the &quot;standard in&quot; or &quot;stdin&quot; filehandle.</p>\n</td></tr><tr><td>\n\n<pre><code>-2 &lt;m2&gt;</code></pre>\n</td><td>\n\n<p>Comma-separated list of files containing mate 2s (filename usually includes <code>_2</code>), e.g. <code>-2 flyA_2.fq,flyB_2.fq</code>. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in <code>&lt;m1&gt;</code>. Reads may be a mix of different lengths. If <code>-</code> is specified, <code>bowtie2</code> will read the mate 2s from the &quot;standard in&quot; or &quot;stdin&quot; filehandle.</p>\n</td></tr><tr><td>\n\n<pre><code>-U &lt;r&gt;</code></pre>\n</td><td>\n\n<p>Comma-separated list of files containing unpaired reads to be aligned, e.g. <code>lane1.fq,lane2.fq,lane3.fq,lane4.fq</code>. Reads may be a mix of different lengths. If <code>-</code> is specified, <code>bowtie2</code> gets the reads from the &quot;standard in&quot; or &quot;stdin&quot; filehandle.</p>\n</td></tr><tr><td>\n\n<pre><code>-S &lt;hit&gt;</code></pre>\n</td><td>\n\n<p>File to write SAM alignments to. By default, alignments are written to the &quot;standard out&quot; or &quot;stdout&quot; filehandle (i.e. the console).</p>\n</td></tr></table>\n\n<h3 id=\"options\">Options</h3>\n<h4 id=\"input-options\">Input options</h4>\n<table>\n<tr><td id=\"bowtie2-options-q\">\n\n<pre><code>-q</code></pre>\n</td><td>\n\n<p>Reads (specified with <code>&lt;m1&gt;</code>, <code>&lt;m2&gt;</code>, <code>&lt;s&gt;</code>) are FASTQ files. FASTQ files usually have extension <code>.fq</code> or <code>.fastq</code>. FASTQ is the default format. See also: <a href=\"#bowtie2-options-solexa-quals\"><code>--solexa-quals</code></a> and <a href=\"#bowtie2-options-int-quals\"><code>--int-quals</code></a>.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-qseq\">\n\n<pre><code>--qseq</code></pre>\n</td><td>\n\n<p>Reads (specified with <code>&lt;m1&gt;</code>, <code>&lt;m2&gt;</code>, <code>&lt;s&gt;</code>) are QSEQ files. QSEQ files usually end in <code>_qseq.txt</code>. See also: <a href=\"#bowtie2-options-solexa-quals\"><code>--solexa-quals</code></a> and <a href=\"#bowtie2-options-int-quals\"><code>--int-quals</code></a>.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-f\">\n\n<pre><code>-f</code></pre>\n</td><td>\n\n<p>Reads (specified with <code>&lt;m1&gt;</code>, <code>&lt;m2&gt;</code>, <code>&lt;s&gt;</code>) are FASTA files. FASTA files usually have extension <code>.fa</code>, <code>.fasta</code>, <code>.mfa</code>, <code>.fna</code> or similar. FASTA files do not have a way of specifying quality values, so when <code>-f</code> is set, the result is as if <code>--ignore-quals</code> is also set.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-r\">\n\n<pre><code>-r</code></pre>\n</td><td>\n\n<p>Reads (specified with <code>&lt;m1&gt;</code>, <code>&lt;m2&gt;</code>, <code>&lt;s&gt;</code>) are files with one input sequence per line, without any other information (no read names, no qualities). When <code>-r</code> is set, the result is as if <code>--ignore-quals</code> is also set.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-c\">\n\n<pre><code>-c</code></pre>\n</td><td>\n\n<p>The read sequences are given on command line. I.e. <code>&lt;m1&gt;</code>, <code>&lt;m2&gt;</code> and <code>&lt;singles&gt;</code> are comma-separated lists of reads rather than lists of read files. There is no way to specify read names or qualities, so <code>-c</code> also implies <code>--ignore-quals</code>.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-s\">\n\n<pre><code>-s/--skip &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Skip (i.e. do not align) the first <code>&lt;int&gt;</code> reads or pairs in the input.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-u\">\n\n<pre><code>-u/--qupto &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Align the first <code>&lt;int&gt;</code> reads or read pairs from the input (after the <a href=\"#bowtie2-options-s\"><code>-s</code>/<code>--skip</code></a> reads or pairs have been skipped), then stop. Default: no limit.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-5\">\n\n<pre><code>-5/--trim5 &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Trim <code>&lt;int&gt;</code> bases from 5' (left) end of each read before alignment (default: 0).</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-3\">\n\n<pre><code>-3/--trim3 &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Trim <code>&lt;int&gt;</code> bases from 3' (right) end of each read before alignment (default: 0).</p>\n</td></tr><tr><td id=\"bowtie2-options-phred33-quals\">\n\n<pre><code>--phred33</code></pre>\n</td><td>\n\n<p>Input qualities are ASCII chars equal to the <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Phred quality</a> plus 33. This is also called the &quot;Phred+33&quot; encoding, which is used by the very latest Illumina pipelines.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-phred64-quals\">\n\n<pre><code>--phred64</code></pre>\n</td><td>\n\n<p>Input qualities are ASCII chars equal to the <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Phred quality</a> plus 64. This is also called the &quot;Phred+64&quot; encoding.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-solexa-quals\">\n\n<pre><code>--solexa-quals</code></pre>\n</td><td>\n\n<p>Convert input qualities from <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Solexa</a> (which can be negative) to <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Phred</a> (which can't). This scheme was used in older Illumina GA Pipeline versions (prior to 1.3). Default: off.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-int-quals\">\n\n<pre><code>--int-quals</code></pre>\n</td><td>\n\n<p>Quality values are represented in the read input file as space-separated ASCII integers, e.g., <code>40 40 30 40</code>..., rather than ASCII characters, e.g., <code>II?I</code>.... Integers are treated as being on the <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Phred quality</a> scale unless <a href=\"#bowtie2-options-solexa-quals\"><code>--solexa-quals</code></a> is also specified. Default: off.</p>\n</td></tr></table>\n\n<h4 id=\"alignment-options\">Alignment options</h4>\n<table>\n\n<tr><td id=\"bowtie2-options-n-ceil\">\n\n<pre><code>--n-ceil &lt;func&gt;</code></pre>\n</td><td>\n\n<p>Sets a function governing the maximum number of ambiguous characters (usually <code>N</code>s and/or <code>.</code>s) allowed in a read as a function of read length. For instance, specifying <code>-L,0,0.15</code> sets the N-ceiling function <code>f</code> to <code>f(x) = 0 + 0.15 * x</code>, where x is the read length. See also: [setting function options]. Reads exceeding this ceiling are <a href=\"#filtering\">filtered out</a>. Default: <code>L,0,0.15</code>.</p>\n</td></tr>\n\n<tr><td id=\"bowtie2-options-ignore-quals\">\n\n<pre><code>--ignore-quals</code></pre>\n</td><td>\n\n<p>When calculating a mismatch penalty, always consider the quality value at the mismatched position to be the highest possible, regardless of the actual value. I.e. input is treated as though all quality values are high. This is also the default behavior when the input doesn't specify quality values (e.g. in <a href=\"#bowtie2-options-f\"><code>-f</code></a>, <a href=\"#bowtie2-options-r\"><code>-r</code></a>, or <a href=\"#bowtie2-options-c\"><code>-c</code></a> modes).</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-nofw\">\n\n<pre><code>--nofw/--norc</code></pre>\n</td><td>\n\n<p>If <code>--nofw</code> is specified, <code>bowtie2</code> will not attempt to align unpaired reads to the forward (Watson) reference strand. If <code>--norc</code> is specified, <code>bowtie2</code> will not attempt to align unpaired reads against the reverse-complement (Crick) reference strand. In paired-end mode, <code>--nofw</code> and <code>--norc</code> pertain to the fragments; i.e. specifying <code>--nofw</code> causes <code>bowtie2</code> to explore only those paired-end configurations corresponding to fragments from the reverse-complement (Crick) strand. Default: both strands enabled.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-end-to-end\">\n\n<pre><code>--end-to-end</code></pre>\n</td><td>\n\n<p>In this mode, Bowtie 2 requires that the entire read align from one end to the other, without any trimming (or &quot;soft clipping&quot;) of characters from either end. The match bonus <a href=\"#bowtie2-options-ma\"><code>--ma</code></a> always equals 0 in this mode, so all alignment scores are less than or equal to 0, and the greatest possible alignment score is 0. This is mutually exclusive with <a href=\"#bowtie2-options-local\"><code>--local</code></a>. <code>--end-to-end</code> is the default mode.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-local\">\n\n<pre><code>--local</code></pre>\n</td><td>\n\n<p>In this mode, Bowtie 2 does not require that the entire read align from one end to the other. Rather, some characters may be omitted (&quot;soft clipped&quot;) from the ends in order to achieve the greatest possible alignment score. The match bonus <a href=\"#bowtie2-options-ma\"><code>--ma</code></a> is used in this mode, and the best possible alignment score is equal to the match bonus (<a href=\"#bowtie2-options-ma\"><code>--ma</code></a>) times the length of the read. Specifying <code>--local</code> and one of the presets (e.g. <code>--local --very-fast</code>) is equivalent to specifying the local version of the preset (<code>--very-fast-local</code>). This is mutually exclusive with <a href=\"#bowtie2-options-end-to-end\"><code>--end-to-end</code></a>. <code>--end-to-end</code> is the default mode.</p>\n</td></tr>\n</table>\n\n<h4 id=\"scoring-options\">Scoring options</h4>\n<table>\n\n<tr><td id=\"bowtie2-options-ma\">\n\n<pre><code>--ma &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Sets the match bonus. In <a href=\"#bowtie2-options-local\"><code>--local</code></a> mode <code>&lt;int&gt;</code> is added to the alignment score for each position where a read character aligns to a reference character and the characters match. Not used in <a href=\"#bowtie2-options-end-to-end\"><code>--end-to-end</code></a> mode. Default: 2.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-mp\">\n\n<pre><code>--mp MX,MN</code></pre>\n</td><td>\n\n<p>Sets the maximum (<code>MX</code>) and minimum (<code>MN</code>) mismatch penalties, both integers. A number less than or equal to <code>MX</code> and greater than or equal to <code>MN</code> is subtracted from the alignment score for each position where a read character aligns to a reference character, the characters do not match, and neither is an <code>N</code>. If <a href=\"#bowtie2-options-ignore-quals\"><code>--ignore-quals</code></a> is specified, the number subtracted quals <code>MX</code>. Otherwise, the number subtracted is <code>MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) )</code> where Q is the Phred quality value. Default: <code>MX</code> = 6, <code>MN</code> = 2.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-np\">\n\n<pre><code>--np &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Sets penalty for positions where the read, reference, or both, contain an ambiguous character such as <code>N</code>. Default: 1.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-rdg\">\n\n<pre><code>--rdg &lt;int1&gt;,&lt;int2&gt;</code></pre>\n</td><td>\n\n<p>Sets the read gap open (<code>&lt;int1&gt;</code>) and extend (<code>&lt;int2&gt;</code>) penalties. A read gap of length N gets a penalty of <code>&lt;int1&gt;</code> + N * <code>&lt;int2&gt;</code>. Default: 5, 3.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-rfg\">\n\n<pre><code>--rfg &lt;int1&gt;,&lt;int2&gt;</code></pre>\n</td><td>\n\n<p>Sets the reference gap open (<code>&lt;int1&gt;</code>) and extend (<code>&lt;int2&gt;</code>) penalties. A reference gap of length N gets a penalty of <code>&lt;int1&gt;</code> + N * <code>&lt;int2&gt;</code>. Default: 5, 3.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-score-min\">\n\n<pre><code>--score-min &lt;func&gt;</code></pre>\n</td><td>\n\n<p>Sets a function governing the minimum alignment score needed for an alignment to be considered &quot;valid&quot; (i.e. good enough to report). This is a function of read length. For instance, specifying <code>L,0,-0.6</code> sets the minimum-score function <code>f</code> to <code>f(x) = 0 + -0.6 * x</code>, where <code>x</code> is the read length. See also: [setting function options]. The default in <a href=\"#bowtie2-options-end-to-end\"><code>--end-to-end</code></a> mode is <code>L,-0.6,-0.6</code> and the default in <a href=\"#bowtie2-options-local\"><code>--local</code></a> mode is <code>G,20,8</code>.</p>\n</td></tr>\n</table>\n\n<h4 id=\"reporting-options\">Reporting options</h4>\n<table>\n\n<tr><td id=\"bowtie2-options-k\">\n\n<pre><code>-k &lt;int&gt;</code></pre>\n</td><td>\n\n<p>By default, <code>bowtie2</code> searches for distinct, valid alignments for each read. When it finds a valid alignment, it continues looking for alignments that are nearly as good or better. The best alignment found is reported (randomly selected from among best if tied). Information about the best alignments is used to estimate mapping quality and to set SAM optional fields, such as <a href=\"#bowtie2-build-opt-fields-as\"><code>AS:i</code></a> and <a href=\"#bowtie2-build-opt-fields-xs\"><code>XS:i</code></a>.</p>\n<p>When <code>-k</code> is specified, however, <code>bowtie2</code> behaves differently. Instead, it searches for at most <code>&lt;int&gt;</code> distinct, valid alignments for each read. The search terminates when it can't find more distinct valid alignments, or when it finds <code>&lt;int&gt;</code>, whichever happens first. All alignments found are reported in descending order by alignment score. The alignment score for a paired-end alignment equals the sum of the alignment scores of the individual mates. Each reported read or pair alignment beyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAGS field. For reads that have more than <code>&lt;int&gt;</code> distinct, valid alignments, <code>bowtie2</code> does not gaurantee that the <code>&lt;int&gt;</code> alignments reported are the best possible in terms of alignment score. <code>-k</code> is mutually exclusive with <a href=\"#bowtie2-options-a\"><code>-a</code></a>.</p>\n<p>Note: Bowtie 2 is not designed with large values for <code>-k</code> in mind, and when aligning reads to long, repetitive genomes large <code>-k</code> can be very, very slow.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-a\">\n\n<pre><code>-a</code></pre>\n</td><td>\n\n<p>Like <a href=\"#bowtie2-options-k\"><code>-k</code></a> but with no upper limit on number of alignments to search for. <code>-a</code> is mutually exclusive with <a href=\"#bowtie2-options-k\"><code>-k</code></a>.</p>\n<p>Note: Bowtie 2 is not designed with <code>-a</code> mode in mind, and when aligning reads to long, repetitive genomes this mode can be very, very slow.</p>\n</td></tr>\n</table>\n\n<h4 id=\"paired-end-options\">Paired-end options</h4>\n<table>\n\n<tr><td id=\"bowtie2-options-I\">\n\n<pre><code>-I/--minins &lt;int&gt;</code></pre>\n</td><td>\n\n<p>The minimum fragment length for valid paired-end alignments. E.g. if <code>-I 60</code> is specified and a paired-end alignment consists of two 20-bp alignments in the appropriate orientation with a 20-bp gap between them, that alignment is considered valid (as long as <a href=\"#bowtie2-options-X\"><code>-X</code></a> is also satisfied). A 19-bp gap would not be valid in that case. If trimming options <a href=\"#bowtie2-options-3\"><code>-3</code></a> or <a href=\"#bowtie2-options-5\"><code>-5</code></a> are also used, the <a href=\"#bowtie2-options-I\"><code>-I</code></a> constraint is applied with respect to the untrimmed mates.</p>\n<p>The larger the difference between <a href=\"#bowtie2-options-I\"><code>-I</code></a> and <a href=\"#bowtie2-options-X\"><code>-X</code></a>, the slower Bowtie 2 will run. This is because larger differences bewteen <a href=\"#bowtie2-options-I\"><code>-I</code></a> and <a href=\"#bowtie2-options-X\"><code>-X</code></a> require that Bowtie 2 scan a larger window to determine if a concordant alignment exists. For typical fragment length ranges (200 to 400 nucleotides), Bowtie 2 is very efficient.</p>\n<p>Default: 0 (essentially imposing no minimum)</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-X\">\n\n<pre><code>-X/--maxins &lt;int&gt;</code></pre>\n</td><td>\n\n<p>The maximum fragment length for valid paired-end alignments. E.g. if <code>-X 100</code> is specified and a paired-end alignment consists of two 20-bp alignments in the proper orientation with a 60-bp gap between them, that alignment is considered valid (as long as <a href=\"#bowtie2-options-I\"><code>-I</code></a> is also satisfied). A 61-bp gap would not be valid in that case. If trimming options <a href=\"#bowtie2-options-3\"><code>-3</code></a> or <a href=\"#bowtie2-options-5\"><code>-5</code></a> are also used, the <code>-X</code> constraint is applied with respect to the untrimmed mates, not the trimmed mates.</p>\n<p>The larger the difference between <a href=\"#bowtie2-options-I\"><code>-I</code></a> and <a href=\"#bowtie2-options-X\"><code>-X</code></a>, the slower Bowtie 2 will run. This is because larger differences bewteen <a href=\"#bowtie2-options-I\"><code>-I</code></a> and <a href=\"#bowtie2-options-X\"><code>-X</code></a> require that Bowtie 2 scan a larger window to determine if a concordant alignment exists. For typical fragment length ranges (200 to 400 nucleotides), Bowtie 2 is very efficient.</p>\n<p>Default: 500.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-fr\">\n\n<pre><code>--fr/--rf/--ff</code></pre>\n</td><td>\n\n<p>The upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand. E.g., if <code>--fr</code> is specified and there is a candidate paired-end alignment where mate 1 appears upstream of the reverse complement of mate 2 and the fragment length constraints (<a href=\"#bowtie2-options-I\"><code>-I</code></a> and <a href=\"#bowtie2-options-X\"><code>-X</code></a>) are met, that alignment is valid. Also, if mate 2 appears upstream of the reverse complement of mate 1 and all other constraints are met, that too is valid. <code>--rf</code> likewise requires that an upstream mate1 be reverse-complemented and a downstream mate2 be forward-oriented. <code>--ff</code> requires both an upstream mate 1 and a downstream mate 2 to be forward-oriented. Default: <code>--fr</code> (appropriate for Illumina's Paired-end Sequencing Assay).</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-no-mixed\">\n\n<pre><code>--no-mixed</code></pre>\n</td><td>\n\n<p>By default, when <code>bowtie2</code> cannot find a concordant or discordant alignment for a pair, it then tries to find alignments for the individual mates. This option disables that behavior.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-no-discordant\">\n\n<pre><code>--no-discordant</code></pre>\n</td><td>\n\n<p>By default, <code>bowtie2</code> looks for discordant alignments if it cannot find any concordant alignments. A discordant alignment is an alignment where both mates align uniquely, but that does not satisfy the paired-end constraints (<a href=\"#bowtie2-options-fr\"><code>--fr</code>/<code>--rf</code>/<code>--ff</code></a>, <a href=\"#bowtie2-options-I\"><code>-I</code></a>, <a href=\"#bowtie2-options-X\"><code>-X</code></a>). This option disables that behavior.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-dovetail\">\n\n<pre><code>--dovetail</code></pre>\n</td><td>\n\n<p>If the mates &quot;dovetail&quot;, that is if one mate alignment extends past the beginning of the other such that the wrong mate begins upstream, consider that to be concordant. See also: <a href=\"#mates-can-overlap-contain-or-dovetail-each-other\">Mates can overlap, contain or dovetail each other</a>. Default: mates cannot dovetail in a concordant alignment.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-no-contain\">\n\n<pre><code>--no-contain</code></pre>\n</td><td>\n\n<p>If one mate alignment contains the other, consider that to be non-concordant. See also: <a href=\"#mates-can-overlap-contain-or-dovetail-each-other\">Mates can overlap, contain or dovetail each other</a>. Default: a mate can contain the other in a concordant alignment.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-no-overlap\">\n\n<pre><code>--no-overlap</code></pre>\n</td><td>\n\n<p>If one mate alignment overlaps the other at all, consider that to be non-concordant. See also: <a href=\"#mates-can-overlap-contain-or-dovetail-each-other\">Mates can overlap, contain or dovetail each other</a>. Default: mates can overlap in a concordant alignment.</p>\n</td></tr></table>\n\n<h4 id=\"output-options\">Output options</h4>\n<table>\n\n<tr><td id=\"bowtie2-options-t\">\n\n<pre><code>-t/--time</code></pre>\n</td><td>\n\n<p>Print the wall-clock time required to load the index files and align the reads. This is printed to the &quot;standard error&quot; (&quot;stderr&quot;) filehandle. Default: off.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-un\">\n\n<pre><code>--un &lt;path&gt;\n--un-gz &lt;path&gt;\n--un-bz2 &lt;path&gt;</code></pre>\n</td><td>\n\n<p>Write unpaired reads that fail to align to file at <code>&lt;path&gt;</code>. These reads correspond to the SAM records with the FLAGS <code>0x4</code> bit set and neither the <code>0x40</code> nor <code>0x80</code> bits set. If <code>--un-gz</code> is specified, output will be gzip compressed. If <code>--un-bz2</code> is specified, output will be bzip2 compressed. Reads written in this way will appear exactly as they did in the input file, without any modification (same sequence, same name, same quality string, same quality encoding). Reads will not necessarily appear in the same order as they did in the input.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-al\">\n\n<pre><code>--al &lt;path&gt;\n--al-gz &lt;path&gt;\n--al-bz2 &lt;path&gt;</code></pre>\n</td><td>\n\n<p>Write unpaired reads that align at least once to file at <code>&lt;path&gt;</code>. These reads correspond to the SAM records with the FLAGS <code>0x4</code>, <code>0x40</code>, and <code>0x80</code> bits unset. If <code>--al-gz</code> is specified, output will be gzip compressed. If <code>--al-bz2</code> is specified, output will be bzip2 compressed. Reads written in this way will appear exactly as they did in the input file, without any modification (same sequence, same name, same quality string, same quality encoding). Reads will not necessarily appear in the same order as they did in the input.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-un-conc\">\n\n<pre><code>--un-conc &lt;path&gt;\n--un-conc-gz &lt;path&gt;\n--un-conc-bz2 &lt;path&gt;</code></pre>\n</td><td>\n\n<p>Write paired-end reads that fail to align concordantly to file(s) at <code>&lt;path&gt;</code>. These reads correspond to the SAM records with the FLAGS <code>0x4</code> bit set and either the <code>0x40</code> or <code>0x80</code> bit set (depending on whether it's mate #1 or #2). <code>.1</code> and <code>.2</code> strings are added to the filename to distinguish which file contains mate #1 and mate #2. If a percent symbol, <code>%</code>, is used in <code>&lt;path&gt;</code>, the percent symbol is replaced with <code>1</code> or <code>2</code> to make the per-mate filenames. Otherwise, <code>.1</code> or <code>.2</code> are added before the final dot in <code>&lt;path&gt;</code> to make the per-mate filenames. Reads written in this way will appear exactly as they did in the input files, without any modification (same sequence, same name, same quality string, same quality encoding). Reads will not necessarily appear in the same order as they did in the inputs.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-al-conc\">\n\n<pre><code>--al-conc &lt;path&gt;\n--al-conc-gz &lt;path&gt;\n--al-conc-bz2 &lt;path&gt;</code></pre>\n</td><td>\n\n<p>Write paired-end reads that align concordantly at least once to file(s) at <code>&lt;path&gt;</code>. These reads correspond to the SAM records with the FLAGS <code>0x4</code> bit unset and either the <code>0x40</code> or <code>0x80</code> bit set (depending on whether it's mate #1 or #2). <code>.1</code> and <code>.2</code> strings are added to the filename to distinguish which file contains mate #1 and mate #2. If a percent symbol, <code>%</code>, is used in <code>&lt;path&gt;</code>, the percent symbol is replaced with <code>1</code> or <code>2</code> to make the per-mate filenames. Otherwise, <code>.1</code> or <code>.2</code> are added before the final dot in <code>&lt;path&gt;</code> to make the per-mate filenames. Reads written in this way will appear exactly as they did in the input files, without any modification (same sequence, same name, same quality string, same quality encoding). Reads will not necessarily appear in the same order as they did in the inputs.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-quiet\">\n\n<pre><code>--quiet</code></pre>\n</td><td>\n\n<p>Print nothing besides alignments and serious errors.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-met-file\">\n\n<pre><code>--met-file &lt;path&gt;</code></pre>\n</td><td>\n\n<p>Write <code>bowtie2</code> metrics to file <code>&lt;path&gt;</code>. Having alignment metric can be useful for debugging certain problems, especially performance issues. See also: <a href=\"#bowtie2-options-met\"><code>--met</code></a>. Default: metrics disabled.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-met-stderr\">\n\n<pre><code>--met-stderr &lt;path&gt;</code></pre>\n</td><td>\n\n<p>Write <code>bowtie2</code> metrics to the &quot;standard error&quot; (&quot;stderr&quot;) filehandle. This is not mutually exclusive with <a href=\"#bowtie2-options-met-file\"><code>--met-file</code></a>. Having alignment metric can be useful for debugging certain problems, especially performance issues. See also: <a href=\"#bowtie2-options-met\"><code>--met</code></a>. Default: metrics disabled.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-met\">\n\n<pre><code>--met &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Write a new <code>bowtie2</code> metrics record every <code>&lt;int&gt;</code> seconds. Only matters if either <a href=\"#bowtie2-options-met-stderr\"><code>--met-stderr</code></a> or <a href=\"#bowtie2-options-met-file\"><code>--met-file</code></a> are specified. Default: 1.</p>\n</td></tr>\n</table>\n\n<h4 id=\"sam-options\">SAM options</h4>\n<table>\n\n<tr><td id=\"bowtie2-options-no-unal\">\n\n<pre><code>--no-unal</code></pre>\n</td><td>\n\n<p>Suppress SAM records for reads that failed to align.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-no-hd\">\n\n<pre><code>--no-hd</code></pre>\n</td><td>\n\n<p>Suppress SAM header lines (starting with <code>@</code>).</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-no-sq\">\n\n<pre><code>--no-sq</code></pre>\n</td><td>\n\n<p>Suppress <code>@SQ</code> SAM header lines.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-rg-id\">\n\n<pre><code>--rg-id &lt;text&gt;</code></pre>\n</td><td>\n\n<p>Set the read group ID to <code>&lt;text&gt;</code>. This causes the SAM <code>@RG</code> header line to be printed, with <code>&lt;text&gt;</code> as the value associated with the <code>ID:</code> tag. It also causes the <code>RG:Z:</code> extra field to be attached to each SAM output record, with value set to <code>&lt;text&gt;</code>.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-rg\">\n\n<pre><code>--rg &lt;text&gt;</code></pre>\n</td><td>\n\n<p>Add <code>&lt;text&gt;</code> (usually of the form <code>TAG:VAL</code>, e.g. <code>SM:Pool1</code>) as a field on the <code>@RG</code> header line. Note: in order for the <code>@RG</code> line to appear, <a href=\"#bowtie2-options-rg-id\"><code>--rg-id</code></a> must also be specified. This is because the <code>ID</code> tag is required by the <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM Spec</a>. Specify <code>--rg</code> multiple times to set multiple fields. See the <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM Spec</a> for details about what fields are legal.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-omit-sec-seq\">\n\n<pre><code>--omit-sec-seq</code></pre>\n</td><td>\n\n<p>When printing secondary alignments, Bowtie 2 by default will write out the <code>SEQ</code> and <code>QUAL</code> strings. Specifying this option causes Bowtie 2 to print an asterix in those fields instead.</p>\n</td></tr>\n\n\n</table>\n\n<h4 id=\"performance-options\">Performance options</h4>\n<table><tr>\n\n<td id=\"bowtie2-options-o\">\n\n<pre><code>-o/--offrate &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Override the offrate of the index with <code>&lt;int&gt;</code>. If <code>&lt;int&gt;</code> is greater than the offrate used to build the index, then some row markings are discarded when the index is read into memory. This reduces the memory footprint of the aligner but requires more time to calculate text offsets. <code>&lt;int&gt;</code> must be greater than the value used to build the index.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-p\">\n\n<pre><code>-p/--threads NTHREADS</code></pre>\n</td><td>\n\n<p>Launch <code>NTHREADS</code> parallel search threads (default: 1). Threads will run on separate processors/cores and synchronize when parsing reads and outputting alignments. Searching for alignments is highly parallel, and speedup is close to linear. Increasing <code>-p</code> increases Bowtie 2's memory footprint. E.g. when aligning to a human genome index, increasing <code>-p</code> from 1 to 8 increases the memory footprint by a few hundred megabytes. This option is only available if <code>bowtie</code> is linked with the <code>pthreads</code> library (i.e. if <code>BOWTIE_PTHREADS=0</code> is not specified at build time).</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-reorder\">\n\n<pre><code>--reorder</code></pre>\n</td><td>\n\n<p>Guarantees that output SAM records are printed in an order corresponding to the order of the reads in the original input file, even when <a href=\"#bowtie2-options-p\"><code>-p</code></a> is set greater than 1. Specifying <code>--reorder</code> and setting <a href=\"#bowtie2-options-p\"><code>-p</code></a> greater than 1 causes Bowtie 2 to run somewhat slower and use somewhat more memory then if <code>--reorder</code> were not specified. Has no effect if <a href=\"#bowtie2-options-p\"><code>-p</code></a> is set to 1, since output order will naturally correspond to input order in that case.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-mm\">\n\n<pre><code>--mm</code></pre>\n</td><td>\n\n<p>Use memory-mapped I/O to load the index, rather than typical file I/O. Memory-mapping allows many concurrent <code>bowtie</code> processes on the same computer to share the same memory image of the index (i.e. you pay the memory overhead just once). This facilitates memory-efficient parallelization of <code>bowtie</code> in situations where using <a href=\"#bowtie2-options-p\"><code>-p</code></a> is not possible or not preferable.</p>\n</td></tr></table>\n\n<h4 id=\"other-options\">Other options</h4>\n<table>\n<tr><td id=\"bowtie2-options-qc-filter\">\n\n<pre><code>--qc-filter</code></pre>\n</td><td>\n\n<p>Filter out reads for which the QSEQ filter field is non-zero. Only has an effect when read format is <a href=\"#bowtie2-options-qseq\"><code>--qseq</code></a>. Default: off.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-seed\">\n\n<pre><code>--seed &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Use <code>&lt;int&gt;</code> as the seed for pseudo-random number generator. Default: 0.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-non-deterministic\">\n\n<pre><code>--non-deterministic</code></pre>\n</td><td>\n\n<p>Normally, Bowtie 2 re-initializes its pseudo-random generator for each read. It seeds the generator with a number derived from (a) the read name, (b) the nucleotide sequence, (c) the quality sequence, (d) the value of the <a href=\"#bowtie2-options-seed\"><code>--seed</code></a> option. This means that if two reads are identical (same name, same nucleotides, same qualities) Bowtie 2 will find and report the same alignment(s) for both, even if there was ambiguity. When <code>--non-deterministic</code> is specified, Bowtie 2 re-initializes its pseudo-random generator for each read using the current time. This means that Bowtie 2 will not necessarily report the same alignment for two identical reads. This is counter-intuitive for some users, but might be more appropriate in situations where the input consists of many identical reads.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-version\">\n\n<pre><code>--version</code></pre>\n</td><td>\n\n<p>Print version information and quit.</p>\n</td></tr>\n<tr><td id=\"bowtie2-options-h\">\n\n<pre><code>-h/--help</code></pre>\n</td><td>\n\n<p>Print usage information and quit.</p>\n</td></tr></table>\n\n<h2 id=\"sam-output\">SAM output</h2>\n<p>Following is a brief description of the <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM</a> format as output by <code>bowtie2</code>. For more details, see the <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM format specification</a>.</p>\n<p>By default, <code>bowtie2</code> prints a SAM header with <code>@HD</code>, <code>@SQ</code> and <code>@PG</code> lines. When one or more <a href=\"#bowtie2-options-rg\"><code>--rg</code></a> arguments are specified, <code>bowtie2</code> will also print an <code>@RG</code> line that includes all user-specified <a href=\"#bowtie2-options-rg\"><code>--rg</code></a> tokens separated by tabs.</p>\n<p>Each subsequnt line describes an alignment or, if the read failed to align, a read. Each line is a collection of at least 12 fields separated by tabs; from left to right, the fields are:</p>\n<ol style=\"list-style-type: decimal\">\n<li><p>Name of read that aligned.</p>\n<p>Note that the <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM specification</a> disallows whitespace in the read name. If the read name contains any whitespace characters, Bowtie 2 will truncate the name at the first whitespace character. This is similar to the behavior of other tools.</p></li>\n<li><p>Sum of all applicable flags. Flags relevant to Bowtie are:</p>\n<table><tr><td>\n\n<pre><code>1</code></pre>\n</td><td>\n\n<p>The read is one of a pair</p>\n</td></tr><tr><td>\n\n<pre><code>2</code></pre>\n</td><td>\n\n<p>The alignment is one end of a proper paired-end alignment</p>\n</td></tr><tr><td>\n\n<pre><code>4</code></pre>\n</td><td>\n\n<p>The read has no reported alignments</p>\n</td></tr><tr><td>\n\n<pre><code>8</code></pre>\n</td><td>\n\n<p>The read is one of a pair and has no reported alignments</p>\n</td></tr><tr><td>\n\n<pre><code>16</code></pre>\n</td><td>\n\n<p>The alignment is to the reverse reference strand</p>\n</td></tr><tr><td>\n\n<pre><code>32</code></pre>\n</td><td>\n\n<p>The other mate in the paired-end alignment is aligned to the reverse reference strand</p>\n</td></tr><tr><td>\n\n<pre><code>64</code></pre>\n</td><td>\n\n<p>The read is mate 1 in a pair</p>\n</td></tr><tr><td>\n\n<pre><code>128</code></pre>\n</td><td>\n\n<p>The read is mate 2 in a pair</p>\n</td></tr></table>\n\n<p>Thus, an unpaired read that aligns to the reverse reference strand will have flag 16. A paired-end read that aligns and is the first mate in the pair will have flag 83 (= 64 + 16 + 2 + 1).</p></li>\n<li><p>Name of reference sequence where alignment occurs</p></li>\n<li><p>1-based offset into the forward reference strand where leftmost character of the alignment occurs</p></li>\n<li><p>Mapping quality</p></li>\n<li><p>CIGAR string representation of alignment</p></li>\n<li><p>Name of reference sequence where mate's alignment occurs. Set to <code>=</code> if the mate's reference sequence is the same as this alignment's, or <code>*</code> if there is no mate.</p></li>\n<li><p>1-based offset into the forward reference strand where leftmost character of the mate's alignment occurs. Offset is 0 if there is no mate.</p></li>\n<li><p>Inferred fragment length. Size is negative if the mate's alignment occurs upstream of this alignment. Size is 0 if the mates did not align concordantly. However, size is non-0 if the mates aligned discordantly to the same chromosome.</p></li>\n<li><p>Read sequence (reverse-complemented if aligned to the reverse strand)</p></li>\n<li><p>ASCII-encoded read qualities (reverse-complemented if the read aligned to the reverse strand). The encoded quality values are on the <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Phred quality</a> scale and the encoding is ASCII-offset by 33 (ASCII char <code>!</code>), similarly to a <a href=\"http://en.wikipedia.org/wiki/FASTQ_format\">FASTQ</a> file.</p></li>\n<li><p>Optional fields. Fields are tab-separated. <code>bowtie2</code> outputs zero or more of these optional fields for each alignment, depending on the type of the alignment:</p>\n<table>\n<tr><td id=\"bowtie2-build-opt-fields-as\">\n</li>\n</ol>\n<pre><code>    AS:i:&lt;N&gt;\n\n&lt;/td&gt;\n&lt;td&gt;\n\nAlignment score.  Can be negative.  Can be greater than 0 in [`--local`]\nmode (but not in [`--end-to-end`] mode).  Only present if SAM record is for\nan aligned read.\n\n&lt;/td&gt;&lt;/tr&gt;\n&lt;tr&gt;&lt;td id=&quot;bowtie2-build-opt-fields-xs&quot;&gt;</code></pre>\n<pre><code>    XS:i:&lt;N&gt;\n\n&lt;/td&gt;\n&lt;td&gt;\n\nAlignment score for second-best alignment.  Can be negative.  Can be greater\nthan 0 in [`--local`] mode (but not in [`--end-to-end`] mode).  Only present\nif the SAM record is for an aligned read and more than one alignment was\nfound for the read.\n\n&lt;/td&gt;&lt;/tr&gt;\n&lt;tr&gt;&lt;td id=&quot;bowtie2-build-opt-fields-ys&quot;&gt;</code></pre>\n<pre><code>    YS:i:&lt;N&gt;\n\n&lt;/td&gt;\n&lt;td&gt;\n\nAlignment score for opposite mate in the paired-end alignment.  Only present\nif the SAM record is for a read that aligned as part of a paired-end\nalignment.\n\n&lt;/td&gt;&lt;/tr&gt;\n&lt;tr&gt;&lt;td id=&quot;bowtie2-build-opt-fields-xn&quot;&gt;</code></pre>\n<pre><code>    XN:i:&lt;N&gt;\n\n&lt;/td&gt;\n&lt;td&gt;\n\nThe number of ambiguous bases in the reference covering this alignment. \nOnly present if SAM record is for an aligned read.\n\n&lt;/td&gt;&lt;/tr&gt;\n&lt;tr&gt;&lt;td id=&quot;bowtie2-build-opt-fields-xm&quot;&gt;</code></pre>\n<pre><code>    XM:i:&lt;N&gt;\n\n&lt;/td&gt;\n&lt;td&gt;\n\nThe number of mismatches in the alignment.  Only present if SAM record is\nfor an aligned read.\n\n&lt;/td&gt;&lt;/tr&gt;\n&lt;tr&gt;&lt;td id=&quot;bowtie2-build-opt-fields-xo&quot;&gt;</code></pre>\n<pre><code>    XO:i:&lt;N&gt;\n\n&lt;/td&gt;\n&lt;td&gt;\n\nThe number of gap opens, for both read and reference gaps, in the alignment.\nOnly present if SAM record is for an aligned read.\n\n&lt;/td&gt;&lt;/tr&gt;\n&lt;tr&gt;&lt;td id=&quot;bowtie2-build-opt-fields-xg&quot;&gt;</code></pre>\n<pre><code>    XG:i:&lt;N&gt;\n\n&lt;/td&gt;\n&lt;td&gt;\n\nThe number of gap extensions, for both read and reference gaps, in the\nalignment. Only present if SAM record is for an aligned read.\n\n&lt;/td&gt;&lt;/tr&gt;\n&lt;tr&gt;&lt;td id=&quot;bowtie2-build-opt-fields-nm&quot;&gt;</code></pre>\n<pre><code>    NM:i:&lt;N&gt;\n\n&lt;/td&gt;\n&lt;td&gt;\n\nThe edit distance; that is, the minimal number of one-nucleotide edits\n(substitutions, insertions and deletions) needed to transform the read\nstring into the reference string.  Only present if SAM record is for an\naligned read.\n\n&lt;/td&gt;&lt;/tr&gt;\n&lt;tr&gt;&lt;td id=&quot;bowtie2-build-opt-fields-yf&quot;&gt;</code></pre>\n<pre><code>    YF:Z:&lt;S&gt;\n\n&lt;/td&gt;&lt;td&gt;\n\nString indicating reason why the read was filtered out.  See also:\n[Filtering].  Only appears for reads that were filtered out.\n\n&lt;/td&gt;&lt;/tr&gt;\n&lt;tr&gt;&lt;td id=&quot;bowtie2-build-opt-fields-yt&quot;&gt;</code></pre>\n<pre><code>    YT:Z:&lt;S&gt;\n\n&lt;/td&gt;&lt;td&gt;\n\nValue of `UU` indicates the read was not part of a pair.  Value of `CP`\nindicates the read was part of a pair and the pair aligned concordantly.\nValue of `DP` indicates the read was part of a pair and the pair aligned\ndiscordantly.  Value of `UP` indicates the read was part of a pair but the\npair failed to aligned either concordantly or discordantly.</code></pre>\n<pre><code>&lt;/td&gt;&lt;/tr&gt;\n&lt;tr&gt;&lt;td id=&quot;bowtie2-build-opt-fields-md&quot;&gt;</code></pre>\n<pre><code>    MD:Z:&lt;S&gt;\n\n&lt;/td&gt;&lt;td&gt;\n\nA string representation of the mismatched reference bases in the alignment. \nSee [SAM] format specification for details.  Only present if SAM record is\nfor an aligned read.\n\n&lt;/td&gt;&lt;/tr&gt;\n&lt;/table&gt;</code></pre>\n<h1 id=\"the-bowtie2-build-indexer\">The <code>bowtie2-build</code> indexer</h1>\n<p><code>bowtie2-build</code> builds a Bowtie index from a set of DNA sequences. <code>bowtie2-build</code> outputs a set of 6 files with suffixes <code>.1.bt2</code>, <code>.2.bt2</code>, <code>.3.bt2</code>, <code>.4.bt2</code>, <code>.rev.1.bt2</code>, and <code>.rev.2.bt2</code>. These files together constitute the index: they are all that is needed to align reads to that reference. The original sequence FASTA files are no longer used by Bowtie 2 once the index is built.</p>\n<p>Bowtie 2's <code>.bt2</code> index format is different from Bowtie 1's <code>.ebwt</code> format, and they are not compatible with each other.</p>\n<p>Use of Karkkainen's <a href=\"http://portal.acm.org/citation.cfm?id=1314852\">blockwise algorithm</a> allows <code>bowtie2-build</code> to trade off between running time and memory usage. <code>bowtie2-build</code> has three options governing how it makes this trade: <a href=\"#bowtie2-build-options-p\"><code>-p</code>/<code>--packed</code></a>, <a href=\"#bowtie2-build-options-bmax\"><code>--bmax</code></a>/<a href=\"#bowtie2-build-options-bmaxdivn\"><code>--bmaxdivn</code></a>, and <a href=\"#bowtie2-build-options-dcv\"><code>--dcv</code></a>. By default, <code>bowtie2-build</code> will automatically search for the settings that yield the best running time without exhausting memory. This behavior can be disabled using the <a href=\"#bowtie2-build-options-a\"><code>-a</code>/<code>--noauto</code></a> option.</p>\n<p>The indexer provides options pertaining to the &quot;shape&quot; of the index, e.g. <a href=\"#bowtie2-build-options-o\"><code>--offrate</code></a> governs the fraction of <a href=\"http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\">Burrows-Wheeler</a> rows that are &quot;marked&quot; (i.e., the density of the suffix-array sample; see the original <a href=\"http://portal.acm.org/citation.cfm?id=796543\">FM Index</a> paper for details). All of these options are potentially profitable trade-offs depending on the application. They have been set to defaults that are reasonable for most cases according to our experiments. See <a href=\"#performance-tuning\">Performance tuning</a> for details.</p>\n<p>Because <code>bowtie2-build</code> uses 32-bit pointers internally, it can handle up to a theoretical maximum of 2^32-1 (somewhat more than 4 billion) characters in an index, though, with other constraints, the actual ceiling is somewhat less than that. If your reference exceeds 2^32-1 characters, <code>bowtie2-build</code> will print an error message and abort. To resolve this, divide your reference sequences into smaller batches and/or chunks and build a separate index for each.</p>\n<p>If your computer has more than 3-4 GB of memory and you would like to exploit that fact to make index building faster, use a 64-bit version of the <code>bowtie2-build</code> binary. The 32-bit version of the binary is restricted to using less than 4 GB of memory. If a 64-bit pre-built binary does not yet exist for your platform on the sourceforge download site, you will need to build one from source.</p>\n<p>The Bowtie 2 index is based on the <a href=\"http://portal.acm.org/citation.cfm?id=796543\">FM Index</a> of Ferragina and Manzini, which in turn is based on the <a href=\"http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\">Burrows-Wheeler</a> transform. The algorithm used to build the index is based on the <a href=\"http://portal.acm.org/citation.cfm?id=1314852\">blockwise algorithm</a> of Karkkainen.</p>\n<h2 id=\"command-line-1\">Command Line</h2>\n<p>Usage:</p>\n<pre><code>bowtie2-build [options]* &lt;reference_in&gt; &lt;bt2_base&gt;</code></pre>\n<h3 id=\"main-arguments-1\">Main arguments</h3>\n<table><tr><td>\n\n<pre><code>&lt;reference_in&gt;</code></pre>\n</td><td>\n\n<p>A comma-separated list of FASTA files containing the reference sequences to be aligned to, or, if <a href=\"#bowtie2-build-options-c\"><code>-c</code></a> is specified, the sequences themselves. E.g., <code>&lt;reference_in&gt;</code> might be <code>chr1.fa,chr2.fa,chrX.fa,chrY.fa</code>, or, if <a href=\"#bowtie2-build-options-c\"><code>-c</code></a> is specified, this might be <code>GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA</code>.</p>\n</td></tr><tr><td>\n\n<pre><code>&lt;bt2_base&gt;</code></pre>\n</td><td>\n\n<p>The basename of the index files to write. By default, <code>bowtie2-build</code> writes files named <code>NAME.1.bt2</code>, <code>NAME.2.bt2</code>, <code>NAME.3.bt2</code>, <code>NAME.4.bt2</code>, <code>NAME.rev.1.bt2</code>, and <code>NAME.rev.2.bt2</code>, where <code>NAME</code> is <code>&lt;bt2_base&gt;</code>.</p>\n</td></tr></table>\n\n<h3 id=\"options-1\">Options</h3>\n<table><tr><td>\n\n<pre><code>-f</code></pre>\n</td><td>\n\n<p>The reference input files (specified as <code>&lt;reference_in&gt;</code>) are FASTA files (usually having extension <code>.fa</code>, <code>.mfa</code>, <code>.fna</code> or similar).</p>\n</td></tr><tr><td id=\"bowtie2-build-options-c\">\n\n<pre><code>-c</code></pre>\n</td><td>\n\n<p>The reference sequences are given on the command line. I.e. <code>&lt;reference_in&gt;</code> is a comma-separated list of sequences rather than a list of FASTA files.</p>\n</td></tr>\n<tr><td id=\"bowtie2-build-options-a\">\n\n<pre><code>-a/--noauto</code></pre>\n</td><td>\n\n<p>Disable the default behavior whereby <code>bowtie2-build</code> automatically selects values for the <a href=\"#bowtie2-build-options-bmax\"><code>--bmax</code></a>, <a href=\"#bowtie2-build-options-dcv\"><code>--dcv</code></a> and <a href=\"#bowtie2-build-options-p\"><code>--packed</code></a> parameters according to available memory. Instead, user may specify values for those parameters. If memory is exhausted during indexing, an error message will be printed; it is up to the user to try new parameters.</p>\n</td></tr><tr><td id=\"bowtie2-build-options-p\">\n\n<pre><code>-p/--packed</code></pre>\n</td><td>\n\n<p>Use a packed (2-bits-per-nucleotide) representation for DNA strings. This saves memory but makes indexing 2-3 times slower. Default: off. This is configured automatically by default; use <a href=\"#bowtie2-build-options-a\"><code>-a</code>/<code>--noauto</code></a> to configure manually.</p>\n</td></tr><tr><td id=\"bowtie2-build-options-bmax\">\n\n<pre><code>--bmax &lt;int&gt;</code></pre>\n</td><td>\n\n<p>The maximum number of suffixes allowed in a block. Allowing more suffixes per block makes indexing faster, but increases peak memory usage. Setting this option overrides any previous setting for <a href=\"#bowtie2-build-options-bmax\"><code>--bmax</code></a>, or <a href=\"#bowtie2-build-options-bmaxdivn\"><code>--bmaxdivn</code></a>. Default (in terms of the <a href=\"#bowtie2-build-options-bmaxdivn\"><code>--bmaxdivn</code></a> parameter) is <a href=\"#bowtie2-build-options-bmaxdivn\"><code>--bmaxdivn</code></a> 4. This is configured automatically by default; use <a href=\"#bowtie2-build-options-a\"><code>-a</code>/<code>--noauto</code></a> to configure manually.</p>\n</td></tr><tr><td id=\"bowtie2-build-options-bmaxdivn\">\n\n<pre><code>--bmaxdivn &lt;int&gt;</code></pre>\n</td><td>\n\n<p>The maximum number of suffixes allowed in a block, expressed as a fraction of the length of the reference. Setting this option overrides any previous setting for <a href=\"#bowtie2-build-options-bmax\"><code>--bmax</code></a>, or <a href=\"#bowtie2-build-options-bmaxdivn\"><code>--bmaxdivn</code></a>. Default: <a href=\"#bowtie2-build-options-bmaxdivn\"><code>--bmaxdivn</code></a> 4. This is configured automatically by default; use <a href=\"#bowtie2-build-options-a\"><code>-a</code>/<code>--noauto</code></a> to configure manually.</p>\n</td></tr><tr><td id=\"bowtie2-build-options-dcv\">\n\n<pre><code>--dcv &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Use <code>&lt;int&gt;</code> as the period for the difference-cover sample. A larger period yields less memory overhead, but may make suffix sorting slower, especially if repeats are present. Must be a power of 2 no greater than 4096. Default: 1024. This is configured automatically by default; use <a href=\"#bowtie2-build-options-a\"><code>-a</code>/<code>--noauto</code></a> to configure manually.</p>\n</td></tr><tr><td id=\"bowtie2-build-options-nodc\">\n\n<pre><code>--nodc</code></pre>\n</td><td>\n\n<p>Disable use of the difference-cover sample. Suffix sorting becomes quadratic-time in the worst case (where the worst case is an extremely repetitive reference). Default: off.</p>\n</td></tr><tr><td>\n\n<pre><code>-r/--noref</code></pre>\n</td><td>\n\n<p>Do not build the <code>NAME.3.bt2</code> and <code>NAME.4.bt2</code> portions of the index, which contain a bitpacked version of the reference sequences and are used for paired-end alignment.</p>\n</td></tr><tr><td>\n\n<pre><code>-3/--justref</code></pre>\n</td><td>\n\n<p>Build only the <code>NAME.3.bt2</code> and <code>NAME.4.bt2</code> portions of the index, which contain a bitpacked version of the reference sequences and are used for paired-end alignment.</p>\n</td></tr><tr><td id=\"bowtie2-build-options-o\">\n\n<pre><code>-o/--offrate &lt;int&gt;</code></pre>\n</td><td>\n\n<p>To map alignments back to positions on the reference sequences, it's necessary to annotate (&quot;mark&quot;) some or all of the <a href=\"http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\">Burrows-Wheeler</a> rows with their corresponding location on the genome. <a href=\"#bowtie2-build-options-o\"><code>-o</code>/<code>--offrate</code></a> governs how many rows get marked: the indexer will mark every 2^<code>&lt;int&gt;</code> rows. Marking more rows makes reference-position lookups faster, but requires more memory to hold the annotations at runtime. The default is 5 (every 32nd row is marked; for human genome, annotations occupy about 340 megabytes).</p>\n</td></tr><tr><td>\n\n<pre><code>-t/--ftabchars &lt;int&gt;</code></pre>\n</td><td>\n\n<p>The ftab is the lookup table used to calculate an initial <a href=\"http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\">Burrows-Wheeler</a> range with respect to the first <code>&lt;int&gt;</code> characters of the query. A larger <code>&lt;int&gt;</code> yields a larger lookup table but faster query times. The ftab has size 4^(<code>&lt;int&gt;</code>+1) bytes. The default setting is 10 (ftab is 4MB).</p>\n</td></tr><tr><td>\n\n<pre><code>--seed &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Use <code>&lt;int&gt;</code> as the seed for pseudo-random number generator.</p>\n</td></tr><tr><td>\n\n<pre><code>--cutoff &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Index only the first <code>&lt;int&gt;</code> bases of the reference sequences (cumulative across sequences) and ignore the rest.</p>\n</td></tr><tr><td>\n\n<pre><code>-q/--quiet</code></pre>\n</td><td>\n\n<p><code>bowtie2-build</code> is verbose by default. With this option <code>bowtie2-build</code> will print only error messages.</p>\n</td></tr><tr><td>\n\n<pre><code>-h/--help</code></pre>\n</td><td>\n\n<p>Print usage information and quit.</p>\n</td></tr><tr><td>\n\n<pre><code>--version</code></pre>\n</td><td>\n\n<p>Print version information and quit.</p>\n</td></tr></table>\n\n<h1 id=\"the-bowtie2-inspect-index-inspector\">The <code>bowtie2-inspect</code> index inspector</h1>\n<p><code>bowtie2-inspect</code> extracts information from a Bowtie index about what kind of index it is and what reference sequences were used to build it. When run without any options, the tool will output a FASTA file containing the sequences of the original references (with all non-<code>A</code>/<code>C</code>/<code>G</code>/<code>T</code> characters converted to <code>N</code>s). It can also be used to extract just the reference sequence names using the <a href=\"#bowtie2-inspect-options-n\"><code>-n</code>/<code>--names</code></a> option or a more verbose summary using the <a href=\"#bowtie2-inspect-options-s\"><code>-s</code>/<code>--summary</code></a> option.</p>\n<h2 id=\"command-line-2\">Command Line</h2>\n<p>Usage:</p>\n<pre><code>bowtie2-inspect [options]* &lt;bt2_base&gt;</code></pre>\n<h3 id=\"main-arguments-2\">Main arguments</h3>\n<table><tr><td>\n\n<pre><code>&lt;bt2_base&gt;</code></pre>\n</td><td>\n\n<p>The basename of the index to be inspected. The basename is name of any of the index files but with the <code>.X.bt2</code> or <code>.rev.X.bt2</code> suffix omitted. <code>bowtie2-inspect</code> first looks in the current directory for the index files, then in the directory specified in the <code>BOWTIE2_INDEXES</code> environment variable.</p>\n</td></tr></table>\n\n<h3 id=\"options-2\">Options</h3>\n<table><tr><td>\n\n<pre><code>-a/--across &lt;int&gt;</code></pre>\n</td><td>\n\n<p>When printing FASTA output, output a newline character every <code>&lt;int&gt;</code> bases (default: 60).</p>\n</td></tr><tr><td id=\"bowtie2-inspect-options-n\">\n\n<pre><code>-n/--names</code></pre>\n</td><td>\n\n<p>Print reference sequence names, one per line, and quit.</p>\n</td></tr><tr><td id=\"bowtie2-inspect-options-s\">\n\n<pre><code>-s/--summary</code></pre>\n</td><td>\n\n<p>Print a summary that includes information about index settings, as well as the names and lengths of the input sequences. The summary has this format:</p>\n<pre><code>Colorspace  &lt;0 or 1&gt;\nSA-Sample   1 in &lt;sample&gt;\nFTab-Chars  &lt;chars&gt;\nSequence-1  &lt;name&gt;  &lt;len&gt;\nSequence-2  &lt;name&gt;  &lt;len&gt;\n...\nSequence-N  &lt;name&gt;  &lt;len&gt;</code></pre>\n<p>Fields are separated by tabs. Colorspace is always set to 0 for Bowtie 2.</p>\n</td></tr><tr><td>\n\n<pre><code>-v/--verbose</code></pre>\n</td><td>\n\n<p>Print verbose output (for debugging).</p>\n</td></tr><tr><td>\n\n<pre><code>--version</code></pre>\n</td><td>\n\n<p>Print version information and quit.</p>\n</td></tr><tr><td>\n\n<pre><code>-h/--help</code></pre>\n</td><td>\n\n<p>Print usage information and quit.</p>\n</td></tr></table>\n\n<h1 id=\"getting-started-with-bowtie-2-lambda-phage-example\">Getting started with Bowtie 2: Lambda phage example</h1>\n<p>Bowtie 2 comes with some example files to get you started. The example files are not scientifically significant; we use the <a href=\"http://en.wikipedia.org/wiki/Lambda_phage\">Lambda phage</a> reference genome simply because it's short, and the reads were generated by a computer program, not a sequencer. However, these files will let you start running Bowtie 2 and downstream tools right away.</p>\n<p>First follow the manual instructions to <a href=\"#obtaining-bowtie-2\">obtain Bowtie 2</a>. Set the <code>BT2_HOME</code> environment variable to point to the new Bowtie 2 directory containing the <code>bowtie2</code>, <code>bowtie2-build</code> and <code>bowtie2-inspect</code> binaries. This is important, as the <code>BT2_HOME</code> variable is used in the commands below to refer to that directory.</p>\n<h2 id=\"indexing-a-reference-genome\">Indexing a reference genome</h2>\n<p>To create an index for the <a href=\"http://en.wikipedia.org/wiki/Lambda_phage\">Lambda phage</a> reference genome included with Bowtie 2, create a new temporary directory (it doesn't matter where), change into that directory, and run:</p>\n<pre><code>$BT2_HOME/bowtie2-build $BT2_HOME/example/reference/lambda_virus.fa lambda_virus</code></pre>\n<p>The command should print many lines of output then quit. When the command completes, the current directory will contain four new files that all start with <code>lambda_virus</code> and end with <code>.1.bt2</code>, <code>.2.bt2</code>, <code>.3.bt2</code>, <code>.4.bt2</code>, <code>.rev.1.bt2</code>, and <code>.rev.2.bt2</code>. These files constitute the index - you're done!</p>\n<p>You can use <code>bowtie2-build</code> to create an index for a set of FASTA files obtained from any source, including sites such as <a href=\"http://genome.ucsc.edu/cgi-bin/hgGateway\">UCSC</a>, <a href=\"http://www.ncbi.nlm.nih.gov/sites/genome\">NCBI</a>, and <a href=\"http://www.ensembl.org/\">Ensembl</a>. When indexing multiple FASTA files, specify all the files using commas to separate file names. For more details on how to create an index with <code>bowtie2-build</code>, see the <a href=\"#the-bowtie2-build-indexer\">manual section on index building</a>. You may also want to bypass this process by obtaining a pre-built index. See <a href=\"#using-a-pre-built-index\">using a pre-built index</a> below for an example.</p>\n<h2 id=\"aligning-example-reads\">Aligning example reads</h2>\n<p>Stay in the directory created in the previous step, which now contains the <code>lambda_virus</code> index files. Next, run:</p>\n<pre><code>$BT2_HOME/bowtie2 -x lambda_virus -U $BT2_HOME/example/reads/reads_1.fq -S eg1.sam</code></pre>\n<p>This runs the Bowtie 2 aligner, which aligns a set of unpaired reads to the <a href=\"http://en.wikipedia.org/wiki/Lambda_phage\">Lambda phage</a> reference genome using the index generated in the previous step. The alignment results in SAM format are written to the file <code>eg1.sam</code>, and a short alignment summary is written to the console. (Actually, the summary is written to the &quot;standard error&quot; or &quot;stderr&quot; filehandle, which is typically printed to the console.)</p>\n<p>To see the first few lines of the SAM output, run:</p>\n<pre><code>head eg1.sam</code></pre>\n<p>You will see something like this:</p>\n<pre><code>@HD VN:1.0  SO:unsorted\n@SQ SN:gi|9626243|ref|NC_001416.1|  LN:48502\n@PG ID:bowtie2  PN:bowtie2  VN:2.0.1\nr1  0   gi|9626243|ref|NC_001416.1| 18401   42  122M    *   0   0   TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG  +&quot;@6&lt;:27(F&amp;5)9)&quot;B:%B+A-%5A?2$HCB0B+0=D&lt;7E/&lt;.03#!.F77@6B==?C&quot;7&gt;;))%;,3-$.A06+&lt;-1/@@?,26&quot;&gt;=?*@&#39;0;$:;??G+:#+(A?9+10!8!?()?7C&gt;  AS:i:-5 XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3  MD:Z:59G13G21G26    YT:Z:UU\nr2  0   gi|9626243|ref|NC_001416.1| 8886    42  275M    *   0   0   NTTNTGATGCGGGCTTGTGGAGTTCAGCCGATCTGACTTATGTCATTACCTATGAAATGTGAGGACGCTATGCCTGTACCAAATCCTACAATGCCGGTGAAAGGTGCCGGGATCACCCTGTGGGTTTATAAGGGGATCGGTGACCCCTACGCGAATCCGCTTTCAGACGTTGACTGGTCGCGTCTGGCAAAAGTTAAAGACCTGACGCCCGGCGAACTGACCGCTGAGNCCTATGACGACAGCTATCTCGATGATGAAGATGCAGACTGGACTGC (#!!&#39;+!$&quot;&quot;%+(+)&#39;%)%!+!(&amp;++)&#39;&#39;&quot;#&quot;#&amp;#&quot;!&#39;!(&quot;%&#39;&quot;&quot;(&quot;+&amp;%$%*%%#$%#%#!)*&#39;(#&quot;)(($&amp;$&#39;&amp;%+&amp;#%*)*#*%*&#39;)(%+!%%*&quot;$%&quot;#+)$&amp;&amp;+)&amp;)*+!&quot;*)!*!(&quot;&amp;&amp;&quot;*#+&quot;&amp;&quot;&#39;(%)*(&quot;&#39;!$*!!%$&amp;&amp;&amp;$!!&amp;&amp;&quot;(*&quot;$&amp;&quot;#&amp;!$%&#39;%&quot;#)$#+%*+)!&amp;*)+(&quot;&quot;#!)!%*#&quot;*)*&#39;)&amp;&quot;)($+*%%)!*)!(&#39;(%&quot;&quot;+%&quot;$##&quot;#+((&#39;!*(($*&#39;!&quot;*(&#39;&quot;+)&amp;%#&amp;$+(&#39;**$$&amp;+*&amp;!#%)&#39;)&#39;(+(!%+ AS:i:-14    XN:i:0  XM:i:8  XO:i:0  XG:i:0  NM:i:8  MD:Z:0A0C0G0A108C23G9T81T46 YT:Z:UU\nr3  16  gi|9626243|ref|NC_001416.1| 11599   42  338M    *   0   0   GGGCGCGTTACTGGGATGATCGTGAAAAGGCCCGTCTTGCGCTTGAAGCCGCCCGAAAGAAGGCTGAGCAGCAGACTCAAGAGGAGAAAAATGCGCAGCAGCGGAGCGATACCGAAGCGTCACGGCTGAAATATACCGAAGAGGCGCAGAAGGCTNACGAACGGCTGCAGACGCCGCTGCAGAAATATACCGCCCGTCAGGAAGAACTGANCAAGGCACNGAAAGACGGGAAAATCCTGCAGGCGGATTACAACACGCTGATGGCGGCGGCGAAAAAGGATTATGAAGCGACGCTGTAAAAGCCGAAACAGTCCAGCGTGAAGGTGTCTGCGGGCGAT  7F$%6=$:9B@/F&#39;&gt;=?!D?@0(:A*)7/&gt;9C&gt;6#1&lt;6:C(.CC;#.;&gt;;2&#39;$4D:?&amp;B!&gt;689?(0(G7+0=@37F)GG=&gt;?958.D2E04C&lt;E,*AD%G0.%$+A:&#39;H;?8&lt;72:88?E6((CF)6DF#.)=&gt;B&gt;D-=&quot;C&#39;B080E&#39;5BH&quot;77&#39;:&quot;@70#4%A5=6.2/1&gt;;9&quot;&amp;-H6)=$/0;5E:&lt;8G!@::1?2DC7C*;@*#.1C0.D&gt;H/20,!&quot;C-#,6@%&lt;+&lt;D(AG-).?&amp;#0.00&#39;@)/F8?B!&amp;&quot;170,)&gt;:?&lt;A7#1(A@0E#&amp;A.*DC.E&quot;)AH&quot;+.,5,2&gt;5&quot;2?:G,F&quot;D0B8D-6$65D&lt;D!A/38860.*4;4B&lt;*31?6  AS:i:-22    XN:i:0  XM:i:8  XO:i:0  XG:i:0  NM:i:8  MD:Z:80C4C16A52T23G30A8T76A41   YT:Z:UU\nr4  0   gi|9626243|ref|NC_001416.1| 40075   42  184M    *   0   0   GGGCCAATGCGCTTACTGATGCGGAATTACGCCGTAAGGCCGCAGATGAGCTTGTCCATATGACTGCGAGAATTAACNGTGGTGAGGCGATCCCTGAACCAGTAAAACAACTTCCTGTCATGGGCGGTAGACCTCTAAATCGTGCACAGGCTCTGGCGAAGATCGCAGAAATCAAAGCTAAGT(=8B)GD04*G%&amp;4F,1&#39;A&gt;.C&amp;7=F$,+#6!))43C,5/5+)?-/0&gt;/D3=-,2/+.1?@-&gt;;)00!&#39;3!7BH$G)HG+ADC&#39;#-9F)7&lt;7&quot;$?&amp;.&gt;0)@5;4,!0-#C!15CF8&amp;HB+B==H&gt;7,/)C5)5*+(F5A%D,EA&lt;(&gt;G9E0&gt;7&amp;/E?4%;#&#39;92)&lt;5+@7:A.(BG@BG86@.G AS:i:-1 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:77C106 YT:Z:UU\nr5  0   gi|9626243|ref|NC_001416.1| 48010   42  138M    *   0   0   GTCAGGAAAGTGGTAAAACTGCAACTCAATTACTGCAATGCCCTCGTAATTAAGTGAATTTACAATATCGTCCTGTTCGGAGGGAAGAACGCGGGATGTTCATTCTTCATCACTTTTAATTGATGTATATGCTCTCTT  9&#39;&#39;%&lt;D)A03E1-*7=),:F/0!6,D9:H,&lt;9D%:0B(%&#39;E,(8EFG$E89B$27G8F*2+4,-!,0D5()&amp;=(FGG:5;3*@/.0F-G#5#3-&gt;(&#39;FDFEG?)5.!)&quot;AGADB3?6(@H(:B&lt;&gt;6!&gt;;&gt;6&gt;G,.&quot;?%  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:138    YT:Z:UU\nr6  16  gi|9626243|ref|NC_001416.1| 41607   42  72M2D119M   *   0   0   TCGATTTGCAAATACCGGAACATCTCGGTAACTGCATATTCTGCATTAAAAAATCAACGCAAAAAATCGGACGCCTGCAAAGATGAGGAGGGATTGCAGCGTGTTTTTAATGAGGTCATCACGGGATNCCATGTGCGTGACGGNCATCGGGAAACGCCAAAGGAGATTATGTACCGAGGAAGAATGTCGCT 1H#G;H&quot;$E*E#&amp;&quot;*)2%66?=9/9&#39;=;4)4/&gt;@%+5#@#$4A*!&lt;D==&quot;8#1*A9BA=:(1+#C&amp;.#(3#H=9E)AC*5,AC#E&#39;536*2?)H14?&gt;9&#39;B=7(3H/B:+A:8%1-+#(E%&amp;$$&amp;14&quot;76D?&gt;7(&amp;20H5%*&amp;CF8!G5B+A4F$7(:&quot;&#39;?0$?G+$)B-?2&lt;0&lt;F=D!38BH,%=8&amp;5@+ AS:i:-13    XN:i:0  XM:i:2  XO:i:1  XG:i:2  NM:i:4  MD:Z:72^TT55C15A47  YT:Z:UU\nr7  16  gi|9626243|ref|NC_001416.1| 4692    42  143M    *   0   0   TCAGCCGGACGCGGGCGCTGCAGCCGTACTCGGGGATGACCGGTTACAACGGCATTATCGCCCGTCTGCAACAGGCTGCCAGCGATCCGATGGTGGACAGCATTCTGCTCGATATGGACANGCCCGGCGGGATGGTGGCGGGG -&quot;/@*7A0)&gt;2,AAH@&amp;&quot;%B)*5*23B/,)90.B@%=FE,E063C9?,:26$-0:,.,1849&#39;4.;F&gt;FA;76+5&amp;$&lt;C&quot;:$!A*,&lt;B,&lt;)@&lt;&#39;85D%C*:)30@85;?.B$05=@95DCDH&lt;53!8G:F:B7/A.E&#39;:434&gt; AS:i:-6 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:98G21C22   YT:Z:UU</code></pre>\n<p>The first few lines (beginning with <code>@</code>) are SAM header lines, and the rest of the lines are SAM alignments, one line per read or mate. See the <a href=\"#sam-output\">Bowtie 2 manual section on SAM output</a> and the <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM specification</a> for details about how to interpret the SAM file format.</p>\n<h2 id=\"paired-end-example\">Paired-end example</h2>\n<p>To align paired-end reads included with Bowtie 2, stay in the same directory and run:</p>\n<pre><code>$BT2_HOME/bowtie2 -x lambda_virus -1 $BT2_HOME/example/reads/reads_1.fq -2 $BT2_HOME/example/reads/reads_2.fq -S eg2.sam</code></pre>\n<p>This aligns a set of paired-end reads to the reference genome, with results written to the file <code>eg2.sam</code>.</p>\n<h2 id=\"local-alignment-example\">Local alignment example</h2>\n<p>To use <a href=\"#end-to-end-alignment-versus-local-alignment\">local alignment</a> to align some longer reads included with Bowtie 2, stay in the same directory and run:</p>\n<pre><code>$BT2_HOME/bowtie2 --local -x lambda_virus -U $BT2_HOME/example/reads/longreads.fq -S eg3.sam</code></pre>\n<p>This aligns the long reads to the reference genome using local alignment, with results written to the file <code>eg3.sam</code>.</p>\n<h2 id=\"using-samtoolsbcftools-downstream\">Using SAMtools/BCFtools downstream</h2>\n<p><a href=\"http://samtools.sourceforge.net/\">SAMtools</a> is a collection of tools for manipulating and analyzing SAM and BAM alignment files. <a href=\"http://samtools.sourceforge.net/mpileup.shtml\">BCFtools</a> is a collection of tools for calling variants and manipulating VCF and BCF files, and it is typically distributed with <a href=\"http://samtools.sourceforge.net/\">SAMtools</a>. Using these tools together allows you to get from alignments in SAM format to variant calls in VCF format. This example assumes that <code>samtools</code> and <code>bcftools</code> are installed and that the directories containing these binaries are in your <a href=\"http://en.wikipedia.org/wiki/PATH_(variable)\">PATH environment variable</a>.</p>\n<p>Run the paired-end example:</p>\n<pre><code>$BT2_HOME/bowtie2 -x $BT2_HOME/example/index/lambda_virus -1 $BT2_HOME/example/reads/reads_1.fq -2 $BT2_HOME/example/reads/reads_2.fq -S eg2.sam</code></pre>\n<p>Use <code>samtools view</code> to convert the SAM file into a BAM file. BAM is a the binary format corresponding to the SAM text format. Run:</p>\n<pre><code>samtools view -bS eg2.sam &gt; eg2.bam</code></pre>\n<p>Use <code>samtools sort</code> to convert the BAM file to a sorted BAM file.</p>\n<pre><code>samtools sort eg2.bam eg2.sorted</code></pre>\n<p>We now have a sorted BAM file called <code>eg2.sorted.bam</code>. Sorted BAM is a useful format because the alignments are (a) compressed, which is convenient for long-term storage, and (b) sorted, which is conveneint for variant discovery. To generate variant calls in VCF format, run:</p>\n<pre><code>samtools mpileup -uf $BT2_HOME/example/reference/lambda_virus.fa eg2.sorted.bam | bcftools view -bvcg - &gt; eg2.raw.bcf</code></pre>\n<p>Then to view the variants, run:</p>\n<pre><code>bcftools view eg2.raw.bcf</code></pre>\n<p>See the official SAMtools guide to <a href=\"http://samtools.sourceforge.net/mpileup.shtml\">Calling SNPs/INDELs with SAMtools/BCFtools</a> for more details and variations on this process.</p>\n</body>\n</html>\n"
  },
  {
    "path": "doc/manual.inc.html",
    "content": "<nav id=\"TOC\">\n<ul>\n<li><a href=\"#introduction\">Introduction</a><ul>\n<li><a href=\"#what-is-centrifuge\">What is Centrifuge?</a></li>\n</ul></li>\n<li><a href=\"#obtaining-centrifuge\">Obtaining Centrifuge</a><ul>\n<li><a href=\"#building-from-source\">Building from source</a></li>\n</ul></li>\n<li><a href=\"#running-centrifuge\">Running Centrifuge</a><ul>\n<li><a href=\"#adding-to-path\">Adding to PATH</a></li>\n<li><a href=\"#before-running-centrifuge\">Before running Centrifuge</a></li>\n<li><a href=\"#database-download-and-index-building\">Database download and index building</a><ul>\n<li><a href=\"#building-index-on-all-complete-bacterial-and-viral-genomes\">Building index on all complete bacterial and viral genomes</a></li>\n<li><a href=\"#adding-human-or-mouse-genome-to-the-index\">Adding human or mouse genome to the index</a></li>\n<li><a href=\"#nt-database\">nt database</a></li>\n<li><a href=\"#custom-database\">Custom database</a></li>\n<li><a href=\"#centrifuge-classification-output\">Centrifuge classification output</a></li>\n<li><a href=\"#centrifuge-summary-output-the-default-filename-is-centrifuge_report.tsv\">Centrifuge summary output (the default filename is centrifuge_report.tsv)</a></li>\n<li><a href=\"#kraken-style-report\">Kraken-style report</a></li>\n</ul></li>\n<li><a href=\"#inspecting-the-centrifuge-index\">Inspecting the Centrifuge index</a></li>\n<li><a href=\"#wrapper\">Wrapper</a></li>\n<li><a href=\"#performance-tuning\">Performance tuning</a></li>\n<li><a href=\"#command-line\">Command Line</a><ul>\n<li><a href=\"#usage\">Usage</a></li>\n<li><a href=\"#main-arguments\">Main arguments</a></li>\n<li><a href=\"#options\">Options</a></li>\n</ul></li>\n</ul></li>\n<li><a href=\"#the-centrifuge-build-indexer\">The <code>centrifuge-build</code> indexer</a><ul>\n<li><a href=\"#command-line-1\">Command Line</a><ul>\n<li><a href=\"#main-arguments-1\">Main arguments</a></li>\n<li><a href=\"#options-1\">Options</a></li>\n</ul></li>\n</ul></li>\n<li><a href=\"#the-centrifuge-inspect-index-inspector\">The <code>centrifuge-inspect</code> index inspector</a><ul>\n<li><a href=\"#command-line-2\">Command Line</a><ul>\n<li><a href=\"#main-arguments-2\">Main arguments</a></li>\n<li><a href=\"#options-2\">Options</a></li>\n</ul></li>\n</ul></li>\n<li><a href=\"#getting-started-with-centrifuge\">Getting started with Centrifuge</a><ul>\n<li><a href=\"#indexing-a-reference-genome\">Indexing a reference genome</a></li>\n<li><a href=\"#classifying-example-reads\">Classifying example reads</a></li>\n</ul></li>\n</ul>\n</nav>\n<!--\n ! This manual is written in \"markdown\" format and thus contains some\n ! distracting formatting clutter.  See 'MANUAL' for an easier-to-read version\n ! of this text document, or see the HTML manual online.\n ! -->\n<h1 id=\"introduction\">Introduction</h1>\n<h2 id=\"what-is-centrifuge\">What is Centrifuge?</h2>\n<p><a href=\"http://www.ccb.jhu.edu/software/centrifuge\">Centrifuge</a> is a novel microbial classification engine that enables rapid, accurate, and sensitive labeling of reads and quantification of species on desktop computers. The system uses a novel indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (5.8 GB for all complete bacterial and viral genomes plus the human genome) and classifies sequences at a very high speed, allowing it to process the millions of reads from a typical high-throughput DNA sequencing run within a few minutes. Together these advances enable timely and accurate analysis of large metagenomics data sets on conventional desktop computers.</p>\n<h1 id=\"obtaining-centrifuge\">Obtaining Centrifuge</h1>\n<p>Download Centrifuge and binaries from the Releases sections on the right side. Binaries are available for Intel architectures (<code>x86_64</code>) running Linux, and Mac OS X.</p>\n<h2 id=\"building-from-source\">Building from source</h2>\n<p>Building Centrifuge from source requires a GNU-like environment with GCC, GNU Make and other basics. It should be possible to build Centrifuge on most vanilla Linux installations or on a Mac installation with <a href=\"http://developer.apple.com/xcode/\">Xcode</a> installed. Centrifuge can also be built on Windows using <a href=\"http://www.cygwin.com/\">Cygwin</a> or <a href=\"http://www.mingw.org/\">MinGW</a> (MinGW recommended). For a MinGW build the choice of what compiler is to be used is important since this will determine if a 32 or 64 bit code can be successfully compiled using it. If there is a need to generate both 32 and 64 bit on the same machine then a multilib MinGW has to be properly installed. <a href=\"http://www.mingw.org/wiki/msys\">MSYS</a>, the <a href=\"http://cygwin.com/packages/mingw-zlib/\">zlib</a> library, and depending on architecture <a href=\"http://sourceware.org/pthreads-win32/\">pthreads</a> library are also required. We are recommending a 64 bit build since it has some clear advantages in real life research problems. In order to simplify the MinGW setup it might be worth investigating popular MinGW personal builds since these are coming already prepared with most of the toolchains needed.</p>\n<p>First, download the [source package] from the Releases secion on the right side. Unzip the file, change to the unzipped directory, and build the Centrifuge tools by running GNU <code>make</code> (usually with the command <code>make</code>, but sometimes with <code>gmake</code>) with no arguments. If building with MinGW, run <code>make</code> from the MSYS environment.</p>\n<p>Centrifuge is using the multithreading software model in order to speed up execution times on SMP architectures where this is possible. On POSIX platforms (like linux, Mac OS, etc) it needs the pthread library. Although it is possible to use pthread library on non-POSIX platform like Windows, due to performance reasons Centrifuge will try to use Windows native multithreading if possible.</p>\n<p>For the support of SRA data access in HISAT2, please download and install the <a href=\"https://github.com/ncbi/ngs/wiki/Downloads\">NCBI-NGS</a> toolkit. When running <code>make</code>, specify additional variables as follow. <code>make USE_SRA=1 NCBI_NGS_DIR=/path/to/NCBI-NGS-directory NCBI_VDB_DIR=/path/to/NCBI-NGS-directory</code>, where <code>NCBI_NGS_DIR</code> and <code>NCBI_VDB_DIR</code> will be used in Makefile for -I and -L compilation options. For example, $(NCBI_NGS_DIR)/include and $(NCBI_NGS_DIR)/lib64 will be used.</p>\n<h1 id=\"running-centrifuge\">Running Centrifuge</h1>\n<h2 id=\"adding-to-path\">Adding to PATH</h2>\n<p>By adding your new Centrifuge directory to your <a href=\"http://en.wikipedia.org/wiki/PATH_(variable)\">PATH environment variable</a>, you ensure that whenever you run <code>centrifuge</code>, <code>centrifuge-build</code>, <code>centrifuge-download</code> or <code>centrifuge-inspect</code> from the command line, you will get the version you just installed without having to specify the entire path. This is recommended for most users. To do this, follow your operating system’s instructions for adding the directory to your <a href=\"http://en.wikipedia.org/wiki/PATH_(variable)\">PATH</a>.</p>\n<p>If you would like to install Centrifuge by copying the Centrifuge executable files to an existing directory in your <a href=\"http://en.wikipedia.org/wiki/PATH_(variable)\">PATH</a>, make sure that you copy all the executables, including <code>centrifuge</code>, <code>centrifuge-class</code>, <code>centrifuge-build</code>, <code>centrifuge-build-bin</code>, <code>centrifuge-download</code> <code>centrifuge-inspect</code> and <code>centrifuge-inspect-bin</code>. Furthermore you need the programs in the scripts/ folder if you opt for genome compression in the database construction.</p>\n<h2 id=\"before-running-centrifuge\">Before running Centrifuge</h2>\n<p>Classification is considerably different from alignment in that classification is performed on a large set of genomes as opposed to on just one reference genome as in alignment. Currently, an enormous number of complete genomes are available at the GenBank (e.g. &gt;4,000 bacterial genomes, &gt;10,000 viral genomes, …). These genomes are organized in a taxonomic tree where each genome is located at the bottom of the tree, at the strain or subspecies level. On the taxonomic tree, genomes have ancestors usually situated at the species level, and those ancestors also have ancestors at the genus level and so on up the family level, the order level, class level, phylum, kingdom, and finally at the root level.</p>\n<p>Given the gigantic number of genomes available, which continues to expand at a rapid rate, and the development of the taxonomic tree, which continues to evolve with new advancements in research, we have designed Centrifuge to be flexible and general enough to reflect this huge database. We provide several standard indexes that will meet most of users’ needs (see the side panel - Indexes). In our approach our indexes not only include raw genome sequences, but also genome names/sizes and taxonomic trees. This enables users to perform additional analyses on Centrifuge’s classification output without the need to download extra database sources. This also eliminates the potential issue of discrepancy between the indexes we provide and the databases users may otherwise download. We plan to provide a couple of additional standard indexes in the near future, and update the indexes on a regular basis.</p>\n<p>We encourage first time users to take a look at and follow a <a href=\"#centrifuge-example\"><code>small example</code></a> that illustrates how to build an index, how to run Centrifuge using the index, how to interpret the classification results, and how to extract additional genomic information from the index. For those who choose to build customized indexes, please take a close look at the following description.</p>\n<h2 id=\"database-download-and-index-building\">Database download and index building</h2>\n<p>Centrifuge indexes can be built with arbritary sequences. Standard choices are all of the complete bacterial and viral genomes, or using the sequences that are part of the BLAST nt database. Centrifuge always needs the nodes.dmp file from the NCBI taxonomy dump to build the taxonomy tree, as well as a sequence ID to taxonomy ID map. The map is a tab-separated file with the sequence ID to taxonomy ID map.</p>\n<p>To download all of the complete archaeal, viral, and bacterial genomes from RefSeq, and build the index:</p>\n<p>Centrifuge indices can be build on arbritary sequences. Usually an ensemble of genomes is used - such as all complete microbial genomes in the RefSeq database, or all sequences in the BLAST nt database.</p>\n<p>To map sequence identifiers to taxonomy IDs, and taxonomy IDs to names and its parents, three files are necessary in addition to the sequence files:</p>\n<ul>\n<li>taxonomy tree: typically nodes.dmp from the NCBI taxonomy dump. Links taxonomy IDs to their parents</li>\n<li>names file: typically names.dmp from the NCBI taxonomy dump. Links taxonomy IDs to their scientific name</li>\n<li>a tab-separated sequence ID to taxonomy ID mapping</li>\n</ul>\n<p>When using the provided scripts to download the genomes, these files are automatically downloaded or generated. When using a custom taxonomy or sequence files, please refer to the section <code>TODO</code> to learn more about their format.</p>\n<h3 id=\"building-index-on-all-complete-bacterial-and-viral-genomes\">Building index on all complete bacterial and viral genomes</h3>\n<p>Use <code>centrifuge-download</code> to download genomes from NCBI. The following two commands download the NCBI taxonomy to <code>taxonomy/</code> in the current directory, and all complete archaeal, bacterial and viral genomes to <code>library/</code>. Low-complexity regions in the genomes are masked after download (parameter <code>-m</code>) using blast+’s <code>dustmasker</code>. <code>centrifuge-download</code> outputs tab-separated sequence ID to taxonomy ID mappings to standard out, which are required by <code>centrifuge-build</code>.</p>\n<pre><code>centrifuge-download -o taxonomy taxonomy\ncentrifuge-download -o library -m -d &quot;archaea,bacteria,viral&quot; refseq &gt; seqid2taxid.map</code></pre>\n<p>To build the index, first concatenate all downloaded sequences into a single file, and then run <code>centrifuge-build</code>:</p>\n<pre><code>cat library/*/*.fna &gt; input-sequences.fna\n\n## build centrifuge index with 4 threads\ncentrifuge-build -p 4 --conversion-table seqid2taxid.map \\\n                 --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \\\n                 input-sequences.fna abv</code></pre>\n<p>After building the index, all files except the index *.[123].cf files may be removed. If you also want to include the human and/or the mouse genome, add their sequences to the library folder before building the index with one of the following commands:</p>\n<p>After the index building, all but the *.[123].cf index files may be removed. I.e. the files in the <code>library/</code> and <code>taxonomy/</code> directories are no longer needed.</p>\n<h3 id=\"adding-human-or-mouse-genome-to-the-index\">Adding human or mouse genome to the index</h3>\n<p>The human and mouse genomes can also be downloaded using <code>centrifuge-download</code>. They are in the domain “vertebrate_mammalian” (argument <code>-d</code>), are assembled at the chromosome level (argument <code>-a</code>) and categorized as reference genomes by RefSeq (<code>-c</code>). The argument <code>-t</code> takes a comma-separated list of taxonomy IDs - e.g. <code>9606</code> for human and <code>10090</code> for mouse:</p>\n<pre><code># download mouse and human reference genomes\ncentrifuge-download -o library -d &quot;vertebrate_mammalian&quot; -a &quot;Chromosome&quot; -t 9606,10090 -c &#39;reference genome&#39; refseq &gt;&gt; seqid2taxid.map\n# only human\ncentrifuge-download -o library -d &quot;vertebrate_mammalian&quot; -a &quot;Chromosome&quot; -t 9606 -c &#39;reference genome&#39; refseq &gt;&gt; seqid2taxid.map\n# only mouse\ncentrifuge-download -o library -d &quot;vertebrate_mammalian&quot; -a &quot;Chromosome&quot; -t 10090 -c &#39;reference genome&#39; refseq &gt;&gt; seqid2taxid.map</code></pre>\n<h3 id=\"nt-database\">nt database</h3>\n<p>NCBI BLAST’s nt database contains all spliced non-redundant coding sequences from multiplpe databases, inferred from genommic sequences. Traditionally used with BLAST, a download of the FASTA is provided on the NCBI homepage. Building an index with any database requires the user to creates a sequence ID to taxonomy ID map that can be generated from a GI taxid dump:</p>\n<pre><code>wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz\ngunzip nt.gz &amp;&amp; mv -v nt nt.fa\n\n# Get mapping file\nwget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz\ngunzip -c gi_taxid_nucl.dmp.gz | sed &#39;s/^/gi|/&#39; &gt; gi_taxid_nucl.map\n\n# build index using 16 cores and a small bucket size, which will require less memory\ncentrifuge-build -p 16 --bmax 1342177280 --conversion-table gi_taxid_nucl.map \\\n                 --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \\ \n                 nt.fa nt</code></pre>\n<h3 id=\"custom-database\">Custom database</h3>\n<p>To build a custom database, you need the provide the follwing four files to <code>centrifuge-build</code>:</p>\n<ul>\n<li><code>--conversion-table</code>: tab-separated file mapping sequence IDs to taxonomy IDs. Sequence IDs are the header up to the first space or second pipe (<code>|</code>).<br />\n</li>\n<li><code>--taxonomy-tree</code>: <code>\\t|\\t</code>-separated file mapping taxonomy IDs to their parents and rank, up to the root of the tree. When using NCBI taxonomy IDs, this will be the <code>nodes.dmp</code> from <code>ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz</code>.</li>\n<li><code>--name-table</code>: ‘|’-separated file mapping taxonomy IDs to a name. A further column (typically column 4) must specify <code>scientific name</code>. When using NCBI taxonomy IDs, <code>names.dmp</code> is the appropriate file.</li>\n<li>reference sequences: The ID of the sequences are the header up to the first space or second pipe (<code>|</code>)</li>\n</ul>\n<p>When using custom taxonomy IDs, use only positive integers greater-equal to <code>1</code> and use <code>1</code> for the root of the tree.</p>\n<h4 id=\"more-info-on---taxonomy-tree-and---name-table\">More info on <code>--taxonomy-tree</code> and <code>--name-table</code></h4>\n<p>The format of these files are based on <code>nodes.dmp</code> and <code>names.dmp</code> from the NCBI taxonomy database dump.</p>\n<ul>\n<li>Field terminator is <code>\\t|\\t</code></li>\n<li>Row terminator is <code>\\t|\\n</code></li>\n</ul>\n<p>The <code>taxonomy-tree</code> / nodes.dmp file consists of taxonomy nodes. The description for each node includes the following fields:</p>\n<pre><code>tax_id                  -- node id in GenBank taxonomy database\nparent tax_id           -- parent node id in GenBank taxonomy database\nrank                    -- rank of this node (superkingdom, kingdom, ..., no rank)</code></pre>\n<p>Further fields are ignored.</p>\n<p>The <code>name-table</code> / names.dmp is the taxonomy names file:</p>\n<pre><code>tax_id                  -- the id of node associated with this name\nname_txt                -- name itself\nunique name             -- the unique variant of this name if name not unique\nname class              -- (scientific name, synonym, common name, ...)</code></pre>\n<p><code>name class</code> <strong>has</strong> to be <code>scientific name</code> to be included in the build. All other lines are ignored</p>\n<h4 id=\"example\">Example</h4>\n<p><em>Conversion table <code>ex.conv</code></em>:</p>\n<pre><code>Seq1    11\nSeq2    12\nSeq3    13\nSeq4    11</code></pre>\n<p><em>Taxonomy tree <code>ex.tree</code></em>:</p>\n<pre><code>1   |   1   |   root\n10  |   1   |   kingdom\n11  |   10  |   species\n12  |   10  |   species\n13  |   1   |   species</code></pre>\n<p><em>Name table <code>ex.name</code></em>:</p>\n<pre><code>1   |   root    |       |   scientific name |\n10  |   Bacteria    |       |   scientific name |\n11  |   Bacterium A |       |   scientific name |\n12  |   Bacterium B |       |   scientific name |\n12  |   Some other species  |       |   scientific name |</code></pre>\n<p><em>Reference sequences <code>ex.fa</code></em>:</p>\n<pre><code>&gt;Seq1\nAAAACGTACGA.....\n&gt;Seq2\nAAAACGTACGA.....\n&gt;Seq3\nAAAACGTACGA.....\n&gt;Seq4\nAAAACGTACGA.....</code></pre>\n<p>To build the database, call</p>\n<pre><code>centrifuge-build --conversion-table ex.conv \\\n                 --taxonomy-tree ex.tree --name-table ex.name \\ \n                 ex.fa ex</code></pre>\n<p>which results in three index files named <code>ex.1.cf</code>, <code>ex.2.cf</code> and <code>ex.3.cf</code>.</p>\n<h3 id=\"centrifuge-classification-output\">Centrifuge classification output</h3>\n<p>The following example shows classification assignments for a read. The assignment output has 8 columns.</p>\n<pre><code>readID    seqID   taxID score      2ndBestScore    hitLength    queryLength numMatches\n1_1       gi|4    9646  4225       0               80   80      1\n\nThe first column is the read ID from a raw sequencing read (e.g., 1_1 in the example).\nThe second column is the sequence ID of the genomic sequence, where the read is classified (e.g., gi|4).\nThe third column is the taxonomic ID of the genomic sequence in the second column (e.g., 9646).\nThe fourth column is the score for the classification, which is the weighted sum of hits (e.g., 4225)\nThe fifth column is the score for the next best classification (e.g., 0).\nThe sixth column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80).\nThe seventh column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80). \nThe eighth column is the number of classifications for this read, indicating how many assignments were made (e.g.,1).</code></pre>\n<h3 id=\"centrifuge-summary-output-the-default-filename-is-centrifuge_report.tsv\">Centrifuge summary output (the default filename is centrifuge_report.tsv)</h3>\n<p>The following example shows a classification summary for each genome or taxonomic unit. The assignment output has 7 columns.</p>\n<pre><code>name                                                            taxID   taxRank    genomeSize   numReads   numUniqueReads   abundance\nWigglesworthia glossinidia endosymbiont of Glossina brevipalpis 36870   leaf       703004       5981       5964             0.0152317\n\nThe first column is the name of a genome, or the name corresponding to a taxonomic ID (the second column) at a rank higher than the strain (e.g., Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis).\nThe second column is the taxonomic ID (e.g., 36870).\nThe third column is the taxonomic rank (e.g., leaf).\nThe fourth column is the length of the genome sequence (e.g., 703004).\nThe fifth column is the number of reads classified to this genomic sequence including multi-classified reads (e.g., 5981).\nThe sixth column is the number of reads uniquely classified to this genomic sequence (e.g., 5964).\nThe seventh column is the proportion of this genome normalized by its genomic length (e.g., 0.0152317).</code></pre>\n<p>As the GenBank database is incomplete (i.e., many more genomes remain to be identified and added), and reads have sequencing errors, classification programs including Centrifuge often report many false assignments. In order to perform more conservative analyses, users may want to discard assignments for reads having a matching length (8th column in the output of Centrifuge) of 40% or lower. It may be also helpful to use a score (4th column) for filtering out some assignments. Our future research plans include working on developing methods that estimate confidence scores for assignments.</p>\n<h3 id=\"kraken-style-report\">Kraken-style report</h3>\n<p><code>centrifuge-kreport</code> can be used to make a Kraken-style report from the Centrifuge output including taxonomy information:</p>\n<p><code>centrifuge-kreport -x &lt;centrifuge index&gt; &lt;centrifuge out file&gt;</code></p>\n<h2 id=\"inspecting-the-centrifuge-index\">Inspecting the Centrifuge index</h2>\n<p>The index can be inspected with <code>centrifuge-inspect</code>. To extract raw sequences:</p>\n<pre><code>centrifuge-inspect &lt;centrifuge index&gt;</code></pre>\n<p>Extract the sequence ID to taxonomy ID conversion table from the index</p>\n<pre><code>centrifuge-inspect --conversion-table &lt;centrifuge index&gt;</code></pre>\n<p>Extract the taxonomy tree from the index:</p>\n<pre><code>centrifuge-inspect --taxonomy-tree &lt;centrifuge index&gt;</code></pre>\n<p>Extract the lengths of the sequences from the index (each row has two columns: taxonomic ID and length):</p>\n<pre><code>centrifuge-inspect --size-table &lt;centrifuge index&gt;</code></pre>\n<p>Extract the names from the index (each row has two columns: taxonomic ID and name):</p>\n<pre><code>centrifuge-inspect --name-table &lt;centrifuge index&gt;</code></pre>\n<h2 id=\"wrapper\">Wrapper</h2>\n<p>The <code>centrifuge</code>, <code>centrifuge-build</code> and <code>centrifuge-inspect</code> executables are actually wrapper scripts that call binary programs as appropriate. Also, the <code>centrifuge</code> wrapper provides some key functionality, like the ability to handle compressed inputs, and the functionality for [<code>--un</code>], [<code>--al</code>] and related options.</p>\n<p>It is recommended that you always run the centrifuge wrappers and not run the binaries directly.</p>\n<h2 id=\"performance-tuning\">Performance tuning</h2>\n<ol type=\"1\">\n<li><p>If your computer has multiple processors/cores, use <code>-p NTHREADS</code></p>\n<p>The <a href=\"#centrifuge-build-options-p\"><code>-p</code></a> option causes Centrifuge to launch a specified number of parallel search threads. Each thread runs on a different processor/core and all threads find alignments in parallel, increasing alignment throughput by approximately a multiple of the number of threads (though in practice, speedup is somewhat worse than linear).</p></li>\n</ol>\n<h2 id=\"command-line\">Command Line</h2>\n<h3 id=\"usage\">Usage</h3>\n<pre><code>centrifuge [options]* -x &lt;centrifuge-idx&gt; {-1 &lt;m1&gt; -2 &lt;m2&gt; | -U &lt;r&gt; | --sra-acc &lt;SRA accession number&gt;} [--report-file &lt;report file name&gt; -S &lt;classification output file name&gt;]</code></pre>\n<h3 id=\"main-arguments\">Main arguments</h3>\n<table>\n<tr>\n<td>\n<pre><code>-x &lt;centrifuge-idx&gt;</code></pre>\n</td>\n<td>\n<p>The basename of the index for the reference genomes. The basename is the name of any of the index files up to but not including the final <code>.1.cf</code> / etc.<br />\n<code>centrifuge</code> looks for the specified index first in the current directory, then in the directory specified in the <code>CENTRIFUGE_INDEXES</code> environment variable.</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>-1 &lt;m1&gt;</code></pre>\n</td>\n<td>\n<p>Comma-separated list of files containing mate 1s (filename usually includes <code>_1</code>), e.g. <code>-1 flyA_1.fq,flyB_1.fq</code>. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in <code>&lt;m2&gt;</code>. Reads may be a mix of different lengths. If <code>-</code> is specified, <code>centrifuge</code> will read the mate 1s from the “standard in” or “stdin” filehandle.</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>-2 &lt;m2&gt;</code></pre>\n</td>\n<td>\n<p>Comma-separated list of files containing mate 2s (filename usually includes <code>_2</code>), e.g. <code>-2 flyA_2.fq,flyB_2.fq</code>. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in <code>&lt;m1&gt;</code>. Reads may be a mix of different lengths. If <code>-</code> is specified, <code>centrifuge</code> will read the mate 2s from the “standard in” or “stdin” filehandle.</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>-U &lt;r&gt;</code></pre>\n</td>\n<td>\n<p>Comma-separated list of files containing unpaired reads to be aligned, e.g. <code>lane1.fq,lane2.fq,lane3.fq,lane4.fq</code>. Reads may be a mix of different lengths. If <code>-</code> is specified, <code>centrifuge</code> gets the reads from the “standard in” or “stdin” filehandle.</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>--sra-acc &lt;SRA accession number&gt;</code></pre>\n</td>\n<td>\n<p>Comma-separated list of SRA accession numbers, e.g. <code>--sra-acc SRR353653,SRR353654</code>. Information about read types is available at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?sp=runinfo&amp;acc=<b>sra-acc</b>&amp;retmode=xml, where <b>sra-acc</b> is SRA accession number. If users run HISAT2 on a computer cluster, it is recommended to disable SRA-related caching (see the instruction at <a href=\"https://github.com/ncbi/sra-tools/wiki/Toolkit-Configuration\">SRA-MANUAL</a>).</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>-S &lt;filename&gt;</code></pre>\n</td>\n<td>\n<p>File to write classification results to. By default, assignments are written to the “standard out” or “stdout” filehandle (i.e. the console).</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>--report-file &lt;filename&gt;</code></pre>\n</td>\n<td>\n<p>File to write a classification summary to (default: centrifuge_report.tsv).</p>\n</td>\n</tr>\n</table>\n<h3 id=\"options\">Options</h3>\n<h4 id=\"input-options\">Input options</h4>\n<table>\n<tr>\n<td id=\"centrifuge-options-q\">\n<pre><code>-q</code></pre>\n</td>\n<td>\n<p>Reads (specified with <code>&lt;m1&gt;</code>, <code>&lt;m2&gt;</code>, <code>&lt;s&gt;</code>) are FASTQ files. FASTQ files usually have extension <code>.fq</code> or <code>.fastq</code>. FASTQ is the default format. See also: <a href=\"#centrifuge-options-solexa-quals\"><code>--solexa-quals</code></a> and <a href=\"#centrifuge-options-int-quals\"><code>--int-quals</code></a>.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-qseq\">\n<pre><code>--qseq</code></pre>\n</td>\n<td>\n<p>Reads (specified with <code>&lt;m1&gt;</code>, <code>&lt;m2&gt;</code>, <code>&lt;s&gt;</code>) are QSEQ files. QSEQ files usually end in <code>_qseq.txt</code>. See also: <a href=\"#centrifuge-options-solexa-quals\"><code>--solexa-quals</code></a> and <a href=\"#centrifuge-options-int-quals\"><code>--int-quals</code></a>.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-f\">\n<pre><code>-f</code></pre>\n</td>\n<td>\n<p>Reads (specified with <code>&lt;m1&gt;</code>, <code>&lt;m2&gt;</code>, <code>&lt;s&gt;</code>) are FASTA files. FASTA files usually have extension <code>.fa</code>, <code>.fasta</code>, <code>.mfa</code>, <code>.fna</code> or similar. FASTA files do not have a way of specifying quality values, so when <code>-f</code> is set, the result is as if <code>--ignore-quals</code> is also set.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-r\">\n<pre><code>-r</code></pre>\n</td>\n<td>\n<p>Reads (specified with <code>&lt;m1&gt;</code>, <code>&lt;m2&gt;</code>, <code>&lt;s&gt;</code>) are files with one input sequence per line, without any other information (no read names, no qualities). When <code>-r</code> is set, the result is as if <code>--ignore-quals</code> is also set.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-c\">\n<pre><code>-c</code></pre>\n</td>\n<td>\n<p>The read sequences are given on command line. I.e. <code>&lt;m1&gt;</code>, <code>&lt;m2&gt;</code> and <code>&lt;singles&gt;</code> are comma-separated lists of reads rather than lists of read files. There is no way to specify read names or qualities, so <code>-c</code> also implies <code>--ignore-quals</code>.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-s\">\n<pre><code>-s/--skip &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>Skip (i.e. do not align) the first <code>&lt;int&gt;</code> reads or pairs in the input.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-u\">\n<pre><code>-u/--qupto &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>Align the first <code>&lt;int&gt;</code> reads or read pairs from the input (after the <a href=\"#centrifuge-options-s\"><code>-s</code>/<code>--skip</code></a> reads or pairs have been skipped), then stop. Default: no limit.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-5\">\n<pre><code>-5/--trim5 &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>Trim <code>&lt;int&gt;</code> bases from 5’ (left) end of each read before alignment (default: 0).</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-3\">\n<pre><code>-3/--trim3 &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>Trim <code>&lt;int&gt;</code> bases from 3’ (right) end of each read before alignment (default: 0).</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-phred33-quals\">\n<pre><code>--phred33</code></pre>\n</td>\n<td>\n<p>Input qualities are ASCII chars equal to the <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Phred quality</a> plus 33. This is also called the “Phred+33” encoding, which is used by the very latest Illumina pipelines.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-phred64-quals\">\n<pre><code>--phred64</code></pre>\n</td>\n<td>\n<p>Input qualities are ASCII chars equal to the <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Phred quality</a> plus 64. This is also called the “Phred+64” encoding.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-solexa-quals\">\n<pre><code>--solexa-quals</code></pre>\n</td>\n<td>\n<p>Convert input qualities from <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Solexa</a> (which can be negative) to <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Phred</a> (which can’t). This scheme was used in older Illumina GA Pipeline versions (prior to 1.3). Default: off.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-int-quals\">\n<pre><code>--int-quals</code></pre>\n</td>\n<td>\n<p>Quality values are represented in the read input file as space-separated ASCII integers, e.g., <code>40 40 30 40</code>…, rather than ASCII characters, e.g., <code>II?I</code>…. Integers are treated as being on the <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Phred quality</a> scale unless <a href=\"#centrifuge-options-solexa-quals\"><code>--solexa-quals</code></a> is also specified. Default: off.</p>\n</td>\n</tr>\n</table>\n<h4 id=\"classification\">Classification</h4>\n<table>\n<tr>\n<td id=\"centrifuge-options-min-hitlen\">\n<pre><code>--min-hitlen &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>Minimum length of partial hits, which must be greater than 15 (default: 22)&quot;</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-k\">\n<pre><code>-k &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>It searches for at most <code>&lt;int&gt;</code> distinct, primary assignments for each read or pair.<br />\nPrimary assignments mean assignments whose assignment score is equal or higher than any other assignments. If there are more primary assignments than this value, the search will merge some of the assignments into a higher taxonomic rank. The assignment score for a paired-end assignment equals the sum of the assignment scores of the individual mates. Default: 5</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-host-taxids\">\n<pre><code>--host-taxids</code></pre>\n</td>\n<td>\n<p>A comma-separated list of taxonomic IDs that will be preferred in classification procedure. The descendants from these IDs will also be preferred. In case some of a read’s assignments correspond to these taxonomic IDs, only those corresponding assignments will be reported.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-exclude-taxids\">\n<pre><code>--exclude-taxids</code></pre>\n</td>\n<td>\n<p>A comma-separated list of taxonomic IDs that will be excluded in classification procedure. The descendants from these IDs will also be exclude.</p>\n</td>\n</tr>\n</table>\n<!--\n#### Alignment options\n\n<table>\n\n<tr><td id=\"centrifuge-options-n-ceil\">\n\n[`--n-ceil`]: #centrifuge-options-n-ceil\n\n    --n-ceil <func>\n\n</td><td>\n\nSets a function governing the maximum number of ambiguous characters (usually\n`N`s and/or `.`s) allowed in a read as a function of read length.  For instance,\nspecifying `-L,0,0.15` sets the N-ceiling function `f` to `f(x) = 0 + 0.15 * x`,\nwhere x is the read length.  See also: [setting function options].  Reads\nexceeding this ceiling are [filtered out].  Default: `L,0,0.15`.\n\n[filtered out]: #filtering\n\n</td></tr>\n\n<tr><td id=\"centrifuge-options-ignore-quals\">\n\n[`--ignore-quals`]: #centrifuge-options-ignore-quals\n\n    --ignore-quals\n\n</td><td>\n\nWhen calculating a mismatch penalty, always consider the quality value at the\nmismatched position to be the highest possible, regardless of the actual value. \nI.e. input is treated as though all quality values are high.  This is also the\ndefault behavior when the input doesn't specify quality values (e.g. in [`-f`],\n[`-r`], or [`-c`] modes).\n\n</td></tr>\n<tr><td id=\"centrifuge-options-nofw\">\n\n[`--nofw`]: #centrifuge-options-nofw\n\n    --nofw/--norc\n\n</td><td>\n\nIf `--nofw` is specified, `centrifuge` will not attempt to align unpaired reads to\nthe forward (Watson) reference strand.  If `--norc` is specified, `centrifuge` will\nnot attempt to align unpaired reads against the reverse-complement (Crick)\nreference strand. In paired-end mode, `--nofw` and `--norc` pertain to the\nfragments; i.e. specifying `--nofw` causes `centrifuge` to explore only those\npaired-end configurations corresponding to fragments from the reverse-complement\n(Crick) strand.  Default: both strands enabled. \n\n</td></tr>\n\n</table>\n\n#### Paired-end options\n\n<table>\n\n<tr><td id=\"centrifuge-options-fr\">\n\n[`--fr`/`--rf`/`--ff`]: #centrifuge-options-fr\n[`--fr`]: #centrifuge-options-fr\n[`--rf`]: #centrifuge-options-fr\n[`--ff`]: #centrifuge-options-fr\n\n    --fr/--rf/--ff\n\n</td><td>\n\nThe upstream/downstream mate orientations for a valid paired-end alignment\nagainst the forward reference strand.  E.g., if `--fr` is specified and there is\na candidate paired-end alignment where mate 1 appears upstream of the reverse\ncomplement of mate 2 and the fragment length constraints ([`-I`] and [`-X`]) are\nmet, that alignment is valid.  Also, if mate 2 appears upstream of the reverse\ncomplement of mate 1 and all other constraints are met, that too is valid.\n`--rf` likewise requires that an upstream mate1 be reverse-complemented and a\ndownstream mate2 be forward-oriented. ` --ff` requires both an upstream mate 1\nand a downstream mate 2 to be forward-oriented.  Default: `--fr` (appropriate\nfor Illumina's Paired-end Sequencing Assay).\n\n</td></tr></table>\n-->\n<h4 id=\"output-options\">Output options</h4>\n<table>\n<tr>\n<td id=\"centrifuge-options-t\">\n<pre><code>-t/--time</code></pre>\n</td>\n<td>\n<p>Print the wall-clock time required to load the index files and align the reads. This is printed to the “standard error” (“stderr”) filehandle. Default: off.</p>\n</td>\n</tr>\n<!--\n<tr><td id=\"centrifuge-options-un\">\n\n[`--un`]: #centrifuge-options-un\n[`--un-gz`]: #centrifuge-options-un\n[`--un-bz2`]: #centrifuge-options-un\n\n    --un <path>\n    --un-gz <path>\n    --un-bz2 <path>\n\n</td><td>\n\nWrite unpaired reads that fail to align to file at `<path>`.  These reads\ncorrespond to the SAM records with the FLAGS `0x4` bit set and neither the\n`0x40` nor `0x80` bits set.  If `--un-gz` is specified, output will be gzip\ncompressed. If `--un-bz2` is specified, output will be bzip2 compressed.  Reads\nwritten in this way will appear exactly as they did in the input file, without\nany modification (same sequence, same name, same quality string, same quality\nencoding).  Reads will not necessarily appear in the same order as they did in\nthe input.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-al\">\n\n[`--al`]: #centrifuge-options-al\n[`--al-gz`]: #centrifuge-options-al\n[`--al-bz2`]: #centrifuge-options-al\n\n    --al <path>\n    --al-gz <path>\n    --al-bz2 <path>\n\n</td><td>\n\nWrite unpaired reads that align at least once to file at `<path>`.  These reads\ncorrespond to the SAM records with the FLAGS `0x4`, `0x40`, and `0x80` bits\nunset.  If `--al-gz` is specified, output will be gzip compressed. If `--al-bz2`\nis specified, output will be bzip2 compressed.  Reads written in this way will\nappear exactly as they did in the input file, without any modification (same\nsequence, same name, same quality string, same quality encoding).  Reads will\nnot necessarily appear in the same order as they did in the input.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-un-conc\">\n\n[`--un-conc`]: #centrifuge-options-un-conc\n[`--un-conc-gz`]: #centrifuge-options-un-conc\n[`--un-conc-bz2`]: #centrifuge-options-un-conc\n\n    --un-conc <path>\n    --un-conc-gz <path>\n    --un-conc-bz2 <path>\n\n</td><td>\n\nWrite paired-end reads that fail to align concordantly to file(s) at `<path>`.\nThese reads correspond to the SAM records with the FLAGS `0x4` bit set and\neither the `0x40` or `0x80` bit set (depending on whether it's mate #1 or #2).\n`.1` and `.2` strings are added to the filename to distinguish which file\ncontains mate #1 and mate #2.  If a percent symbol, `%`, is used in `<path>`,\nthe percent symbol is replaced with `1` or `2` to make the per-mate filenames.\nOtherwise, `.1` or `.2` are added before the final dot in `<path>` to make the\nper-mate filenames.  Reads written in this way will appear exactly as they did\nin the input files, without any modification (same sequence, same name, same\nquality string, same quality encoding).  Reads will not necessarily appear in\nthe same order as they did in the inputs.\n\n</td></tr>\n<tr><td id=\"centrifuge-options-al-conc\">\n\n[`--al-conc`]: #centrifuge-options-al-conc\n[`--al-conc-gz`]: #centrifuge-options-al-conc\n[`--al-conc-bz2`]: #centrifuge-options-al-conc\n\n    --al-conc <path>\n    --al-conc-gz <path>\n    --al-conc-bz2 <path>\n\n</td><td>\n\nWrite paired-end reads that align concordantly at least once to file(s) at\n`<path>`. These reads correspond to the SAM records with the FLAGS `0x4` bit\nunset and either the `0x40` or `0x80` bit set (depending on whether it's mate #1\nor #2). `.1` and `.2` strings are added to the filename to distinguish which\nfile contains mate #1 and mate #2.  If a percent symbol, `%`, is used in\n`<path>`, the percent symbol is replaced with `1` or `2` to make the per-mate\nfilenames. Otherwise, `.1` or `.2` are added before the final dot in `<path>` to\nmake the per-mate filenames.  Reads written in this way will appear exactly as\nthey did in the input files, without any modification (same sequence, same name,\nsame quality string, same quality encoding).  Reads will not necessarily appear\nin the same order as they did in the inputs.\n\n</td></tr>\n-->\n<tr>\n<td id=\"centrifuge-options-quiet\">\n<pre><code>--quiet</code></pre>\n</td>\n<td>\n<p>Print nothing besides alignments and serious errors.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-met-file\">\n<pre><code>--met-file &lt;path&gt;</code></pre>\n</td>\n<td>\n<p>Write <code>centrifuge</code> metrics to file <code>&lt;path&gt;</code>. Having alignment metric can be useful for debugging certain problems, especially performance issues. See also: <a href=\"#centrifuge-options-met\"><code>--met</code></a>. Default: metrics disabled.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-met-stderr\">\n<pre><code>--met-stderr</code></pre>\n</td>\n<td>\n<p>Write <code>centrifuge</code> metrics to the “standard error” (“stderr”) filehandle. This is not mutually exclusive with <a href=\"#centrifuge-options-met-file\"><code>--met-file</code></a>. Having alignment metric can be useful for debugging certain problems, especially performance issues. See also: <a href=\"#centrifuge-options-met\"><code>--met</code></a>. Default: metrics disabled.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-met\">\n<pre><code>--met &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>Write a new <code>centrifuge</code> metrics record every <code>&lt;int&gt;</code> seconds. Only matters if either <a href=\"#centrifuge-options-met-stderr\"><code>--met-stderr</code></a> or <a href=\"#centrifuge-options-met-file\"><code>--met-file</code></a> are specified. Default: 1.</p>\n</td>\n</tr>\n</table>\n<h4 id=\"performance-options\">Performance options</h4>\n<table>\n<tr>\n<td id=\"centrifuge-options-o\">\n<pre><code>-o/--offrate &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>Override the offrate of the index with <code>&lt;int&gt;</code>. If <code>&lt;int&gt;</code> is greater than the offrate used to build the index, then some row markings are discarded when the index is read into memory. This reduces the memory footprint of the aligner but requires more time to calculate text offsets. <code>&lt;int&gt;</code> must be greater than the value used to build the index.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-p\">\n<pre><code>-p/--threads NTHREADS</code></pre>\n</td>\n<td>\n<p>Launch <code>NTHREADS</code> parallel search threads (default: 1). Threads will run on separate processors/cores and synchronize when parsing reads and outputting alignments. Searching for alignments is highly parallel, and speedup is close to linear. Increasing <code>-p</code> increases Centrifuge’s memory footprint. E.g. when aligning to a human genome index, increasing <code>-p</code> from 1 to 8 increases the memory footprint by a few hundred megabytes. This option is only available if <code>bowtie</code> is linked with the <code>pthreads</code> library (i.e. if <code>BOWTIE_PTHREADS=0</code> is not specified at build time).</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-reorder\">\n<pre><code>--reorder</code></pre>\n</td>\n<td>\n<p>Guarantees that output records are printed in an order corresponding to the order of the reads in the original input file, even when <a href=\"#centrifuge-build-options-p\"><code>-p</code></a> is set greater than 1. Specifying <code>--reorder</code> and setting <a href=\"#centrifuge-build-options-p\"><code>-p</code></a> greater than 1 causes Centrifuge to run somewhat slower and use somewhat more memory then if <code>--reorder</code> were not specified. Has no effect if <a href=\"#centrifuge-build-options-p\"><code>-p</code></a> is set to 1, since output order will naturally correspond to input order in that case.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-mm\">\n<pre><code>--mm</code></pre>\n</td>\n<td>\n<p>Use memory-mapped I/O to load the index, rather than typical file I/O. Memory-mapping allows many concurrent <code>bowtie</code> processes on the same computer to share the same memory image of the index (i.e. you pay the memory overhead just once). This facilitates memory-efficient parallelization of <code>bowtie</code> in situations where using <a href=\"#centrifuge-build-options-p\"><code>-p</code></a> is not possible or not preferable.</p>\n</td>\n</tr>\n</table>\n<h4 id=\"other-options\">Other options</h4>\n<table>\n<tr>\n<td id=\"centrifuge-options-qc-filter\">\n<pre><code>--qc-filter</code></pre>\n</td>\n<td>\n<p>Filter out reads for which the QSEQ filter field is non-zero. Only has an effect when read format is <a href=\"#centrifuge-options-qseq\"><code>--qseq</code></a>. Default: off.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-seed\">\n<pre><code>--seed &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>Use <code>&lt;int&gt;</code> as the seed for pseudo-random number generator. Default: 0.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-non-deterministic\">\n<pre><code>--non-deterministic</code></pre>\n</td>\n<td>\n<p>Normally, Centrifuge re-initializes its pseudo-random generator for each read. It seeds the generator with a number derived from (a) the read name, (b) the nucleotide sequence, (c) the quality sequence, (d) the value of the <a href=\"#centrifuge-options-seed\"><code>--seed</code></a> option. This means that if two reads are identical (same name, same nucleotides, same qualities) Centrifuge will find and report the same classification(s) for both, even if there was ambiguity. When <code>--non-deterministic</code> is specified, Centrifuge re-initializes its pseudo-random generator for each read using the current time. This means that Centrifuge will not necessarily report the same classification for two identical reads. This is counter-intuitive for some users, but might be more appropriate in situations where the input consists of many identical reads.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-version\">\n<pre><code>--version</code></pre>\n</td>\n<td>\n<p>Print version information and quit.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-options-h\">\n<pre><code>-h/--help</code></pre>\n</td>\n<td>\n<p>Print usage information and quit.</p>\n</td>\n</tr>\n</table>\n<h1 id=\"the-centrifuge-build-indexer\">The <code>centrifuge-build</code> indexer</h1>\n<p><code>centrifuge-build</code> builds a Centrifuge index from a set of DNA sequences. <code>centrifuge-build</code> outputs a set of 6 files with suffixes <code>.1.cf</code>, <code>.2.cf</code>, and <code>.3.cf</code>. These files together constitute the index: they are all that is needed to align reads to that reference. The original sequence FASTA files are no longer used by Centrifuge once the index is built.</p>\n<p>Use of Karkkainen’s <a href=\"http://portal.acm.org/citation.cfm?id=1314852\">blockwise algorithm</a> allows <code>centrifuge-build</code> to trade off between running time and memory usage. <code>centrifuge-build</code> has two options governing how it makes this trade: <a href=\"#centrifuge-build-options-bmax\"><code>--bmax</code></a>/<a href=\"#centrifuge-build-options-bmaxdivn\"><code>--bmaxdivn</code></a>, and <a href=\"#centrifuge-build-options-dcv\"><code>--dcv</code></a>. By default, <code>centrifuge-build</code> will automatically search for the settings that yield the best running time without exhausting memory. This behavior can be disabled using the <a href=\"#centrifuge-build-options-a\"><code>-a</code>/<code>--noauto</code></a> option.</p>\n<p>The indexer provides options pertaining to the “shape” of the index, e.g. <a href=\"#centrifuge-build-options-o\"><code>--offrate</code></a> governs the fraction of <a href=\"http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\">Burrows-Wheeler</a> rows that are “marked” (i.e., the density of the suffix-array sample; see the original <a href=\"http://en.wikipedia.org/wiki/FM-index\">FM Index</a> paper for details). All of these options are potentially profitable trade-offs depending on the application. They have been set to defaults that are reasonable for most cases according to our experiments. See <a href=\"#performance-tuning\">Performance tuning</a> for details.</p>\n<p>The Centrifuge index is based on the <a href=\"http://en.wikipedia.org/wiki/FM-index\">FM Index</a> of Ferragina and Manzini, which in turn is based on the <a href=\"http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\">Burrows-Wheeler</a> transform. The algorithm used to build the index is based on the <a href=\"http://portal.acm.org/citation.cfm?id=1314852\">blockwise algorithm</a> of Karkkainen.</p>\n<h2 id=\"command-line-1\">Command Line</h2>\n<p>Usage:</p>\n<pre><code>centrifuge-build [options]* --conversion-table &lt;table_in&gt; --taxonomy-tree &lt;taxonomy_in&gt; --name-table &lt;table_in2&gt; &lt;reference_in&gt; &lt;cf_base&gt;</code></pre>\n<h3 id=\"main-arguments-1\">Main arguments</h3>\n<table>\n<tr>\n<td>\n<pre><code>&lt;reference_in&gt;</code></pre>\n</td>\n<td>\n<p>A comma-separated list of FASTA files containing the reference sequences to be aligned to, or, if <a href=\"#centrifuge-build-options-c\"><code>-c</code></a> is specified, the sequences themselves. E.g., <code>&lt;reference_in&gt;</code> might be <code>chr1.fa,chr2.fa,chrX.fa,chrY.fa</code>, or, if <a href=\"#centrifuge-build-options-c\"><code>-c</code></a> is specified, this might be <code>GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA</code>.</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>&lt;cf_base&gt;</code></pre>\n</td>\n<td>\n<p>The basename of the index files to write. By default, <code>centrifuge-build</code> writes files named <code>NAME.1.cf</code>, <code>NAME.2.cf</code>, and <code>NAME.3.cf</code>, where <code>NAME</code> is <code>&lt;cf_base&gt;</code>.</p>\n</td>\n</tr>\n</table>\n<h3 id=\"options-1\">Options</h3>\n<table>\n<tr>\n<td>\n<pre><code>-f</code></pre>\n</td>\n<td>\n<p>The reference input files (specified as <code>&lt;reference_in&gt;</code>) are FASTA files (usually having extension <code>.fa</code>, <code>.mfa</code>, <code>.fna</code> or similar).</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-build-options-c\">\n<pre><code>-c</code></pre>\n</td>\n<td>\n<p>The reference sequences are given on the command line. I.e. <code>&lt;reference_in&gt;</code> is a comma-separated list of sequences rather than a list of FASTA files.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-build-options-a\">\n<pre><code>-a/--noauto</code></pre>\n</td>\n<td>\n<p>Disable the default behavior whereby <code>centrifuge-build</code> automatically selects values for the <a href=\"#centrifuge-build-options-bmax\"><code>--bmax</code></a>, <a href=\"#centrifuge-build-options-dcv\"><code>--dcv</code></a> and [<code>--packed</code>] parameters according to available memory. Instead, user may specify values for those parameters. If memory is exhausted during indexing, an error message will be printed; it is up to the user to try new parameters.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-build-options-p\">\n<pre><code>-p/--threads &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>Launch <code>NTHREADS</code> parallel search threads (default: 1).</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-build-options-conversion-table\">\n<pre><code>--conversion-table &lt;file&gt;</code></pre>\n</td>\n<td>\n<p>List of UIDs (unique ID) and corresponding taxonomic IDs.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-build-options-taxonomy-tree\">\n<pre><code>--taxonomy-tree &lt;file&gt;</code></pre>\n</td>\n<td>\n<p>Taxonomic tree (e.g. nodes.dmp).</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-build-options-name-table\">\n<pre><code>--name-table &lt;file&gt;</code></pre>\n</td>\n<td>\n<p>Name table (e.g. names.dmp).</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-build-options-taxonomy-tree\">\n<pre><code>--size-table &lt;file&gt;</code></pre>\n</td>\n<td>\n<p>List of taxonomic IDs and lengths of the sequences belonging to the same taxonomic IDs.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-build-options-bmax\">\n<pre><code>--bmax &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>The maximum number of suffixes allowed in a block. Allowing more suffixes per block makes indexing faster, but increases peak memory usage. Setting this option overrides any previous setting for <a href=\"#centrifuge-build-options-bmax\"><code>--bmax</code></a>, or <a href=\"#centrifuge-build-options-bmaxdivn\"><code>--bmaxdivn</code></a>. Default (in terms of the <a href=\"#centrifuge-build-options-bmaxdivn\"><code>--bmaxdivn</code></a> parameter) is <a href=\"#centrifuge-build-options-bmaxdivn\"><code>--bmaxdivn</code></a> 4. This is configured automatically by default; use <a href=\"#centrifuge-build-options-a\"><code>-a</code>/<code>--noauto</code></a> to configure manually.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-build-options-bmaxdivn\">\n<pre><code>--bmaxdivn &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>The maximum number of suffixes allowed in a block, expressed as a fraction of the length of the reference. Setting this option overrides any previous setting for <a href=\"#centrifuge-build-options-bmax\"><code>--bmax</code></a>, or <a href=\"#centrifuge-build-options-bmaxdivn\"><code>--bmaxdivn</code></a>. Default: <a href=\"#centrifuge-build-options-bmaxdivn\"><code>--bmaxdivn</code></a> 4. This is configured automatically by default; use <a href=\"#centrifuge-build-options-a\"><code>-a</code>/<code>--noauto</code></a> to configure manually.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-build-options-dcv\">\n<pre><code>--dcv &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>Use <code>&lt;int&gt;</code> as the period for the difference-cover sample. A larger period yields less memory overhead, but may make suffix sorting slower, especially if repeats are present. Must be a power of 2 no greater than 4096. Default: 1024. This is configured automatically by default; use <a href=\"#centrifuge-build-options-a\"><code>-a</code>/<code>--noauto</code></a> to configure manually.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-build-options-nodc\">\n<pre><code>--nodc</code></pre>\n</td>\n<td>\n<p>Disable use of the difference-cover sample. Suffix sorting becomes quadratic-time in the worst case (where the worst case is an extremely repetitive reference). Default: off.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-build-options-o\">\n<pre><code>-o/--offrate &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>To map alignments back to positions on the reference sequences, it’s necessary to annotate (“mark”) some or all of the <a href=\"http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\">Burrows-Wheeler</a> rows with their corresponding location on the genome. <a href=\"#centrifuge-build-options-o\"><code>-o</code>/<code>--offrate</code></a> governs how many rows get marked: the indexer will mark every 2^<code>&lt;int&gt;</code> rows. Marking more rows makes reference-position lookups faster, but requires more memory to hold the annotations at runtime. The default is 4 (every 16th row is marked; for human genome, annotations occupy about 680 megabytes).</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>-t/--ftabchars &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>The ftab is the lookup table used to calculate an initial <a href=\"http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\">Burrows-Wheeler</a> range with respect to the first <code>&lt;int&gt;</code> characters of the query. A larger <code>&lt;int&gt;</code> yields a larger lookup table but faster query times. The ftab has size 4^(<code>&lt;int&gt;</code>+1) bytes. The default setting is 10 (ftab is 4MB).</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>--seed &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>Use <code>&lt;int&gt;</code> as the seed for pseudo-random number generator.</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>--kmer-count &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>Use <code>&lt;int&gt;</code> as kmer-size for counting the distinct number of k-mers in the input sequences.</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>-q/--quiet</code></pre>\n</td>\n<td>\n<p><code>centrifuge-build</code> is verbose by default. With this option <code>centrifuge-build</code> will print only error messages.</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>-h/--help</code></pre>\n</td>\n<td>\n<p>Print usage information and quit.</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>--version</code></pre>\n</td>\n<td>\n<p>Print version information and quit.</p>\n</td>\n</tr>\n</table>\n<h1 id=\"the-centrifuge-inspect-index-inspector\">The <code>centrifuge-inspect</code> index inspector</h1>\n<p><code>centrifuge-inspect</code> extracts information from a Centrifuge index about what kind of index it is and what reference sequences were used to build it. When run without any options, the tool will output a FASTA file containing the sequences of the original references (with all non-<code>A</code>/<code>C</code>/<code>G</code>/<code>T</code> characters converted to <code>N</code>s). It can also be used to extract just the reference sequence names using the <a href=\"#centrifuge-inspect-options-n\"><code>-n</code>/<code>--names</code></a> option or a more verbose summary using the <a href=\"#centrifuge-inspect-options-s\"><code>-s</code>/<code>--summary</code></a> option.</p>\n<h2 id=\"command-line-2\">Command Line</h2>\n<p>Usage:</p>\n<pre><code>centrifuge-inspect [options]* &lt;cf_base&gt;</code></pre>\n<h3 id=\"main-arguments-2\">Main arguments</h3>\n<table>\n<tr>\n<td>\n<pre><code>&lt;cf_base&gt;</code></pre>\n</td>\n<td>\n<p>The basename of the index to be inspected. The basename is name of any of the index files but with the <code>.X.cf</code> suffix omitted. <code>centrifuge-inspect</code> first looks in the current directory for the index files, then in the directory specified in the <code>Centrifuge_INDEXES</code> environment variable.</p>\n</td>\n</tr>\n</table>\n<h3 id=\"options-2\">Options</h3>\n<table>\n<tr>\n<td>\n<pre><code>-a/--across &lt;int&gt;</code></pre>\n</td>\n<td>\n<p>When printing FASTA output, output a newline character every <code>&lt;int&gt;</code> bases (default: 60).</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-inspect-options-n\">\n<pre><code>-n/--names</code></pre>\n</td>\n<td>\n<p>Print reference sequence names, one per line, and quit.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-inspect-options-s\">\n<pre><code>-s/--summary</code></pre>\n</td>\n<td>\n<p>Print a summary that includes information about index settings, as well as the names and lengths of the input sequences. The summary has this format:</p>\n<pre><code>Colorspace  &lt;0 or 1&gt;\nSA-Sample   1 in &lt;sample&gt;\nFTab-Chars  &lt;chars&gt;\nSequence-1  &lt;name&gt;  &lt;len&gt;\nSequence-2  &lt;name&gt;  &lt;len&gt;\n...\nSequence-N  &lt;name&gt;  &lt;len&gt;</code></pre>\n<p>Fields are separated by tabs. Colorspace is always set to 0 for Centrifuge.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-inspect-options-conversion-table\">\n<pre><code>--conversion-table</code></pre>\n</td>\n<td>\n<p>Print a list of UIDs (unique ID) and corresponding taxonomic IDs.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-inspect-options-taxonomy-tree\">\n<pre><code>--taxonomy-tree</code></pre>\n</td>\n<td>\n<p>Print taxonomic tree.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-inspect-options-name-table\">\n<pre><code>--name-table</code></pre>\n</td>\n<td>\n<p>Print name table.</p>\n</td>\n</tr>\n<tr>\n<td id=\"centrifuge-inspect-options-taxonomy-tree\">\n<pre><code>--size-table</code></pre>\n</td>\n<td>\n<p>Print a list of taxonomic IDs and lengths of the sequences belonging to the same taxonomic IDs.</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>-v/--verbose</code></pre>\n</td>\n<td>\n<p>Print verbose output (for debugging).</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>--version</code></pre>\n</td>\n<td>\n<p>Print version information and quit.</p>\n</td>\n</tr>\n<tr>\n<td>\n<pre><code>-h/--help</code></pre>\n</td>\n<td>\n<p>Print usage information and quit.</p>\n</td>\n</tr>\n</table>\n<h1 id=\"getting-started-with-centrifuge\">Getting started with Centrifuge</h1>\n<p>Centrifuge comes with some example files to get you started. The example files are not scientifically significant; these files will simply let you start running Centrifuge and downstream tools right away.</p>\n<p>First follow the manual instructions to <a href=\"#obtaining-centrifuge\">obtain Centrifuge</a>. Set the <code>CENTRIFUGE_HOME</code> environment variable to point to the new Centrifuge directory containing the <code>centrifuge</code>, <code>centrifuge-build</code> and <code>centrifuge-inspect</code> binaries. This is important, as the <code>CENTRIFUGE_HOME</code> variable is used in the commands below to refer to that directory.</p>\n<h2 id=\"indexing-a-reference-genome\">Indexing a reference genome</h2>\n<p>To create an index for two small sequences included with Centrifuge, create a new temporary directory (it doesn’t matter where), change into that directory, and run:</p>\n<pre><code>$CENTRIFUGE_HOME/centrifuge-build --conversion-table $CENTRIFUGE_HOME/example/reference/gi_to_tid.dmp --taxonomy-tree $CENTRIFUGE_HOME/example/reference/nodes.dmp --name-table $CENTRIFUGE_HOME/example/reference/names.dmp $CENTRIFUGE_HOME/example/reference/test.fa test</code></pre>\n<p>The command should print many lines of output then quit. When the command completes, the current directory will contain ten new files that all start with <code>test</code> and end with <code>.1.cf</code>, <code>.2.cf</code>, <code>.3.cf</code>. These files constitute the index - you’re done!</p>\n<p>You can use <code>centrifuge-build</code> to create an index for a set of FASTA files obtained from any source, including sites such as <a href=\"http://genome.ucsc.edu/cgi-bin/hgGateway\">UCSC</a>, <a href=\"http://www.ncbi.nlm.nih.gov/sites/genome\">NCBI</a>, and <a href=\"http://www.ensembl.org/\">Ensembl</a>. When indexing multiple FASTA files, specify all the files using commas to separate file names. For more details on how to create an index with <code>centrifuge-build</code>, see the <a href=\"#the-centrifuge-build-indexer\">manual section on index building</a>. You may also want to bypass this process by obtaining a pre-built index.</p>\n<h2 id=\"classifying-example-reads\">Classifying example reads</h2>\n<p>Stay in the directory created in the previous step, which now contains the <code>test</code> index files. Next, run:</p>\n<pre><code>$CENTRIFUGE_HOME/centrifuge -f -x test $CENTRIFUGE_HOME/example/reads/input.fa</code></pre>\n<p>This runs the Centrifuge classifier, which classifies a set of unpaired reads to the the genomes using the index generated in the previous step. The classification results are reported to stdout, and a short classification summary is written to centrifuge-species_report.tsv.</p>\n<p>You will see something like this:</p>\n<pre><code>readID  seqID taxID     score   2ndBestScore    hitLength   numMatches\nC_1 gi|7     9913      4225 4225        80      2\nC_1 gi|4     9646      4225 4225        80      2\nC_2 gi|4     9646      4225 4225        80      2\nC_2 gi|7     9913      4225 4225        80      2\nC_3 gi|7     9913      4225 4225        80      2\nC_3 gi|4     9646      4225 4225        80      2\nC_4 gi|4     9646      4225 4225        80      2\nC_4 gi|7     9913      4225 4225        80      2\n1_1 gi|4     9646      4225 0       80      1\n1_2 gi|4     9646      4225 0       80      1\n2_1 gi|7     9913      4225 0       80      1\n2_2 gi|7     9913      4225 0       80      1\n2_3 gi|7     9913      4225 0       80      1\n2_4 gi|7     9913      4225 0       80      1\n2_5 gi|7     9913      4225 0       80      1\n2_6 gi|7     9913      4225 0       80      1</code></pre>\n"
  },
  {
    "path": "doc/manual.inc.html.old",
    "content": "<div id=\"TOC\">\n<ul>\n<li><a href=\"#introduction\">Introduction</a><ul>\n<li><a href=\"#what-is-hisat\">What is HISAT?</a></li>\n</ul></li>\n<li><a href=\"#obtaining-hisat\">Obtaining HISAT</a><ul>\n<li><a href=\"#building-from-source\">Building from source</a></li>\n</ul></li>\n<li><a href=\"#running-hisat\">Running HISAT</a><ul>\n<li><a href=\"#adding-to-path\">Adding to PATH</a></li>\n<li><a href=\"#reporting\">Reporting</a></li>\n<li><a href=\"#alignment-summmary\">Alignment summmary</a></li>\n<li><a href=\"#wrapper\">Wrapper</a></li>\n<li><a href=\"#small-and-large-indexes\">Small and large indexes</a></li>\n<li><a href=\"#performance-tuning\">Performance tuning</a></li>\n<li><a href=\"#command-line\">Command Line</a><ul>\n<li><a href=\"#setting-function-options\">Setting function options</a></li>\n<li><a href=\"#usage\">Usage</a></li>\n<li><a href=\"#main-arguments\">Main arguments</a></li>\n<li><a href=\"#options\">Options</a></li>\n</ul></li>\n<li><a href=\"#sam-output\">SAM output</a></li>\n</ul></li>\n<li><a href=\"#the-hisat-build-indexer\">The <code>hisat-build</code> indexer</a><ul>\n<li><a href=\"#command-line-1\">Command Line</a><ul>\n<li><a href=\"#main-arguments-1\">Main arguments</a></li>\n<li><a href=\"#options-1\">Options</a></li>\n</ul></li>\n</ul></li>\n<li><a href=\"#the-hisat-inspect-index-inspector\">The <code>hisat-inspect</code> index inspector</a><ul>\n<li><a href=\"#command-line-2\">Command Line</a><ul>\n<li><a href=\"#main-arguments-2\">Main arguments</a></li>\n<li><a href=\"#options-2\">Options</a></li>\n</ul></li>\n</ul></li>\n<li><a href=\"#getting-started-with-hisat\">Getting started with HISAT</a><ul>\n<li><a href=\"#indexing-a-reference-genome\">Indexing a reference genome</a></li>\n<li><a href=\"#aligning-example-reads\">Aligning example reads</a></li>\n<li><a href=\"#paired-end-example\">Paired-end example</a></li>\n<li><a href=\"#using-samtoolsbcftools-downstream\">Using SAMtools/BCFtools downstream</a></li>\n</ul></li>\n</ul>\n</div>\n<!--\n ! This manual is written in \"markdown\" format and thus contains some\n ! distracting formatting clutter.  See 'MANUAL' for an easier-to-read version\n ! of this text document, or see the HTML manual online.\n ! -->\n\n<h1 id=\"introduction\">Introduction</h1>\n<h2 id=\"what-is-hisat\">What is HISAT?</h2>\n<p><a href=\"http://ccb.jhu.edu/software/hisat\">HISAT</a> is a fast and sensitive spliced alignment program. As part of HISAT, we have developed a new indexing scheme based on the Burrows-Wheeler transform (<a href=\"http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\">BWT</a>) and the <a href=\"http://en.wikipedia.org/wiki/FM-index\">FM index</a>, called hierarchical indexing, that employs two types of indexes: (1) one global FM index representing the whole genome, and (2) many separate local FM indexes for small regions collectively covering the genome. Our hierarchical index for the human genome (about 3 billion bp) includes ~48,000 local FM indexes, each representing a genomic region of ~64,000bp. As the basis for non-gapped alignment, the FM index is extremely fast with a low memory footprint, as demonstrated by <a href=\"http://bowtie-bio.sf.net\">Bowtie</a>. In addition, HISAT provides several alignment strategies specifically designed for mapping different types of RNA-seq reads. All these together, HISAT enables extremely fast and sensitive alignment of reads, in particular those spanning two exons or more. As a result, HISAT is much faster &gt;50 times than <a href=\"http://ccb.jhu.edu/software/tophat\">TopHat2</a> with better alignment quality. Although it uses a large number of indexes, the memory requirement of HISAT is still modest, approximately 4.3 GB for human. HISAT uses the <a href=\"http://bowtie-bio.sf.net/bowtie2\">Bowtie2</a> implementation to handle most of the operations on the FM index. In addition to spliced alignment, HISAT handles reads involving indels and supports a paired-end alignment mode. Multiple processors can be used simultaneously to achieve greater alignment speed. HISAT outputs alignments in <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM</a> format, enabling interoperation with a large number of other tools (e.g. <a href=\"http://samtools.sourceforge.net\">SAMtools</a>, <a href=\"http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit\">GATK</a>) that use SAM. HISAT is distributed under the <a href=\"http://www.gnu.org/licenses/gpl-3.0.html\">GPLv3 license</a>, and it runs on the command line under Linux, Mac OS X and Windows.</p>\n<h1 id=\"obtaining-hisat\">Obtaining HISAT</h1>\n<p>Download HISAT sources and binaries from the Releases sections on the right side. Binaries are available for Intel architectures (<code>x86_64</code>) running Linux, and Mac OS X.</p>\n<h2 id=\"building-from-source\">Building from source</h2>\n<p>Building HISAT from source requires a GNU-like environment with GCC, GNU Make and other basics. It should be possible to build HISAT on most vanilla Linux installations or on a Mac installation with <a href=\"http://developer.apple.com/xcode/\">Xcode</a> installed. HISAT can also be built on Windows using <a href=\"http://www.cygwin.com/\">Cygwin</a> or <a href=\"http://www.mingw.org/\">MinGW</a> (MinGW recommended). For a MinGW build the choice of what compiler is to be used is important since this will determine if a 32 or 64 bit code can be successfully compiled using it. If there is a need to generate both 32 and 64 bit on the same machine then a multilib MinGW has to be properly installed. <a href=\"http://www.mingw.org/wiki/msys\">MSYS</a>, the <a href=\"http://cygwin.com/packages/mingw-zlib/\">zlib</a> library, and depending on architecture <a href=\"http://sourceware.org/pthreads-win32/\">pthreads</a> library are also required. We are recommending a 64 bit build since it has some clear advantages in real life research problems. In order to simplify the MinGW setup it might be worth investigating popular MinGW personal builds since these are coming already prepared with most of the toolchains needed.</p>\n<p>First, download the <a href=\"http://ccb.jhu.edu/software/hisat/downloads/hisat-0.1.0-beta.zip\">source package</a> from the Releases secion on the right side. Unzip the file, change to the unzipped directory, and build the HISAT tools by running GNU <code>make</code> (usually with the command <code>make</code>, but sometimes with <code>gmake</code>) with no arguments. If building with MinGW, run <code>make</code> from the MSYS environment.</p>\n<p>HISAT is using the multithreading software model in order to speed up execution times on SMP architectures where this is possible. On POSIX platforms (like linux, Mac OS, etc) it needs the pthread library. Although it is possible to use pthread library on non-POSIX platform like Windows, due to performance reasons HISAT will try to use Windows native multithreading if possible.</p>\n<h1 id=\"running-hisat\">Running HISAT</h1>\n<h2 id=\"adding-to-path\">Adding to PATH</h2>\n<p>By adding your new HISAT directory to your <a href=\"http://en.wikipedia.org/wiki/PATH_(variable)\">PATH environment variable</a>, you ensure that whenever you run <code>hisat</code>, <code>hisat-build</code> or <code>hisat-inspect</code> from the command line, you will get the version you just installed without having to specify the entire path. This is recommended for most users. To do this, follow your operating system's instructions for adding the directory to your <a href=\"http://en.wikipedia.org/wiki/PATH_(variable)\">PATH</a>.</p>\n<p>If you would like to install HISAT by copying the HISAT executable files to an existing directory in your <a href=\"http://en.wikipedia.org/wiki/PATH_(variable)\">PATH</a>, make sure that you copy all the executables, including <code>hisat</code>, <code>hisat-align-s</code>, <code>hisat-align-l</code>, <code>hisat-build</code>, <code>hisat-build-s</code>, <code>hisat-build-l</code>, <code>hisat-inspect</code>, <code>hisat-inspect-s</code> and <code>hisat-inspect-l</code>.</p>\n<h2 id=\"reporting\">Reporting</h2>\n<!--\nThe reporting mode governs how many alignments HISAT looks for, and how to\nreport them.  HISAT has three distinct reporting modes.  The default\nreporting mode is similar to the default reporting mode of many other read\nalignment tools, including [BWA].\n\nIn general, when we say that a read has an alignment, we mean that it has a\n[valid alignment].  When we say that a read has multiple alignments, we mean\nthat it has multiple alignments that are valid and distinct from one another. \n\n[valid alignment]: #valid-alignments-meet-or-exceed-the-minimum-score-threshold\n[BWA]: http://bio-bwa.sourceforge.net/\n\n### Distinct alignments map a read to different places\n\nTwo alignments for the same individual read are \"distinct\" if they map the same\nread to different places.  Specifically, we say that two alignments are distinct\nif there are no alignment positions where a particular read offset is aligned\nopposite a particular reference offset in both alignments with the same\norientation.  E.g. if the first alignment is in the forward orientation and\naligns the read character at read offset 10 to the reference character at\nchromosome 3, offset 3,445,245, and the second alignment is also in the forward\norientation and also aligns the read character at read offset 10 to the\nreference character at chromosome 3, offset 3,445,245, they are not distinct\nalignments.\n\nTwo alignments for the same pair are distinct if either the mate 1s in the two\npaired-end alignments are distinct or the mate 2s in the two alignments are\ndistinct or both.\n\n### Default mode: search for multiple alignments, report the best one\n\nBy default, HISAT searches for distinct, valid alignments for each read. When\nit finds a valid alignment, it generally will continue to look for alignments\nthat are nearly as good or better.  It will eventually stop looking, either\nbecause it exceeded a limit placed on search effort (see [`-D`] and [`-R`]) or\nbecause it already knows all it needs to know to report an alignment.\nInformation from the best alignments are used to estimate mapping quality (the\n`MAPQ` [SAM] field) and to set SAM optional fields, such as [`AS:i`] and\n[`XS:i`].  HISAT does not gaurantee that the alignment reported is the best\npossible in terms of alignment score.\n\nSee also: [`-D`], which puts an upper limit on the number of dynamic programming\nproblems (i.e. seed extensions) that can \"fail\" in a row before HISAT stops\nsearching.  Increasing [`-D`] makes HISAT slower, but increases the\nlikelihood that it will report the correct alignment for a read that aligns many\nplaces.\n\nSee also: [`-R`], which sets the maximum number of times HISAT will \"re-seed\"\nwhen attempting to align a read with repetitive seeds.  Increasing [`-R`] makes\nHISAT slower, but increases the likelihood that it will report the correct\nalignment for a read that aligns many places.\n\n### -k mode: search for one or more alignments, report each\n\nIn [`-k`] mode, HISAT searches for up to N distinct, valid alignments for\neach read, where N equals the integer specified with the `-k` parameter.  That\nis, if `-k 2` is specified, HISAT will search for at most 2 distinct\nalignments.  It reports all alignments found, in descending order by alignment\nscore.  The alignment score for a paired-end alignment equals the sum of the\nalignment scores of the individual mates.  Each reported read or pair alignment\nbeyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAGS\nfield.  See the [SAM specification] for details.\n\nHISAT does not \"find\" alignments in any specific order, so for reads that\nhave more than N distinct, valid alignments, HISAT does not gaurantee that\nthe N alignments reported are the best possible in terms of alignment score.\nStill, this mode can be effective and fast in situations where the user cares\nmore about whether a read aligns (or aligns a certain number of times) than\nwhere exactly it originated.\n-->\n\n<h2 id=\"alignment-summmary\">Alignment summmary</h2>\n<p>When HISAT finishes running, it prints messages summarizing what happened. These messages are printed to the &quot;standard error&quot; (&quot;stderr&quot;) filehandle. For datasets consisting of unpaired reads, the summary might look like this:</p>\n<pre><code>20000 reads; of these:\n  20000 (100.00%) were unpaired; of these:\n    1247 (6.24%) aligned 0 times\n    18739 (93.69%) aligned exactly 1 time\n    14 (0.07%) aligned &gt;1 times\n93.77% overall alignment rate</code></pre>\n<p>For datasets consisting of pairs, the summary might look like this:</p>\n<pre><code>10000 reads; of these:\n  10000 (100.00%) were paired; of these:\n    650 (6.50%) aligned concordantly 0 times\n    8823 (88.23%) aligned concordantly exactly 1 time\n    527 (5.27%) aligned concordantly &gt;1 times\n    ----\n    650 pairs aligned concordantly 0 times; of these:\n      34 (5.23%) aligned discordantly 1 time\n    ----\n    616 pairs aligned 0 times concordantly or discordantly; of these:\n      1232 mates make up the pairs; of these:\n        660 (53.57%) aligned 0 times\n        571 (46.35%) aligned exactly 1 time\n        1 (0.08%) aligned &gt;1 times\n96.70% overall alignment rate</code></pre>\n<p>The indentation indicates how subtotals relate to totals.</p>\n<h2 id=\"wrapper\">Wrapper</h2>\n<p>The <code>hisat</code>, <code>hisat-build</code> and <code>hisat-inspect</code> executables are actually wrapper scripts that call binary programs as appropriate. The wrappers shield users from having to distinguish between &quot;small&quot; and &quot;large&quot; index formats, discussed briefly in the following section. Also, the <code>hisat</code> wrapper provides some key functionality, like the ability to handle compressed inputs, and the fucntionality for <a href=\"#hisat-options-un\"><code>--un</code></a>, <a href=\"#hisat-options-al\"><code>--al</code></a> and related options.</p>\n<p>It is recommended that you always run the hisat wrappers and not run the binaries directly.</p>\n<h2 id=\"small-and-large-indexes\">Small and large indexes</h2>\n<p><code>hisat-build</code> can index reference genomes of any size. For genomes less than about 4 billion nucleotides in length, <code>hisat-build</code> builds a &quot;small&quot; index using 32-bit numbers in various parts of the index. When the genome is longer, <code>hisat-build</code> builds a &quot;large&quot; index using 64-bit numbers. Small indexes are stored in files with the <code>.bt2</code> extension, and large indexes are stored in files with the <code>.bt2l</code> extension. The user need not worry about whether a particular index is small or large; the wrapper scripts will automatically build and use the appropriate index.</p>\n<h2 id=\"performance-tuning\">Performance tuning</h2>\n<ol style=\"list-style-type: decimal\">\n<li><p>If your computer has multiple processors/cores, use <code>-p</code></p>\n<p>The <a href=\"#hisat-options-p\"><code>-p</code></a> option causes HISAT to launch a specified number of parallel search threads. Each thread runs on a different processor/core and all threads find alignments in parallel, increasing alignment throughput by approximately a multiple of the number of threads (though in practice, speedup is somewhat worse than linear).</p></li>\n</ol>\n<h2 id=\"command-line\">Command Line</h2>\n<h3 id=\"setting-function-options\">Setting function options</h3>\n<p>Some HISAT options specify a function rather than an individual number or setting. In these cases the user specifies three parameters: (a) a function type <code>F</code>, (b) a constant term <code>B</code>, and (c) a coefficient <code>A</code>. The available function types are constant (<code>C</code>), linear (<code>L</code>), square-root (<code>S</code>), and natural log (<code>G</code>). The parameters are specified as <code>F,B,A</code> - that is, the function type, the constant term, and the coefficient are separated by commas with no whitespace. The constant term and coefficient may be negative and/or floating-point numbers.</p>\n<p>For example, if the function specification is <code>L,-0.4,-0.6</code>, then the function defined is:</p>\n<pre><code>f(x) = -0.4 + -0.6 * x</code></pre>\n<p>If the function specification is <code>G,1,5.4</code>, then the function defined is:</p>\n<pre><code>f(x) = 1.0 + 5.4 * ln(x)</code></pre>\n<p>See the documentation for the option in question to learn what the parameter <code>x</code> is for. For example, in the case if the <a href=\"#hisat-options-score-min\"><code>--score-min</code></a> option, the function <code>f(x)</code> sets the minimum alignment score necessary for an alignment to be considered valid, and <code>x</code> is the read length.</p>\n<h3 id=\"usage\">Usage</h3>\n<pre><code>hisat [options]* -x &lt;hisat-idx&gt; {-1 &lt;m1&gt; -2 &lt;m2&gt; | -U &lt;r&gt;} -S [&lt;hit&gt;]</code></pre>\n<h3 id=\"main-arguments\">Main arguments</h3>\n<table><tr><td>\n\n<pre><code>-x &lt;hisat-idx&gt;</code></pre>\n</td><td>\n\n<p>The basename of the index for the reference genome. The basename is the name of any of the index files up to but not including the final <code>.1.bt2</code> / <code>.rev.1.bt2</code> / etc. <code>hisat</code> looks for the specified index first in the current directory, then in the directory specified in the <code>HISAT_INDEXES</code> environment variable.</p>\n</td></tr><tr><td>\n\n<pre><code>-1 &lt;m1&gt;</code></pre>\n</td><td>\n\n<p>Comma-separated list of files containing mate 1s (filename usually includes <code>_1</code>), e.g. <code>-1 flyA_1.fq,flyB_1.fq</code>. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in <code>&lt;m2&gt;</code>. Reads may be a mix of different lengths. If <code>-</code> is specified, <code>hisat</code> will read the mate 1s from the &quot;standard in&quot; or &quot;stdin&quot; filehandle.</p>\n</td></tr><tr><td>\n\n<pre><code>-2 &lt;m2&gt;</code></pre>\n</td><td>\n\n<p>Comma-separated list of files containing mate 2s (filename usually includes <code>_2</code>), e.g. <code>-2 flyA_2.fq,flyB_2.fq</code>. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in <code>&lt;m1&gt;</code>. Reads may be a mix of different lengths. If <code>-</code> is specified, <code>hisat</code> will read the mate 2s from the &quot;standard in&quot; or &quot;stdin&quot; filehandle.</p>\n</td></tr><tr><td>\n\n<pre><code>-U &lt;r&gt;</code></pre>\n</td><td>\n\n<p>Comma-separated list of files containing unpaired reads to be aligned, e.g. <code>lane1.fq,lane2.fq,lane3.fq,lane4.fq</code>. Reads may be a mix of different lengths. If <code>-</code> is specified, <code>hisat</code> gets the reads from the &quot;standard in&quot; or &quot;stdin&quot; filehandle.</p>\n</td></tr><tr><td>\n\n<pre><code>-S &lt;hit&gt;</code></pre>\n</td><td>\n\n<p>File to write SAM alignments to. By default, alignments are written to the &quot;standard out&quot; or &quot;stdout&quot; filehandle (i.e. the console).</p>\n</td></tr></table>\n\n<h3 id=\"options\">Options</h3>\n<h4 id=\"input-options\">Input options</h4>\n<table>\n<tr><td id=\"hisat-options-q\">\n\n<pre><code>-q</code></pre>\n</td><td>\n\n<p>Reads (specified with <code>&lt;m1&gt;</code>, <code>&lt;m2&gt;</code>, <code>&lt;s&gt;</code>) are FASTQ files. FASTQ files usually have extension <code>.fq</code> or <code>.fastq</code>. FASTQ is the default format. See also: <a href=\"#hisat-options-solexa-quals\"><code>--solexa-quals</code></a> and <a href=\"#hisat-options-int-quals\"><code>--int-quals</code></a>.</p>\n</td></tr>\n<tr><td id=\"hisat-options-qseq\">\n\n<pre><code>--qseq</code></pre>\n</td><td>\n\n<p>Reads (specified with <code>&lt;m1&gt;</code>, <code>&lt;m2&gt;</code>, <code>&lt;s&gt;</code>) are QSEQ files. QSEQ files usually end in <code>_qseq.txt</code>. See also: <a href=\"#hisat-options-solexa-quals\"><code>--solexa-quals</code></a> and <a href=\"#hisat-options-int-quals\"><code>--int-quals</code></a>.</p>\n</td></tr>\n<tr><td id=\"hisat-options-f\">\n\n<pre><code>-f</code></pre>\n</td><td>\n\n<p>Reads (specified with <code>&lt;m1&gt;</code>, <code>&lt;m2&gt;</code>, <code>&lt;s&gt;</code>) are FASTA files. FASTA files usually have extension <code>.fa</code>, <code>.fasta</code>, <code>.mfa</code>, <code>.fna</code> or similar. FASTA files do not have a way of specifying quality values, so when <code>-f</code> is set, the result is as if <code>--ignore-quals</code> is also set.</p>\n</td></tr>\n<tr><td id=\"hisat-options-r\">\n\n<pre><code>-r</code></pre>\n</td><td>\n\n<p>Reads (specified with <code>&lt;m1&gt;</code>, <code>&lt;m2&gt;</code>, <code>&lt;s&gt;</code>) are files with one input sequence per line, without any other information (no read names, no qualities). When <code>-r</code> is set, the result is as if <code>--ignore-quals</code> is also set.</p>\n</td></tr>\n<tr><td id=\"hisat-options-c\">\n\n<pre><code>-c</code></pre>\n</td><td>\n\n<p>The read sequences are given on command line. I.e. <code>&lt;m1&gt;</code>, <code>&lt;m2&gt;</code> and <code>&lt;singles&gt;</code> are comma-separated lists of reads rather than lists of read files. There is no way to specify read names or qualities, so <code>-c</code> also implies <code>--ignore-quals</code>.</p>\n</td></tr>\n<tr><td id=\"hisat-options-s\">\n\n<pre><code>-s/--skip &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Skip (i.e. do not align) the first <code>&lt;int&gt;</code> reads or pairs in the input.</p>\n</td></tr>\n<tr><td id=\"hisat-options-u\">\n\n<pre><code>-u/--qupto &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Align the first <code>&lt;int&gt;</code> reads or read pairs from the input (after the <a href=\"#hisat-options-s\"><code>-s</code>/<code>--skip</code></a> reads or pairs have been skipped), then stop. Default: no limit.</p>\n</td></tr>\n<tr><td id=\"hisat-options-5\">\n\n<pre><code>-5/--trim5 &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Trim <code>&lt;int&gt;</code> bases from 5' (left) end of each read before alignment (default: 0).</p>\n</td></tr>\n<tr><td id=\"hisat-options-3\">\n\n<pre><code>-3/--trim3 &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Trim <code>&lt;int&gt;</code> bases from 3' (right) end of each read before alignment (default: 0).</p>\n</td></tr><tr><td id=\"hisat-options-phred33-quals\">\n\n<pre><code>--phred33</code></pre>\n</td><td>\n\n<p>Input qualities are ASCII chars equal to the <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Phred quality</a> plus 33. This is also called the &quot;Phred+33&quot; encoding, which is used by the very latest Illumina pipelines.</p>\n</td></tr>\n<tr><td id=\"hisat-options-phred64-quals\">\n\n<pre><code>--phred64</code></pre>\n</td><td>\n\n<p>Input qualities are ASCII chars equal to the <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Phred quality</a> plus 64. This is also called the &quot;Phred+64&quot; encoding.</p>\n</td></tr>\n<tr><td id=\"hisat-options-solexa-quals\">\n\n<pre><code>--solexa-quals</code></pre>\n</td><td>\n\n<p>Convert input qualities from <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Solexa</a> (which can be negative) to <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Phred</a> (which can't). This scheme was used in older Illumina GA Pipeline versions (prior to 1.3). Default: off.</p>\n</td></tr>\n<tr><td id=\"hisat-options-int-quals\">\n\n<pre><code>--int-quals</code></pre>\n</td><td>\n\n<p>Quality values are represented in the read input file as space-separated ASCII integers, e.g., <code>40 40 30 40</code>..., rather than ASCII characters, e.g., <code>II?I</code>.... Integers are treated as being on the <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Phred quality</a> scale unless <a href=\"#hisat-options-solexa-quals\"><code>--solexa-quals</code></a> is also specified. Default: off.</p>\n</td></tr></table>\n\n<h4 id=\"alignment-options\">Alignment options</h4>\n<table>\n\n<tr><td id=\"hisat-options-n-ceil\">\n\n<pre><code>--n-ceil &lt;func&gt;</code></pre>\n</td><td>\n\n<p>Sets a function governing the maximum number of ambiguous characters (usually <code>N</code>s and/or <code>.</code>s) allowed in a read as a function of read length. For instance, specifying <code>-L,0,0.15</code> sets the N-ceiling function <code>f</code> to <code>f(x) = 0 + 0.15 * x</code>, where x is the read length. See also: [setting function options]. Reads exceeding this ceiling are <a href=\"#filtering\">filtered out</a>. Default: <code>L,0,0.15</code>.</p>\n</td></tr>\n\n<tr><td id=\"hisat-options-ignore-quals\">\n\n<pre><code>--ignore-quals</code></pre>\n</td><td>\n\n<p>When calculating a mismatch penalty, always consider the quality value at the mismatched position to be the highest possible, regardless of the actual value. I.e. input is treated as though all quality values are high. This is also the default behavior when the input doesn't specify quality values (e.g. in <a href=\"#hisat-options-f\"><code>-f</code></a>, <a href=\"#hisat-options-r\"><code>-r</code></a>, or <a href=\"#hisat-options-c\"><code>-c</code></a> modes).</p>\n</td></tr>\n<tr><td id=\"hisat-options-nofw\">\n\n<pre><code>--nofw/--norc</code></pre>\n</td><td>\n\n<p>If <code>--nofw</code> is specified, <code>hisat</code> will not attempt to align unpaired reads to the forward (Watson) reference strand. If <code>--norc</code> is specified, <code>hisat</code> will not attempt to align unpaired reads against the reverse-complement (Crick) reference strand. In paired-end mode, <code>--nofw</code> and <code>--norc</code> pertain to the fragments; i.e. specifying <code>--nofw</code> causes <code>hisat</code> to explore only those paired-end configurations corresponding to fragments from the reverse-complement (Crick) strand. Default: both strands enabled.</p>\n</td></tr>\n\n<!--\n<tr><td id=\"hisat-options-end-to-end\">\n\n[`--end-to-end`]: #hisat-options-end-to-end\n\n    --end-to-end\n\n</td><td>\n\nIn this mode, HISAT requires that the entire read align from one end to the\nother, without any trimming (or \"soft clipping\") of characters from either end.\nThe match bonus [`--ma`] always equals 0 in this mode, so all alignment scores\nare less than or equal to 0, and the greatest possible alignment score is 0.\nThis is mutually exclusive with [`--local`].  `--end-to-end` is the default mode.\n\n</td></tr>\n<tr><td id=\"hisat-options-local\">\n\n[`--local`]: #hisat-options-local\n\n    --local\n\n</td><td>\n\nIn this mode, HISAT does not require that the entire read align from one end\nto the other.  Rather, some characters may be omitted (\"soft clipped\") from the\nends in order to achieve the greatest possible alignment score.  The match bonus\n[`--ma`] is used in this mode, and the best possible alignment score is equal to\nthe match bonus ([`--ma`]) times the length of the read.  Specifying `--local`\nand one of the presets (e.g. `--local --very-fast`) is equivalent to specifying\nthe local version of the preset (`--very-fast-local`).  This is mutually\nexclusive with [`--end-to-end`].  `--end-to-end` is the default mode.\n\n</td></tr>\n-->\n\n</table>\n\n<h4 id=\"scoring-options\">Scoring options</h4>\n<table>\n\n<tr><td id=\"hisat-options-ma\">\n\n<pre><code>--ma &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Sets the match bonus. In [<code>--local</code>] mode <code>&lt;int&gt;</code> is added to the alignment score for each position where a read character aligns to a reference character and the characters match. Not used in [<code>--end-to-end</code>] mode. Default: 2.</p>\n</td></tr>\n<tr><td id=\"hisat-options-mp\">\n\n<pre><code>--mp MX,MN</code></pre>\n</td><td>\n\n<p>Sets the maximum (<code>MX</code>) and minimum (<code>MN</code>) mismatch penalties, both integers. A number less than or equal to <code>MX</code> and greater than or equal to <code>MN</code> is subtracted from the alignment score for each position where a read character aligns to a reference character, the characters do not match, and neither is an <code>N</code>. If <a href=\"#hisat-options-ignore-quals\"><code>--ignore-quals</code></a> is specified, the number subtracted quals <code>MX</code>. Otherwise, the number subtracted is <code>MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) )</code> where Q is the Phred quality value. Default: <code>MX</code> = 6, <code>MN</code> = 2.</p>\n</td></tr>\n<tr><td id=\"hisat-options-np\">\n\n<pre><code>--np &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Sets penalty for positions where the read, reference, or both, contain an ambiguous character such as <code>N</code>. Default: 1.</p>\n</td></tr>\n<tr><td id=\"hisat-options-rdg\">\n\n<pre><code>--rdg &lt;int1&gt;,&lt;int2&gt;</code></pre>\n</td><td>\n\n<p>Sets the read gap open (<code>&lt;int1&gt;</code>) and extend (<code>&lt;int2&gt;</code>) penalties. A read gap of length N gets a penalty of <code>&lt;int1&gt;</code> + N * <code>&lt;int2&gt;</code>. Default: 5, 3.</p>\n</td></tr>\n<tr><td id=\"hisat-options-rfg\">\n\n<pre><code>--rfg &lt;int1&gt;,&lt;int2&gt;</code></pre>\n</td><td>\n\n<p>Sets the reference gap open (<code>&lt;int1&gt;</code>) and extend (<code>&lt;int2&gt;</code>) penalties. A reference gap of length N gets a penalty of <code>&lt;int1&gt;</code> + N * <code>&lt;int2&gt;</code>. Default: 5, 3.</p>\n</td></tr>\n<tr><td id=\"hisat-options-score-min\">\n\n<pre><code>--score-min &lt;func&gt;</code></pre>\n</td><td>\n\n<p>Sets a function governing the minimum alignment score needed for an alignment to be considered &quot;valid&quot; (i.e. good enough to report). This is a function of read length. For instance, specifying <code>L,0,-0.6</code> sets the minimum-score function <code>f</code> to <code>f(x) = 0 + -0.6 * x</code>, where <code>x</code> is the read length. See also: [setting function options]. The default is <code>C,-18,0</code>.</p>\n</td></tr>\n</table>\n\n<h4 id=\"spliced-alignment-options\">Spliced alignment options</h4>\n<table>\n\n<tr><td id=\"hisat-options-pen-cansplice\">\n\n<pre><code>--pen-cansplice &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Sets the penalty for a canonical splice site. Default: 0.</p>\n</td></tr>\n\n<tr><td id=\"hisat-options-pen-noncansplice\">\n\n<pre><code>--pen-noncansplice &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Sets the penalty for a non-canonical splice site. Default: 3.</p>\n</td></tr>\n<tr><td id=\"hisat-options-pen-intronlen\">\n\n<pre><code>--pen-intronlen &lt;func&gt;</code></pre>\n</td><td>\n\n<p>Sets the penalty for long introns so that alignments with shorter introns are preferred to those with longer introns. Default: G,-8,1</p>\n</td></tr>\n\n<tr><td id=\"hisat-options-known-splicesite-infile\">\n\n<pre><code>--known-splicesite-infile &lt;path&gt;</code></pre>\n</td><td>\n\n<p>With this mode, you can provide a list of known splice sites, which HISAT makes use of them to align reads with small anchors.<br />You can create such a list using &quot;python extract_splice_sites.py genes.gtf &gt; splicesites.txt&quot;, where &quot;extract_splice_sites.py&quot; is included in the HISAT package, &quot;genes.gtf&quot; is a gene annotation file, and &quot;splicesites.txt&quot; is a list of splice sites with which you provide HISAT in this mode.</p>\n</td></tr>\n\n<tr><td id=\"hisat-options-novel-splice-outfile\">\n\n<pre><code>--novel-splicesite-outfile &lt;path&gt;</code></pre>\n</td><td>\n\n<p>In this mode, HISAT reports a list of splice sites in the file &quot;path&quot;:<br /> chromosome name &quot;tab&quot; genomic position of the flanking base on the left side of an intron &quot;tab&quot; genomic position of the flanking base on the right &quot;tab&quot; strand</p>\n</td></tr>\n\n<tr><td id=\"hisat-options-novel-splicesite-infile\">\n\n<pre><code>--novel-splicesite-infile &lt;path&gt;</code></pre>\n</td><td>\n\n<p>With this mode, you can provide a list of novel splice sites that were generated from the above option &quot;--novel-splicesite-outfile&quot;.</p>\n</td></tr>\n\n<tr><td id=\"hisat-options-no-temp-splicesite\">\n\n<pre><code>--no-temp-splicesite</code></pre>\n</td><td>\n\n<p>HISAT, by default, makes use of splice sites found by earlier reads to align later reads in the same run, in particular, reads with small anchors (&lt;= 15 bp).<br />The option disables this default alignment strategy.</p>\n</td></tr>\n\n<tr><td id=\"hisat-options-no-spliced-alignment\">\n\n<pre><code>--no-spliced-alignment</code></pre>\n</td><td>\n\n<p>Disable spliced alignment.</p>\n</td></tr>\n\n\n<tr><td id=\"hisat-options-rna-strandness\">\n\n<pre><code>--rna-strandness &lt;string&gt;</code></pre>\n</td><td>\n\n<p>Specify strand-specific information: the default is unstranded.<br />For single-end reads, use F or R. 'F' means a read corresponds to a transcript. 'R' means a read corresponds to the reverse complemented counterpart of a transcript. For paired-end reads, use either FR or RF.<br />Every read alignment will have an XS attribute tag: '+' means a read belongs to a transcript on '+' strand of genome. '-' means a read belongs to a transcript on '-' strand of genome. <br />\n(TopHat has a similar option, --library-type option, where fr-firststrand corresponds to R and RF; fr-secondstrand corresponds to F and FR.)</p>\n</td></tr>\n\n</table>\n\n<h4 id=\"reporting-options\">Reporting options</h4>\n<table>\n\n<tr><td id=\"hisat-options-k\">\n\n<pre><code>-k &lt;int&gt;</code></pre>\n</td><td>\n\n<p>It searches for at most <code>&lt;int&gt;</code> distinct, valid alignments for each read. The search terminates when it can't find more distinct valid alignments, or when it finds <code>&lt;int&gt;</code>, whichever happens first. All alignments found are reported in descending order by alignment score. The alignment score for a paired-end alignment equals the sum of the alignment scores of the individual mates. Each reported read or pair alignment beyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAGS field. For reads that have more than <code>&lt;int&gt;</code> distinct, valid alignments, <code>hisat</code> does not gaurantee that the <code>&lt;int&gt;</code> alignments reported are the best possible in terms of alignment score. Default: 5</p>\n<p>Note: HISAT is not designed with large values for <code>-k</code> in mind, and when aligning reads to long, repetitive genomes large <code>-k</code> can be very, very slow.</p>\n</td></tr>\n\n</table>\n\n<h4 id=\"paired-end-options\">Paired-end options</h4>\n<table>\n\n<tr><td id=\"hisat-options-I\">\n\n<pre><code>-I/--minins &lt;int&gt;</code></pre>\n</td><td>\n\n<p>The minimum fragment length for valid paired-end alignments. E.g. if <code>-I 60</code> is specified and a paired-end alignment consists of two 20-bp alignments in the appropriate orientation with a 20-bp gap between them, that alignment is considered valid (as long as <a href=\"#hisat-options-X\"><code>-X</code></a> is also satisfied). A 19-bp gap would not be valid in that case. If trimming options <a href=\"#hisat-options-3\"><code>-3</code></a> or <a href=\"#hisat-options-5\"><code>-5</code></a> are also used, the <a href=\"#hisat-options-I\"><code>-I</code></a> constraint is applied with respect to the untrimmed mates.</p>\n<p>The larger the difference between <a href=\"#hisat-options-I\"><code>-I</code></a> and <a href=\"#hisat-options-X\"><code>-X</code></a>, the slower HISAT will run. This is because larger differences bewteen <a href=\"#hisat-options-I\"><code>-I</code></a> and <a href=\"#hisat-options-X\"><code>-X</code></a> require that HISAT scan a larger window to determine if a concordant alignment exists. For typical fragment length ranges (200 to 400 nucleotides), HISAT is very efficient.</p>\n<p>Default: 0 (essentially imposing no minimum)</p>\n</td></tr>\n<tr><td id=\"hisat-options-X\">\n\n<pre><code>-X/--maxins &lt;int&gt;</code></pre>\n</td><td>\n\n<p>The maximum fragment length for valid paired-end alignments. E.g. if <code>-X 100</code> is specified and a paired-end alignment consists of two 20-bp alignments in the proper orientation with a 60-bp gap between them, that alignment is considered valid (as long as <a href=\"#hisat-options-I\"><code>-I</code></a> is also satisfied). A 61-bp gap would not be valid in that case. If trimming options <a href=\"#hisat-options-3\"><code>-3</code></a> or <a href=\"#hisat-options-5\"><code>-5</code></a> are also used, the <code>-X</code> constraint is applied with respect to the untrimmed mates, not the trimmed mates.</p>\n<p>The larger the difference between <a href=\"#hisat-options-I\"><code>-I</code></a> and <a href=\"#hisat-options-X\"><code>-X</code></a>, the slower HISAT will run. This is because larger differences bewteen <a href=\"#hisat-options-I\"><code>-I</code></a> and <a href=\"#hisat-options-X\"><code>-X</code></a> require that HISAT scan a larger window to determine if a concordant alignment exists. For typical fragment length ranges (200 to 400 nucleotides), HISAT is very efficient.</p>\n<p>Default: 500.</p>\n</td></tr>\n<tr><td id=\"hisat-options-fr\">\n\n<pre><code>--fr/--rf/--ff</code></pre>\n</td><td>\n\n<p>The upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand. E.g., if <code>--fr</code> is specified and there is a candidate paired-end alignment where mate 1 appears upstream of the reverse complement of mate 2 and the fragment length constraints (<a href=\"#hisat-options-I\"><code>-I</code></a> and <a href=\"#hisat-options-X\"><code>-X</code></a>) are met, that alignment is valid. Also, if mate 2 appears upstream of the reverse complement of mate 1 and all other constraints are met, that too is valid. <code>--rf</code> likewise requires that an upstream mate1 be reverse-complemented and a downstream mate2 be forward-oriented. <code>--ff</code> requires both an upstream mate 1 and a downstream mate 2 to be forward-oriented. Default: <code>--fr</code> (appropriate for Illumina's Paired-end Sequencing Assay).</p>\n</td></tr>\n<tr><td id=\"hisat-options-no-mixed\">\n\n<pre><code>--no-mixed</code></pre>\n</td><td>\n\n<p>By default, when <code>hisat</code> cannot find a concordant or discordant alignment for a pair, it then tries to find alignments for the individual mates. This option disables that behavior.</p>\n</td></tr>\n<tr><td id=\"hisat-options-no-discordant\">\n\n<pre><code>--no-discordant</code></pre>\n</td><td>\n\n<p>By default, <code>hisat</code> looks for discordant alignments if it cannot find any concordant alignments. A discordant alignment is an alignment where both mates align uniquely, but that does not satisfy the paired-end constraints (<a href=\"#hisat-options-fr\"><code>--fr</code>/<code>--rf</code>/<code>--ff</code></a>, <a href=\"#hisat-options-I\"><code>-I</code></a>, <a href=\"#hisat-options-X\"><code>-X</code></a>). This option disables that behavior.</p>\n</td></tr>\n<tr><td id=\"hisat-options-dovetail\">\n\n<pre><code>--dovetail</code></pre>\n</td><td>\n\n<p>If the mates &quot;dovetail&quot;, that is if one mate alignment extends past the beginning of the other such that the wrong mate begins upstream, consider that to be concordant. See also: <a href=\"#mates-can-overlap-contain-or-dovetail-each-other\">Mates can overlap, contain or dovetail each other</a>. Default: mates cannot dovetail in a concordant alignment.</p>\n</td></tr>\n<tr><td id=\"hisat-options-no-contain\">\n\n<pre><code>--no-contain</code></pre>\n</td><td>\n\n<p>If one mate alignment contains the other, consider that to be non-concordant. See also: <a href=\"#mates-can-overlap-contain-or-dovetail-each-other\">Mates can overlap, contain or dovetail each other</a>. Default: a mate can contain the other in a concordant alignment.</p>\n</td></tr>\n<tr><td id=\"hisat-options-no-overlap\">\n\n<pre><code>--no-overlap</code></pre>\n</td><td>\n\n<p>If one mate alignment overlaps the other at all, consider that to be non-concordant. See also: <a href=\"#mates-can-overlap-contain-or-dovetail-each-other\">Mates can overlap, contain or dovetail each other</a>. Default: mates can overlap in a concordant alignment.</p>\n</td></tr></table>\n\n<h4 id=\"output-options\">Output options</h4>\n<table>\n\n<tr><td id=\"hisat-options-t\">\n\n<pre><code>-t/--time</code></pre>\n</td><td>\n\n<p>Print the wall-clock time required to load the index files and align the reads. This is printed to the &quot;standard error&quot; (&quot;stderr&quot;) filehandle. Default: off.</p>\n</td></tr>\n<tr><td id=\"hisat-options-un\">\n\n<pre><code>--un &lt;path&gt;\n--un-gz &lt;path&gt;\n--un-bz2 &lt;path&gt;</code></pre>\n</td><td>\n\n<p>Write unpaired reads that fail to align to file at <code>&lt;path&gt;</code>. These reads correspond to the SAM records with the FLAGS <code>0x4</code> bit set and neither the <code>0x40</code> nor <code>0x80</code> bits set. If <code>--un-gz</code> is specified, output will be gzip compressed. If <code>--un-bz2</code> is specified, output will be bzip2 compressed. Reads written in this way will appear exactly as they did in the input file, without any modification (same sequence, same name, same quality string, same quality encoding). Reads will not necessarily appear in the same order as they did in the input.</p>\n</td></tr>\n<tr><td id=\"hisat-options-al\">\n\n<pre><code>--al &lt;path&gt;\n--al-gz &lt;path&gt;\n--al-bz2 &lt;path&gt;</code></pre>\n</td><td>\n\n<p>Write unpaired reads that align at least once to file at <code>&lt;path&gt;</code>. These reads correspond to the SAM records with the FLAGS <code>0x4</code>, <code>0x40</code>, and <code>0x80</code> bits unset. If <code>--al-gz</code> is specified, output will be gzip compressed. If <code>--al-bz2</code> is specified, output will be bzip2 compressed. Reads written in this way will appear exactly as they did in the input file, without any modification (same sequence, same name, same quality string, same quality encoding). Reads will not necessarily appear in the same order as they did in the input.</p>\n</td></tr>\n<tr><td id=\"hisat-options-un-conc\">\n\n<pre><code>--un-conc &lt;path&gt;\n--un-conc-gz &lt;path&gt;\n--un-conc-bz2 &lt;path&gt;</code></pre>\n</td><td>\n\n<p>Write paired-end reads that fail to align concordantly to file(s) at <code>&lt;path&gt;</code>. These reads correspond to the SAM records with the FLAGS <code>0x4</code> bit set and either the <code>0x40</code> or <code>0x80</code> bit set (depending on whether it's mate #1 or #2). <code>.1</code> and <code>.2</code> strings are added to the filename to distinguish which file contains mate #1 and mate #2. If a percent symbol, <code>%</code>, is used in <code>&lt;path&gt;</code>, the percent symbol is replaced with <code>1</code> or <code>2</code> to make the per-mate filenames. Otherwise, <code>.1</code> or <code>.2</code> are added before the final dot in <code>&lt;path&gt;</code> to make the per-mate filenames. Reads written in this way will appear exactly as they did in the input files, without any modification (same sequence, same name, same quality string, same quality encoding). Reads will not necessarily appear in the same order as they did in the inputs.</p>\n</td></tr>\n<tr><td id=\"hisat-options-al-conc\">\n\n<pre><code>--al-conc &lt;path&gt;\n--al-conc-gz &lt;path&gt;\n--al-conc-bz2 &lt;path&gt;</code></pre>\n</td><td>\n\n<p>Write paired-end reads that align concordantly at least once to file(s) at <code>&lt;path&gt;</code>. These reads correspond to the SAM records with the FLAGS <code>0x4</code> bit unset and either the <code>0x40</code> or <code>0x80</code> bit set (depending on whether it's mate #1 or #2). <code>.1</code> and <code>.2</code> strings are added to the filename to distinguish which file contains mate #1 and mate #2. If a percent symbol, <code>%</code>, is used in <code>&lt;path&gt;</code>, the percent symbol is replaced with <code>1</code> or <code>2</code> to make the per-mate filenames. Otherwise, <code>.1</code> or <code>.2</code> are added before the final dot in <code>&lt;path&gt;</code> to make the per-mate filenames. Reads written in this way will appear exactly as they did in the input files, without any modification (same sequence, same name, same quality string, same quality encoding). Reads will not necessarily appear in the same order as they did in the inputs.</p>\n</td></tr>\n<tr><td id=\"hisat-options-quiet\">\n\n<pre><code>--quiet</code></pre>\n</td><td>\n\n<p>Print nothing besides alignments and serious errors.</p>\n</td></tr>\n<tr><td id=\"hisat-options-met-file\">\n\n<pre><code>--met-file &lt;path&gt;</code></pre>\n</td><td>\n\n<p>Write <code>hisat</code> metrics to file <code>&lt;path&gt;</code>. Having alignment metric can be useful for debugging certain problems, especially performance issues. See also: <a href=\"#hisat-options-met\"><code>--met</code></a>. Default: metrics disabled.</p>\n</td></tr>\n<tr><td id=\"hisat-options-met-stderr\">\n\n<pre><code>--met-stderr</code></pre>\n</td><td>\n\n<p>Write <code>hisat</code> metrics to the &quot;standard error&quot; (&quot;stderr&quot;) filehandle. This is not mutually exclusive with <a href=\"#hisat-options-met-file\"><code>--met-file</code></a>. Having alignment metric can be useful for debugging certain problems, especially performance issues. See also: <a href=\"#hisat-options-met\"><code>--met</code></a>. Default: metrics disabled.</p>\n</td></tr>\n<tr><td id=\"hisat-options-met\">\n\n<pre><code>--met &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Write a new <code>hisat</code> metrics record every <code>&lt;int&gt;</code> seconds. Only matters if either <a href=\"#hisat-options-met-stderr\"><code>--met-stderr</code></a> or <a href=\"#hisat-options-met-file\"><code>--met-file</code></a> are specified. Default: 1.</p>\n</td></tr>\n</table>\n\n<h4 id=\"sam-options\">SAM options</h4>\n<table>\n\n<tr><td id=\"hisat-options-no-unal\">\n\n<pre><code>--no-unal</code></pre>\n</td><td>\n\n<p>Suppress SAM records for reads that failed to align.</p>\n</td></tr>\n<tr><td id=\"hisat-options-no-hd\">\n\n<pre><code>--no-hd</code></pre>\n</td><td>\n\n<p>Suppress SAM header lines (starting with <code>@</code>).</p>\n</td></tr>\n<tr><td id=\"hisat-options-no-sq\">\n\n<pre><code>--no-sq</code></pre>\n</td><td>\n\n<p>Suppress <code>@SQ</code> SAM header lines.</p>\n</td></tr>\n<tr><td id=\"hisat-options-rg-id\">\n\n<pre><code>--rg-id &lt;text&gt;</code></pre>\n</td><td>\n\n<p>Set the read group ID to <code>&lt;text&gt;</code>. This causes the SAM <code>@RG</code> header line to be printed, with <code>&lt;text&gt;</code> as the value associated with the <code>ID:</code> tag. It also causes the <code>RG:Z:</code> extra field to be attached to each SAM output record, with value set to <code>&lt;text&gt;</code>.</p>\n</td></tr>\n<tr><td id=\"hisat-options-rg\">\n\n<pre><code>--rg &lt;text&gt;</code></pre>\n</td><td>\n\n<p>Add <code>&lt;text&gt;</code> (usually of the form <code>TAG:VAL</code>, e.g. <code>SM:Pool1</code>) as a field on the <code>@RG</code> header line. Note: in order for the <code>@RG</code> line to appear, <a href=\"#hisat-options-rg-id\"><code>--rg-id</code></a> must also be specified. This is because the <code>ID</code> tag is required by the <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM Spec</a>. Specify <code>--rg</code> multiple times to set multiple fields. See the <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM Spec</a> for details about what fields are legal.</p>\n</td></tr>\n<tr><td id=\"hisat-options-omit-sec-seq\">\n\n<pre><code>--omit-sec-seq</code></pre>\n</td><td>\n\n<p>When printing secondary alignments, HISAT by default will write out the <code>SEQ</code> and <code>QUAL</code> strings. Specifying this option causes HISAT to print an asterix in those fields instead.</p>\n</td></tr>\n\n\n</table>\n\n<h4 id=\"performance-options\">Performance options</h4>\n<table><tr>\n\n<td id=\"hisat-options-o\">\n\n<pre><code>-o/--offrate &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Override the offrate of the index with <code>&lt;int&gt;</code>. If <code>&lt;int&gt;</code> is greater than the offrate used to build the index, then some row markings are discarded when the index is read into memory. This reduces the memory footprint of the aligner but requires more time to calculate text offsets. <code>&lt;int&gt;</code> must be greater than the value used to build the index.</p>\n</td></tr>\n<tr><td id=\"hisat-options-p\">\n\n<pre><code>-p/--threads NTHREADS</code></pre>\n</td><td>\n\n<p>Launch <code>NTHREADS</code> parallel search threads (default: 1). Threads will run on separate processors/cores and synchronize when parsing reads and outputting alignments. Searching for alignments is highly parallel, and speedup is close to linear. Increasing <code>-p</code> increases HISAT's memory footprint. E.g. when aligning to a human genome index, increasing <code>-p</code> from 1 to 8 increases the memory footprint by a few hundred megabytes. This option is only available if <code>bowtie</code> is linked with the <code>pthreads</code> library (i.e. if <code>BOWTIE_PTHREADS=0</code> is not specified at build time).</p>\n</td></tr>\n<tr><td id=\"hisat-options-reorder\">\n\n<pre><code>--reorder</code></pre>\n</td><td>\n\n<p>Guarantees that output SAM records are printed in an order corresponding to the order of the reads in the original input file, even when <a href=\"#hisat-options-p\"><code>-p</code></a> is set greater than 1. Specifying <code>--reorder</code> and setting <a href=\"#hisat-options-p\"><code>-p</code></a> greater than 1 causes HISAT to run somewhat slower and use somewhat more memory then if <code>--reorder</code> were not specified. Has no effect if <a href=\"#hisat-options-p\"><code>-p</code></a> is set to 1, since output order will naturally correspond to input order in that case.</p>\n</td></tr>\n<tr><td id=\"hisat-options-mm\">\n\n<pre><code>--mm</code></pre>\n</td><td>\n\n<p>Use memory-mapped I/O to load the index, rather than typical file I/O. Memory-mapping allows many concurrent <code>bowtie</code> processes on the same computer to share the same memory image of the index (i.e. you pay the memory overhead just once). This facilitates memory-efficient parallelization of <code>bowtie</code> in situations where using <a href=\"#hisat-options-p\"><code>-p</code></a> is not possible or not preferable.</p>\n</td></tr></table>\n\n<h4 id=\"other-options\">Other options</h4>\n<table>\n<tr><td id=\"hisat-options-qc-filter\">\n\n<pre><code>--qc-filter</code></pre>\n</td><td>\n\n<p>Filter out reads for which the QSEQ filter field is non-zero. Only has an effect when read format is <a href=\"#hisat-options-qseq\"><code>--qseq</code></a>. Default: off.</p>\n</td></tr>\n<tr><td id=\"hisat-options-seed\">\n\n<pre><code>--seed &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Use <code>&lt;int&gt;</code> as the seed for pseudo-random number generator. Default: 0.</p>\n</td></tr>\n<tr><td id=\"hisat-options-non-deterministic\">\n\n<pre><code>--non-deterministic</code></pre>\n</td><td>\n\n<p>Normally, HISAT re-initializes its pseudo-random generator for each read. It seeds the generator with a number derived from (a) the read name, (b) the nucleotide sequence, (c) the quality sequence, (d) the value of the <a href=\"#hisat-options-seed\"><code>--seed</code></a> option. This means that if two reads are identical (same name, same nucleotides, same qualities) HISAT will find and report the same alignment(s) for both, even if there was ambiguity. When <code>--non-deterministic</code> is specified, HISAT re-initializes its pseudo-random generator for each read using the current time. This means that HISAT will not necessarily report the same alignment for two identical reads. This is counter-intuitive for some users, but might be more appropriate in situations where the input consists of many identical reads.</p>\n</td></tr>\n<tr><td id=\"hisat-options-version\">\n\n<pre><code>--version</code></pre>\n</td><td>\n\n<p>Print version information and quit.</p>\n</td></tr>\n<tr><td id=\"hisat-options-h\">\n\n<pre><code>-h/--help</code></pre>\n</td><td>\n\n<p>Print usage information and quit.</p>\n</td></tr></table>\n\n<h2 id=\"sam-output\">SAM output</h2>\n<p>Following is a brief description of the <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM</a> format as output by <code>hisat</code>. For more details, see the <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM format specification</a>.</p>\n<p>By default, <code>hisat</code> prints a SAM header with <code>@HD</code>, <code>@SQ</code> and <code>@PG</code> lines. When one or more <a href=\"#hisat-options-rg\"><code>--rg</code></a> arguments are specified, <code>hisat</code> will also print an <code>@RG</code> line that includes all user-specified <a href=\"#hisat-options-rg\"><code>--rg</code></a> tokens separated by tabs.</p>\n<p>Each subsequnt line describes an alignment or, if the read failed to align, a read. Each line is a collection of at least 12 fields separated by tabs; from left to right, the fields are:</p>\n<ol style=\"list-style-type: decimal\">\n<li><p>Name of read that aligned.</p>\n<p>Note that the <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM specification</a> disallows whitespace in the read name. If the read name contains any whitespace characters, HISAT will truncate the name at the first whitespace character. This is similar to the behavior of other tools.</p></li>\n<li><p>Sum of all applicable flags. Flags relevant to HISAT are:</p>\n<table><tr><td>\n\n<pre><code>1</code></pre>\n</td><td>\n\n<p>The read is one of a pair</p>\n</td></tr><tr><td>\n\n<pre><code>2</code></pre>\n</td><td>\n\n<p>The alignment is one end of a proper paired-end alignment</p>\n</td></tr><tr><td>\n\n<pre><code>4</code></pre>\n</td><td>\n\n<p>The read has no reported alignments</p>\n</td></tr><tr><td>\n\n<pre><code>8</code></pre>\n</td><td>\n\n<p>The read is one of a pair and has no reported alignments</p>\n</td></tr><tr><td>\n\n<pre><code>16</code></pre>\n</td><td>\n\n<p>The alignment is to the reverse reference strand</p>\n</td></tr><tr><td>\n\n<pre><code>32</code></pre>\n</td><td>\n\n<p>The other mate in the paired-end alignment is aligned to the reverse reference strand</p>\n</td></tr><tr><td>\n\n<pre><code>64</code></pre>\n</td><td>\n\n<p>The read is mate 1 in a pair</p>\n</td></tr><tr><td>\n\n<pre><code>128</code></pre>\n</td><td>\n\n<p>The read is mate 2 in a pair</p>\n</td></tr></table>\n\n<p>Thus, an unpaired read that aligns to the reverse reference strand will have flag 16. A paired-end read that aligns and is the first mate in the pair will have flag 83 (= 64 + 16 + 2 + 1).</p></li>\n<li><p>Name of reference sequence where alignment occurs</p></li>\n<li><p>1-based offset into the forward reference strand where leftmost character of the alignment occurs</p></li>\n<li><p>Mapping quality</p></li>\n<li><p>CIGAR string representation of alignment</p></li>\n<li><p>Name of reference sequence where mate's alignment occurs. Set to <code>=</code> if the mate's reference sequence is the same as this alignment's, or <code>*</code> if there is no mate.</p></li>\n<li><p>1-based offset into the forward reference strand where leftmost character of the mate's alignment occurs. Offset is 0 if there is no mate.</p></li>\n<li><p>Inferred fragment length. Size is negative if the mate's alignment occurs upstream of this alignment. Size is 0 if the mates did not align concordantly. However, size is non-0 if the mates aligned discordantly to the same chromosome.</p></li>\n<li><p>Read sequence (reverse-complemented if aligned to the reverse strand)</p></li>\n<li><p>ASCII-encoded read qualities (reverse-complemented if the read aligned to the reverse strand). The encoded quality values are on the <a href=\"http://en.wikipedia.org/wiki/Phred_quality_score\">Phred quality</a> scale and the encoding is ASCII-offset by 33 (ASCII char <code>!</code>), similarly to a <a href=\"http://en.wikipedia.org/wiki/FASTQ_format\">FASTQ</a> file.</p></li>\n<li><p>Optional fields. Fields are tab-separated. <code>hisat</code> outputs zero or more of these optional fields for each alignment, depending on the type of the alignment:</p>\n<table>\n<tr><td id=\"hisat-build-opt-fields-as\">\n\n<pre><code>AS:i:&lt;N&gt;</code></pre>\n</td>\n<td>\n\n<p>Alignment score. Can be negative. Can be greater than 0 in [<code>--local</code>] mode (but not in [<code>--end-to-end</code>] mode). Only present if SAM record is for an aligned read.</p>\n</td></tr>\n<tr><td id=\"hisat-build-opt-fields-xs\">\n\n<pre><code>XS:i:&lt;N&gt;</code></pre>\n</td>\n<td>\n\n<p>Alignment score for second-best alignment. Can be negative. Can be greater than 0 in [<code>--local</code>] mode (but not in [<code>--end-to-end</code>] mode). Only present if the SAM record is for an aligned read and more than one alignment was found for the read.</p>\n</td></tr>\n<tr><td id=\"hisat-build-opt-fields-ys\">\n\n<pre><code>YS:i:&lt;N&gt;</code></pre>\n</td>\n<td>\n\n<p>Alignment score for opposite mate in the paired-end alignment. Only present if the SAM record is for a read that aligned as part of a paired-end alignment.</p>\n</td></tr>\n<tr><td id=\"hisat-build-opt-fields-xn\">\n\n<pre><code>XN:i:&lt;N&gt;</code></pre>\n</td>\n<td>\n\n<p>The number of ambiguous bases in the reference covering this alignment. Only present if SAM record is for an aligned read.</p>\n</td></tr>\n<tr><td id=\"hisat-build-opt-fields-xm\">\n\n<pre><code>XM:i:&lt;N&gt;</code></pre>\n</td>\n<td>\n\n<p>The number of mismatches in the alignment. Only present if SAM record is for an aligned read.</p>\n</td></tr>\n<tr><td id=\"hisat-build-opt-fields-xo\">\n\n<pre><code>XO:i:&lt;N&gt;</code></pre>\n</td>\n<td>\n\n<p>The number of gap opens, for both read and reference gaps, in the alignment. Only present if SAM record is for an aligned read.</p>\n</td></tr>\n<tr><td id=\"hisat-build-opt-fields-xg\">\n\n<pre><code>XG:i:&lt;N&gt;</code></pre>\n</td>\n<td>\n\n<p>The number of gap extensions, for both read and reference gaps, in the alignment. Only present if SAM record is for an aligned read.</p>\n</td></tr>\n<tr><td id=\"hisat-build-opt-fields-nm\">\n\n<pre><code>NM:i:&lt;N&gt;</code></pre>\n</td>\n<td>\n\n<p>The edit distance; that is, the minimal number of one-nucleotide edits (substitutions, insertions and deletions) needed to transform the read string into the reference string. Only present if SAM record is for an aligned read.</p>\n</td></tr>\n<tr><td id=\"hisat-build-opt-fields-yf\">\n\n<pre><code>YF:Z:&lt;S&gt;</code></pre>\n</td><td>\n\n<p>String indicating reason why the read was filtered out. See also: [Filtering]. Only appears for reads that were filtered out.</p>\n</td></tr>\n<tr><td id=\"hisat-build-opt-fields-yt\">\n\n<pre><code>YT:Z:&lt;S&gt;</code></pre>\n</td><td>\n\n<p>Value of <code>UU</code> indicates the read was not part of a pair. Value of <code>CP</code> indicates the read was part of a pair and the pair aligned concordantly. Value of <code>DP</code> indicates the read was part of a pair and the pair aligned discordantly. Value of <code>UP</code> indicates the read was part of a pair but the pair failed to aligned either concordantly or discordantly.</p>\n</td></tr>\n<tr><td id=\"hisat-build-opt-fields-md\">\n\n<pre><code>MD:Z:&lt;S&gt;</code></pre>\n</td><td>\n\n<p>A string representation of the mismatched reference bases in the alignment. See <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM</a> format specification for details. Only present if SAM record is for an aligned read.</p>\n</td></tr>\n</table>\n</li>\n</ol>\n<h1 id=\"the-hisat-build-indexer\">The <code>hisat-build</code> indexer</h1>\n<p><code>hisat-build</code> builds a HISAT index from a set of DNA sequences. <code>hisat-build</code> outputs a set of 6 files with suffixes <code>.1.bt2</code>, <code>.2.bt2</code>, <code>.3.bt2</code>, <code>.4.bt2</code>, <code>.rev.1.bt2</code>, and <code>.rev.2.bt2</code>. In the case of a large index these suffixes will have a <code>bt2l</code> termination. These files together constitute the index: they are all that is needed to align reads to that reference. The original sequence FASTA files are no longer used by HISAT once the index is built.</p>\n<p>Use of Karkkainen's <a href=\"http://portal.acm.org/citation.cfm?id=1314852\">blockwise algorithm</a> allows <code>hisat-build</code> to trade off between running time and memory usage. <code>hisat-build</code> has three options governing how it makes this trade: <a href=\"#hisat-build-options-p\"><code>-p</code>/<code>--packed</code></a>, <a href=\"#hisat-build-options-bmax\"><code>--bmax</code></a>/<a href=\"#hisat-build-options-bmaxdivn\"><code>--bmaxdivn</code></a>, and <a href=\"#hisat-build-options-dcv\"><code>--dcv</code></a>. By default, <code>hisat-build</code> will automatically search for the settings that yield the best running time without exhausting memory. This behavior can be disabled using the <a href=\"#hisat-build-options-a\"><code>-a</code>/<code>--noauto</code></a> option.</p>\n<p>The indexer provides options pertaining to the &quot;shape&quot; of the index, e.g. <a href=\"#hisat-build-options-o\"><code>--offrate</code></a> governs the fraction of <a href=\"http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\">Burrows-Wheeler</a> rows that are &quot;marked&quot; (i.e., the density of the suffix-array sample; see the original <a href=\"http://en.wikipedia.org/wiki/FM-index\">FM Index</a> paper for details). All of these options are potentially profitable trade-offs depending on the application. They have been set to defaults that are reasonable for most cases according to our experiments. See <a href=\"#performance-tuning\">Performance tuning</a> for details.</p>\n<p><code>hisat-build</code> can generate either <a href=\"#small-and-large-indexes\">small or large indexes</a>. The wrapper will decide which based on the length of the input genome. If the reference does not exceed 4 billion characters but a large index is preferred, the user can specify <a href=\"#hisat-build-options-large-index\"><code>--large-index</code></a> to force <code>hisat-build</code> to build a large index instead.</p>\n<p>The HISAT index is based on the <a href=\"http://en.wikipedia.org/wiki/FM-index\">FM Index</a> of Ferragina and Manzini, which in turn is based on the <a href=\"http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\">Burrows-Wheeler</a> transform. The algorithm used to build the index is based on the <a href=\"http://portal.acm.org/citation.cfm?id=1314852\">blockwise algorithm</a> of Karkkainen.</p>\n<h2 id=\"command-line-1\">Command Line</h2>\n<p>Usage:</p>\n<pre><code>hisat-build [options]* &lt;reference_in&gt; &lt;bt2_base&gt;</code></pre>\n<h3 id=\"main-arguments-1\">Main arguments</h3>\n<table><tr><td>\n\n<pre><code>&lt;reference_in&gt;</code></pre>\n</td><td>\n\n<p>A comma-separated list of FASTA files containing the reference sequences to be aligned to, or, if <a href=\"#hisat-build-options-c\"><code>-c</code></a> is specified, the sequences themselves. E.g., <code>&lt;reference_in&gt;</code> might be <code>chr1.fa,chr2.fa,chrX.fa,chrY.fa</code>, or, if <a href=\"#hisat-build-options-c\"><code>-c</code></a> is specified, this might be <code>GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA</code>.</p>\n</td></tr><tr><td>\n\n<pre><code>&lt;bt2_base&gt;</code></pre>\n</td><td>\n\n<p>The basename of the index files to write. By default, <code>hisat-build</code> writes files named <code>NAME.1.bt2</code>, <code>NAME.2.bt2</code>, <code>NAME.3.bt2</code>, <code>NAME.4.bt2</code>, <code>NAME.5.bt2</code>, <code>NAME.6.bt2</code>, <code>NAME.rev.1.bt2</code>, <code>NAME.rev.2.bt2</code>, <code>NAME.rev.5.bt2</code>, and <code>NAME.rev.6.bt2</code> where <code>NAME</code> is <code>&lt;bt2_base&gt;</code>.</p>\n</td></tr></table>\n\n<h3 id=\"options-1\">Options</h3>\n<table><tr><td>\n\n<pre><code>-f</code></pre>\n</td><td>\n\n<p>The reference input files (specified as <code>&lt;reference_in&gt;</code>) are FASTA files (usually having extension <code>.fa</code>, <code>.mfa</code>, <code>.fna</code> or similar).</p>\n</td></tr><tr><td id=\"hisat-build-options-c\">\n\n<pre><code>-c</code></pre>\n</td><td>\n\n<p>The reference sequences are given on the command line. I.e. <code>&lt;reference_in&gt;</code> is a comma-separated list of sequences rather than a list of FASTA files.</p>\n</td></tr>\n</td>\n</tra>\n<tr><td id=\"hisat-build-options-large-index\">\n\n<pre><code>--large-index</code></pre>\n</td><td>\n\n<p>Force <code>hisat-build</code> to build a <a href=\"#small-and-large-indexes\">large index</a>, even if the reference is less than ~ 4 billion nucleotides inlong.</p>\n</td></tr>\n<tr><td id=\"hisat-build-options-a\">\n\n<pre><code>-a/--noauto</code></pre>\n</td><td>\n\n<p>Disable the default behavior whereby <code>hisat-build</code> automatically selects values for the <a href=\"#hisat-build-options-bmax\"><code>--bmax</code></a>, <a href=\"#hisat-build-options-dcv\"><code>--dcv</code></a> and <a href=\"#hisat-build-options-p\"><code>--packed</code></a> parameters according to available memory. Instead, user may specify values for those parameters. If memory is exhausted during indexing, an error message will be printed; it is up to the user to try new parameters.</p>\n</td></tr><tr><td id=\"hisat-build-options-p\">\n\n<pre><code>-p/--packed</code></pre>\n</td><td>\n\n<p>Use a packed (2-bits-per-nucleotide) representation for DNA strings. This saves memory but makes indexing 2-3 times slower. Default: off. This is configured automatically by default; use <a href=\"#hisat-build-options-a\"><code>-a</code>/<code>--noauto</code></a> to configure manually.</p>\n</td></tr><tr><td id=\"hisat-build-options-bmax\">\n\n<pre><code>--bmax &lt;int&gt;</code></pre>\n</td><td>\n\n<p>The maximum number of suffixes allowed in a block. Allowing more suffixes per block makes indexing faster, but increases peak memory usage. Setting this option overrides any previous setting for <a href=\"#hisat-build-options-bmax\"><code>--bmax</code></a>, or <a href=\"#hisat-build-options-bmaxdivn\"><code>--bmaxdivn</code></a>. Default (in terms of the <a href=\"#hisat-build-options-bmaxdivn\"><code>--bmaxdivn</code></a> parameter) is <a href=\"#hisat-build-options-bmaxdivn\"><code>--bmaxdivn</code></a> 4. This is configured automatically by default; use <a href=\"#hisat-build-options-a\"><code>-a</code>/<code>--noauto</code></a> to configure manually.</p>\n</td></tr><tr><td id=\"hisat-build-options-bmaxdivn\">\n\n<pre><code>--bmaxdivn &lt;int&gt;</code></pre>\n</td><td>\n\n<p>The maximum number of suffixes allowed in a block, expressed as a fraction of the length of the reference. Setting this option overrides any previous setting for <a href=\"#hisat-build-options-bmax\"><code>--bmax</code></a>, or <a href=\"#hisat-build-options-bmaxdivn\"><code>--bmaxdivn</code></a>. Default: <a href=\"#hisat-build-options-bmaxdivn\"><code>--bmaxdivn</code></a> 4. This is configured automatically by default; use <a href=\"#hisat-build-options-a\"><code>-a</code>/<code>--noauto</code></a> to configure manually.</p>\n</td></tr><tr><td id=\"hisat-build-options-dcv\">\n\n<pre><code>--dcv &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Use <code>&lt;int&gt;</code> as the period for the difference-cover sample. A larger period yields less memory overhead, but may make suffix sorting slower, especially if repeats are present. Must be a power of 2 no greater than 4096. Default: 1024. This is configured automatically by default; use <a href=\"#hisat-build-options-a\"><code>-a</code>/<code>--noauto</code></a> to configure manually.</p>\n</td></tr><tr><td id=\"hisat-build-options-nodc\">\n\n<pre><code>--nodc</code></pre>\n</td><td>\n\n<p>Disable use of the difference-cover sample. Suffix sorting becomes quadratic-time in the worst case (where the worst case is an extremely repetitive reference). Default: off.</p>\n</td></tr><tr><td>\n\n<pre><code>-r/--noref</code></pre>\n</td><td>\n\n<p>Do not build the <code>NAME.3.bt2</code> and <code>NAME.4.bt2</code> portions of the index, which contain a bitpacked version of the reference sequences and are used for paired-end alignment.</p>\n</td></tr><tr><td>\n\n<pre><code>-3/--justref</code></pre>\n</td><td>\n\n<p>Build only the <code>NAME.3.bt2</code> and <code>NAME.4.bt2</code> portions of the index, which contain a bitpacked version of the reference sequences and are used for paired-end alignment.</p>\n</td></tr><tr><td id=\"hisat-build-options-o\">\n\n<pre><code>-o/--offrate &lt;int&gt;</code></pre>\n</td><td>\n\n<p>To map alignments back to positions on the reference sequences, it's necessary to annotate (&quot;mark&quot;) some or all of the <a href=\"http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\">Burrows-Wheeler</a> rows with their corresponding location on the genome. <a href=\"#hisat-build-options-o\"><code>-o</code>/<code>--offrate</code></a> governs how many rows get marked: the indexer will mark every 2^<code>&lt;int&gt;</code> rows. Marking more rows makes reference-position lookups faster, but requires more memory to hold the annotations at runtime. The default is 4 (every 16th row is marked; for human genome, annotations occupy about 680 megabytes).</p>\n</td></tr><tr><td>\n\n<pre><code>-t/--ftabchars &lt;int&gt;</code></pre>\n</td><td>\n\n<p>The ftab is the lookup table used to calculate an initial <a href=\"http://en.wikipedia.org/wiki/Burrows-Wheeler_transform\">Burrows-Wheeler</a> range with respect to the first <code>&lt;int&gt;</code> characters of the query. A larger <code>&lt;int&gt;</code> yields a larger lookup table but faster query times. The ftab has size 4^(<code>&lt;int&gt;</code>+1) bytes. The default setting is 10 (ftab is 4MB).</p>\n</td></tr><tr><td id=\"hisat-build-options-localoffrate\">\n\n<pre><code>--localoffrate &lt;int&gt;</code></pre>\n</td><td>\n\n<p>This option governs how many rows get marked in a local index: the indexer will mark every 2^<code>&lt;int&gt;</code> rows. Marking more rows makes reference-position lookups faster, but requires more memory to hold the annotations at runtime. The default is 3 (every 8th row is marked, this occupies about 16KB per local index).</p>\n</td></tr><tr><td>\n\n<pre><code>--localftabchars &lt;int&gt;</code></pre>\n</td><td>\n\n<p>The local ftab is the lookup table in a local index. The default setting is 6 (ftab is 8KB per local index).</p>\n</td></tr><tr><td>\n\n<pre><code>--seed &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Use <code>&lt;int&gt;</code> as the seed for pseudo-random number generator.</p>\n</td></tr><tr><td>\n\n<pre><code>--cutoff &lt;int&gt;</code></pre>\n</td><td>\n\n<p>Index only the first <code>&lt;int&gt;</code> bases of the reference sequences (cumulative across sequences) and ignore the rest.</p>\n</td></tr><tr><td>\n\n<pre><code>-q/--quiet</code></pre>\n</td><td>\n\n<p><code>hisat-build</code> is verbose by default. With this option <code>hisat-build</code> will print only error messages.</p>\n</td></tr><tr><td>\n\n<pre><code>-h/--help</code></pre>\n</td><td>\n\n<p>Print usage information and quit.</p>\n</td></tr><tr><td>\n\n<pre><code>--version</code></pre>\n</td><td>\n\n<p>Print version information and quit.</p>\n</td></tr></table>\n\n<h1 id=\"the-hisat-inspect-index-inspector\">The <code>hisat-inspect</code> index inspector</h1>\n<p><code>hisat-inspect</code> extracts information from a HISAT index about what kind of index it is and what reference sequences were used to build it. When run without any options, the tool will output a FASTA file containing the sequences of the original references (with all non-<code>A</code>/<code>C</code>/<code>G</code>/<code>T</code> characters converted to <code>N</code>s). It can also be used to extract just the reference sequence names using the <a href=\"#hisat-inspect-options-n\"><code>-n</code>/<code>--names</code></a> option or a more verbose summary using the <a href=\"#hisat-inspect-options-s\"><code>-s</code>/<code>--summary</code></a> option.</p>\n<h2 id=\"command-line-2\">Command Line</h2>\n<p>Usage:</p>\n<pre><code>hisat-inspect [options]* &lt;bt2_base&gt;</code></pre>\n<h3 id=\"main-arguments-2\">Main arguments</h3>\n<table><tr><td>\n\n<pre><code>&lt;bt2_base&gt;</code></pre>\n</td><td>\n\n<p>The basename of the index to be inspected. The basename is name of any of the index files but with the <code>.X.bt2</code> or <code>.rev.X.bt2</code> suffix omitted. <code>hisat-inspect</code> first looks in the current directory for the index files, then in the directory specified in the <code>HISAT_INDEXES</code> environment variable.</p>\n</td></tr></table>\n\n<h3 id=\"options-2\">Options</h3>\n<table><tr><td>\n\n<pre><code>-a/--across &lt;int&gt;</code></pre>\n</td><td>\n\n<p>When printing FASTA output, output a newline character every <code>&lt;int&gt;</code> bases (default: 60).</p>\n</td></tr><tr><td id=\"hisat-inspect-options-n\">\n\n<pre><code>-n/--names</code></pre>\n</td><td>\n\n<p>Print reference sequence names, one per line, and quit.</p>\n</td></tr><tr><td id=\"hisat-inspect-options-s\">\n\n<pre><code>-s/--summary</code></pre>\n</td><td>\n\n<p>Print a summary that includes information about index settings, as well as the names and lengths of the input sequences. The summary has this format:</p>\n<pre><code>Colorspace  &lt;0 or 1&gt;\nSA-Sample   1 in &lt;sample&gt;\nFTab-Chars  &lt;chars&gt;\nSequence-1  &lt;name&gt;  &lt;len&gt;\nSequence-2  &lt;name&gt;  &lt;len&gt;\n...\nSequence-N  &lt;name&gt;  &lt;len&gt;</code></pre>\n<p>Fields are separated by tabs. Colorspace is always set to 0 for HISAT.</p>\n</td></tr><tr><td>\n\n<pre><code>-v/--verbose</code></pre>\n</td><td>\n\n<p>Print verbose output (for debugging).</p>\n</td></tr><tr><td>\n\n<pre><code>--version</code></pre>\n</td><td>\n\n<p>Print version information and quit.</p>\n</td></tr><tr><td>\n\n<pre><code>-h/--help</code></pre>\n</td><td>\n\n<p>Print usage information and quit.</p>\n</td></tr></table>\n\n<h1 id=\"getting-started-with-hisat\">Getting started with HISAT</h1>\n<p>HISAT comes with some example files to get you started. The example files are not scientifically significant; these files will simply let you start running HISAT and downstream tools right away.</p>\n<p>First follow the manual instructions to <a href=\"#obtaining-hisat\">obtain HISAT</a>. Set the <code>HISAT_HOME</code> environment variable to point to the new HISAT directory containing the <code>hisat</code>, <code>hisat-build</code> and <code>hisat-inspect</code> binaries. This is important, as the <code>HISAT_HOME</code> variable is used in the commands below to refer to that directory.</p>\n<h2 id=\"indexing-a-reference-genome\">Indexing a reference genome</h2>\n<p>To create an index for the genomic region (1 million bps from the human chromosome 22 between 20,000,000 and 20,999,999) included with HISAT, create a new temporary directory (it doesn't matter where), change into that directory, and run:</p>\n<pre><code>$HISAT_HOME/hisat-build $HISAT_HOME/example/reference/22_20-21M.fa 22_20-21M_hisat</code></pre>\n<p>The command should print many lines of output then quit. When the command completes, the current directory will contain ten new files that all start with <code>22_20-21M_hisat</code> and end with <code>.1.bt2</code>, <code>.2.bt2</code>, <code>.3.bt2</code>, <code>.4.bt2</code>, <code>.5.bt2</code>, <code>.6.bt2</code>, <code>.rev.1.bt2</code>, <code>.rev.2.bt2</code>, <code>.rev.5.bt2</code>, and <code>.rev.6.bt2</code>. These files constitute the index - you're done!</p>\n<p>You can use <code>hisat-build</code> to create an index for a set of FASTA files obtained from any source, including sites such as <a href=\"http://genome.ucsc.edu/cgi-bin/hgGateway\">UCSC</a>, <a href=\"http://www.ncbi.nlm.nih.gov/sites/genome\">NCBI</a>, and <a href=\"http://www.ensembl.org/\">Ensembl</a>. When indexing multiple FASTA files, specify all the files using commas to separate file names. For more details on how to create an index with <code>hisat-build</code>, see the <a href=\"#the-hisat-build-indexer\">manual section on index building</a>. You may also want to bypass this process by obtaining a pre-built index.</p>\n<h2 id=\"aligning-example-reads\">Aligning example reads</h2>\n<p>Stay in the directory created in the previous step, which now contains the <code>22_20-21M_hisat</code> index files. Next, run:</p>\n<pre><code>$HISAT_HOME/hisat -x 22_20-21M_hisat -U $HISAT_HOME/example/reads/reads_1.fq -S eg1.sam</code></pre>\n<p>This runs the HISAT aligner, which aligns a set of unpaired reads to the the genome region using the index generated in the previous step. The alignment results in SAM format are written to the file <code>eg1.sam</code>, and a short alignment summary is written to the console. (Actually, the summary is written to the &quot;standard error&quot; or &quot;stderr&quot; filehandle, which is typically printed to the console.)</p>\n<p>To see the first few lines of the SAM output, run:</p>\n<pre><code>head eg1.sam</code></pre>\n<p>You will see something like this:</p>\n<pre><code>@HD VN:1.0   SO:unsorted\n@SQ SN:22:20000000-20999999 LN:1000000\n@PG ID:hisat            PN:hisat    VN:0.1.0\n1   0               22:20000000-20999999    4115    255 100M            *   0   0   GGAGCGCAGCGTGGGCGGCCCCGCAGCGCGGCCTCGGACCCCAGAAGGGCTTCCCCGGGTCCGTTGGCGCGCGGGGAGCGGCGTTCCCAGGGCGCGGCGC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU\n2   16              22:20000000-20999999    4197    255 100M            *   0   0   GTTCCCAGGGCGCGGCGCGGTGCGGCGCGGCGCGGGTCGCAGTCCACGCGGCCGCAACTCGGACCGGTGCGGGGGCCGCCCCCTCCCTCCAGGCCCAGCG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU\n3   0               22:20000000-20999999    4113    255 100M            *   0   0   CTGGAGCGCAGCGTGGGCGGCCCCGCAGCGCGGCCTCGGACCCCAGAAGGGCTTCCCCGGGTCCGTTGGCGCGCGGGGAGCGGCGTTCCCAGGGCGCGGC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU\n4   0               22:20000000-20999999    52358   255 100M            *   0   0   TTCAGGGTCTGCCTTTATGCCAGTGAGGAGCAGCAGAGTCTGATACTAGGTCTAGGACCGGCCGAGGTATACCATGAACATGTGGATACACCTGAGCCCA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU\n5   16              22:20000000-20999999    52680   255 100M            *   0   0   CTTCTGGCCAGTAGGTCTTTGTTCTGGTCCAACGACAGGAGTAGGCTTGTATTTAAAAGCGGCCCCTCCTCTCCTGTGGCCACAGAACACAGGCGTGCTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU\n6   16              22:20000000-20999999    52664   255 100M            *   0   0   TCTCACCTCTCATGTGCTTCTGGCCAGTAGGTCTTTGTTCTGGTCCAACGACAGGAGTAGGCTTGTATTTAAAAGCGGCCCCTCCTCTCCTGTGGCCACA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU\n7   0               22:20000000-20999999    52468   255 100M            *   0   0   TGTACACAGGCACTCACATGGCACACACATACACTCCTGCGTGTGCACAAGCACACACATGCAAGCCATATACATGGACACCGACACAGGCACATGTACG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU\n8   0               22:20000000-20999999    4538    255 100M            *   0   0   CGGCCCCGCACCTGCCCGAACCTCTGCGGCGGCGGTGGCAGGGTACGCGGGACCGCTCCCTCCCAGCCGACTTACGAGAACATCCCCCGACCATCCAGCC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU\n9   16              22:20000000-20999999    4667    255 50M19567N50M    *   0   0   CTTCCCCGGACTCTGGCCGCGTAGCCTCCGCCACCACTCCCAGTTCACAGACCTCGCGACCTGTGTCAGCAGAGCCGCCCTGCACCACCATGTGCATCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-1 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU XS:A:+\n10  0               22:20000000-20999999    30948   255 20M9021N80M     *   0   0   CAACAACGAGATCCTCAGTGGGCTGGACATGGAGGAAGGCAAGGAAGGAGGCACATGGCTGGGCATCAGCACACGTGGCAAGCTGGCAGCACTCACCAAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-1 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU XS:A:+\n11  16              22:20000000-20999999    40044   255 65M8945N35M     *   0   0   TGGCAAGCTGGCAGCACTCACCAACTACCTGCAGCCGCAGCTGGACTGGCAGGCCCGAGGGCGAGGCACCTACGGGCTGAGCAACGCGCTGCTGGAGACT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-1 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU XS:A:+</code></pre>\n<p>The first few lines (beginning with <code>@</code>) are SAM header lines, and the rest of the lines are SAM alignments, one line per read or mate. See the <a href=\"#sam-output\">HISAT manual section on SAM output</a> and the <a href=\"http://samtools.sourceforge.net/SAM1.pdf\">SAM specification</a> for details about how to interpret the SAM file format.</p>\n<h2 id=\"paired-end-example\">Paired-end example</h2>\n<p>To align paired-end reads included with HISAT, stay in the same directory and run:</p>\n<pre><code>$HISAT_HOME/hisat -x 22_20-21M_hisat -1 $HISAT_HOME/example/reads/reads_1.fq -2 $HISAT_HOME/example/reads/reads_2.fq -S eg2.sam</code></pre>\n<p>This aligns a set of paired-end reads to the reference genome, with results written to the file <code>eg2.sam</code>.</p>\n<h2 id=\"using-samtoolsbcftools-downstream\">Using SAMtools/BCFtools downstream</h2>\n<p><a href=\"http://samtools.sourceforge.net\">SAMtools</a> is a collection of tools for manipulating and analyzing SAM and BAM alignment files. <a href=\"http://samtools.sourceforge.net/mpileup.shtml\">BCFtools</a> is a collection of tools for calling variants and manipulating VCF and BCF files, and it is typically distributed with <a href=\"http://samtools.sourceforge.net\">SAMtools</a>. Using these tools together allows you to get from alignments in SAM format to variant calls in VCF format. This example assumes that <code>samtools</code> and <code>bcftools</code> are installed and that the directories containing these binaries are in your <a href=\"http://en.wikipedia.org/wiki/PATH_(variable)\">PATH environment variable</a>.</p>\n<p>Run the paired-end example:</p>\n<pre><code>$HISAT_HOME/hisat -x $HISAT_HOME/example/index/22_20-21M_hisat -1 $HISAT_HOME/example/reads/reads_1.fq -2 $HISAT_HOME/example/reads/reads_2.fq -S eg2.sam</code></pre>\n<p>Use <code>samtools view</code> to convert the SAM file into a BAM file. BAM is a the binary format corresponding to the SAM text format. Run:</p>\n<pre><code>samtools view -bS eg2.sam &gt; eg2.bam</code></pre>\n<p>Use <code>samtools sort</code> to convert the BAM file to a sorted BAM file.</p>\n<pre><code>samtools sort eg2.bam eg2.sorted</code></pre>\n<p>We now have a sorted BAM file called <code>eg2.sorted.bam</code>. Sorted BAM is a useful format because the alignments are (a) compressed, which is convenient for long-term storage, and (b) sorted, which is conveneint for variant discovery. To generate variant calls in VCF format, run:</p>\n<pre><code>samtools mpileup -uf $HISAT_HOME/example/reference/22_20-21M.fa eg2.sorted.bam | bcftools view -bvcg - &gt; eg2.raw.bcf</code></pre>\n<p>Then to view the variants, run:</p>\n<pre><code>bcftools view eg2.raw.bcf</code></pre>\n<p>See the official SAMtools guide to <a href=\"http://samtools.sourceforge.net/mpileup.shtml\">Calling SNPs/INDELs with SAMtools/BCFtools</a> for more details and variations on this process.</p>\n"
  },
  {
    "path": "doc/manual.shtml",
    "content": "<!--#set var=\"Title\" value=\"Centrifuge\" -->\n<!--#set var=\"NoCrumbs\" value=\"1\" -->\n<!--#set var=\"SubTitle\" value=\"Classifier for metagenomic sequences\"-->\n<!--#set var=\"ExtraCSS\" value=\"/software/centrifuge/add.css\"-->\n<!--#include virtual=\"/iheader_r.shtml\"-->\n<div id=\"mainContent\">\n  <div id=\"main\">\n  \n     <div id=\"rightside\">\n\n <!--  #  set var=\"BwtIndexes\" value=\"1\" -->\n <!--#include virtual=\"sidebar.inc.shtml\"-->\n          \n\t</div> <!-- End of \"rightside\" -->\n\n\t\n  <div id=\"leftside\">\n  <h1>Table of Contents</h1>\n <!--#include virtual=\"manual.inc.html\"-->\n  </div>\n  </div>\n</div>\n\n<!--#include virtual=\"footer.inc.html\"-->\n\n<!-- Google analytics code -->\n<script type=\"text/javascript\">\nvar gaJsHost = ((\"https:\" == document.location.protocol) ? \"https://ssl.\" : \"http://www.\");\ndocument.write(unescape(\"%3Cscript src='\" + gaJsHost + \"google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E\"));\n</script>\n<script type=\"text/javascript\">\nvar pageTracker = _gat._getTracker(\"UA-6101038-1\");\npageTracker._trackPageview();\n</script>\n\n</body>\n</html>\n"
  },
  {
    "path": "doc/sidebar.inc.shtml",
    "content": "<h2>Site Map</h2>\n<div class=\"box\">\n <ul>\n   <li><a href=\"index.shtml\">Home</a></li>\n   <li><a href=\"manual.shtml\">Manual</a></li>\n   <li><a href=\"faq.shtml\">FAQ</a></li>\n </ul>\n</div>\n\n<!--\n<h2>News and Updates</h2>\n<div class=\"box\">\n <ul>\n   <table width=\"100%\">\n\t <tbody><tr><td>New releases and related tools will be announced through the Bowtie\n     <a href=\"https://lists.sourceforge.net/lists/listinfo/bowtie-bio-announce\"><b>mailing list</b></a>.</td></tr>\n   </tbody></table>\n </ul>\n</div>\n-->\n\n<h2>Getting Help</h2>\n<div class=\"box\">\n <ul>\n   <table width=\"100%\">\n     <tbody><tr><td>\n        Please submit an issue on <a href=\"https://github.com/infphilo/centrifuge/issues\">GitHub</a>, or send an E-Mail to\n        <a href=\"mailto:centrifuge.metagenomics@gmail.com\">centrifuge.metagenomics@gmail.com</a> for private communications.</td></tr>\n   </tbody></table>\n </ul>\n</div>\n\n<a href=\"ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/downloads\"><h2><u>Releases</u></h2></a>\n<div class=\"box\">\n <ul>\n   <table width=\"100%\"><tbody><tr><td>version 1.0.3-beta</td> <td align=\"right\">12/06/2016</td></tr>\n       <tr>\n\t <td><a href=\"ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/downloads/centrifuge-1.0.3-beta-source.zip\" onclick=\"javascript: pageTracker._trackPageview('/downloads/centrifuge'); \">&nbsp;&nbsp;&nbsp;Source code</a></td>\n       </tr>\n       <tr>\n\t <td><a href=\"ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/downloads/centrifuge-1.0.3-beta-Linux_x86_64.zip\" onclick=\"javascript: pageTracker._trackPageview('/downloads/centrifuge'); \">&nbsp;&nbsp;&nbsp;Linux x86_64 binary</a></td>\n       </tr>\n       <tr>\n\t <td><a href=\"ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/downloads/centrifuge-1.0.3-beta-OSX_x86_64.zip\" onclick=\"javascript: pageTracker._trackPageview('/downloads/centrifuge'); \">&nbsp;&nbsp;&nbsp;Mac OS X x86_64 binary</a></td>\n       </tr>\n   </tbody></table>\n </ul>\n</div>\n\n<a href=\"ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data\"><h2><u>Indexes</u></h2></a> \n  <div class=\"box\">\n    <table width=\"100%\"><tr><td>last updated:</td> <td align=\"right\">12/06/2016</td></tr>\n      <tr>\n        <td>\n\t  <a href=\"ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed.tar.gz\"><i>&nbsp;&nbsp;&nbsp;Bacteria, Archaea (compressed)</i></a>\n        </td>\n\t<td align=\"right\" style=\"font-size: x-small\">\n\t  <b>4.4 GB</b>\n        </td>\n      </tr>\n      <tr>\n        <td>\n\t  <a href=\"ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz\"><i>&nbsp;&nbsp;&nbsp;Bacteria, Archaea, Viruses, Human (compressed)</i></a>\n        </td>\n\t<td align=\"right\" style=\"font-size: x-small\">\n\t  <b>5.4 GB</b>\n        </td>\n      </tr>\n      <tr>\n        <td>\n\t  <a href=\"ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p+h+v.tar.gz\"><i>&nbsp;&nbsp;&nbsp;Bacteria, Aarchaea, Viruses, Human </i></a>\n        </td>\n\t<td align=\"right\" style=\"font-size: x-small\">\n\t  <b>7.9 GB</b>\n        </td>\n      </tr>\n      <tr>\n        <td>\n\t  <a href=\"ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/nt.tar.gz\"><i>&nbsp;&nbsp;&nbsp;NCBI nucleotide non-redundant sequences </i></a>\n        </td>\n\t<td align=\"right\" style=\"font-size: x-small\">\n\t  <b>50 GB</b>\n        </td>\n      </tr>\n    </table>\n  </div>\n\n<h2>Related Tools</h2>\n<div class=\"box\">\n <ul>\n   <li><a href=\"https://github.com/fbreitwieser/pavian\">Pavian</a>: Tool for interactive analysis of pathogen and metagenomics data</li>\n   <li><a href=\"http://www.ccb.jhu.edu/software/hisat2\">HISAT2</a>: Graph-based alignment to a population of genomes</li>\n   <li><a href=\"http://bowtie-bio.sourceforge.net/bowtie2\">Bowtie2</a>: Ultrafast read alignment</li>\n </ul>\n</div>\n\n<h2>Publications</h2>\n<div class=\"box\">\n <ul>\n   <li><p>Kim D, Song L, Breitwieser FP, and Salzberg SL. <a href=\"http://genome.cshlp.org/content/early/2016/11/16/gr.210641.116.abstract\"><b>Centrifuge: rapid and sensitive classification of metagenomic sequences</b></a>. <i>Genome Research</i> 2016</p></li>\n </ul>\n</div>\n\n<h2>Contributors</h2>\n<div class=\"box\">\n <ul>\n   <li><a href=\"http://www.ccb.jhu.edu/people/infphilo\">Daehwan Kim</a></li>\n   <li><a href=\"http://ccb.jhu.edu/people/lsong/\">Li Song</a></li>\n   <li><a href=\"http://www.ccb.jhu.edu/people/fbreitwieser\">Florian Breitwieser</a></li>\n   <li><a href=\"http://salzberg-lab.org/about-me/\">Steven Salzberg</a></li>\n </ul>\n</div>\n\n<h2>Links</h2>\n<div class=\"box\">\n <ul>\n   <li><a href=\"http://www.ccb.jhu.edu/\">Center for Computational Biology at Johns Hopkins University </a></li>\n   <li><a href=\"http://www.cs.jhu.edu/\">Computer Science Department at Johns Hopkins University </a></li>\n </ul>\n</div>        \n"
  },
  {
    "path": "doc/strip_markdown.pl",
    "content": "#!/usr/bin/env perl -w\n\n##\n# strip_markdown.pl\n#\n# Used to convert MANUAL.markdown to MANUAL.  Leaves all manual content, but\n# strips away some of the clutter that makes it hard to read the markdown.\n#\n\nuse strict;\nuse warnings;\n\nmy $lastBlank = 0;\n\nwhile(<>) {\n\t# Skip comments\n\tnext if /^\\s*<!--/;\n\tnext if /^\\s*!/;\n\tnext if /^\\s*-->/;\n\t# Skip internal links\n\tnext if /\\[.*\\]: #/;\n\t# Skip HTML\n\tnext if /^\\s?\\s?\\s?<.*>\\s*$/;\n\t# Skip HTML\n\tnext if /^\\s*<table/;\n\tnext if /^\\s*<\\/td/;\n\tnext if /^\\s*<.*>\\s*$/;\n\t# Strip [`...`]\n\ts/\\[`/`/g;\n\ts/`\\]/`/g;\n\t# Strip [#...]\n\t#s/\\[#[^\\]]*\\]//g;\n\t# Strip (#...)\n\ts/\\(#[^\\)]*\\)//g;\n\t# Turn hashes into spaces\n\t#s/^####/   /;\n\t#s/^###/ /;\n\tif(/^\\s*$/) {\n\t\tnext if $lastBlank;\n\t\t$lastBlank = 1;\n\t} else {\n\t\t$lastBlank = 0;\n\t}\n\tprint $_;\n}\n"
  },
  {
    "path": "doc/style.css",
    "content": "/* \nStylesheet for the free sNews15_1 template\nfrom http://www.free-css-templates.com\n*/\n\n/* Reset all margins and paddings for browsers */\n* { \n\tpadding: 0;\n\tmargin: 0;\n}\n\nbody { \n\tfont: .8em Verdana, Arial, Sans-Serif; \n\tline-height: 1.6em; \n\tmargin: 0;\n\t/* background-image: url(../images/bg.jpg); */\n\t/* background-repeat: repeat */\n}\n\n#wrap {\tmargin: 0 auto;\twidth: 95% }\n\n/* TOP HEADER -------- */\n#top {\n\tmargin: 0 auto;\n\tpadding: 0;\n\tbackground:#1E6BAC url(../images/ccbstrip.jpg) repeat-x top;\n\theight: 141px;\n}\n#top h1 { padding: 10px 0 0 25px; color: #FFF; font-size: 240%; background: transparent;}\n#top h2 { padding: 0px 0 0 25px; color: #bbb; font-size: 100%; background: transparent;}\n#top .padding { padding-top: 5px; }\n/*\n#top .lefts { \n\tbackground: transparent url(../images/topl.jpg) no-repeat left; \n\theight: 81px; \n}\n#top .rights {\n\tbackground: transparent url(../images/topr.jpg) no-repeat right;\n\tfloat: right;\n\theight: 81px;\n\twidth: 18px;\n}\n*/\n/* SEARCH BOX AND BUTTON ----------*/\n#search { float: right;  padding: 10px 25px 0 0;  }\n\n#search input.text { \n\tborder: 1px solid #eee;\n\tdisplay: inline;\n\tmargin-top: 5px;\n\twidth: 120px;\n\theight: 12px;\n\tfont-size: 10px;\n }\n #search input.searchbutton {\n\tborder: 0;\n\tbackground: transparent;\n\tcolor: #FFF;\n\tcursor: pointer;\n\tfont: bold 0.8em Arial, Arial, Sans-Serif\n }\n\n#subheader { \n\tclear: both; \n\tborder-top: 1px dotted #888;\t\n\tborder-bottom: 1px dotted #888;\n\tbackground: #eaeaea;\n\tcolor: #505050;\n\tpadding: 1em;\n\tmargin: 15px 0px 10px 0px;\n\t\n}\n#subheader a { text-decoration: none; /* border-bottom: 1px dashed #0066B3; */ } \n \n \n/* TOP MENU ---------- */\n#topmenu {  \tmargin: 0px 8px 0 8px; \n\t\t\tpadding: 0;\n\t\t\tbackground: url(../images/menu.jpg) repeat-x top;\n\t\t\theight: 30px;\n\t\t\t\n}\n#topmenu .lefts { \n\tbackground: url(../images/menul.jpg) no-repeat left; \n\theight: 30px; \n\tpadding-left: 0px;\n}\n#topmenu .rights {\n\tbackground: url(../images/menur.jpg) no-repeat right;\n\tfloat: right;\n\theight: 30px;\n\twidth: 8px;\n}\n#topmenu li a { \n\tcolor: #FFF;\n\ttext-align: left;\n\tpadding-left: 10px;\n\tpadding-right: 15px;\n\ttext-decoration: none;\n\tbackground: transparent;\n\tfont-weight: bold\n} \n#topmenu li { padding: 0px;\n\tfloat: left;\n\tmargin: 0;\n\tfont-size: 11px;\n\tline-height: 30px;\n\twhite-space: nowrap;\n\t/* list-style-type: none; */\n\twidth: auto;\n\tbackground: url(../images/sep.gif) no-repeat top right\n\t\n}\n\n#main { background: #FFF; margin: 25px 0 15px 0; color: #666; }\n\n#main #rightside {\n\twidth: 300px;\n\tfloat: right;\n\tbackground: #FFF;\n\tmargin-right: 0px;\n\tcolor: #555;\n\t\n} \n\n#main #rightside .box {\n\tbackground: #efefef;\n\tmargin-bottom: 10px;\n\tpadding: 5px;\n\tcolor: #555;\n}\n\n#main #rightside h2 {\n\tfont: bold 1.0em Arial, Arial, Sans-Serif; \n    background: #CDCDCD url(../images/greyc.gif) no-repeat top right;\n\theight: 18px;\n\tpadding: 3px;\n\tcolor: #666;\n}\n\n/* LEFT SIDE - ARTICLES AREA -------- */\n#leftside {\n\tpadding-left: 8px;\n\tcolor: #555;\n\tbackground: #FFF;\n\tmargin-right: 255px;\n\tmargin-left: 0px;\n\t\n}\n\n#manual {\n\tmargin-right: 305px;\n\tmargin-left: 0px;\n\twidth: auto;\n}\n\n#leftside h1 { padding: 15px 0 10px 0 }\n#leftside h2 { padding: 15px 0 10px 0; color: #555; text-indent: 17px; background: #FFF url(../images/head.gif) no-repeat left; }\n#leftside h3 { padding: 15px 0 10px 0; font-size: 100%; margin-left: 5px; text-indent: 17px; background: #FFF url(../images/head.gif) no-repeat left; }\n#leftside ul { margin-left: 24px; padding-left 24px; list-style-type: circle }\n#leftside li { }\n#leftside p { padding: 0px 0 10px 0 }\n\n#footer {\n\tclear: both;\n\tbackground: #FFF url(../images/footer.jpg) repeat-x;\n\theight: 46px;\n\tmargin-left: 0px;\n\tmargin-right: 0px;\n\tfont-size: 75%;\n\tcolor: #666;\n}\n#footer p  { padding: 5px }\n#footer .rside { float: right; display: inline; padding: 5px; text-align: right}\n\n#toc ol { list-style: roman }\n\na { color: #0066B3; background: inherit; text-decoration: none }\nh1 { font: bold 1.9em Arial, Arial, Sans-Serif }\nh2 { font: bold 1.2em Arial, Arial, Sans-Serif; padding: 0; margin: 0 }\nul { padding: 0; margin: 0; list-style-type: none }\nli {  }\nol { margin-left: 24px;\n     padding-left 24px;\n     list-style: decimal }\n/* blockquote { margin-left: 35px; font-family: \"Courier New\", Courier, monospace; } */\nblockquote { margin-left: 35px; font-family: \"Courier New\", Courier; }\ntt { font-family: \"Courier New\", Courier, monospace; }\n.date { border-top: 1px solid #e5e5e5; text-align: right; margin-bottom: 25px; margin-top: 5px;}\n#main #leftside .date a, #main #rightside a { border: 0; text-decoration: none; }\n \n.comment .date { text-align: left; border: 0;}\t\n\n\n#breadcrumbs { \n\tfloat: left;\n\tpadding-left: 8px;\n\tpadding-top: 0px;\n\tfont: bold .8em Arial, Arial, Sans-Serif; \n\tcolor: #666;\n\twidth: 100%;\n\theight: 25px;\n\tmargin-top: 10px;\n\tmargin-bottom: 10px;\n\tclear: both;\n}\n\n\n\n#leftside #txt {width: 100%; height: 10em; padding: 3px 3px 3px 6px; margin-left:0em;}\n#leftside textarea { border: 1px solid #bbb; width: 100%;  }\n\n\n/* SNEWS */\n#main #leftside fieldset { float: left; width: 100%; border: 1px solid #ccc; padding: 10px 8px; margin: 0 10px 8px 0; background: #FFF; color: #000; }\n#main #leftside fieldset p { width: 100%; }\n#main input { padding: 3px; margin: 0; border: 1px solid #bbb }\n/*p { margin-top: 5px; }*/\np { margin-top: 10px; }\n/*input.search { border: 1px solid #ccc; padding: 4px; width: 160px; }*/\n.comment { background: #FFF; color: #808080; padding: 10px; margin: 0 0 10px 0; border-top: 1px solid #ccc; }\n.commentsbox { background: #FFF; color: #808080; padding: 10px; margin: 0 0 10px 0; border-top: 1px solid #ccc; }\n\n\n#box-table-a\n{\n\tfont-family: .8em Verdana, Arial, Sans-Serif; \n\t/*font-size: 12px;*/\n\tmargin: 45px;\n\twidth: 600px;\n\ttext-align: left;\n\tborder-collapse: collapse;\n}\n#box-table-a th\n{\n\tfont-size: 13px;\n\tfont-weight: normal;\n\tpadding: 8px;\n\tbackground: #b9c9fe;\n\tborder-top: 4px solid #aabcfe;\n\tborder-bottom: 1px solid #fff;\n\tcolor: #039;\n}\n#box-table-a td\n{\n\tpadding: 8px;\n\tbackground: #e8edff; \n\tborder-bottom: 2px solid #fff;\n\tcolor: #669;\n\tborder-top: 2px solid transparent;\n}\n#box-table-a tr:hover td\n{\n\tbackground: #d0dafd;\n\tcolor: #339;\n}\n\n\n#box-table-b\n{\n\tfont-family: .8em Verdana, Arial, Sans-Serif; \n\t/*font-size: 12px;*/\n\tmargin: 45px;\n\twidth: 480px;\n\ttext-align: center;\n\tborder-collapse: collapse;\n\tborder-top: 7px solid #9baff1;\n\tborder-bottom: 7px solid #9baff1;\n}\n#box-table-b th\n{\n\tfont-size: 13px;\n\tfont-weight: normal;\n\tpadding: 8px;\n\tbackground: #e8edff;\n\tborder-right: 1px solid #9baff1;\n\tborder-left: 1px solid #9baff1;\n\tcolor: #039;\n}\n#box-table-b td\n{\n\tpadding: 8px;\n\tbackground: #e8edff; \n\tborder-right: 1px solid #aabcfe;\n\tborder-left: 1px solid #aabcfe;\n\tcolor: #669;\n}\n\n#manual h1  { margin: 0 15px 10px 15px; padding: 10px 0 10px 0; font: bold 1.9em Arial, Arial, Sans-Serif }\n#manual h2  { margin: 0 15px 10px 15px; padding: 10px 0 10px 0; font: bold 1.2em Arial, Arial, Sans-Serif }\n#manual h3  { margin: 0 15px 10px 20px; padding: 10px 0 10px 0; font: 1.2em Arial, Arial, Sans-Serif }\n#manual h4  { margin: 0 15px 10px 25px; padding: 10px 0 10px 0; font: 1.1em Arial, Arial, Sans-Serif }\n#manual p   { margin: 0 15px 10px 15px; color: #444 }\n#manual table { margin-top: 15px }\n#manual ul  { margin: 0 15px 10px 15px; padding: 0; margin: 0 }\n#manual pre { margin: 0 15px 15px 25px }\n#manual li  { margin: 0 15px 1px 15px; color: #444 }\n#manual ol  { margin-left: 24px; padding-left 24px; list-style: decimal }\n#manual td  { vertical-align: top; }\n#manual blockquote { margin-left: 35px; font-family: \"Courier New\", Courier; }\n#manual tt { font: .8em; font-family: \"Courier New\", Courier; }\n#manual code { font: .8em; font-family: \"Courier New\", Courier; }\n#manual .date { border-top: 1px solid #e5e5e5; text-align: right; margin-bottom: 25px; margin-top: 5px;}\n#manual .date a, #main #rightside a { border: 0; text-decoration: none; }\n#manual .date a, #main #rightside a { border: 0; text-decoration: none; }\n#manual td { vertical-align: top; }\n"
  },
  {
    "path": "dp_framer.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"dp_framer.h\"\n\nusing namespace std;\n\n/**\n * Set up variables that describe the shape of a dynamic programming matrix to\n * be filled in.  The matrix is built around the diagonal containing the seed\n * hit: the \"seed diagonal\".  The N diagonals to the right of the seed diagonal\n * are the \"RHS gap\" diagonals, where N is the maximum number of read or\n * reference gaps permitted (whichever is larger).  The N diagonals to the left\n * of the seed diagonal are the \"LHS gap\" diagonals.\n *\n * The way the rectangle is currently formulated, there are another N diagonals\n * to the left of the \"LHS gap\" diagonals called the \"LHS extra diagonals\".  It\n * might also be possible to split the \"extra diagonals\" into two subsets and\n * place them both to the left of the LHS gap diagonals and to the right of the\n * RHS gap diagonals.\n *\n * The purpose of arranging and these groupings of diagonals is that a subset\n * of them, the \"core diagonals\", can now be considered \"covered.\"  By\n * \"covered\" I mean that any alignment that overlaps a cell in any of the core\n * diagonals cannot possibly overlap another, higher-scoring alignment that\n * falls partially outside the rectangle.\n *\n * Say the read is 5 characters long, the maximum number of read or ref gaps is\n * 2, and the seed hit puts the main diagonal at offset 10 in the reference.\n * The larger rectangle explored looks like this:\n *\n *  off=10, maxgap=2\n *\n * Ref      1\n * off: 67890123456   0: seed diagonal\n *      **OO0oo++----   o: \"RHS gap\" diagonals\n *      -**OO0oo++---   O: \"LHS gap\" diagonals\n *      --**OO0oo++--   *: \"LHS extra\" diagonals\n *      ---**OO0oo++-   +: \"RHS extra\" diagonals\n *      ----**OO0oo++   -: cells that can't possibly be involved in a valid    \n *                         alignment that overlaps one of the core diagonals\n *\n * The \"core diagonals\" are marked with 0's, O's or o's.\n *\n * A caveat is that, for performance reasons, we place an upper limit on N -\n * the maximum number of read or reference gaps.  It is constrained to be no\n * greater than 'maxgap'.  This means that in some situations, we may report an\n * alignment that spuriously trumps a better alignment that falls partially\n * outside the rectangle.  Also, we may fail to find a valid alignment with\n * more than 'maxgap' gaps.\n *\n * Another issue is trimming: if the seed hit is sufficiently close to one or\n * both ends of the reference sequence, and either (a) overhang is not\n * permitted, or (b) the number of Ns permitted is less than the number of\n * columns that overhang the reference, then we want to exclude the trimmed\n * columns from the rectangle.\n *\n * We need to return enough information so that downstream routines can fully\n * understand the shape of the rectangle, which diagonals are which (esp. which\n * are the \"core\" diagonals, since we needn't examine any more seed hits from\n * those columns in the future), and how the rectangle is trimmed.  The\n * information returned should be compatible with the sort of information\n * returned by the routines that set up rectangles for mate finding.\n */\nbool DynProgFramer::frameSeedExtensionRect(\n\tint64_t  off,      // ref offset implied by seed hit assuming no gaps\n\tsize_t   rdlen,    // length of read sequence used in DP table (so len\n\t                   // of +1 nucleotide sequence for colorspace reads)\n\tint64_t  reflen,   // length of reference sequence aligned to\n\tsize_t   maxrdgap, // max # of read gaps permitted in opp mate alignment\n\tsize_t   maxrfgap, // max # of ref gaps permitted in opp mate alignment\n\tint64_t  maxns,    // # Ns permitted\n\tsize_t   maxhalf,  // max width in either direction\n\tDPRect&  rect)     // out: DP rectangle\n{\n\tassert_gt(rdlen, 0);\n\tassert_gt(reflen, 0);\n\t// Set N, the maximum number of reference or read gaps permitted, whichever\n\t// is larger.  Also, enforce ceiling: can't be larger than 'maxhalf'.\n\tsize_t maxgap = max(maxrdgap, maxrfgap);\n\tmaxgap = min(maxgap, maxhalf);\n\t// Leave room for \"LHS gap\" and \"LHS extra\" diagonals\n\tint64_t refl = off - 2 * maxgap;               // inclusive\n\t// Leave room for \"RHS gap\" and \"RHS extra\" diagonals\n\tint64_t refr = off + (rdlen - 1) + 2 * maxgap; // inclusive\n\tsize_t triml = 0, trimr = 0;\n\t// Check if we have to trim to fit the extents of the reference\n\tif(trimToRef_) {\n\t\tmaxns = 0; // no leeway\n\t} else if(maxns == (int64_t)rdlen) {\n\t\tmaxns--;\n\t}\n\t// Trim from RHS of rectangle\n\tif(refr >= reflen + maxns) {\n\t\ttrimr = (size_t)(refr - (reflen + maxns - 1));\n\t}\n\t// Trim from LHS of rectangle\n\tif(refl < -maxns) {\n\t\ttriml = (size_t)(-refl) - (size_t)maxns;\n\t}\n\trect.refl_pretrim = refl;\n\trect.refr_pretrim = refr;\n\trect.refl  = refl + triml;\n\trect.refr  = refr - trimr;\n\trect.triml = triml;\n\trect.trimr = trimr;\n\trect.maxgap = maxgap;\n\t// Remember which diagonals are \"core\" as offsets from the LHS of the\n\t// untrimmed rectangle\n\trect.corel = maxgap;\n\trect.corer = rect.corel + 2 * maxgap; // inclusive\n\tassert(rect.repOk());\n\treturn !rect.entirelyTrimmed();\n}\n\n/**\n * Set up variables that describe the shape of a dynamic programming matrix to\n * be filled in.  The matrix is built around the diagonals that terminate in\n * the range of columns where the RHS of the opposite mate must fall in order\n * to satisfy the fragment-length constraint.  These are the \"mate\" diagonals\n * and they also happen to be the \"core\" diagonals in this case.\n *\n * The N diagonals to the right of the mate diagonals are the \"RHS gap\"\n * diagonals, where N is the maximum number of read or reference gaps permitted\n * (whichever is larger).  The N diagonals to the left of the mate diagonals\n * are the \"LHS gap\" diagonals.\n *\n * The purpose of arranging and these groupings of diagonals is that a subset\n * of them, the \"core diagonals\", can now be considered \"covered.\"  By\n * \"covered\" I mean that any alignment that overlaps a cell in any of the core\n * diagonals cannot possibly overlap another, higher-scoring alignment that\n * falls partially outside the rectangle.\n *\n *   |Anchor| \n *   o---------OO0000000000000oo------  0: mate diagonal (also core diags!)\n *   -o---------OO0000000000000oo-----  o: \"RHS gap\" diagonals\n *   --o---------OO0000000000000oo----  O: \"LHS gap\" diagonals\n *   ---oo--------OO0000000000000oo---  *: \"LHS extra\" diagonals\n *   -----o--------OO0000000000000oo--  -: cells that can't possibly be\n *   ------o--------OO0000000000000oo-     involved in a valid alignment that\n *   -------o--------OO0000000000000oo     overlaps one of the core diagonals\n *                     XXXXXXXXXXXXX\n *                     | RHS Range |\n *                     ^           ^\n *                     rl          rr\n *\n * The \"core diagonals\" are marked with 0s.\n *\n * A caveat is that, for performance reasons, we place an upper limit on N -\n * the maximum number of read or reference gaps.  It is constrained to be no\n * greater than 'maxgap'.  This means that in some situations, we may report an\n * alignment that spuriously trumps a better alignment that falls partially\n * outside the rectangle.  Also, we may fail to find a valid alignment with\n * more than 'maxgap' gaps.\n *\n * Another issue is trimming: if the seed hit is sufficiently close to one or\n * both ends of the reference sequence, and either (a) overhang is not\n * permitted, or (b) the number of Ns permitted is less than the number of\n * columns that overhang the reference, then we want to exclude the trimmed\n * columns from the rectangle.\n */\nbool DynProgFramer::frameFindMateAnchorLeftRect(\n\tint64_t ll,       // leftmost Watson off for LHS of opp alignment\n\tint64_t lr,       // rightmost Watson off for LHS of opp alignment\n\tint64_t rl,       // leftmost Watson off for RHS of opp alignment\n\tint64_t rr,       // rightmost Watson off for RHS of opp alignment\n\tsize_t  rdlen,    // length of opposite mate\n\tint64_t reflen,   // length of reference sequence aligned to\n\tsize_t  maxrdgap, // max # of read gaps permitted in opp mate alignment\n\tsize_t  maxrfgap, // max # of ref gaps permitted in opp mate alignment\n\tint64_t maxns,    // max # ns permitted in the alignment\n\tsize_t  maxhalf,  // max width in either direction\n\tDPRect& rect)     // out: DP rectangle\n\tconst\n{\n\tassert_geq(lr, ll);  // LHS rightmost must be >= LHS leftmost\n\tassert_geq(rr, rl);  // RHS rightmost must be >= RHS leftmost\n\tassert_geq(rr, lr);  // RHS rightmost must be >= LHS rightmost\n\tassert_geq(rl, ll);  // RHS leftmost must be >= LHS leftmost\n\tassert_gt(rdlen, 0);\n\tassert_gt(reflen, 0);\n\tsize_t triml = 0, trimr = 0;\n\tsize_t maxgap = max(maxrdgap, maxrfgap);\n\tmaxgap = max(maxgap, maxhalf);\n\t// Amount of padding we have to add to account for the fact that alignments\n\t// ending between en_left/en_right might start in various columns in the\n\t// first row\n\tint64_t pad_left = maxgap;\n\tint64_t pad_right = maxgap;\n\tint64_t en_left  = rl;\n\tint64_t en_right = rr;\n\tint64_t st_left  = en_left - (rdlen-1);\n\tASSERT_ONLY(int64_t st_right = en_right - (rdlen-1));\n\tint64_t en_right_pad = en_right + pad_right;\n\tASSERT_ONLY(int64_t en_left_pad  = en_left  - pad_left);\n\tASSERT_ONLY(int64_t st_right_pad = st_right + pad_right);\n\tint64_t st_left_pad  = st_left  - pad_left;\n\tassert_leq(st_left, en_left);\n\tassert_geq(en_right, st_right);\n\tassert_leq(st_left_pad, en_left_pad);\n\tassert_geq(en_right_pad, st_right_pad);\n\tint64_t refl = st_left_pad;\n\tint64_t refr = en_right_pad;\n\tif(trimToRef_) {\n\t\tmaxns = 0;\n\t} else if(maxns == (int64_t)rdlen) {\n\t\tmaxns--;\n\t}\n\t// Trim from the RHS of the rectangle?\n\tif(refr >= reflen + maxns) {\n\t\ttrimr = (size_t)(refr - (reflen + maxns - 1));\n\t}\n\t// Trim from the LHS of the rectangle?\n\tif(refl < -maxns) {\n\t\ttriml = (size_t)(-refl) - (size_t)maxns;\n\t}\n\tsize_t width = (size_t)(refr - refl + 1);\n\trect.refl_pretrim = refl;\n\trect.refr_pretrim = refr;\n\trect.refl  = refl + triml;\n\trect.refr  = refr - trimr;\n\trect.triml = triml;\n\trect.trimr = trimr;\n\trect.maxgap = maxgap;\n\trect.corel = maxgap;\n\trect.corer = width - maxgap - 1; // inclusive\n\tassert(rect.repOk());\n\treturn !rect.entirelyTrimmed();\n}\n\n/**\n * Set up variables that describe the shape of a dynamic programming matrix to\n * be filled in.  The matrix is built around the diagonals that begin in the\n * range of columns where the LHS of the opposite mate must fall in order to\n * satisfy the fragment-length constraint.  These are the \"mate\" diagonals and\n * they also happen to be the \"core\" diagonals in this case.\n *\n * The N diagonals to the right of the mate diagonals are the \"RHS gap\"\n * diagonals, where N is the maximum number of read or reference gaps permitted\n * (whichever is larger).  The N diagonals to the left of the mate diagonals\n * are the \"LHS gap\" diagonals.\n *\n * The purpose of arranging and these groupings of diagonals is that a subset\n * of them, the \"core diagonals\", can now be considered \"covered.\"  By\n * \"covered\" I mean that any alignment that overlaps a cell in any of the core\n * diagonals cannot possibly overlap another, higher-scoring alignment that\n * falls partially outside the rectangle.\n *\n *    ll          lr\n *    v           v\n *    | LHS Range |\n *    XXXXXXXXXXXXX          |Anchor|\n *  OO0000000000000oo--------o--------  0: mate diagonal (also core diags!)\n *  -OO0000000000000oo--------o-------  o: \"RHS gap\" diagonals\n *  --OO0000000000000oo--------o------  O: \"LHS gap\" diagonals\n *  ---OO0000000000000oo--------oo----  *: \"LHS extra\" diagonals\n *  ----OO0000000000000oo---------o---  -: cells that can't possibly be\n *  -----OO0000000000000oo---------o--     involved in a valid alignment that\n *  ------OO0000000000000oo---------o-     overlaps one of the core diagonals\n *\n * The \"core diagonals\" are marked with 0s.\n *\n * A caveat is that, for performance reasons, we place an upper limit on N -\n * the maximum number of read or reference gaps.  It is constrained to be no\n * greater than 'maxgap'.  This means that in some situations, we may report an\n * alignment that spuriously trumps a better alignment that falls partially\n * outside the rectangle.  Also, we may fail to find a valid alignment with\n * more than 'maxgap' gaps.\n *\n * Another issue is trimming: if the seed hit is sufficiently close to one or\n * both ends of the reference sequence, and either (a) overhang is not\n * permitted, or (b) the number of Ns permitted is less than the number of\n * columns that overhang the reference, then we want to exclude the trimmed\n * columns from the rectangle.\n */\nbool DynProgFramer::frameFindMateAnchorRightRect(\n\tint64_t ll,       // leftmost Watson off for LHS of opp alignment\n\tint64_t lr,       // rightmost Watson off for LHS of opp alignment\n\tint64_t rl,       // leftmost Watson off for RHS of opp alignment\n\tint64_t rr,       // rightmost Watson off for RHS of opp alignment\n\tsize_t rdlen,     // length of opposite mate\n\tint64_t reflen,   // length of reference sequence aligned to\n\tsize_t maxrdgap,  // max # of read gaps permitted in opp mate alignment\n\tsize_t maxrfgap,  // max # of ref gaps permitted in opp mate alignment\n\tint64_t maxns,    // max # ns permitted in the alignment\n\tsize_t maxhalf,   // max width in either direction\n\tDPRect& rect)     // out: DP rectangle\n\tconst\n{\n\tassert_geq(lr, ll);\n\tassert_geq(rr, rl);\n\tassert_geq(rr, lr);\n\tassert_geq(rl, ll);\n\tassert_gt(rdlen, 0);\n\tassert_gt(reflen, 0);\n\tsize_t triml = 0, trimr = 0;\n\tsize_t maxgap = max(maxrdgap, maxrfgap);\n\tmaxgap = max(maxgap, maxhalf);\n\tint64_t pad_left = maxgap;\n\tint64_t pad_right = maxgap;\n\tint64_t st_left = ll;\n\tint64_t st_right = lr;\n\tASSERT_ONLY(int64_t en_left = st_left + (rdlen-1));\n\tint64_t en_right = st_right + (rdlen-1);\n\tint64_t en_right_pad = en_right + pad_right;\n\tASSERT_ONLY(int64_t en_left_pad  = en_left  - pad_left);\n\tASSERT_ONLY(int64_t st_right_pad = st_right + pad_right);\n\tint64_t st_left_pad  = st_left  - pad_left;\n\tassert_leq(st_left, en_left);\n\tassert_geq(en_right, st_right);\n\tassert_leq(st_left_pad, en_left_pad);\n\tassert_geq(en_right_pad, st_right_pad);\n\t// We have enough info to deduce where the boundaries of our rectangle\n\t// should be.  Finalize the boundaries, ignoring reference trimming for now\n\tint64_t refl = st_left_pad;\n\tint64_t refr = en_right_pad;\n\tif(trimToRef_) {\n\t\tmaxns = 0;\n\t} else if(maxns == (int64_t)rdlen) {\n\t\tmaxns--;\n\t}\n\t// Trim from the RHS of the rectangle?\n\tif(refr >= reflen + maxns) {\n\t\ttrimr = (size_t)(refr - (reflen + maxns - 1));\n\t}\n\t// Trim from the LHS of the rectangle?\n\tif(refl < -maxns) {\n\t\ttriml = (size_t)(-refl) - (size_t)maxns;\n\t}\n\tsize_t width = (size_t)(refr - refl + 1);\n\trect.refl_pretrim = refl;\n\trect.refr_pretrim = refr;\n\trect.refl  = refl + triml;\n\trect.refr  = refr - trimr;\n\trect.triml = triml;\n\trect.trimr = trimr;\n\trect.maxgap = maxgap;\n\trect.corel = maxgap;\n\trect.corer = width - maxgap - 1; // inclusive\n\tassert(rect.repOk());\n\treturn !rect.entirelyTrimmed();\n}\n\n#ifdef MAIN_DP_FRAMER\n\n#include <iostream>\n\nstatic void testCaseFindMateAnchorLeft(\n\tconst char *testName,\n\tbool trimToRef,\n\tint64_t ll,\n\tint64_t lr,\n\tint64_t rl,\n\tint64_t rr,\n\tsize_t rdlen,\n\tsize_t reflen,\n\tsize_t maxrdgap,\n\tsize_t maxrfgap,\n\tsize_t ex_width,\n\tsize_t ex_solwidth,\n\tsize_t ex_trimup,\n\tsize_t ex_trimdn,\n\tint64_t ex_refl,\n\tint64_t ex_refr,\n\tconst char *ex_st,    // string of '0'/'1' chars\n\tconst char *ex_en)    // string of '0'/'1' chars\n{\n\tcerr << testName << \"...\";\n\tDynProgFramer fr(trimToRef);\n\tsize_t width, solwidth;\n\tint64_t refl, refr;\n\tEList<bool> st, en;\n\tsize_t trimup, trimdn;\n\tsize_t maxhalf = 500;\n\tsize_t maxgaps = 0;\n\tfr.frameFindMateAnchorLeft(\n\t\tll,       // leftmost Watson off for LHS of opp alignment\n\t\tlr,       // rightmost Watson off for LHS of opp alignment\n\t\trl,       // leftmost Watson off for RHS of opp alignment\n\t\trr,       // rightmost Watson off for RHS of opp alignment\n\t\trdlen,    // length of opposite mate\n\t\treflen,   // length of reference sequence aligned to\n\t\tmaxrdgap, // max # of read gaps permitted in opp mate alignment\n\t\tmaxrfgap, // max # of ref gaps permitted in opp mate alignment\n\t\tmaxns,    // max # Ns permitted\n\t\tmaxhalf,  // max width in either direction\n\t\twidth,    // out: calculated width stored here\n\t\tmaxgaps,  // out: max # gaps\n\t\ttrimup,   // out: number of bases trimmed from upstream end\n\t\ttrimdn,   // out: number of bases trimmed from downstream end\n\t\trefl,     // out: ref pos of upper LHS of parallelogram\n\t\trefr,     // out: ref pos of lower RHS of parallelogram\n\t\tst,       // out: legal starting columns stored here\n\t\ten);      // out: legal ending columns stored here\n\tassert_eq(ex_width, width);\n\tassert_eq(ex_solwidth, solwidth);\n\tassert_eq(ex_trimup, trimup);\n\tassert_eq(ex_trimdn, trimdn);\n\tassert_eq(ex_refl, refl);\n\tassert_eq(ex_refr, refr);\n\tfor(size_t i = 0; i < width; i++) {\n\t\tassert_eq((ex_st[i] == '1'), st[i]);\n\t\tassert_eq((ex_en[i] == '1'), en[i]);\n\t}\n\tcerr << \"PASSED\" << endl;\n}\n\nstatic void testCaseFindMateAnchorRight(\n\tconst char *testName,\n\tbool trimToRef,\n\tint64_t ll,\n\tint64_t lr,\n\tint64_t rl,\n\tint64_t rr,\n\tsize_t rdlen,\n\tsize_t reflen,\n\tsize_t maxrdgap,\n\tsize_t maxrfgap,\n\tsize_t ex_width,\n\tsize_t ex_solwidth,\n\tsize_t ex_trimup,\n\tsize_t ex_trimdn,\n\tint64_t ex_refl,\n\tint64_t ex_refr,\n\tconst char *ex_st,    // string of '0'/'1' chars\n\tconst char *ex_en)    // string of '0'/'1' chars\n{\n\tcerr << testName << \"...\";\n\tDynProgFramer fr(trimToRef);\n\tsize_t width, solwidth;\n\tsize_t maxgaps;\n\tint64_t refl, refr;\n\tEList<bool> st, en;\n\tsize_t trimup, trimdn;\n\tsize_t maxhalf = 500;\n\tfr.frameFindMateAnchorRight(\n\t\tll,       // leftmost Watson off for LHS of opp alignment\n\t\tlr,       // rightmost Watson off for LHS of opp alignment\n\t\trl,       // leftmost Watson off for RHS of opp alignment\n\t\trr,       // rightmost Watson off for RHS of opp alignment\n\t\trdlen,    // length of opposite mate\n\t\treflen,   // length of reference sequence aligned to\n\t\tmaxrdgap, // max # of read gaps permitted in opp mate alignment\n\t\tmaxrfgap, // max # of ref gaps permitted in opp mate alignment\n\t\tmaxns,    // max # Ns permitted\n\t\tmaxhalf,  // max width in either direction\n\t\twidth,    // out: calculated width stored here\n\t\tmaxgaps,  // out: calcualted max # gaps\n\t\ttrimup,   // out: number of bases trimmed from upstream end\n\t\ttrimdn,   // out: number of bases trimmed from downstream end\n\t\trefl,     // out: ref pos of upper LHS of parallelogram\n\t\trefr,     // out: ref pos of lower RHS of parallelogram\n\t\tst,       // out: legal starting columns stored here\n\t\ten);      // out: legal ending columns stored here\n\tassert_eq(ex_width, width);\n\tassert_eq(ex_trimup, trimup);\n\tassert_eq(ex_trimdn, trimdn);\n\tassert_eq(ex_refl, refl);\n\tassert_eq(ex_refr, refr);\n\tfor(size_t i = 0; i < width; i++) {\n\t\tassert_eq((ex_st[i] == '1'), st[i]);\n\t\tassert_eq((ex_en[i] == '1'), en[i]);\n\t}\n\tcerr << \"PASSED\" << endl;\n}\n\nint main(void) {\n\t\n\t///////////////////////////\n\t//\n\t// ANCHOR ON THE LEFT\n\t//\n\t///////////////////////////\n\n\t//    -------------\n\t//       o     o\n\t//        o     o\n\t//         o     o\n\t//          o     o\n\t//        <<<------->>>\n\t// 012345678901234567890\n\t// 0         1         2\n\ttestCaseFindMateAnchorLeft(\n\t\t\"FindMateAnchorLeft1\",\n\t\tfalse,            // trim to reference\n\t\t3,                // left offset of upper parallelogram extent\n\t\t15,               // right offset of upper parallelogram extent\n\t\t10,               // left offset of lower parallelogram extent\n\t\t16,               // right offset of lower parallelogram extent\n\t\t5,                // length of opposite mate\n\t\t30,               // length of reference sequence aligned to\n\t\t3,                // max # of read gaps permitted in opp mate alignment\n\t\t3,                // max # of ref gaps permitted in opp mate alignment\n\t\t13,               // expected width\n\t\t0,                // expected # bases trimmed from upstream end\n\t\t0,                // expected # bases trimmed from downstream end\n\t\t3,                // ref offset of upstream column\n\t\t19,               // ref offset of downstream column\n\t\t\"1111111111111\",  // expected starting bools\n\t\t\"0001111111000\"); // expected ending bools\n\n\t//        *******\n\t//     <<===-----\n\t//       o    o\n\t//        o    o\n\t//         o    o\n\t//          o    o\n\t//         <<=----->>\n\t//            *******\n\t// 012345678901234567890\n\t// 0         1         2\n\ttestCaseFindMateAnchorLeft(\n\t\t\"FindMateAnchorLeft2\",\n\t\tfalse,            // trim to reference\n\t\t9,                // left offset of left upper parallelogram extent\n\t\t14,               // right offset of left upper parallelogram extent\n\t\t10,               // left offset of left lower parallelogram extent\n\t\t15,               // right offset of left lower parallelogram extent\n\t\t5,                // length of opposite mate\n\t\t30,               // length of reference sequence aligned to\n\t\t2,                // max # of read gaps permitted in opp mate alignment\n\t\t2,                // max # of ref gaps permitted in opp mate alignment\n\t\t7,                // expected width\n\t\t3,                // expected # bases trimmed from upstream end\n\t\t0,                // expected # bases trimmed from downstream end\n\t\t7,                // ref offset of upstream column\n\t\t17,               // ref offset of downstream column\n\t\t\"0011111\",        // expected starting bools\n\t\t\"1111100\");       // expected ending bools\n\n\t//        *******\n\t//     <<===--->>\n\t//       o    o\n\t//        o    o\n\t//         o    o\n\t//          o    o\n\t//           o    o\n\t//         <<=----->>\n\t//            *******\n\t// 01234567890123456xxxx\n\t// 0         1         2\n\ttestCaseFindMateAnchorLeft(\n\t\t\"FindMateAnchorLeft3\",\n\t\ttrue,             // trim to reference\n\t\t9,                // left offset of left upper parallelogram extent\n\t\t14,               // right offset of left upper parallelogram extent\n\t\t10,               // left offset of left lower parallelogram extent\n\t\t15,               // right offset of left lower parallelogram extent\n\t\t5,                // length of opposite mate\n\t\t17,               // length of reference sequence aligned to\n\t\t2,                // max # of read gaps permitted in opp mate alignment\n\t\t2,                // max # of ref gaps permitted in opp mate alignment\n\t\t7,                // expected width\n\t\t3,                // expected # bases trimmed from upstream end\n\t\t0,                // expected # bases trimmed from downstream end\n\t\t7,                // ref offset of upstream column\n\t\t17,               // ref offset of downstream column\n\t\t\"0011111\",        // expected starting bools\n\t\t\"1111100\");       // expected ending bools\n\n\t//        ******\n\t//     <<===-----\n\t//       o    o\n\t//        o    o\n\t//         o    o\n\t//          o    o\n\t//         <<=----=>>\n\t//            ******\n\t// 012345678901234xxxxxx\n\t// 0         1         2\n\ttestCaseFindMateAnchorLeft(\n\t\t\"FindMateAnchorLeft4\",\n\t\ttrue,             // trim to reference\n\t\t9,                // left offset of left upper parallelogram extent\n\t\t14,               // right offset of left upper parallelogram extent\n\t\t10,               // left offset of left lower parallelogram extent\n\t\t15,               // right offset of left lower parallelogram extent\n\t\t5,                // length of opposite mate\n\t\t15,               // length of reference sequence aligned to\n\t\t2,                // max # of read gaps permitted in opp mate alignment\n\t\t2,                // max # of ref gaps permitted in opp mate alignment\n\t\t6,                // expected width\n\t\t3,                // expected # bases trimmed from upstream end\n\t\t1,                // expected # bases trimmed from downstream end\n\t\t7,                // ref offset of upstream column\n\t\t16,               // ref offset of downstream column\n\t\t\"001111\",         // expected starting bools\n\t\t\"111100\");        // expected ending bools\n\n\t// -1         0         2\n\t//  xxxxxxxxxx012345678xx\n\t//\n\t//           *******\n\t//        <<===-----\n\t//          o    o\n\t//           o    o\n\t//            o    o\n\t//             o    o\n\t//              o    o\n\t//            <<=----->>\n\t//               *******\n\t//                \n\t//  xxxxxxxxxx012345678xx\n\t// -1         0         2\n\ttestCaseFindMateAnchorLeft(\n\t\t\"FindMateAnchorLeft5\",\n\t\ttrue,             // trim to reference\n\t\t1,                // left offset of left upper parallelogram extent\n\t\t7,                // right offset of left upper parallelogram extent\n\t\t2,                // left offset of left lower parallelogram extent\n\t\t7,                // right offset of left lower parallelogram extent\n\t\t5,                // length of opposite mate\n\t\t9,                // length of reference sequence aligned to\n\t\t2,                // max # of read gaps permitted in opp mate alignment\n\t\t2,                // max # of ref gaps permitted in opp mate alignment\n\t\t7,                // expected width\n\t\t3,                // expected # bases trimmed from upstream end\n\t\t0,                // expected # bases trimmed from downstream end\n\t\t-1,               // ref offset of upstream column\n\t\t9,                // ref offset of downstream column\n\t\t\"0011111\",        // expected starting bools\n\t\t\"1111100\");       // expected ending bools\n\n\t//   <<<<==-===>>\n\t//       o    o\n\t//        o    o\n\t//         o    o\n\t//          o    o\n\t//       <<<<------>>\n\t//           ******\n\t// 012345678901234567890\n\t// 0         1         2\n\ttestCaseFindMateAnchorLeft(\n\t\t\"FindMateAnchorLeft6\",\n\t\tfalse,            // trim to reference\n\t\t8,                // left offset of left upper parallelogram extent\n\t\t8,                // right offset of left upper parallelogram extent\n\t\t10,               // left offset of left lower parallelogram extent\n\t\t15,               // right offset of left lower parallelogram extent\n\t\t5,                // length of opposite mate\n\t\t30,               // length of reference sequence aligned to\n\t\t4,                // max # of read gaps permitted in opp mate alignment\n\t\t2,                // max # of ref gaps permitted in opp mate alignment\n\t\t6,                // expected width\n\t\t4,                // expected # bases trimmed from upstream end\n\t\t2,                // expected # bases trimmed from downstream end\n\t\t6,                // ref offset of upstream column\n\t\t15,               // ref offset of downstream column\n\t\t\"001000\",         // expected starting bools\n\t\t\"111111\");        // expected ending bools\n\n\t///////////////////////////\n\t//\n\t// ANCHOR ON THE RIGHT\n\t//\n\t///////////////////////////\n\n\t//        <<<------->>>\n\t//           o     o\n\t//            o     o\n\t//             o     o\n\t//              o     o\n\t//            <<<------->>>\n\t// 012345678901234567890123456789\n\t// 0         1         2\n\ttestCaseFindMateAnchorRight(\n\t\t\"FindMateAnchorRight1\",\n\t\tfalse,            // trim to reference\n\t\t10,               // left offset of left upper parallelogram extent\n\t\t16,               // right offset of left upper parallelogram extent\n\t\t11,               // left offset of left lower parallelogram extent\n\t\t23,               // right offset of left lower parallelogram extent\n\t\t5,                // length of opposite mate\n\t\t30,               // length of reference sequence aligned to\n\t\t3,                // max # of read gaps permitted in opp mate alignment\n\t\t3,                // max # of ref gaps permitted in opp mate alignment\n\t\t13,               // expected width\n\t\t0,                // expected # bases trimmed from upstream end\n\t\t0,                // expected # bases trimmed from downstream end\n\t\t7,                // ref offset of upstream column\n\t\t23,               // ref offset of downstream column\n\t\t\"0001111111000\",  // expected starting bools\n\t\t\"1111111111111\"); // expected ending bools\n\n\t// 0         1         2\n\t// 012345678901234567890\n\t//        *******\n\t//     <<------>>\n\t//        o    o\n\t//         o    o\n\t//          o    o\n\t//           o    o\n\t//         <<===--->>\n\t//            *******\n\t// 012345678901234567890\n\t// 0         1         2\n\ttestCaseFindMateAnchorRight(\n\t\t\"FindMateAnchorRight2\",\n\t\tfalse,            // trim to reference\n\t\t6,                // left offset of left upper parallelogram extent\n\t\t11,               // right offset of left upper parallelogram extent\n\t\t13,               // left offset of left lower parallelogram extent\n\t\t18,               // right offset of left lower parallelogram extent\n\t\t5,                // length of opposite mate\n\t\t30,               // length of reference sequence aligned to\n\t\t2,                // max # of read gaps permitted in opp mate alignment\n\t\t2,                // max # of ref gaps permitted in opp mate alignment\n\t\t7,                // expected width\n\t\t3,                // expected # bases trimmed from upstream end\n\t\t0,                // expected # bases trimmed from downstream end\n\t\t7,                // ref offset of upstream column\n\t\t17,               // ref offset of downstream column\n\t\t\"1111100\",        // expected starting bools\n\t\t\"0011111\");       // expected ending bools\n\n\t// Reference trimming takes off the left_pad of the left mate\n\t//\n\t//             *******\n\t//          <<------>>\n\t//            o    o\n\t//             o    o\n\t//              o    o\n\t//               o    o\n\t//                o    o\n\t//              <<===--->>\n\t//                 *******\n\t//  0123456789012345678901234567890\n\t// -1         0         1         2\n\ttestCaseFindMateAnchorRight(\n\t\t\"FindMateAnchorRight3\",\n\t\ttrue,             // trim to reference\n\t\t0,                // left offset of left upper parallelogram extent\n\t\t5,                // right offset of left upper parallelogram extent\n\t\t7,                // left offset of left lower parallelogram extent\n\t\t11,               // right offset of left lower parallelogram extent\n\t\t5,                // length of opposite mate\n\t\t30,               // length of reference sequence aligned to\n\t\t2,                // max # of read gaps permitted in opp mate alignment\n\t\t2,                // max # of ref gaps permitted in opp mate alignment\n\t\t7,                // expected width\n\t\t3,                // expected # bases trimmed from upstream end\n\t\t0,                // expected # bases trimmed from downstream end\n\t\t1,                // ref offset of upstream column\n\t\t11,               // ref offset of downstream column\n\t\t\"1111100\",        // expected starting bools\n\t\t\"0011111\");       // expected ending bools\n\n\t// Reference trimming takes off the leftmost 5 positions of the left mate,\n\t// and takes 1 from the right mate\n\t//\n\t//            *****\n\t//       <<------>>\n\t//         o    o\n\t//          o    o\n\t//           o    o\n\t//            o    o\n\t//             o    o\n\t//           <<===--->>\n\t//                *****\n\t//  0987654321012345678901234567890\n\t// -1         0         1         2\n\ttestCaseFindMateAnchorRight(\n\t\t\"FindMateAnchorRight4\",\n\t\ttrue,             // trim to reference\n\t\t-3,               // left offset of left upper parallelogram extent\n\t\t2,                // right offset of left upper parallelogram extent\n\t\t4,                // left offset of left lower parallelogram extent\n\t\t10,               // right offset of left lower parallelogram extent\n\t\t5,                // length of opposite mate\n\t\t30,               // length of reference sequence aligned to\n\t\t2,                // max # of read gaps permitted in opp mate alignment\n\t\t2,                // max # of ref gaps permitted in opp mate alignment\n\t\t5,                // expected width\n\t\t5,                // expected # bases trimmed from upstream end\n\t\t0,                // expected # bases trimmed from downstream end\n\t\t0,                // ref offset of upstream column\n\t\t8,                // ref offset of downstream column\n\t\t\"11100\",          // expected starting bools\n\t\t\"11111\");         // expected ending bools\n\n\t// Reference trimming takes off the leftmost 5 positions of the left mate,\n\t// and takes 1 from the left of the right mate.  Also, it takes 2 from the\n\t// right of the right mate.\n\t//\n\t//            ***\n\t//       <<------>>\n\t//         o    o\n\t//          o    o\n\t//           o    o\n\t//            o    o\n\t//             o    o\n\t//           <<===--->>\n\t//                ***\n\t//  0987654321012345678901234567890\n\t// -1         0         1         2\n\ttestCaseFindMateAnchorRight(\n\t\t\"FindMateAnchorRight5\",\n\t\ttrue,             // trim to reference\n\t\t-3,               // left offset of left upper parallelogram extent\n\t\t2,                // right offset of left upper parallelogram extent\n\t\t4,                // left offset of left lower parallelogram extent\n\t\t10,               // right offset of left lower parallelogram extent\n\t\t5,                // length of opposite mate\n\t\t7,                // length of reference sequence aligned to\n\t\t2,                // max # of read gaps permitted in opp mate alignment\n\t\t2,                // max # of ref gaps permitted in opp mate alignment\n\t\t3,                // expected width\n\t\t5,                // expected # bases trimmed from upstream end\n\t\t2,                // expected # bases trimmed from downstream end\n\t\t0,                // ref offset of upstream column\n\t\t6,                // ref offset of downstream column\n\t\t\"111\",            // expected starting bools\n\t\t\"111\");           // expected ending bools\n\n\t//       ******\n\t//     <<------>>>>\n\t//        o    o\n\t//         o    o\n\t//          o    o\n\t//           o    o\n\t//         <<====-=>>>>\n\t//           ******\n\t// 012345678901234567890\n\t// 0         1         2\n\ttestCaseFindMateAnchorRight(\n\t\t\"FindMateAnchorRight6\",\n\t\tfalse,            // trim to reference\n\t\t6,                // left offset of left upper parallelogram extent\n\t\t11,               // right offset of left upper parallelogram extent\n\t\t14,               // left offset of left lower parallelogram extent\n\t\t14,               // right offset of left lower parallelogram extent\n\t\t5,                // length of opposite mate\n\t\t30,               // length of reference sequence aligned to\n\t\t4,                // max # of read gaps permitted in opp mate alignment\n\t\t2,                // max # of ref gaps permitted in opp mate alignment\n\t\t6,                // expected width\n\t\t2,                // expected # bases trimmed from upstream end\n\t\t4,                // expected # bases trimmed from downstream end\n\t\t6,                // ref offset of upstream column\n\t\t15,               // ref offset of downstream column\n\t\t\"111111\",         // expected starting bools\n\t\t\"000010\");        // expected ending bools\n\n\t//         ****\n\t//   <<<<==---->>\n\t//       o    o\n\t//        o    o\n\t//         o    o\n\t//          o    o\n\t//           o    o\n\t//       <<<<====-=>>\n\t//             ****\n\t// 012345678901234567890\n\t// 0         1         2\n\ttestCaseFindMateAnchorRight(\n\t\t\"FindMateAnchorRight7\",\n\t\tfalse,            // trim to reference\n\t\t6,                // left offset of left upper parallelogram extent\n\t\t11,               // right offset of left upper parallelogram extent\n\t\t14,               // left offset of left lower parallelogram extent\n\t\t14,               // right offset of left lower parallelogram extent\n\t\t5,                // length of opposite mate\n\t\t30,               // length of reference sequence aligned to\n\t\t2,                // max # of read gaps permitted in opp mate alignment\n\t\t4,                // max # of ref gaps permitted in opp mate alignment\n\t\t4,                // expected width\n\t\t6,                // expected # bases trimmed from upstream end\n\t\t2,                // expected # bases trimmed from downstream end\n\t\t8,                // ref offset of upstream column\n\t\t15,               // ref offset of downstream column\n\t\t\"1111\",           // expected starting bools\n\t\t\"0010\");          // expected ending bools\n\t\n\ttestCaseFindMateAnchorRight(\n\t\t\"FindMateAnchorRight8\",\n\t\ttrue,             // trim to reference\n\t\t-37,              // left offset of left upper parallelogram extent\n\t\t13,               // right offset of left upper parallelogram extent\n\t\t-37,              // left offset of left lower parallelogram extent\n\t\t52,               // right offset of left lower parallelogram extent\n\t\t10,               // length of opposite mate\n\t\t53,               // length of reference sequence aligned to\n\t\t0,                // max # of read gaps permitted in opp mate alignment\n\t\t0,                // max # of ref gaps permitted in opp mate alignment\n\t\t14,               // expected width\n\t\t37,               // expected # bases trimmed from upstream end\n\t\t0,                // expected # bases trimmed from downstream end\n\t\t0,                // ref offset of upstream column\n\t\t22,               // ref offset of downstream column\n\t\t\"11111111111111\", // expected starting bools\n\t\t\"11111111111111\");// expected ending bools\n}\n\n#endif /*def MAIN_DP_FRAMER*/\n"
  },
  {
    "path": "dp_framer.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n/*\n *  dp_framer.h\n *\n * Classes and routines for framing dynamic programming problems.  There are 2\n * basic types of dynamic programming problems solved in Bowtie 2:\n *\n * 1. Seed extension: we found a seed hit using Burrows-Wheeler techniques and\n *    now we would like to extend it into a full alignment by doing dynamic\n *    programming in the vicinity of the seed hit.\n *\n * 2. Mate finding: we would a full alignment for one mate in a pair and now we\n *    would like to extend it into a full alignment by doing dynamic\n *    programming in the area prescribed by the maximum and minimum fragment\n *    lengths.\n *\n * By \"framing\" the dynamic programming problem, we mean that all of the\n * following DP inputs are calculated:\n *\n * 1. The width of the parallelogram/rectangle to explore.\n * 2. The 0-based offset of the reference position associated with the leftmost\n *    diagnomal/column in the parallelogram/rectangle to explore\n * 3. An EList<bool> of length=width encoding which columns the alignment may\n *    start in\n * 4. An EList<bool> of length=width encoding which columns the alignment may\n *    end in\n */\n\n#ifndef DP_FRAMER_H_\n#define DP_FRAMER_H_\n\n#include <stdint.h>\n#include \"ds.h\"\n#include \"ref_coord.h\"\n\n/**\n * Describes a dynamic programming rectangle.\n *\n * Only knows about reference offsets, not reference sequences.\n */\nstruct DPRect {\n\n\tDPRect(int cat = 0) /*: st(cat), en(cat)*/ {\n\t\trefl = refr = triml = trimr = corel = corer = 0;\n\t}\n\n\tint64_t refl;         // leftmost ref offset involved post trimming (incl)\n\tint64_t refr;         // rightmost ref offset involved post trimming (incl)\n\n\tint64_t refl_pretrim; // leftmost ref offset involved pre trimming (incl)\n\tint64_t refr_pretrim; // rightmost ref offset involved pre trimming (incl)\n\t\n\tsize_t  triml;        // positions trimmed from LHS\n\tsize_t  trimr;        // positions trimmed from RHS\n\t\n\t// If \"core\" diagonals are specified, then any alignment reported has to\n\t// overlap one of the core diagonals.  This is to avoid the situation where\n\t// an alignment is reported that overlaps a better-scoring alignment that\n\t// falls partially outside the rectangle.  This is used in both seed\n\t// extensions and in mate finding.  Filtering based on the core diagonals\n\t// should happen in the backtrace routine.  I.e. it should simply never\n\t// return an alignment that doesn't overlap a core diagonal, even if there\n\t// is such an alignment and it's valid.\n\t\n\tsize_t  corel; // offset of column where leftmost \"core\" diagonal starts\n\tsize_t  corer; // offset of column where rightmost \"core\" diagonal starts\n\t// [corel, corer] is an inclusive range and offsets are with respect to the\n\t// original, untrimmed rectangle.\n\t\n\tsize_t  maxgap; // max # gaps - width of the gap bands\n\t\n\t/**\n\t * Return true iff the combined effect of triml and trimr is to trim away\n\t * the entire rectangle.\n\t */\n\tbool entirelyTrimmed() const {\n\t\tbool tr = refr < refl;\n\t\tASSERT_ONLY(size_t width = (size_t)(refr_pretrim - refl_pretrim + 1));\n\t\tassert(tr == (width <= triml + trimr));\n\t\treturn tr;\n\t}\n\t\n#ifndef NDEBUG\n\tbool repOk() const {\n\t\tassert_geq(corer, corel);\n\t\treturn true;\n\t}\n#endif\n\t\n\t/**\n\t * Set the given interval to the range of diagonals that are \"covered\" by\n\t * this dynamic programming problem.\n\t */\n\tvoid initIval(Interval& iv) {\n\t\tiv.setOff(refl_pretrim + (int64_t)corel);\n\t\tiv.setLen(corer - corel + 1);\n\t}\n};\n\n/**\n * Encapsulates routines for calculating parameters for the various types of\n * dynamic programming problems solved in Bowtie2.\n */\nclass DynProgFramer {\n\npublic:\n\n\tDynProgFramer(bool trimToRef) : trimToRef_(trimToRef) { }\n\n\t/**\n\t * Similar to frameSeedExtensionParallelogram but we're being somewhat more\n\t * inclusive in order to ensure all characters aling the \"width\" in the last\n\t * row are exhaustively scored.\n\t */\n\tbool frameSeedExtensionRect(\n\t\tint64_t off,      // ref offset implied by seed hit assuming no gaps\n\t\tsize_t rdlen,     // length of read sequence used in DP table (so len\n\t\t\t\t\t\t  // of +1 nucleotide sequence for colorspace reads)\n\t\tint64_t reflen,   // length of reference sequence aligned to\n\t\tsize_t maxrdgap,  // max # of read gaps permitted in opp mate alignment\n\t\tsize_t maxrfgap,  // max # of ref gaps permitted in opp mate alignment\n\t\tint64_t maxns,    // # Ns permitted\n\t\tsize_t maxhalf,   // max width in either direction\n\t\tDPRect& rect);    // out: DP rectangle\n\n\t/**\n\t * Given information about an anchor mate hit, and information deduced by\n\t * PairedEndPolicy about where the opposite mate can begin and start given\n\t * the fragment length range, return parameters for the dynamic programming\n\t * problem to solve.\n\t */\n\tbool frameFindMateRect(\n\t\tbool anchorLeft,  // true iff anchor alignment is to the left\n\t\tint64_t ll,       // leftmost Watson off for LHS of opp alignment\n\t\tint64_t lr,       // rightmost Watson off for LHS of opp alignment\n\t\tint64_t rl,       // leftmost Watson off for RHS of opp alignment\n\t\tint64_t rr,       // rightmost Watson off for RHS of opp alignment\n\t\tsize_t  rdlen,    // length of opposite mate\n\t\tint64_t reflen,   // length of reference sequence aligned to\n\t\tsize_t  maxrdgap, // max # of read gaps permitted in opp mate alignment\n\t\tsize_t  maxrfgap, // max # of ref gaps permitted in opp mate alignment\n\t\tint64_t maxns,    // max # Ns permitted\n\t\tsize_t  maxhalf,  // max width in either direction\n\t\tDPRect& rect)     // out: DP rectangle\n\t\tconst\n\t{\n\t\tif(anchorLeft) {\n\t\t\treturn frameFindMateAnchorLeftRect(\n\t\t\t\tll,\n\t\t\t\tlr,\n\t\t\t\trl,\n\t\t\t\trr,\n\t\t\t\trdlen,\n\t\t\t\treflen,\n\t\t\t\tmaxrdgap,\n\t\t\t\tmaxrfgap,\n\t\t\t\tmaxns,\n\t\t\t\tmaxhalf,\n\t\t\t\trect);\n\t\t} else {\n\t\t\treturn frameFindMateAnchorRightRect(\n\t\t\t\tll,\n\t\t\t\tlr,\n\t\t\t\trl,\n\t\t\t\trr,\n\t\t\t\trdlen,\n\t\t\t\treflen,\n\t\t\t\tmaxrdgap,\n\t\t\t\tmaxrfgap,\n\t\t\t\tmaxns,\n\t\t\t\tmaxhalf,\n\t\t\t\trect);\n\t\t}\n\t}\n\n\t/**\n\t * Given information about an anchor mate hit, and information deduced by\n\t * PairedEndPolicy about where the opposite mate can begin and start given\n\t * the fragment length range, return parameters for the dynamic programming\n\t * problem to solve.\n\t */\n\tbool frameFindMateAnchorLeftRect(\n\t\tint64_t ll,       // leftmost Watson off for LHS of opp alignment\n\t\tint64_t lr,       // rightmost Watson off for LHS of opp alignment\n\t\tint64_t rl,       // leftmost Watson off for RHS of opp alignment\n\t\tint64_t rr,       // rightmost Watson off for RHS of opp alignment\n\t\tsize_t  rdlen,    // length of opposite mate\n\t\tint64_t reflen,   // length of reference sequence aligned to\n\t\tsize_t  maxrdgap, // max # of read gaps permitted in opp mate alignment\n\t\tsize_t  maxrfgap, // max # of ref gaps permitted in opp mate alignment\n\t\tint64_t maxns,    // max # Ns permitted in alignment\n\t\tsize_t  maxhalf,  // max width in either direction\n\t\tDPRect& rect)     // out: DP rectangle\n\t\tconst;\n\n\t/**\n\t * Given information about an anchor mate hit, and information deduced by\n\t * PairedEndPolicy about where the opposite mate can begin and start given\n\t * the fragment length range, return parameters for the dynamic programming\n\t * problem to solve.\n\t */\n\tbool frameFindMateAnchorRightRect(\n\t\tint64_t ll,       // leftmost Watson off for LHS of opp alignment\n\t\tint64_t lr,       // rightmost Watson off for LHS of opp alignment\n\t\tint64_t rl,       // leftmost Watson off for RHS of opp alignment\n\t\tint64_t rr,       // rightmost Watson off for RHS of opp alignment\n\t\tsize_t  rdlen,    // length of opposite mate\n\t\tint64_t reflen,   // length of reference sequence aligned to\n\t\tsize_t  maxrdgap, // max # of read gaps permitted in opp mate alignment\n\t\tsize_t  maxrfgap, // max # of ref gaps permitted in opp mate alignment\n\t\tint64_t maxns,    // max # Ns permitted in alignment\n\t\tsize_t  maxhalf,  // max width in either direction\n\t\tDPRect& rect)     // out: DP rectangle\n\t\tconst;\n\nprotected:\n\n\t/**\n\t * Trim the given parallelogram width and reference window so that neither\n\t * overhangs the beginning or end of the reference.  Return true if width\n\t * is still > 0 after trimming, otherwise return false.\n\t */\n\tvoid trimToRef(\n\t\tsize_t   reflen,  // in: length of reference sequence aligned to\n\t\tint64_t& refl,    // in/out: ref pos of upper LHS of parallelogram\n\t\tint64_t& refr,    // in/out: ref pos of lower RHS of parallelogram\n\t\tsize_t&  trimup,  // out: number of bases trimmed from upstream end\n\t\tsize_t&  trimdn)  // out: number of bases trimmed from downstream end\n\t{\n\t\tif(refl < 0) {\n\t\t\ttrimup = (size_t)(-refl);\n\t\t\t//refl = 0;\n\t\t}\n\t\tif(refr >= (int64_t)reflen) {\n\t\t\ttrimdn = (size_t)(refr - reflen + 1);\n\t\t\t//refr = (int64_t)reflen-1;\n\t\t}\n\t}\n\n\tbool trimToRef_;\n};\n\n#endif /*ndef DP_FRAMER_H_*/\n"
  },
  {
    "path": "ds.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"ds.h\"\n\nMemoryTally gMemTally;\n\n/**\n * Tally a memory allocation of size amt bytes.\n */\nvoid MemoryTally::add(int cat, uint64_t amt) {\n\tThreadSafe ts(&mutex_m);\n\ttots_[cat] += amt;\n\ttot_ += amt;\n\tif(tots_[cat] > peaks_[cat]) {\n\t\tpeaks_[cat] = tots_[cat];\n\t}\n\tif(tot_ > peak_) {\n\t\tpeak_ = tot_;\n\t}\n}\n\n/**\n * Tally a memory free of size amt bytes.\n */\nvoid MemoryTally::del(int cat, uint64_t amt) {\n\tThreadSafe ts(&mutex_m);\n\tassert_geq(tots_[cat], amt);\n\tassert_geq(tot_, amt);\n\ttots_[cat] -= amt;\n\ttot_ -= amt;\n}\n\t\n#ifdef MAIN_DS\n\n#include <limits>\n#include \"random_source.h\"\n\nusing namespace std;\n\nint main(void) {\n\tcerr << \"Test EHeap 1...\";\n\t{\n\t\tEHeap<float> h;\n\t\th.insert(0.5f);  // 1\n\t\th.insert(0.6f);  // 2\n\t\th.insert(0.25f); // 3\n\t\th.insert(0.75f); // 4\n\t\th.insert(0.1f);  // 5\n\t\th.insert(0.9f);  // 6\n\t\th.insert(0.4f);  // 7\n\t\tassert_eq(7, h.size());\n\t\tif(h.pop() != 0.1f) {\n\t\t\tthrow 1;\n\t\t}\n\t\tassert_eq(6, h.size());\n\t\tif(h.pop() != 0.25f) {\n\t\t\tthrow 1;\n\t\t}\n\t\tassert_eq(5, h.size());\n\t\tif(h.pop() != 0.4f) {\n\t\t\tthrow 1;\n\t\t}\n\t\tassert_eq(4, h.size());\n\t\tif(h.pop() != 0.5f) {\n\t\t\tthrow 1;\n\t\t}\n\t\tassert_eq(3, h.size());\n\t\tif(h.pop() != 0.6f) {\n\t\t\tthrow 1;\n\t\t}\n\t\tassert_eq(2, h.size());\n\t\tif(h.pop() != 0.75f) {\n\t\t\tthrow 1;\n\t\t}\n\t\tassert_eq(1, h.size());\n\t\tif(h.pop() != 0.9f) {\n\t\t\tthrow 1;\n\t\t}\n\t\tassert_eq(0, h.size());\n\t\tassert(h.empty());\n\t}\n\tcerr << \"PASSED\" << endl;\n\n\tcerr << \"Test EHeap 2...\";\n\t{\n\t\tEHeap<size_t> h;\n\t\tRandomSource rnd(12);\n\t\tsize_t lim = 2000;\n\t\twhile(h.size() < lim) {\n\t\t\th.insert(rnd.nextU32());\n\t\t}\n\t\tsize_t last = std::numeric_limits<size_t>::max();\n\t\tbool first = true;\n\t\twhile(!h.empty()) {\n\t\t\tsize_t p = h.pop();\n\t\t\tassert(first || p >= last);\n\t\t\tlast = p;\n\t\t\tfirst = false;\n\t\t}\n\t}\n\tcerr << \"PASSED\" << endl;\n\n\tcerr << \"Test EBitList 1...\";\n\t{\n\t\tEBitList<128> l;\n\t\tassert_eq(0, l.size());\n\t\tassert_eq(std::numeric_limits<size_t>::max(), l.max());\n\t\t\n\t\tassert(!l.test(0));\n\t\tassert(!l.test(1));\n\t\tassert(!l.test(10));\n\t\t\n\t\tfor(int i = 0; i < 3; i++) {\n\t\t\tl.set(10);\n\t\t\tassert(!l.test(0));\n\t\t\tassert(!l.test(1));\n\t\t\tassert(!l.test(9));\n\t\t\tassert(l.test(10));\n\t\t\tassert(!l.test(11));\n\t\t}\n\t\t\n\t\tassert_eq(10, l.max());\n\t\tl.clear();\n\t\tassert(!l.test(10));\n\t\tassert_eq(std::numeric_limits<size_t>::max(), l.max());\n\t\t\n\t\tRandomSource rnd(12);\n\t\tsize_t lim = 2000;\n\t\tfor(size_t i = 0; i < lim; i++) {\n\t\t\tuint32_t ri = rnd.nextU32() % 10000;\n\t\t\tl.set(ri);\n\t\t\tassert(l.test(ri));\n\t\t}\n\t}\n\tcerr << \"PASSED\" << endl;\n}\n\n#endif /*def MAIN_SSTRING*/\n"
  },
  {
    "path": "ds.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef DS_H_\n#define DS_H_\n\n#include <algorithm>\n#include <stdexcept>\n#include <utility>\n#include <stdint.h>\n#include <string.h>\n#include <limits>\n#include \"assert_helpers.h\"\n#include \"threading.h\"\n#include \"random_source.h\"\n#include \"btypes.h\"\n\n/**\n * Tally how much memory is allocated to certain \n */\nclass MemoryTally {\n\npublic:\n\n\tMemoryTally() : tot_(0), peak_(0) {\n\t\tmemset(tots_,  0, 256 * sizeof(uint64_t));\n\t\tmemset(peaks_, 0, 256 * sizeof(uint64_t));\n\t}\n\n\t/**\n\t * Tally a memory allocation of size amt bytes.\n\t */\n\tvoid add(int cat, uint64_t amt);\n\n\t/**\n\t * Tally a memory free of size amt bytes.\n\t */\n\tvoid del(int cat, uint64_t amt);\n\t\n\t/**\n\t * Return the total amount of memory allocated.\n\t */\n\tuint64_t total() { return tot_; }\n\n\t/**\n\t * Return the total amount of memory allocated in a particular\n\t * category.\n\t */\n\tuint64_t total(int cat) { return tots_[cat]; }\n\n\t/**\n\t * Return the peak amount of memory allocated.\n\t */\n\tuint64_t peak() { return peak_; }\n\n\t/**\n\t * Return the peak amount of memory allocated in a particular\n\t * category.\n\t */\n\tuint64_t peak(int cat) { return peaks_[cat]; }\n\n#ifndef NDEBUG\n\t/**\n\t * Check that memory tallies are internally consistent;\n\t */\n\tbool repOk() const {\n\t\tuint64_t tot = 0;\n\t\tfor(int i = 0; i < 256; i++) {\n\t\t\tassert_leq(tots_[i], peaks_[i]);\n\t\t\ttot += tots_[i];\n\t\t}\n\t\tassert_eq(tot, tot_);\n\t\treturn true;\n\t}\n#endif\n\nprotected:\n\n\tMUTEX_T mutex_m;\n\tuint64_t tots_[256];\n\tuint64_t tot_;\n\tuint64_t peaks_[256];\n\tuint64_t peak_;\n};\n\nextern MemoryTally gMemTally;\n\n/**\n * A simple fixed-length array of type T, automatically freed in the\n * destructor.\n */\ntemplate<typename T>\nclass AutoArray {\npublic:\n\n\tAutoArray(size_t sz, int cat = 0) : cat_(cat) {\n\t\tt_ = NULL;\n\t\tt_ = new T[sz];\n\t\tgMemTally.add(cat_, sz);\n\t\tmemset(t_, 0, sz * sizeof(T));\n\t\tsz_ = sz;\n\t}\n\t\n\t~AutoArray() {\n\t\tif(t_ != NULL) {\n\t\t\tdelete[] t_;\n\t\t\tgMemTally.del(cat_, sz_);\n\t\t}\n\t}\n\t\n\tT& operator[](size_t sz) {\n\t\treturn t_[sz];\n\t}\n\t\n\tconst T& operator[](size_t sz) const {\n\t\treturn t_[sz];\n\t}\n\t\n\tsize_t size() const { return sz_; }\n\nprivate:\n\tint cat_;\n\tT *t_;\n\tsize_t sz_;\n};\n\n/**\n * A wrapper for a non-array pointer that associates it with a memory\n * category for tracking purposes and calls delete on it when the\n * PtrWrap is destroyed.\n */\ntemplate<typename T>\nclass PtrWrap {\npublic:\n\n\texplicit PtrWrap(\n\t\tT* p,\n\t\tbool freeable = true,\n\t\tint cat = 0) :\n\t\tcat_(cat),\n\t\tp_(NULL)\n\t{\n\t\tinit(p, freeable);\n\t}\n\n\texplicit PtrWrap(int cat = 0) :\n\t\tcat_(cat),\n\t\tp_(NULL)\n\t{\n\t\treset();\n\t}\n\n\tvoid reset() {\n\t\tfree();\n\t\tinit(NULL);\n\t}\n\n\t~PtrWrap() { free(); }\n\t\n\tvoid init(T* p, bool freeable = true) {\n\t\tassert(p_ == NULL);\n\t\tp_ = p;\n\t\tfreeable_ = freeable;\n\t\tif(p != NULL && freeable_) {\n\t\t\tgMemTally.add(cat_, sizeof(T));\n\t\t}\n\t}\n\t\n\tvoid free() {\n\t\tif(p_ != NULL) {\n\t\t\tif(freeable_) {\n\t\t\t\tdelete p_;\n\t\t\t\tgMemTally.del(cat_, sizeof(T));\n\t\t\t}\n\t\t\tp_ = NULL;\n\t\t}\n\t}\n\t\n\tinline T* get() { return p_; }\n\tinline const T* get() const { return p_; }\n\nprivate:\n\tint cat_;\n\tT *p_;\n\tbool freeable_;\n};\n\n/**\n * A wrapper for an array pointer that associates it with a memory\n * category for tracking purposes and calls delete[] on it when the\n * PtrWrap is destroyed.\n */\ntemplate<typename T>\nclass APtrWrap {\npublic:\n\n\texplicit APtrWrap(\n\t\tT* p,\n\t\tsize_t sz,\n\t\tbool freeable = true,\n\t\tint cat = 0) :\n\t\tcat_(cat),\n\t\tp_(NULL)\n\t{\n\t\tinit(p, sz, freeable);\n\t}\n\n\texplicit APtrWrap(int cat = 0) :\n\t\tcat_(cat),\n\t\tp_(NULL)\n\t{\n\t\treset();\n\t}\n\t\n\tvoid reset() {\n\t\tfree();\n\t\tinit(NULL, 0);\n\t}\n\n\t~APtrWrap() { free(); }\n\t\n\tvoid init(T* p, size_t sz, bool freeable = true) {\n\t\tassert(p_ == NULL);\n\t\tp_ = p;\n\t\tsz_ = sz;\n\t\tfreeable_ = freeable;\n\t\tif(p != NULL && freeable_) {\n\t\t\tgMemTally.add(cat_, sizeof(T) * sz_);\n\t\t}\n\t}\n\t\n\tvoid free() {\n\t\tif(p_ != NULL) {\n\t\t\tif(freeable_) {\n\t\t\t\tdelete[] p_;\n\t\t\t\tgMemTally.del(cat_, sizeof(T) * sz_);\n\t\t\t}\n\t\t\tp_ = NULL;\n\t\t}\n\t}\n\t\n\tinline T* get() { return p_; }\n\tinline const T* get() const { return p_; }\n\nprivate:\n\tint cat_;\n\tT *p_;\n\tbool freeable_;\n\tsize_t sz_;\n};\n\n/**\n * An EList<T> is an expandable list with these features:\n *\n *  - Payload type is a template parameter T.\n *  - Initial size can be specified at construction time, otherwise\n *    default of 128 is used.\n *  - When allocated initially or when expanding, the new[] operator is\n *    used, which in turn calls the default constructor for T.\n *  - All copies (e.g. assignment of a const T& to an EList<T> element,\n *    or during expansion) use operator=.\n *  - When the EList<T> is resized to a smaller size (or cleared, which\n *    is like resizing to size 0), the underlying containing is not\n *    reshaped.  Thus, ELists<T>s never release memory before\n *    destruction.\n *\n * And these requirements:\n *\n *  - Payload type T must have a default constructor.\n *\n * For efficiency reasons, ELists should not be declared on the stack\n * in often-called worker functions.  Best practice is to declare\n * ELists at a relatively stable layer of the stack (such that it\n * rarely bounces in and out of scope) and let the worker function use\n * it and *expand* it only as needed.  The effect is that only\n * relatively few allocations and copies will be incurred, and they'll\n * occur toward the beginning of the computation before stabilizing at\n * a \"high water mark\" for the remainder of the computation.\n *\n * A word about multidimensional lists.  One way to achieve a\n * multidimensional lists is to nest ELists.  This works, but it often\n * involves a lot more calls to the default constructor and to\n * operator=, especially when the outermost EList needs expanding, than\n * some of the alternatives.  One alternative is use a most specialized\n * container that still uses ELists but knows to use xfer instead of\n * operator= when T=EList.\n *\n * The 'cat_' fiends encodes a category.  This makes it possible to\n * distinguish between object subgroups in the global memory tally.\n *\n * Memory allocation is lazy.  Allocation is only triggered when the\n * user calls push_back, expand, resize, or another function that\n * increases the size of the list.  This saves memory and also makes it\n * easier to deal with nested ELists, since the default constructor\n * doesn't set anything in stone.\n */\ntemplate <typename T, int S = 128>\nclass EList {\n\npublic:\n\n\t/**\n\t * Allocate initial default of S elements.\n\t */\n\texplicit EList() :\n\t\tcat_(0), allocCat_(-1), list_(NULL), sz_(S), cur_(0) { }\n\n\t/**\n\t * Allocate initial default of S elements.\n\t */\n\texplicit EList(int cat) :\n\t\tcat_(cat), allocCat_(-1), list_(NULL), sz_(S), cur_(0)\n\t{\n\t\tassert_geq(cat, 0);\n\t}\n\n\t/**\n\t * Initially allocate given number of elements; should be > 0.\n\t */\n\texplicit EList(size_t isz, int cat = 0) :\n\t\tcat_(cat), allocCat_(-1), list_(NULL), sz_(isz), cur_(0)\n\t{\n\t\tassert_geq(cat, 0);\n\t}\n\n\t/**\n\t * Copy from another EList using operator=.\n\t */\n\tEList(const EList<T, S>& o) :\n\t\tcat_(0), allocCat_(-1), list_(NULL), sz_(0), cur_(0)\n\t{\n\t\t*this = o;\n\t}\n\n\t/**\n\t * Copy from another EList using operator=.\n\t */\n\texplicit EList(const EList<T, S>& o, int cat) :\n\t\tcat_(cat), allocCat_(-1), list_(NULL), sz_(0), cur_(0)\n\t{\n\t\t*this = o;\n\t\tassert_geq(cat, 0);\n\t}\n\n\t/**\n\t * Destructor.\n\t */\n\t~EList() { free(); }\n\n\t/**\n\t * Make this object into a copy of o by allocat\n\t */\n\tEList<T, S>& operator=(const EList<T, S>& o) {\n\t\tassert_eq(cat_, o.cat());\n\t\tif(o.cur_ == 0) {\n\t\t\t// Nothing to copy\n\t\t\tcur_ = 0;\n\t\t\treturn *this;\n\t\t}\n\t\tif(list_ == NULL) {\n\t\t\t// cat_ should already be set\n\t\t\tlazyInit();\n\t\t}\n\t\tif(sz_ < o.cur_) expandNoCopy(o.cur_ + 1);\n\t\tassert_geq(sz_, o.cur_);\n\t\tcur_ = o.cur_;\n\t\tfor(size_t i = 0; i < cur_; i++) {\n\t\t\tlist_[i] = o.list_[i];\n\t\t}\n\t\treturn *this;\n\t}\n\t\n\t/**\n\t * Transfer the guts of another EList into this one without using\n\t * operator=, etc.  We have to set EList o's list_ field to NULL to\n\t * avoid o's destructor from deleting list_ out from under us.\n\t */\n\tvoid xfer(EList<T, S>& o) {\n\t\t// What does it mean to transfer to a different-category list?\n\t\tassert_eq(cat_, o.cat());\n\t\t// Can only transfer into an empty object\n\t\tfree();\n\t\tallocCat_ = cat_;\n\t\tlist_ = o.list_;\n\t\tsz_ = o.sz_;\n\t\tcur_ = o.cur_;\n\t\to.list_ = NULL;\n\t\to.sz_ = o.cur_ = 0;\n\t\to.allocCat_ = -1;\n\t}\n\n\t/**\n\t * Return number of elements.\n\t */\n\tinline size_t size() const { return cur_; }\n\n\t/**\n\t * Return number of elements allocated.\n\t */\n\tinline size_t capacity() const { return sz_; }\n\t\n\t/**\n\t * Return the total size in bytes occupied by this list.\n\t */\n\tsize_t totalSizeBytes() const {\n\t\treturn \t2 * sizeof(int) +\n\t\t        2 * sizeof(size_t) +\n\t\t\t\tcur_ * sizeof(T);\n\t}\n\n\t/**\n\t * Return the total capacity in bytes occupied by this list.\n\t */\n\tsize_t totalCapacityBytes() const {\n\t\treturn \t2 * sizeof(int) +\n\t\t        2 * sizeof(size_t) +\n\t\t\t\tsz_ * sizeof(T);\n\t}\n\t\n\t/**\n\t * Ensure that there is sufficient capacity to expand to include\n\t * 'thresh' more elements without having to expand.\n\t */\n\tinline void ensure(size_t thresh) {\n\t\tif(list_ == NULL) lazyInit();\n\t\texpandCopy(cur_ + thresh);\n\t}\n\n\t/**\n\t * Ensure that there is sufficient capacity to include 'newsz' elements.\n\t * If there isn't enough capacity right now, expand capacity to exactly\n\t * equal 'newsz'.\n\t */\n\tinline void reserveExact(size_t newsz) {\n\t\tif(list_ == NULL) lazyInitExact(newsz);\n\t\texpandCopyExact(newsz);\n\t}\n\n\t/**\n\t * Return true iff there are no elements.\n\t */\n\tinline bool empty() const { return cur_ == 0; }\n\t\n\t/**\n\t * Return true iff list hasn't been initialized yet.\n\t */\n\tinline bool null() const { return list_ == NULL; }\n\n\t/**\n\t * Add an element to the back and immediately initialize it via\n\t * operator=.\n\t */\n\tvoid push_back(const T& el) {\n\t\tif(list_ == NULL) lazyInit();\n\t\tif(cur_ == sz_) expandCopy(sz_+1);\n\t\tlist_[cur_++] = el;\n\t}\n\n\t/**\n\t * Add an element to the back.  No intialization is done.\n\t */\n\tvoid expand() {\n\t\tif(list_ == NULL) lazyInit();\n\t\tif(cur_ == sz_) expandCopy(sz_+1);\n\t\tcur_++;\n\t}\n\n\t/**\n\t * Add an element to the back.  No intialization is done.\n\t */\n\tvoid fill(size_t begin, size_t end, const T& v) {\n\t\tassert_leq(begin, end);\n\t\tassert_leq(end, cur_);\n\t\tfor(size_t i = begin; i < end; i++) {\n\t\t\tlist_[i] = v;\n\t\t}\n\t}\n\n\t/**\n\t * Add an element to the back.  No intialization is done.\n\t */\n\tvoid fill(const T& v) {\n\t\tfor(size_t i = 0; i < cur_; i++) {\n\t\t\tlist_[i] = v;\n\t\t}\n\t}\n\n\t/**\n\t * Set all bits in specified range of elements in list array to 0.\n\t */\n\tvoid fillZero(size_t begin, size_t end) {\n\t\tassert_leq(begin, end);\n\t\tmemset(&list_[begin], 0, sizeof(T) * (end-begin));\n\t}\n\n\t/**\n\t * Set all bits in the list array to 0.\n\t */\n\tvoid fillZero() {\n\t\tmemset(list_, 0, sizeof(T) * cur_);\n\t}\n\n\t/**\n\t * If size is less than requested size, resize up to at least sz\n\t * and set cur_ to requested sz.\n\t */\n\tvoid resizeNoCopy(size_t sz) {\n\t\tif(sz > 0 && list_ == NULL) lazyInit();\n\t\tif(sz <= cur_) {\n\t\t\tcur_ = sz;\n\t\t\treturn;\n\t\t}\n\t\tif(sz_ < sz) expandNoCopy(sz);\n\t\tcur_ = sz;\n\t}\n\n\t/**\n\t * If size is less than requested size, resize up to at least sz\n\t * and set cur_ to requested sz.\n\t */\n\tvoid resize(size_t sz) {\n\t\tif(sz > 0 && list_ == NULL) lazyInit();\n\t\tif(sz <= cur_) {\n\t\t\tcur_ = sz;\n\t\t\treturn;\n\t\t}\n\t\tif(sz_ < sz) {\n\t\t\texpandCopy(sz);\n\t\t}\n\t\tcur_ = sz;\n\t}\n\n\t/**\n\t * If size is less than requested size, resize up to exactly sz and set\n\t * cur_ to requested sz.\n\t */\n\tvoid resizeExact(size_t sz) {\n\t\tif(sz > 0 && list_ == NULL) lazyInitExact(sz);\n\t\tif(sz <= cur_) {\n\t\t\tcur_ = sz;\n\t\t\treturn;\n\t\t}\n\t\tif(sz_ < sz) expandCopyExact(sz);\n\t\tcur_ = sz;\n\t}\n\n\t/**\n\t * Erase element at offset idx.\n\t */\n\tvoid erase(size_t idx) {\n\t\tassert_lt(idx, cur_);\n\t\tfor(size_t i = idx; i < cur_-1; i++) {\n\t\t\tlist_[i] = list_[i+1];\n\t\t}\n\t\tcur_--;\n\t}\n\n\t/**\n\t * Erase range of elements starting at offset idx and going for len.\n\t */\n\tvoid erase(size_t idx, size_t len) {\n\t\tassert_geq(len, 0);\n\t\tif(len == 0) {\n\t\t\treturn;\n\t\t}\n\t\tassert_lt(idx, cur_);\n\t\tfor(size_t i = idx; i < cur_-len; i++) {\n\t\t\tlist_[i] = list_[i+len];\n\t\t}\n\t\tcur_ -= len;\n\t}\n\n\t/**\n\t * Insert value 'el' at offset 'idx'\n\t */\n\tvoid insert(const T& el, size_t idx) {\n\t\tif(list_ == NULL) lazyInit();\n\t\tassert_leq(idx, cur_);\n\t\tif(cur_ == sz_) expandCopy(sz_+1);\n\t\tfor(size_t i = cur_; i > idx; i--) {\n\t\t\tlist_[i] = list_[i-1];\n\t\t}\n\t\tlist_[idx] = el;\n\t\tcur_++;\n\t}\n\n\t/**\n\t * Insert contents of list 'l' at offset 'idx'\n\t */\n\tvoid insert(const EList<T>& l, size_t idx) {\n\t\tif(list_ == NULL) lazyInit();\n\t\tassert_lt(idx, cur_);\n\t\tif(l.cur_ == 0) return;\n\t\tif(cur_ + l.cur_ > sz_) expandCopy(cur_ + l.cur_);\n\t\tfor(size_t i = cur_ + l.cur_ - 1; i > idx + (l.cur_ - 1); i--) {\n\t\t\tlist_[i] = list_[i - l.cur_];\n\t\t}\n\t\tfor(size_t i = 0; i < l.cur_; i++) {\n\t\t\tlist_[i+idx] = l.list_[i];\n\t\t}\n\t\tcur_ += l.cur_;\n\t}\n\n\t/**\n\t * Remove an element from the top of the stack.\n\t */\n\tvoid pop_back() {\n\t\tassert_gt(cur_, 0);\n\t\tcur_--;\n\t}\n\n\t/**\n\t * Make the stack empty.\n\t */\n\tvoid clear() {\n\t\tcur_ = 0; // re-use stack memory\n\t\t// Don't clear heap; re-use it\n\t}\n\n\t/**\n\t * Get the element on the top of the stack.\n\t */\n\tinline T& back() {\n\t\tassert_gt(cur_, 0);\n\t\treturn list_[cur_-1];\n\t}\n\n\t/**\n\t * Reverse list elements.\n\t */\n\tvoid reverse() {\n\t\tif(cur_ > 1) {\n\t\t\tsize_t n = cur_ >> 1;\n\t\t\tfor(size_t i = 0; i < n; i++) {\n\t\t\t\tT tmp = list_[i];\n\t\t\t\tlist_[i] = list_[cur_ - i - 1];\n\t\t\t\tlist_[cur_ - i - 1] = tmp;\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Get the element on the top of the stack, const version.\n\t */\n\tinline const T& back() const {\n\t\tassert_gt(cur_, 0);\n\t\treturn list_[cur_-1];\n\t}\n\n\t/**\n\t * Get the frontmost element (bottom of stack).\n\t */\n\tinline T& front() {\n\t\tassert_gt(cur_, 0);\n\t\treturn list_[0];\n\t}\n\n\t/**\n\t * Get the element on the bottom of the stack, const version.\n\t */\n\tinline const T& front() const { return front(); }\n\n\t/**\n\t * Return true iff this list and list o contain the same elements in the\n\t * same order according to type T's operator==.\n\t */\n\tbool operator==(const EList<T, S>& o) const {\n\t\tif(size() != o.size()) {\n\t\t\treturn false;\n\t\t}\n\t\tfor(size_t i = 0; i < size(); i++) {\n\t\t\tif(!(get(i) == o.get(i))) {\n\t\t\t\treturn false;\n\t\t\t}\n\t\t}\n\t\treturn true;\n\t}\n\n\t/**\n\t * Return true iff this list contains all of the elements in o according to\n\t * type T's operator==.\n\t */\n\tbool isSuperset(const EList<T, S>& o) const {\n\t\tif(o.size() > size()) {\n\t\t\t// This can't be a superset if the other set contains more elts\n\t\t\treturn false;\n\t\t}\n\t\t// For each element in o\n\t\tfor(size_t i = 0; i < o.size(); i++) {\n\t\t\tbool inthis = false;\n\t\t\t// Check if it's in this\n\t\t\tfor(size_t j = 0; j < size(); j++) {\n\t\t\t\tif(o[i] == (*this)[j]) {\n\t\t\t\t\tinthis = true;\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(!inthis) {\n\t\t\t\treturn false;\n\t\t\t}\n\t\t}\n\t\treturn true;\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline T& operator[](size_t i) {\n\t\tassert_lt(i, cur_);\n\t\treturn list_[i];\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline const T& operator[](size_t i) const {\n\t\tassert_lt(i, cur_);\n\t\treturn list_[i];\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline T& get(size_t i) {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline const T& get(size_t i) const {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a reference to the ith element.  This version is not\n\t * inlined, which guarantees we can use it from the debugger.\n\t */\n\tT& getSlow(size_t i) {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a reference to the ith element.  This version is not\n\t * inlined, which guarantees we can use it from the debugger.\n\t */\n\tconst T& getSlow(size_t i) const {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Sort some of the contents.\n\t */\n\tvoid sortPortion(size_t begin, size_t num) {\n\t\tsortPortion(begin, num, std::less<T>());\n\t}\n\n\ttemplate<class Compare>\n\tvoid sortPortion(size_t begin, size_t num, Compare comp) {\n\t\t\tassert_leq(begin+num, cur_);\n\t\t\tif(num < 2) return;\n\t\t\tstd::sort(list_ + begin, list_ + begin + num, comp);\n\t}\n\n\t/**\n\t * Shuffle a portion of the list.\n\t */\n\tvoid shufflePortion(size_t begin, size_t num, RandomSource& rnd) {\n\t\tassert_leq(begin+num, cur_);\n\t\tif(num < 2) return;\n\t\tsize_t left = num;\n\t\tfor(size_t i = begin; i < begin + num - 1; i++) {\n\t\t\tuint32_t rndi = rnd.nextU32() % left;\n\t\t\tif(rndi > 0) {\n\t\t\t\tstd::swap(list_[i], list_[i + rndi]);\n\t\t\t}\n\t\t\tleft--;\n\t\t}\n\t}\n\t\n\t/**\n\t * Sort contents\n\t */\n\tvoid sort() {\n\t\tsortPortion(0, cur_, std::less<T>());\n\t}\n\n\ttemplate <class Compare>\n\tvoid sort(Compare comp)  {\n\t\tsortPortion(0, cur_, comp);\n\t}\n\n\t/**\n\t * Return true iff every element is < its successor.  Only operator< is\n\t * used.\n\t */\n\tbool sorted() const {\n\t\tfor(size_t i = 1; i < cur_; i++) {\n\t\t\tif(!(list_[i-1] < list_[i])) {\n\t\t\t\treturn false;\n\t\t\t}\n\t\t}\n\t\treturn true;\n\t}\n\n\t/**\n\t * Delete element at position 'idx'; slide subsequent chars up.\n\t */\n\tvoid remove(size_t idx) {\n\t\tassert_lt(idx, cur_);\n\t\tassert_gt(cur_, 0);\n\t\tfor(size_t i = idx; i < cur_-1; i++) {\n\t\t\tlist_[i] = list_[i+1];\n\t\t}\n\t\tcur_--;\n\t}\n\t\n\t/**\n\t * Return a pointer to the beginning of the buffer.\n\t */\n\tT *ptr() { return list_; }\n\n\t/**\n\t * Return a const pointer to the beginning of the buffer.\n\t */\n\tconst T *ptr() const { return list_; }\n\n\t/**\n\t * Set the memory category for this object.\n\t */\n\tvoid setCat(int cat) {\n\t\t// What does it mean to set the category after the list_ is\n\t\t// already allocated?\n\t\tassert(null());\n\t\tassert_gt(cat, 0); cat_ = cat;\n\t}\n\n\t/**\n\t * Return memory category.\n\t */\n\tint cat() const { return cat_; }\n\n\t/**\n\t * Perform a binary search for the first element that is not less\n\t * than 'el'.  Return cur_ if all elements are less than el.\n\t */\n\tsize_t bsearchLoBound(const T& el) const {\n\t\tsize_t hi = cur_;\n\t\tsize_t lo = 0;\n\t\twhile(true) {\n\t\t\tif(lo == hi) {\n\t\t\t\treturn lo;\n\t\t\t}\n\t\t\tsize_t mid = lo + ((hi-lo)>>1);\n\t\t\tassert_neq(mid, hi);\n\t\t\tif(list_[mid] < el) {\n\t\t\t\tif(lo == mid) {\n\t\t\t\t\treturn hi;\n\t\t\t\t}\n\t\t\t\tlo = mid;\n\t\t\t} else {\n\t\t\t\thi = mid;\n\t\t\t}\n\t\t}\n\t}\n\nprivate:\n\n\t/**\n\t * Initialize memory for EList.\n\t */\n\tvoid lazyInit() {\n\t\tassert(list_ == NULL);\n\t\tlist_ = alloc(sz_);\n\t}\n\n\t/**\n\t * Initialize exactly the prescribed number of elements for EList.\n\t */\n\tvoid lazyInitExact(size_t sz) {\n\t\tassert_gt(sz, 0);\n\t\tassert(list_ == NULL);\n\t\tsz_ = sz;\n\t\tlist_ = alloc(sz);\n\t}\n\n\t/**\n\t * Allocate a T array of length sz_ and store in list_.  Also,\n\t * tally into the global memory tally.\n\t */\n\tT *alloc(size_t sz) {\n\t\tT* tmp = new T[sz];\n\t\tassert(tmp != NULL);\n\t\tgMemTally.add(cat_, sz);\n\t\tallocCat_ = cat_;\n\t\treturn tmp;\n\t}\n\n\t/**\n\t * Allocate a T array of length sz_ and store in list_.  Also,\n\t * tally into the global memory tally.\n\t */\n\tvoid free() {\n\t\tif(list_ != NULL) {\n\t\t\tassert_neq(-1, allocCat_);\n\t\t\tassert_eq(allocCat_, cat_);\n\t\t\tdelete[] list_;\n\t\t\tgMemTally.del(cat_, sz_);\n\t\t\tlist_ = NULL;\n\t\t\tsz_ = cur_ = 0;\n\t\t}\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has at least 'thresh' elements.  Size\n\t * increases quadratically with number of expansions.  Copy old contents\n\t * into new buffer using operator=.\n\t */\n\tvoid expandCopy(size_t thresh) {\n\t\tif(thresh <= sz_) return;\n\t\tsize_t newsz = (sz_ * 2)+1;\n\t\twhile(newsz < thresh) newsz *= 2;\n\t\texpandCopyExact(newsz);\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has exactly 'newsz' elements.  Copy\n\t * old contents into new buffer using operator=.\n\t */\n\tvoid expandCopyExact(size_t newsz) {\n\t\tif(newsz <= sz_) return;\n\t\tT* tmp = alloc(newsz);\n\t\tassert(tmp != NULL);\n\t\tsize_t cur = cur_;\n\t\tif(list_ != NULL) {\n \t\t\tfor(size_t i = 0; i < cur_; i++) {\n\t\t\t\t// Note: operator= is used\n\t\t\t\ttmp[i] = list_[i];\n\t\t\t}\n\t\t\tfree();\n\t\t}\n\t\tlist_ = tmp;\n\t\tsz_ = newsz;\n\t\tcur_ = cur;\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has at least 'thresh' elements.\n\t * Size increases quadratically with number of expansions.  Don't copy old\n\t * contents into the new buffer.\n\t */\n\tvoid expandNoCopy(size_t thresh) {\n\t\tassert(list_ != NULL);\n\t\tif(thresh <= sz_) return;\n\t\tsize_t newsz = (sz_ * 2)+1;\n\t\twhile(newsz < thresh) newsz *= 2;\n\t\texpandNoCopyExact(newsz);\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has exactly 'newsz' elements.  Don't\n\t * copy old contents into the new buffer.\n\t */\n\tvoid expandNoCopyExact(size_t newsz) {\n\t\tassert(list_ != NULL);\n\t\tassert_gt(newsz, 0);\n\t\tfree();\n\t\tT* tmp = alloc(newsz);\n\t\tassert(tmp != NULL);\n\t\tlist_ = tmp;\n\t\tsz_ = newsz;\n\t\tassert_gt(sz_, 0);\n\t}\n\n\tint cat_;      // memory category, for accounting purposes\n\tint allocCat_; // category at time of allocation\n\tT *list_;      // list pointer, returned from new[]\n\tsize_t sz_;    // capacity\n\tsize_t cur_;   // occupancy (AKA size)\n};\n\n/**\n * An ELList<T> is an expandable list of lists with these features:\n *\n *  - Payload type of the inner list is a template parameter T.\n *  - Initial size can be specified at construction time, otherwise\n *    default of 128 is used.\n *  - When allocated initially or when expanding, the new[] operator is\n *    used, which in turn calls the default constructor for EList<T>.\n *  - Upon expansion, instead of copies, xfer is used.\n *  - When the ELList<T> is resized to a smaller size (or cleared,\n *    which is like resizing to size 0), the underlying containing is\n *    not reshaped.  Thus, ELLists<T>s never release memory before\n *    destruction.\n *\n * And these requirements:\n *\n *  - Payload type T must have a default constructor.\n *\n */\ntemplate <typename T, int S1 = 128, int S2 = 128>\nclass ELList {\n\npublic:\n\n\t/**\n\t * Allocate initial default of 128 elements.\n\t */\n\texplicit ELList(int cat = 0) :\n\t\tcat_(cat), list_(NULL), sz_(S2), cur_(0)\n\t{\n\t\tassert_geq(cat, 0);\n\t}\n\n\t/**\n\t * Initially allocate given number of elements; should be > 0.\n\t */\n\texplicit ELList(size_t isz, int cat = 0) :\n\t\tcat_(cat), list_(NULL), sz_(isz), cur_(0)\n\t{\n\t\tassert_gt(isz, 0);\n\t\tassert_geq(cat, 0);\n\t}\n\n\t/**\n\t * Copy from another ELList using operator=.\n\t */\n\tELList(const ELList<T, S1, S2>& o) :\n\t\tcat_(0), list_(NULL), sz_(0), cur_(0)\n\t{\n\t\t*this = o;\n\t}\n\n\t/**\n\t * Copy from another ELList using operator=.\n\t */\n\texplicit ELList(const ELList<T, S1, S2>& o, int cat) :\n\t\tcat_(cat), list_(NULL), sz_(0), cur_(0)\n\t{\n\t\t*this = o;\n\t\tassert_geq(cat, 0);\n\t}\n\n\t/**\n\t * Destructor.\n\t */\n\t~ELList() { free(); }\n\n\t/**\n\t * Make this object into a copy of o by allocating enough memory to\n\t * fit the number of elements in o (note: the number of elements\n\t * may be substantially less than the memory allocated in o) and\n\t * using operator= to copy them over.\n\t */\n\tELList<T, S1, S2>& operator=(const ELList<T, S1, S2>& o) {\n\t\tassert_eq(cat_, o.cat());\n\t\tif(list_ == NULL) {\n\t\t\tlazyInit();\n\t\t}\n\t\tif(o.cur_ == 0) {\n\t\t\tcur_ = 0;\n\t\t\treturn *this;\n\t\t}\n\t\tif(sz_ < o.cur_) expandNoCopy(o.cur_ + 1);\n\t\tassert_geq(sz_, o.cur_);\n\t\tcur_ = o.cur_;\n\t\tfor(size_t i = 0; i < cur_; i++) {\n\t\t\t// Note: using operator=, not xfer\n\t\t\tassert_eq(list_[i].cat(), o.list_[i].cat());\n\t\t\tlist_[i] = o.list_[i];\n\t\t}\n\t\treturn *this;\n\t}\n\t\n\t/**\n\t * Transfer the guts of another EList into this one without using\n\t * operator=, etc.  We have to set EList o's list_ field to NULL to\n\t * avoid o's destructor from deleting list_ out from under us.\n\t */\n\tvoid xfer(ELList<T, S1, S2>& o) {\n\t\tassert_eq(cat_, o.cat());\n\t\tlist_ = o.list_; // list_ is an array of EList<T>s\n\t\tsz_   = o.sz_;\n\t\tcur_  = o.cur_;\n\t\to.list_ = NULL;\n\t\to.sz_ = o.cur_ = 0;\n\t}\n\n\t/**\n\t * Return number of elements.\n\t */\n\tinline size_t size() const { return cur_; }\n\n\t/**\n\t * Return true iff there are no elements.\n\t */\n\tinline bool empty() const { return cur_ == 0; }\n\n\t/**\n\t * Return true iff list hasn't been initialized yet.\n\t */\n\tinline bool null() const { return list_ == NULL; }\n\n\t/**\n\t * Add an element to the back.  No intialization is done.\n\t */\n\tvoid expand() {\n\t\tif(list_ == NULL) lazyInit();\n\t\tif(cur_ == sz_) expandCopy(sz_+1);\n\t\tcur_++;\n\t}\n\n\t/**\n\t * If size is less than requested size, resize up to at least sz\n\t * and set cur_ to requested sz.\n\t */\n\tvoid resize(size_t sz) {\n\t\tif(sz > 0 && list_ == NULL) lazyInit();\n\t\tif(sz <= cur_) {\n\t\t\tcur_ = sz;\n\t\t\treturn;\n\t\t}\n\t\tif(sz_ < sz) {\n\t\t\texpandCopy(sz);\n\t\t}\n\t\tcur_ = sz;\n\t}\n\n\t/**\n\t * Make the stack empty.\n\t */\n\tvoid clear() {\n\t\tcur_ = 0; // re-use stack memory\n\t\t// Don't clear heap; re-use it\n\t}\n\n\t/**\n\t * Get the element on the top of the stack.\n\t */\n\tinline EList<T, S1>& back() {\n\t\tassert_gt(cur_, 0);\n\t\treturn list_[cur_-1];\n\t}\n\n\t/**\n\t * Get the element on the top of the stack, const version.\n\t */\n\tinline const EList<T, S1>& back() const {\n\t\tassert_gt(cur_, 0);\n\t\treturn list_[cur_-1];\n\t}\n\n\t/**\n\t * Get the frontmost element (bottom of stack).\n\t */\n\tinline EList<T, S1>& front() {\n\t\tassert_gt(cur_, 0);\n\t\treturn list_[0];\n\t}\n\n\t/**\n\t * Get the element on the bottom of the stack, const version.\n\t */\n\tinline const EList<T, S1>& front() const { return front(); }\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline EList<T, S1>& operator[](size_t i) {\n\t\tassert_lt(i, cur_);\n\t\treturn list_[i];\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline const EList<T, S1>& operator[](size_t i) const {\n\t\tassert_lt(i, cur_);\n\t\treturn list_[i];\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline EList<T, S1>& get(size_t i) {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline const EList<T, S1>& get(size_t i) const {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a reference to the ith element.  This version is not\n\t * inlined, which guarantees we can use it from the debugger.\n\t */\n\tEList<T, S1>& getSlow(size_t i) {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a reference to the ith element.  This version is not\n\t * inlined, which guarantees we can use it from the debugger.\n\t */\n\tconst EList<T, S1>& getSlow(size_t i) const {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a pointer to the beginning of the buffer.\n\t */\n\tEList<T, S1> *ptr() { return list_; }\n\t\n\t/**\n\t * Set the memory category for this object and all children.\n\t */\n\tvoid setCat(int cat) {\n\t\tassert_gt(cat, 0);\n\t\tcat_ = cat;\n\t\tif(cat_ != 0) {\n\t\t\tfor(size_t i = 0; i < sz_; i++) {\n\t\t\t\tassert(list_[i].null());\n\t\t\t\tlist_[i].setCat(cat_);\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Return memory category.\n\t */\n\tint cat() const { return cat_; }\n\nprotected:\n\n\t/**\n\t * Initialize memory for EList.\n\t */\n\tvoid lazyInit() {\n\t\tassert(list_ == NULL);\n\t\tlist_ = alloc(sz_);\n\t}\n\n\t/**\n\t * Allocate a T array of length sz_ and store in list_.  Also,\n\t * tally into the global memory tally.\n\t */\n\tEList<T, S1> *alloc(size_t sz) {\n\t\tassert_gt(sz, 0);\n\t\tEList<T, S1> *tmp = new EList<T, S1>[sz];\n\t\tgMemTally.add(cat_, sz);\n\t\tif(cat_ != 0) {\n\t\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\t\tassert(tmp[i].ptr() == NULL);\n\t\t\t\ttmp[i].setCat(cat_);\n\t\t\t}\n\t\t}\n\t\treturn tmp;\n\t}\n\n\t/**\n\t * Allocate a T array of length sz_ and store in list_.  Also,\n\t * tally into the global memory tally.\n\t */\n\tvoid free() {\n\t\tif(list_ != NULL) {\n\t\t\tdelete[] list_;\n\t\t\tgMemTally.del(cat_, sz_);\n\t\t\tlist_ = NULL;\n\t\t}\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has at least 'thresh' elements.\n\t * Expansions are quadratic.  Copy old contents into new buffer\n\t * using operator=.\n\t */\n\tvoid expandCopy(size_t thresh) {\n\t\tassert(list_ != NULL);\n\t\tif(thresh <= sz_) return;\n\t\tsize_t newsz = (sz_ * 2)+1;\n\t\twhile(newsz < thresh) newsz *= 2;\n\t\tEList<T, S1>* tmp = alloc(newsz);\n\t\tif(list_ != NULL) {\n\t\t\tfor(size_t i = 0; i < cur_; i++) {\n\t\t\t\tassert_eq(cat_, tmp[i].cat());\n\t\t\t\ttmp[i].xfer(list_[i]);\n\t\t\t\tassert_eq(cat_, tmp[i].cat());\n\t\t\t}\n\t\t\tfree();\n\t\t}\n\t\tlist_ = tmp;\n\t\tsz_ = newsz;\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has at least 'thresh' elements.\n\t * Expansions are quadratic.  Don't copy old contents over.\n\t */\n\tvoid expandNoCopy(size_t thresh) {\n\t\tassert(list_ != NULL);\n\t\tif(thresh <= sz_) return;\n\t\tfree();\n\t\tsize_t newsz = (sz_ * 2)+1;\n\t\twhile(newsz < thresh) newsz *= 2;\n\t\tEList<T, S1>* tmp = alloc(newsz);\n\t\tlist_ = tmp;\n\t\tsz_ = newsz;\n\t\tassert_gt(sz_, 0);\n\t}\n\n\tint cat_;    // memory category, for accounting purposes\n\tEList<T, S1> *list_; // list pointer, returned from new[]\n\tsize_t sz_;  // capacity\n\tsize_t cur_; // occupancy (AKA size)\n\n};\n\n/**\n * An ELLList<T> is an expandable list of expandable lists with these\n * features:\n *\n *  - Payload type of the innermost list is a template parameter T.\n *  - Initial size can be specified at construction time, otherwise\n *    default of 128 is used.\n *  - When allocated initially or when expanding, the new[] operator is\n *    used, which in turn calls the default constructor for ELList<T>.\n *  - Upon expansion, instead of copies, xfer is used.\n *  - When the ELLList<T> is resized to a smaller size (or cleared,\n *    which is like resizing to size 0), the underlying containing is\n *    not reshaped.  Thus, ELLLists<T>s never release memory before\n *    destruction.\n *\n * And these requirements:\n *\n *  - Payload type T must have a default constructor.\n *\n */\ntemplate <typename T, int S1 = 128, int S2 = 128, int S3 = 128>\nclass ELLList {\n\npublic:\n\n\t/**\n\t * Allocate initial default of 128 elements.\n\t */\n\texplicit ELLList(int cat = 0) :\n\t\tcat_(cat), list_(NULL), sz_(S3), cur_(0)\n\t{\n\t\tassert_geq(cat, 0);\n\t}\n\n\t/**\n\t * Initially allocate given number of elements; should be > 0.\n\t */\n\texplicit ELLList(size_t isz, int cat = 0) :\n\t\tcat_(cat), list_(NULL), sz_(isz), cur_(0)\n\t{\n\t\tassert_geq(cat, 0);\n\t\tassert_gt(isz, 0);\n\t}\n\n\t/**\n\t * Copy from another ELLList using operator=.\n\t */\n\tELLList(const ELLList<T, S1, S2, S3>& o) :\n\t\tcat_(0), list_(NULL), sz_(0), cur_(0)\n\t{\n\t\t*this = o;\n\t}\n\n\t/**\n\t * Copy from another ELLList using operator=.\n\t */\n\texplicit ELLList(const ELLList<T, S1, S2, S3>& o, int cat) :\n\t\tcat_(cat), list_(NULL), sz_(0), cur_(0)\n\t{\n\t\t*this = o;\n\t\tassert_geq(cat, 0);\n\t}\n\n\t/**\n\t * Destructor.\n\t */\n\t~ELLList() { free(); }\n\n\t/**\n\t * Make this object into a copy of o by allocating enough memory to\n\t * fit the number of elements in o (note: the number of elements\n\t * may be substantially less than the memory allocated in o) and\n\t * using operator= to copy them over.\n\t */\n\tELLList<T, S1, S2, S3>& operator=(const ELLList<T, S1, S2, S3>& o) {\n\t\tassert_eq(cat_, o.cat());\n\t\tif(list_ == NULL) lazyInit();\n\t\tif(o.cur_ == 0) {\n\t\t\tcur_ = 0;\n\t\t\treturn *this;\n\t\t}\n\t\tif(sz_ < o.cur_) expandNoCopy(o.cur_ + 1);\n\t\tassert_geq(sz_, o.cur_);\n\t\tcur_ = o.cur_;\n\t\tfor(size_t i = 0; i < cur_; i++) {\n\t\t\t// Note: using operator=, not xfer\n\t\t\tassert_eq(list_[i].cat(), o.list_[i].cat());\n\t\t\tlist_[i] = o.list_[i];\n\t\t}\n\t\treturn *this;\n\t}\n\t\n\t/**\n\t * Transfer the guts of another EList into this one without using\n\t * operator=, etc.  We have to set EList o's list_ field to NULL to\n\t * avoid o's destructor from deleting list_ out from under us.\n\t */\n\tvoid xfer(ELLList<T, S1, S2, S3>& o) {\n\t\tassert_eq(cat_, o.cat());\n\t\tlist_ = o.list_; // list_ is an array of EList<T>s\n\t\tsz_   = o.sz_;\n\t\tcur_  = o.cur_;\n\t\to.list_ = NULL;\n\t\to.sz_ = o.cur_ = 0;\n\t}\n\n\t/**\n\t * Return number of elements.\n\t */\n\tinline size_t size() const { return cur_; }\n\n\t/**\n\t * Return true iff there are no elements.\n\t */\n\tinline bool empty() const { return cur_ == 0; }\n\n\t/**\n\t * Return true iff list hasn't been initialized yet.\n\t */\n\tinline bool null() const { return list_ == NULL; }\n\n\t/**\n\t * Add an element to the back.  No intialization is done.\n\t */\n\tvoid expand() {\n\t\tif(list_ == NULL) lazyInit();\n\t\tif(cur_ == sz_) expandCopy(sz_+1);\n\t\tcur_++;\n\t}\n\n\t/**\n\t * If size is less than requested size, resize up to at least sz\n\t * and set cur_ to requested sz.\n\t */\n\tvoid resize(size_t sz) {\n\t\tif(sz > 0 && list_ == NULL) lazyInit();\n\t\tif(sz <= cur_) {\n\t\t\tcur_ = sz;\n\t\t\treturn;\n\t\t}\n\t\tif(sz_ < sz) expandCopy(sz);\n\t\tcur_ = sz;\n\t}\n\n\t/**\n\t * Make the stack empty.\n\t */\n\tvoid clear() {\n\t\tcur_ = 0; // re-use stack memory\n\t\t// Don't clear heap; re-use it\n\t}\n\n\t/**\n\t * Get the element on the top of the stack.\n\t */\n\tinline ELList<T, S1, S2>& back() {\n\t\tassert_gt(cur_, 0);\n\t\treturn list_[cur_-1];\n\t}\n\n\t/**\n\t * Get the element on the top of the stack, const version.\n\t */\n\tinline const ELList<T, S1, S2>& back() const {\n\t\tassert_gt(cur_, 0);\n\t\treturn list_[cur_-1];\n\t}\n\n\t/**\n\t * Get the frontmost element (bottom of stack).\n\t */\n\tinline ELList<T, S1, S2>& front() {\n\t\tassert_gt(cur_, 0);\n\t\treturn list_[0];\n\t}\n\n\t/**\n\t * Get the element on the bottom of the stack, const version.\n\t */\n\tinline const ELList<T, S1, S2>& front() const { return front(); }\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline ELList<T, S1, S2>& operator[](size_t i) {\n\t\tassert_lt(i, cur_);\n\t\treturn list_[i];\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline const ELList<T, S1, S2>& operator[](size_t i) const {\n\t\tassert_lt(i, cur_);\n\t\treturn list_[i];\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline ELList<T, S1, S2>& get(size_t i) {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline const ELList<T, S1, S2>& get(size_t i) const {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a reference to the ith element.  This version is not\n\t * inlined, which guarantees we can use it from the debugger.\n\t */\n\tELList<T, S1, S2>& getSlow(size_t i) {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a reference to the ith element.  This version is not\n\t * inlined, which guarantees we can use it from the debugger.\n\t */\n\tconst ELList<T, S1, S2>& getSlow(size_t i) const {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a pointer to the beginning of the buffer.\n\t */\n\tELList<T, S1, S2> *ptr() { return list_; }\n\n\t/**\n\t * Set the memory category for this object and all children.\n\t */\n\tvoid setCat(int cat) {\n\t\tassert_gt(cat, 0);\n\t\tcat_ = cat;\n\t\tif(cat_ != 0) {\n\t\t\tfor(size_t i = 0; i < sz_; i++) {\n\t\t\t\tassert(list_[i].null());\n\t\t\t\tlist_[i].setCat(cat_);\n\t\t\t}\n\t\t}\n\t}\n\t\n\t/**\n\t * Return memory category.\n\t */\n\tint cat() const { return cat_; }\n\nprotected:\n\n\t/**\n\t * Initialize memory for EList.\n\t */\n\tvoid lazyInit() {\n\t\tassert(null());\n\t\tlist_ = alloc(sz_);\n\t}\n\n\t/**\n\t * Allocate a T array of length sz_ and store in list_.  Also,\n\t * tally into the global memory tally.\n\t */\n\tELList<T, S1, S2> *alloc(size_t sz) {\n\t\tassert_gt(sz, 0);\n\t\tELList<T, S1, S2> *tmp = new ELList<T, S1, S2>[sz];\n\t\tgMemTally.add(cat_, sz);\n\t\tif(cat_ != 0) {\n\t\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\t\tassert(tmp[i].ptr() == NULL);\n\t\t\t\ttmp[i].setCat(cat_);\n\t\t\t}\n\t\t}\n\t\treturn tmp;\n\t}\n\n\t/**\n\t * Allocate a T array of length sz_ and store in list_.  Also,\n\t * tally into the global memory tally.\n\t */\n\tvoid free() {\n\t\tif(list_ != NULL) {\n\t\t\tdelete[] list_;\n\t\t\tgMemTally.del(cat_, sz_);\n\t\t\tlist_ = NULL;\n\t\t}\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has at least 'thresh' elements.\n\t * Expansions are quadratic.  Copy old contents into new buffer\n\t * using operator=.\n\t */\n\tvoid expandCopy(size_t thresh) {\n\t\tassert(list_ != NULL);\n\t\tif(thresh <= sz_) return;\n\t\tsize_t newsz = (sz_ * 2)+1;\n\t\twhile(newsz < thresh) newsz *= 2;\n\t\tELList<T, S1, S2>* tmp = alloc(newsz);\n\t\tif(list_ != NULL) {\n\t\t\tfor(size_t i = 0; i < cur_; i++) {\n\t\t\t\tassert_eq(cat_, tmp[i].cat());\n\t\t\t\ttmp[i].xfer(list_[i]);\n\t\t\t\tassert_eq(cat_, tmp[i].cat());\n\t\t\t}\n\t\t\tfree();\n\t\t}\n\t\tlist_ = tmp;\n\t\tsz_ = newsz;\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has at least 'thresh' elements.\n\t * Expansions are quadratic.  Don't copy old contents over.\n\t */\n\tvoid expandNoCopy(size_t thresh) {\n\t\tassert(list_ != NULL);\n\t\tif(thresh <= sz_) return;\n\t\tfree();\n\t\tsize_t newsz = (sz_ * 2)+1;\n\t\twhile(newsz < thresh) newsz *= 2;\n\t\tELList<T, S1, S2>* tmp = alloc(newsz);\n\t\tlist_ = tmp;\n\t\tsz_ = newsz;\n\t\tassert_gt(sz_, 0);\n\t}\n\n\tint cat_;    // memory category, for accounting purposes\n\tELList<T, S1, S2> *list_; // list pointer, returned from new[]\n\tsize_t sz_;  // capacity\n\tsize_t cur_; // occupancy (AKA size)\n\n};\n\n/**\n * Expandable set using a heap-allocated sorted array.\n *\n * Note that the copy constructor and operator= routines perform\n * shallow copies (w/ memcpy).\n */\ntemplate <typename T>\nclass ESet {\npublic:\n\n\t/**\n\t * Allocate initial default of 128 elements.\n\t */\n\tESet(int cat = 0) :\n\t\tcat_(cat),\n\t\tlist_(NULL),\n\t\tsz_(0),\n\t\tcur_(0)\n\t{\n\t\tif(sz_ > 0) {\n\t\t\tlist_ = alloc(sz_);\n\t\t}\n\t}\n\n\t/**\n\t * Initially allocate given number of elements; should be > 0.\n\t */\n\tESet(size_t isz, int cat = 0) :\n\t\tcat_(cat),\n\t\tlist_(NULL),\n\t\tsz_(isz),\n\t\tcur_(0)\n\t{\n\t\tassert_gt(isz, 0);\n\t\tif(sz_ > 0) {\n\t\t\tlist_ = alloc(sz_);\n\t\t}\n\t}\n\n\t/**\n\t * Copy from another ESet.\n\t */\n\tESet(const ESet<T>& o, int cat = 0) :\n\t\tcat_(cat), list_(NULL)\n\t{\n\t\tassert_eq(cat_, o.cat());\n\t\t*this = o;\n\t}\n\n\t/**\n\t * Destructor.\n\t */\n\t~ESet() { free(); }\n\n\t/**\n\t * Copy contents of given ESet into this ESet.\n\t */\n\tESet& operator=(const ESet<T>& o) {\n\t\tassert_eq(cat_, o.cat());\n\t\tsz_ = o.sz_;\n\t\tcur_ = o.cur_;\n\t\tfree();\n\t\tif(sz_ > 0) {\n\t\t\tlist_ = alloc(sz_);\n\t\t\tmemcpy(list_, o.list_, cur_ * sizeof(T));\n\t\t} else {\n\t\t\tlist_ = NULL;\n\t\t}\n\t\treturn *this;\n\t}\n\n\t/**\n\t * Return number of elements.\n\t */\n\tsize_t size() const { return cur_; }\n\n\t/**\n\t * Return the total size in bytes occupied by this set.\n\t */\n\tsize_t totalSizeBytes() const {\n\t\treturn sizeof(int) + cur_ * sizeof(T) + 2 * sizeof(size_t);\n\t}\n\n\t/**\n\t * Return the total capacity in bytes occupied by this set.\n\t */\n\tsize_t totalCapacityBytes() const {\n\t\treturn sizeof(int) + sz_ * sizeof(T) + 2 * sizeof(size_t);\n\t}\n\t\n\t/**\n\t * Return true iff there are no elements.\n\t */\n\tbool empty() const { return cur_ == 0; }\n\n\t/**\n\t * Return true iff list isn't initialized yet.\n\t */\n\tbool null() const { return list_ == NULL; }\n\n\t/**\n\t * Insert a new element into the set in sorted order.\n\t */\n\tbool insert(const T& el) {\n\t\tsize_t i = 0;\n\t\tif(cur_ == 0) {\n\t\t\tinsert(el, 0);\n\t\t\treturn true;\n\t\t}\n\t\tif(cur_ < 16) {\n\t\t\t// Linear scan\n\t\t\ti = scanLoBound(el);\n\t\t} else {\n\t\t\t// Binary search\n\t\t\ti = bsearchLoBound(el);\n\t\t}\n\t\tif(i < cur_ && list_[i] == el) return false;\n\t\tinsert(el, i);\n\t\treturn true;\n\t}\n\n\t/**\n\t * Return true iff this set contains 'el'.\n\t */\n\tbool contains(const T& el) const {\n\t\tif(cur_ == 0) {\n\t\t\treturn false;\n\t\t}\n\t\telse if(cur_ == 1) {\n\t\t\treturn el == list_[0];\n\t\t}\n\t\tsize_t i;\n\t\tif(cur_ < 16) {\n\t\t\t// Linear scan\n\t\t\ti = scanLoBound(el);\n\t\t} else {\n\t\t\t// Binary search\n\t\t\ti = bsearchLoBound(el);\n\t\t}\n\t\treturn i != cur_ && list_[i] == el;\n\t}\n\n\t/**\n\t * Remove element from set.\n\t */\n\tvoid remove(const T& el) {\n\t\tsize_t i;\n\t\tif(cur_ < 16) {\n\t\t\t// Linear scan\n\t\t\ti = scanLoBound(el);\n\t\t} else {\n\t\t\t// Binary search\n\t\t\ti = bsearchLoBound(el);\n\t\t}\n\t\tassert(i != cur_ && list_[i] == el);\n\t\terase(i);\n\t}\n\n\t/**\n\t * If size is less than requested size, resize up to at least sz\n\t * and set cur_ to requested sz.\n\t */\n\tvoid resize(size_t sz) {\n\t\tif(sz <= cur_) return;\n\t\tif(sz_ < sz) expandCopy(sz);\n\t}\n\n\t/**\n\t * Clear set without deallocating (or setting) anything.\n\t */\n\tvoid clear() { cur_ = 0; }\n\n\t/**\n\t * Return memory category.\n\t */\n\tint cat() const { return cat_; }\n\t\n\t/**\n\t * Set the memory category for this object.\n\t */\n\tvoid setCat(int cat) {\n\t\tcat_ = cat;\n\t}\n\n\t/**\n\t * Transfer the guts of another EList into this one without using\n\t * operator=, etc.  We have to set EList o's list_ field to NULL to\n\t * avoid o's destructor from deleting list_ out from under us.\n\t */\n\tvoid xfer(ESet<T>& o) {\n\t\t// What does it mean to transfer to a different-category list?\n\t\tassert_eq(cat_, o.cat());\n\t\t// Can only transfer into an empty object\n\t\tfree();\n\t\tlist_ = o.list_;\n\t\tsz_ = o.sz_;\n\t\tcur_ = o.cur_;\n\t\to.list_ = NULL;\n\t\to.sz_ = o.cur_ = 0;\n\t}\n\n\t/**\n\t * Return a pointer to the beginning of the buffer.\n\t */\n\tT *ptr() { return list_; }\n\n\t/**\n\t * Return a const pointer to the beginning of the buffer.\n\t */\n\tconst T *ptr() const { return list_; }\n\nprivate:\n\n\t/**\n\t * Allocate a T array of length sz_ and store in list_.  Also,\n\t * tally into the global memory tally.\n\t */\n\tT *alloc(size_t sz) {\n\t\tassert_gt(sz, 0);\n\t\tT *tmp = new T[sz];\n\t\tgMemTally.add(cat_, sz);\n\t\treturn tmp;\n\t}\n\n\t/**\n\t * Allocate a T array of length sz_ and store in list_.  Also,\n\t * tally into the global memory tally.\n\t */\n\tvoid free() {\n\t\tif(list_ != NULL) {\n\t\t\tdelete[] list_;\n\t\t\tgMemTally.del(cat_, sz_);\n\t\t\tlist_ = NULL;\n\t\t}\n\t}\n\n\t/**\n\t * Simple linear scan that returns the index of the first element\n\t * of list_ that is not less than el, or cur_ if all elements are\n\t * less than el.\n\t */\n\tsize_t scanLoBound(const T& el) const {\n\t\tfor(size_t i = 0; i < cur_; i++) {\n\t\t\tif(!(list_[i] < el)) {\n\t\t\t\t// Shouldn't be equal\n\t\t\t\treturn i;\n\t\t\t}\n\t\t}\n\t\treturn cur_;\n\t}\n\n\t/**\n\t * Perform a binary search for the first element that is not less\n\t * than 'el'.  Return cur_ if all elements are less than el.\n\t */\n\tsize_t bsearchLoBound(const T& el) const {\n\t\tsize_t hi = cur_;\n\t\tsize_t lo = 0;\n\t\twhile(true) {\n\t\t\tif(lo == hi) {\n#ifndef NDEBUG\n\t\t\t\tif((rand() % 10) == 0) {\n\t\t\t\t\tassert_eq(lo, scanLoBound(el));\n\t\t\t\t}\n#endif\n\t\t\t\treturn lo;\n\t\t\t}\n\t\t\tsize_t mid = lo + ((hi-lo)>>1);\n\t\t\tassert_neq(mid, hi);\n\t\t\tif(list_[mid] < el) {\n\t\t\t\tif(lo == mid) {\n#ifndef NDEBUG\n\t\t\t\t\tif((rand() % 10) == 0) {\n\t\t\t\t\t\tassert_eq(hi, scanLoBound(el));\n\t\t\t\t\t}\n#endif\n\t\t\t\t\treturn hi;\n\t\t\t\t}\n\t\t\t\tlo = mid;\n\t\t\t} else {\n\t\t\t\thi = mid;\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Return true if sorted, assert otherwise.\n\t */\n\tbool sorted() const {\n\t\tif(cur_ <= 1) return true;\n#ifndef NDEBUG\n\t\tif((rand() % 20) == 0) {\n\t\t\tfor(size_t i = 0; i < cur_-1; i++) {\n\t\t\t\tassert(list_[i] < list_[i+1]);\n\t\t\t}\n\t\t}\n#endif\n\t\treturn true;\n\t}\n\n\t/**\n\t * Insert value 'el' at offset 'idx'.  It's OK to insert at cur_,\n\t * which is equivalent to appending.\n\t */\n\tvoid insert(const T& el, size_t idx) {\n\t\tassert_leq(idx, cur_);\n\t\tif(cur_ == sz_) {\n\t\t\texpandCopy(sz_+1);\n\t\t\tassert(sorted());\n\t\t}\n\t\tfor(size_t i = cur_; i > idx; i--) {\n\t\t\tlist_[i] = list_[i-1];\n\t\t}\n\t\tlist_[idx] = el;\n\t\tcur_++;\n\t\tassert(sorted());\n\t}\n\n\t/**\n\t * Erase element at offset idx.\n\t */\n\tvoid erase(size_t idx) {\n\t\tassert_lt(idx, cur_);\n\t\tfor(size_t i = idx; i < cur_-1; i++) {\n\t\t\tlist_[i] = list_[i+1];\n\t\t}\n\t\tcur_--;\n\t\tassert(sorted());\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has at least 'thresh' elements.\n\t * Expansions are quadratic.\n\t */\n\tvoid expandCopy(size_t thresh) {\n\t\tif(thresh <= sz_) return;\n\t\tsize_t newsz = (sz_ * 2)+1;\n\t\twhile(newsz < thresh) {\n\t\t\tnewsz *= 2;\n\t\t}\n\t\tT* tmp = alloc(newsz);\n\t\tfor(size_t i = 0; i < cur_; i++) {\n\t\t\ttmp[i] = list_[i];\n\t\t}\n\t\tfree();\n\t\tlist_ = tmp;\n\t\tsz_ = newsz;\n\t}\n\n\tint cat_;    // memory category, for accounting purposes\n\tT *list_;    // list pointer, returned from new[]\n\tsize_t sz_;  // capacity\n\tsize_t cur_; // occupancy (AKA size)\n};\n\ntemplate <typename T, int S = 128>\nclass ELSet {\n\npublic:\n\n\t/**\n\t * Allocate initial default of 128 elements.\n\t */\n\texplicit ELSet(int cat = 0) :\n\t\tcat_(cat), list_(NULL), sz_(S), cur_(0)\n\t{\n\t\tassert_geq(cat, 0);\n\t}\n\n\t/**\n\t * Initially allocate given number of elements; should be > 0.\n\t */\n\texplicit ELSet(size_t isz, int cat = 0) :\n\t\tcat_(cat), list_(NULL), sz_(isz), cur_(0)\n\t{\n\t\tassert_gt(isz, 0);\n\t\tassert_geq(cat, 0);\n\t}\n\n\t/**\n\t * Copy from another ELList using operator=.\n\t */\n\tELSet(const ELSet<T, S>& o) :\n\t\tcat_(0), list_(NULL), sz_(0), cur_(0)\n\t{\n\t\t*this = o;\n\t}\n\n\t/**\n\t * Copy from another ELList using operator=.\n\t */\n\texplicit ELSet(const ELSet<T, S>& o, int cat) :\n\t\tcat_(cat), list_(NULL), sz_(0), cur_(0)\n\t{\n\t\t*this = o;\n\t\tassert_geq(cat, 0);\n\t}\n\n\t/**\n\t * Destructor.\n\t */\n\t~ELSet() { free(); }\n\n\t/**\n\t * Make this object into a copy of o by allocating enough memory to\n\t * fit the number of elements in o (note: the number of elements\n\t * may be substantially less than the memory allocated in o) and\n\t * using operator= to copy them over.\n\t */\n\tELSet<T, S>& operator=(const ELSet<T, S>& o) {\n\t\tassert_eq(cat_, o.cat());\n\t\tif(list_ == NULL) {\n\t\t\tlazyInit();\n\t\t}\n\t\tif(o.cur_ == 0) {\n\t\t\tcur_ = 0;\n\t\t\treturn *this;\n\t\t}\n\t\tif(sz_ < o.cur_) expandNoCopy(o.cur_ + 1);\n\t\tassert_geq(sz_, o.cur_);\n\t\tcur_ = o.cur_;\n\t\tfor(size_t i = 0; i < cur_; i++) {\n\t\t\t// Note: using operator=, not xfer\n\t\t\tassert_eq(list_[i].cat(), o.list_[i].cat());\n\t\t\tlist_[i] = o.list_[i];\n\t\t}\n\t\treturn *this;\n\t}\n\t\n\t/**\n\t * Transfer the guts of another ESet into this one without using\n\t * operator=, etc.  We have to set ESet o's list_ field to NULL to\n\t * avoid o's destructor from deleting list_ out from under us.\n\t */\n\tvoid xfer(ELSet<T, S>& o) {\n\t\tassert_eq(cat_, o.cat());\n\t\tlist_ = o.list_; // list_ is an array of ESet<T>s\n\t\tsz_   = o.sz_;\n\t\tcur_  = o.cur_;\n\t\to.list_ = NULL;\n\t\to.sz_ = o.cur_ = 0;\n\t}\n\n\t/**\n\t * Return number of elements.\n\t */\n\tinline size_t size() const { return cur_; }\n\n\t/**\n\t * Return true iff there are no elements.\n\t */\n\tinline bool empty() const { return cur_ == 0; }\n\n\t/**\n\t * Return true iff list hasn't been initialized yet.\n\t */\n\tinline bool null() const { return list_ == NULL; }\n\n\t/**\n\t * Add an element to the back.  No intialization is done.\n\t */\n\tvoid expand() {\n\t\tif(list_ == NULL) lazyInit();\n\t\tif(cur_ == sz_) expandCopy(sz_+1);\n\t\tcur_++;\n\t}\n\n\t/**\n\t * If size is less than requested size, resize up to at least sz\n\t * and set cur_ to requested sz.\n\t */\n\tvoid resize(size_t sz) {\n\t\tif(sz > 0 && list_ == NULL) lazyInit();\n\t\tif(sz <= cur_) {\n\t\t\tcur_ = sz;\n\t\t\treturn;\n\t\t}\n\t\tif(sz_ < sz) {\n\t\t\texpandCopy(sz);\n\t\t}\n\t\tcur_ = sz;\n\t}\n\n\t/**\n\t * Make the stack empty.\n\t */\n\tvoid clear() {\n\t\tcur_ = 0; // re-use stack memory\n\t\t// Don't clear heap; re-use it\n\t}\n\n\t/**\n\t * Get the element on the top of the stack.\n\t */\n\tinline ESet<T>& back() {\n\t\tassert_gt(cur_, 0);\n\t\treturn list_[cur_-1];\n\t}\n\n\t/**\n\t * Get the element on the top of the stack, const version.\n\t */\n\tinline const ESet<T>& back() const {\n\t\tassert_gt(cur_, 0);\n\t\treturn list_[cur_-1];\n\t}\n\n\t/**\n\t * Get the frontmost element (bottom of stack).\n\t */\n\tinline ESet<T>& front() {\n\t\tassert_gt(cur_, 0);\n\t\treturn list_[0];\n\t}\n\n\t/**\n\t * Get the element on the bottom of the stack, const version.\n\t */\n\tinline const ESet<T>& front() const { return front(); }\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline ESet<T>& operator[](size_t i) {\n\t\tassert_lt(i, cur_);\n\t\treturn list_[i];\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline const ESet<T>& operator[](size_t i) const {\n\t\tassert_lt(i, cur_);\n\t\treturn list_[i];\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline ESet<T>& get(size_t i) {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline const ESet<T>& get(size_t i) const {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a reference to the ith element.  This version is not\n\t * inlined, which guarantees we can use it from the debugger.\n\t */\n\tESet<T>& getSlow(size_t i) {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a reference to the ith element.  This version is not\n\t * inlined, which guarantees we can use it from the debugger.\n\t */\n\tconst ESet<T>& getSlow(size_t i) const {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a pointer to the beginning of the buffer.\n\t */\n\tESet<T> *ptr() { return list_; }\n\n\t/**\n\t * Return a const pointer to the beginning of the buffer.\n\t */\n\tconst ESet<T> *ptr() const { return list_; }\n\n\t/**\n\t * Set the memory category for this object and all children.\n\t */\n\tvoid setCat(int cat) {\n\t\tassert_gt(cat, 0);\n\t\tcat_ = cat;\n\t\tif(cat_ != 0) {\n\t\t\tfor(size_t i = 0; i < sz_; i++) {\n\t\t\t\tassert(list_[i].null());\n\t\t\t\tlist_[i].setCat(cat_);\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Return memory category.\n\t */\n\tint cat() const { return cat_; }\n\nprotected:\n\n\t/**\n\t * Initialize memory for ELSet.\n\t */\n\tvoid lazyInit() {\n\t\tassert(list_ == NULL);\n\t\tlist_ = alloc(sz_);\n\t}\n\n\t/**\n\t * Allocate a T array of length sz_ and store in list_.  Also,\n\t * tally into the global memory tally.\n\t */\n\tESet<T> *alloc(size_t sz) {\n\t\tassert_gt(sz, 0);\n\t\tESet<T> *tmp = new ESet<T>[sz];\n\t\tgMemTally.add(cat_, sz);\n\t\tif(cat_ != 0) {\n\t\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\t\tassert(tmp[i].ptr() == NULL);\n\t\t\t\ttmp[i].setCat(cat_);\n\t\t\t}\n\t\t}\n\t\treturn tmp;\n\t}\n\n\t/**\n\t * Allocate a T array of length sz_ and store in list_.  Also,\n\t * tally into the global memory tally.\n\t */\n\tvoid free() {\n\t\tif(list_ != NULL) {\n\t\t\tdelete[] list_;\n\t\t\tgMemTally.del(cat_, sz_);\n\t\t\tlist_ = NULL;\n\t\t}\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has at least 'thresh' elements.\n\t * Expansions are quadratic.  Copy old contents into new buffer\n\t * using operator=.\n\t */\n\tvoid expandCopy(size_t thresh) {\n\t\tassert(list_ != NULL);\n\t\tif(thresh <= sz_) return;\n\t\tsize_t newsz = (sz_ * 2)+1;\n\t\twhile(newsz < thresh) newsz *= 2;\n\t\tESet<T>* tmp = alloc(newsz);\n\t\tif(list_ != NULL) {\n\t\t\tfor(size_t i = 0; i < cur_; i++) {\n\t\t\t\tassert_eq(cat_, tmp[i].cat());\n\t\t\t\ttmp[i].xfer(list_[i]);\n\t\t\t\tassert_eq(cat_, tmp[i].cat());\n\t\t\t}\n\t\t\tfree();\n\t\t}\n\t\tlist_ = tmp;\n\t\tsz_ = newsz;\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has at least 'thresh' elements.\n\t * Expansions are quadratic.  Don't copy old contents over.\n\t */\n\tvoid expandNoCopy(size_t thresh) {\n\t\tassert(list_ != NULL);\n\t\tif(thresh <= sz_) return;\n\t\tfree();\n\t\tsize_t newsz = (sz_ * 2)+1;\n\t\twhile(newsz < thresh) newsz *= 2;\n\t\tESet<T>* tmp = alloc(newsz);\n\t\tlist_ = tmp;\n\t\tsz_ = newsz;\n\t\tassert_gt(sz_, 0);\n\t}\n\n\tint cat_;    // memory category, for accounting purposes\n\tESet<T> *list_; // list pointer, returned from new[]\n\tsize_t sz_;  // capacity\n\tsize_t cur_; // occupancy (AKA size)\n\n};\n\n/**\n * Expandable map using a heap-allocated sorted array.\n *\n * Note that the copy constructor and operator= routines perform\n * shallow copies (w/ memcpy).\n */\ntemplate <typename K, typename V>\nclass EMap {\n\npublic:\n\n\t/**\n\t * Allocate initial default of 128 elements.\n\t */\n\tEMap(int cat = 0) :\n\t\tcat_(cat),\n\t\tlist_(NULL),\n\t\tsz_(128),\n\t\tcur_(0)\n\t{\n\t\tlist_ = alloc(sz_);\n\t}\n\n\t/**\n\t * Initially allocate given number of elements; should be > 0.\n\t */\n\tEMap(size_t isz, int cat = 0) :\n\t\tcat_(cat),\n\t\tlist_(NULL),\n\t\tsz_(isz),\n\t\tcur_(0)\n\t{\n\t\tassert_gt(isz, 0);\n\t\tlist_ = alloc(sz_);\n\t}\n\n\t/**\n\t * Copy from another ESet.\n\t */\n\tEMap(const EMap<K, V>& o) : list_(NULL) {\n\t\t*this = o;\n\t}\n\n\t/**\n\t * Destructor.\n\t */\n\t~EMap() { free(); }\n\n\t/**\n\t * Copy contents of given ESet into this ESet.\n\t */\n\tEMap& operator=(const EMap<K, V>& o) {\n\t\tsz_ = o.sz_;\n\t\tcur_ = o.cur_;\n\t\tfree();\n\t\tlist_ = alloc(sz_);\n\t\tmemcpy(list_, o.list_, cur_ * sizeof(std::pair<K, V>));\n\t\treturn *this;\n\t}\n\n\t/**\n\t * Return number of elements.\n\t */\n\tsize_t size() const { return cur_; }\n\t\n\t/**\n\t * Return the total size in bytes occupied by this map.\n\t */\n\tsize_t totalSizeBytes() const {\n\t\treturn \tsizeof(int) +\n\t\t        2 * sizeof(size_t) +\n\t\t\t\tcur_ * sizeof(std::pair<K, V>);\n\t}\n\n\t/**\n\t * Return the total capacity in bytes occupied by this map.\n\t */\n\tsize_t totalCapacityBytes() const {\n\t\treturn \tsizeof(int) +\n\t\t        2 * sizeof(size_t) +\n\t\t\t\tsz_ * sizeof(std::pair<K, V>);\n\t}\n\n\t/**\n\t * Return true iff there are no elements.\n\t */\n\tbool empty() const { return cur_ == 0; }\n\n\t/**\n\t * Insert a new element into the set in sorted order.\n\t */\n\tbool insert(const std::pair<K, V>& el) {\n\t\tsize_t i = 0;\n\t\tif(cur_ == 0) {\n\t\t\tinsert(el, 0);\n\t\t\treturn true;\n\t\t}\n\t\tif(cur_ < 16) {\n\t\t\t// Linear scan\n\t\t\ti = scanLoBound(el.first);\n\t\t} else {\n\t\t\t// Binary search\n\t\t\ti = bsearchLoBound(el.first);\n\t\t}\n\t\tif(list_[i] == el) return false; // already there\n\t\tinsert(el, i);\n\t\treturn true; // not already there\n\t}\n\n\t/**\n\t * Return true iff this set contains 'el'.\n\t */\n\tbool contains(const K& el) const {\n\t\tif(cur_ == 0) return false;\n\t\telse if(cur_ == 1) return el == list_[0].first;\n\t\tsize_t i;\n\t\tif(cur_ < 16) {\n\t\t\t// Linear scan\n\t\t\ti = scanLoBound(el);\n\t\t} else {\n\t\t\t// Binary search\n\t\t\ti = bsearchLoBound(el);\n\t\t}\n\t\treturn i != cur_ && list_[i].first == el;\n\t}\n\n\t/**\n\t * Return true iff this set contains 'el'.\n\t */\n\tbool containsEx(const K& el, size_t& i) const {\n\t\tif(cur_ == 0) return false;\n\t\telse if(cur_ == 1) {\n\t\t\ti = 0;\n\t\t\treturn el == list_[0].first;\n\t\t}\n\t\tif(cur_ < 16) {\n\t\t\t// Linear scan\n\t\t\ti = scanLoBound(el);\n\t\t} else {\n\t\t\t// Binary search\n\t\t\ti = bsearchLoBound(el);\n\t\t}\n\t\treturn i != cur_ && list_[i].first == el;\n\t}\n\n\t/**\n\t * Remove element from set.\n\t */\n\tvoid remove(const K& el) {\n\t\tsize_t i;\n\t\tif(cur_ < 16) {\n\t\t\t// Linear scan\n\t\t\ti = scanLoBound(el);\n\t\t} else {\n\t\t\t// Binary search\n\t\t\ti = bsearchLoBound(el);\n\t\t}\n\t\tassert(i != cur_ && list_[i].first == el);\n\t\terase(i);\n\t}\n\n\t/**\n\t * If size is less than requested size, resize up to at least sz\n\t * and set cur_ to requested sz.\n\t */\n\tvoid resize(size_t sz) {\n\t\tif(sz <= cur_) return;\n\t\tif(sz_ < sz) expandCopy(sz);\n\t}\n\t\n\t/**\n\t * Get the ith key, value pair in the map.\n\t */\n\tconst std::pair<K, V>& get(size_t i) const {\n\t\tassert_lt(i, cur_);\n\t\treturn list_[i];\n\t}\n\t\n\t/**\n\t * Get the ith key, value pair in the map.\n\t */\n\tconst std::pair<K, V>& operator[](size_t i) const {\n\t\treturn get(i);\n\t}\n\n\t/**\n\t * Clear set without deallocating (or setting) anything.\n\t */\n\tvoid clear() { cur_ = 0; }\n\nprivate:\n\n\t/**\n\t * Allocate a T array of length sz_ and store in list_.  Also,\n\t * tally into the global memory tally.\n\t */\n\tstd::pair<K, V> *alloc(size_t sz) {\n\t\tassert_gt(sz, 0);\n\t\tstd::pair<K, V> *tmp = new std::pair<K, V>[sz];\n\t\tgMemTally.add(cat_, sz);\n\t\treturn tmp;\n\t}\n\n\t/**\n\t * Allocate a T array of length sz_ and store in list_.  Also,\n\t * tally into the global memory tally.\n\t */\n\tvoid free() {\n\t\tif(list_ != NULL) {\n\t\t\tdelete[] list_;\n\t\t\tgMemTally.del(cat_, sz_);\n\t\t\tlist_ = NULL;\n\t\t}\n\t}\n\n\t/**\n\t * Simple linear scan that returns the index of the first element\n\t * of list_ that is not less than el, or cur_ if all elements are\n\t * less than el.\n\t */\n\tsize_t scanLoBound(const K& el) const {\n\t\tfor(size_t i = 0; i < cur_; i++) {\n\t\t\tif(!(list_[i].first < el)) {\n\t\t\t\t// Shouldn't be equal\n\t\t\t\treturn i;\n\t\t\t}\n\t\t}\n\t\treturn cur_;\n\t}\n\n\t/**\n\t * Perform a binary search for the first element that is not less\n\t * than 'el'.  Return cur_ if all elements are less than el.\n\t */\n\tsize_t bsearchLoBound(const K& el) const {\n\t\tsize_t hi = cur_;\n\t\tsize_t lo = 0;\n\t\twhile(true) {\n\t\t\tif(lo == hi) {\n#ifndef NDEBUG\n\t\t\t\tif((rand() % 10) == 0) {\n\t\t\t\t\tassert_eq(lo, scanLoBound(el));\n\t\t\t\t}\n#endif\n\t\t\t\treturn lo;\n\t\t\t}\n\t\t\tsize_t mid = lo + ((hi-lo)>>1);\n\t\t\tassert_neq(mid, hi);\n\t\t\tif(list_[mid].first < el) {\n\t\t\t\tif(lo == mid) {\n#ifndef NDEBUG\n\t\t\t\t\tif((rand() % 10) == 0) {\n\t\t\t\t\t\tassert_eq(hi, scanLoBound(el));\n\t\t\t\t\t}\n#endif\n\t\t\t\t\treturn hi;\n\t\t\t\t}\n\t\t\t\tlo = mid;\n\t\t\t} else {\n\t\t\t\thi = mid;\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Return true if sorted, assert otherwise.\n\t */\n\tbool sorted() const {\n\t\tif(cur_ <= 1) return true;\n#ifndef NDEBUG\n\t\tfor(size_t i = 0; i < cur_-1; i++) {\n\t\t\tassert(!(list_[i] == list_[i+1]));\n\t\t\tassert(list_[i] < list_[i+1]);\n\t\t}\n#endif\n\t\treturn true;\n\t}\n\n\t/**\n\t * Insert value 'el' at offset 'idx'.  It's OK to insert at cur_,\n\t * which is equivalent to appending.\n\t */\n\tvoid insert(const std::pair<K, V>& el, size_t idx) {\n\t\tassert_leq(idx, cur_);\n\t\tif(cur_ == sz_) {\n\t\t\texpandCopy(sz_+1);\n\t\t}\n\t\tfor(size_t i = cur_; i > idx; i--) {\n\t\t\tlist_[i] = list_[i-1];\n\t\t}\n\t\tlist_[idx] = el;\n\t\tassert(idx == cur_ || list_[idx] < list_[idx+1]);\n\t\tcur_++;\n\t\tassert(sorted());\n\t}\n\n\t/**\n\t * Erase element at offset idx.\n\t */\n\tvoid erase(size_t idx) {\n\t\tassert_lt(idx, cur_);\n\t\tfor(size_t i = idx; i < cur_-1; i++) {\n\t\t\tlist_[i] = list_[i+1];\n\t\t}\n\t\tcur_--;\n\t\tassert(sorted());\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has at least 'thresh' elements.\n\t * Expansions are quadratic.\n\t */\n\tvoid expandCopy(size_t thresh) {\n\t\tif(thresh <= sz_) return;\n\t\tsize_t newsz = sz_ * 2;\n\t\twhile(newsz < thresh) newsz *= 2;\n\t\tstd::pair<K, V>* tmp = alloc(newsz);\n\t\tfor(size_t i = 0; i < cur_; i++) {\n\t\t\ttmp[i] = list_[i];\n\t\t}\n\t\tfree();\n\t\tlist_ = tmp;\n\t\tsz_ = newsz;\n\t}\n\n\tint cat_;    // memory category, for accounting purposes\n\tstd::pair<K, V> *list_; // list pointer, returned from new[]\n\tsize_t sz_;  // capacity\n\tsize_t cur_; // occupancy (AKA size)\n};\n\n/**\n * A class that allows callers to create objects that are referred to by ID.\n * Objects should not be referred to via pointers or references, since they\n * are stored in an expandable buffer that might be resized and thereby moved\n * to another address.\n */\ntemplate <typename T, int S = 128>\nclass EFactory {\n\npublic:\n\n\texplicit EFactory(size_t isz, int cat = 0) : l_(isz, cat) { }\n\t\n\texplicit EFactory(int cat = 0) : l_(cat) { }\n\t\n\t/**\n\t * Clear the list.\n\t */\n\tvoid clear() {\n\t\tl_.clear();\n\t}\n\t\n\t/**\n\t * Add one additional item to the list and return its ID.\n\t */\n\tsize_t alloc() {\n\t\tl_.expand();\n\t\treturn l_.size()-1;\n\t}\n\t\n\t/**\n\t * Return the number of items in the list.\n\t */\n\tsize_t size() const {\n\t\treturn l_.size();\n\t}\n\n\t/**\n\t * Return the number of items in the factory.\n\t */\n\tsize_t totalSizeBytes() const {\n\t\treturn l_.totalSizeBytes();\n\t}\n\n\t/**\n\t * Return the total capacity in bytes occupied by this factory.\n\t */\n\tsize_t totalCapacityBytes() const {\n\t\treturn \tl_.totalCapacityBytes();\n\t}\n    \n    /**\n     * Resize the list.\n     */\n    void resize(size_t sz) {\n        l_.resize(sz);\n    }\n\n\t/**\n\t * Return true iff the list is empty.\n\t */\n\tbool empty() const {\n\t\treturn size() == 0;\n\t}\n\t\n\t/**\n\t * Shrink the list such that the  topmost (most recently allocated) element\n\t * is removed.\n\t */\n\tvoid pop() {\n\t\tl_.resize(l_.size()-1);\n\t}\n\t\n\t/**\n\t * Return mutable list item at offset 'off'\n\t */\n\tT& operator[](size_t off) {\n\t\treturn l_[off];\n\t}\n\n\t/**\n\t * Return immutable list item at offset 'off'\n\t */\n\tconst T& operator[](size_t off) const {\n\t\treturn l_[off];\n\t}\n\nprotected:\n\n\tEList<T, S> l_;\n};\n\n/**\n * An expandable bit vector based on EList\n */\ntemplate <int S = 128>\nclass EBitList {\n\npublic:\n\n\texplicit EBitList(size_t isz, int cat = 0) : l_(isz, cat) { reset(); }\n\t\n\t//explicit EBitList(int cat = 0) : l_(cat) { reset(); }\n\n\t/**\n\t * Reset to empty state.\n\t */\n\tvoid clear() {\n\t\treset();\n\t}\n\t\n\t/**\n\t * Reset to empty state.\n\t */\n\tvoid reset() {\n\t\tl_.clear();\n\t\tmax_ = std::numeric_limits<size_t>::max();\n\t}\n\n\t/**\n\t * Set a bit.\n\t */\n\tvoid set(size_t off) {\n\t\tresize(off);\n\t\tl_[off >> 3] |= (1 << (off & 7));\n\t\tif(off > max_ || max_ == std::numeric_limits<size_t>::max()) {\n\t\t\tmax_ = off;\n\t\t}\n\t}\n\n\t/**\n\t * Return mutable list item at offset 'off'\n\t */\n\tbool test(size_t off) const {\n\t\tif((size_t)(off >> 3) >= l_.size()) {\n\t\t\treturn false;\n\t\t}\n\t\treturn (l_[off >> 3] & (1 << (off & 7))) != 0;\n\t}\n\t\n\t/**\n\t * Return size of the underlying byte array.\n\t */\n\tsize_t size() const {\n\t\treturn l_.size();\n\t}\n\t\n\t/**\n\t * Resize to accomodate at least the given number of bits.\n\t */\n\tvoid resize(size_t off) {\n\t\tif((size_t)(off >> 3) >= l_.size()) {\n\t\t\tsize_t oldsz = l_.size();\n\t\t\tl_.resize((off >> 3) + 1);\n\t\t\tfor(size_t i = oldsz; i < l_.size(); i++) {\n\t\t\t\tl_[i] = 0;\n\t\t\t}\n\t\t}\n\t}\n\t\n\t/**\n\t * Return max set bit.\n\t */\n\tsize_t max() const {\n\t\treturn max_;\n\t}\n\nprotected:\n\n\tEList<uint8_t, S> l_;\n\tsize_t max_;\n};\n\n/**\n * Implements a min-heap.\n */\ntemplate <typename T, int S = 128>\nclass EHeap {\npublic:\n\n\t/**\n\t * Add the element to the next available leaf position and percolate up.\n\t */\n\tvoid insert(T o) {\n\t\tsize_t pos = l_.size();\n\t\tl_.push_back(o);\n\t\twhile(pos > 0) {\n\t\t\tsize_t parent = (pos-1) >> 1;\n\t\t\tif(l_[pos] < l_[parent]) {\n\t\t\t\tT tmp(l_[pos]);\n\t\t\t\tl_[pos] = l_[parent];\n\t\t\t\tl_[parent] = tmp;\n\t\t\t\tpos = parent;\n\t\t\t} else break;\n\t\t}\n\t\tassert(repOk());\n\t}\n\t\n\t/**\n\t * Return the topmost element.\n\t */\n\tT top() {\n\t\tassert_gt(l_.size(), 0);\n\t\treturn l_[0];\n\t}\n\t\n\t/**\n\t * Remove the topmost element.\n\t */\n\tT pop() {\n\t\tassert_gt(l_.size(), 0);\n\t\tT ret = l_[0];\n\t\tl_[0] = l_[l_.size()-1];\n\t\tl_.resize(l_.size()-1);\n\t\tsize_t cur = 0;\n\t\twhile(true) {\n\t\t\tsize_t c1 = ((cur+1) << 1) - 1;\n\t\t\tsize_t c2 = c1 + 1;\n\t\t\tif(c2 < l_.size()) {\n\t\t\t\tif(l_[c1] < l_[cur] && l_[c1] <= l_[c2]) {\n\t\t\t\t\tT tmp(l_[c1]);\n\t\t\t\t\tl_[c1] = l_[cur];\n\t\t\t\t\tl_[cur] = tmp;\n\t\t\t\t\tcur = c1;\n\t\t\t\t} else if(l_[c2] < l_[cur]) {\n\t\t\t\t\tT tmp(l_[c2]);\n\t\t\t\t\tl_[c2] = l_[cur];\n\t\t\t\t\tl_[cur] = tmp;\n\t\t\t\t\tcur = c2;\n\t\t\t\t} else {\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t} else if(c1 < l_.size()) {\n\t\t\t\tif(l_[c1] < l_[cur]) {\n\t\t\t\t\tT tmp(l_[c1]);\n\t\t\t\t\tl_[c1] = l_[cur];\n\t\t\t\t\tl_[cur] = tmp;\n\t\t\t\t\tcur = c1;\n\t\t\t\t} else {\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\tbreak;\n\t\t\t}\n\t\t}\n\t\tassert(repOk());\n\t\treturn ret;\n\t}\n\t\n\t/**\n\t * Return number of elements in the heap.\n\t */\n\tsize_t size() const {\n\t\treturn l_.size();\n\t}\n\n\t/**\n\t * Return the total size in bytes occupied by this heap.\n\t */\n\tsize_t totalSizeBytes() const {\n\t\treturn \tl_.totalSizeBytes();\n\t}\n\n\t/**\n\t * Return the total capacity in bytes occupied by this heap.\n\t */\n\tsize_t totalCapacityBytes() const {\n\t\treturn \tl_.totalCapacityBytes();\n\t}\n\t\n\t/**\n\t * Return true when heap is empty.\n\t */\n\tbool empty() const {\n\t\treturn l_.empty();\n\t}\n\t\n\t/**\n\t * Return element at offset i.\n\t */\n\tconst T& operator[](size_t i) const {\n\t\treturn l_[i];\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that heap property holds.\n\t */\n\tbool repOk() const {\n\t\tif(empty()) return true;\n\t\treturn repOkNode(0);\n\t}\n\n\t/**\n\t * Check that heap property holds at and below this node.\n\t */\n\tbool repOkNode(size_t cur) const {\n        size_t c1 = ((cur+1) << 1) - 1;\n        size_t c2 = c1 + 1;\n\t\tif(c1 < l_.size()) {\n\t\t\tassert_leq(l_[cur], l_[c1]);\n\t\t}\n\t\tif(c2 < l_.size()) {\n\t\t\tassert_leq(l_[cur], l_[c2]);\n\t\t}\n\t\tif(c2 < l_.size()) {\n\t\t\treturn repOkNode(c1) && repOkNode(c2);\n\t\t} else if(c1 < l_.size()) {\n\t\t\treturn repOkNode(c1);\n\t\t}\n\t\treturn true;\n\t}\n#endif\n\t\n\t/**\n\t * Clear the heap so that it's empty.\n\t */\n\tvoid clear() {\n\t\tl_.clear();\n\t}\n\nprotected:\n\n\tEList<T, S> l_;\n};\n\n/**\n * Dispenses pages of memory for all the lists in the cache, including\n * the sequence-to-range map, the range list, the edits list, and the\n * offsets list.  All lists contend for the same pool of memory.\n */\nclass Pool {\npublic:\n\tPool(\n\t\tuint64_t bytes,\n\t\tuint32_t pagesz,\n\t\tint cat = 0) :\n\t\tcat_(cat),\n\t\tcur_(0),\n\t\tbytes_(bytes),\n\t\tpagesz_(pagesz),\n\t\tpages_(cat)\n\t{\n\t\tfor(size_t i = 0; i < ((bytes+pagesz-1)/pagesz); i++) {\n\t\t\tpages_.push_back(new uint8_t[pagesz]);\n\t\t\tgMemTally.add(cat, pagesz);\n\t\t\tassert(pages_.back() != NULL);\n\t\t}\n\t\tassert(repOk());\n\t}\n\t\n\t/**\n\t * Free each page.\n\t */\n\t~Pool() {\n\t\tfor(size_t i = 0; i < pages_.size(); i++) {\n\t\t\tassert(pages_[i] != NULL);\n\t\t\tdelete[] pages_[i];\n\t\t\tgMemTally.del(cat_, pagesz_);\n\t\t}\n\t}\n\n\t/**\n\t * Allocate one page, or return NULL if no pages are left.\n\t */\n\tuint8_t * alloc() {\n\t\tassert(repOk());\n\t\tif(cur_ == pages_.size()) return NULL;\n\t\treturn pages_[cur_++];\n\t}\n    \n    bool full() { return cur_ == pages_.size(); }\n\n\t/**\n\t * Clear the pool so that no pages are considered allocated.\n\t */\n\tvoid clear() {\n\t\tcur_ = 0;\n\t\tassert(repOk());\n\t}\n\n\t/**\n\t * Reset the Pool to be as though\n\t */\n\tvoid free() {\n\t\t// Currently a no-op because the only freeing method supported\n\t\t// now is to clear the entire pool\n\t}\n\n#ifndef NDEBUG\n\t/**\n\t * Check that pool is internally consistent.\n\t */\n\tbool repOk() const {\n\t\tassert_leq(cur_, pages_.size());\n\t\tassert(!pages_.empty());\n\t\tassert_gt(bytes_, 0);\n\t\tassert_gt(pagesz_, 0);\n\t\treturn true;\n\t}\n#endif\n\npublic:\n\tint             cat_;    // memory category, for accounting purposes\n\tuint32_t        cur_;    // next page to hand out\n\tconst uint64_t  bytes_;  // total bytes in the pool\n\tconst uint32_t  pagesz_; // size of a single page\n\tEList<uint8_t*> pages_;  // the pages themselves\n};\n\n/**\n * An expandable list backed by a pool.\n */\ntemplate<typename T, int S>\nclass PList {\n\n#define PLIST_PER_PAGE (S / sizeof(T))\n\npublic:\n\t/**\n\t * Initialize the current-edit pointer to 0 and set the number of\n\t * edits per memory page.\n\t */\n\tPList(int cat = 0) :\n\t\tcur_(0),\n\t\tcurPage_(0),\n\t\tpages_(cat) { }\n\n\t/**\n\t * Add 1 object to the list.\n\t */\n\tbool add(Pool& p, const T& o) {\n\t\tassert(repOk());\n\t\tif(!ensure(p, 1)) return false;\n\t\tif(cur_ == PLIST_PER_PAGE) {\n\t\t\tcur_ = 0;\n\t\t\tcurPage_++;\n\t\t}\n\t\tassert_lt(curPage_, pages_.size());\n\t\tassert(repOk());\n\t\tassert_lt(cur_, PLIST_PER_PAGE);\n\t\tpages_[curPage_][cur_++] = o;\n\t\treturn true;\n\t}\n\n\t/**\n\t * Add a list of objects to the list.\n\t */\n\tbool add(Pool& p, const EList<T>& os) {\n\t\tif(!ensure(p, os.size())) return false;\n\t\tfor(size_t i = 0; i < os.size(); i++) {\n\t\t\tif(cur_ == PLIST_PER_PAGE) {\n\t\t\t\tcur_ = 0;\n\t\t\t\tcurPage_++;\n\t\t\t}\n\t\t\tassert_lt(curPage_, pages_.size());\n\t\t\tassert(repOk());\n\t\t\tassert_lt(cur_, PLIST_PER_PAGE);\n\t\t\tpages_[curPage_][cur_++] = os[i];\n\t\t}\n\t\treturn true;\n\t}\n\n\t/**\n\t * Add a list of objects to the list.\n\t */\n\tbool copy(\n\t\tPool& p,\n\t\tconst PList<T, S>& src,\n\t\tsize_t i,\n\t\tsize_t len)\n\t{\n\t\tif(!ensure(p, src.size())) return false;\n\t\tfor(size_t i = 0; i < src.size(); i++) {\n\t\t\tif(cur_ == PLIST_PER_PAGE) {\n\t\t\t\tcur_ = 0;\n\t\t\t\tcurPage_++;\n\t\t\t}\n\t\t\tassert_lt(curPage_, pages_.size());\n\t\t\tassert(repOk());\n\t\t\tassert_lt(cur_, PLIST_PER_PAGE);\n\t\t\tpages_[curPage_][cur_++] = src[i];\n\t\t}\n\t\treturn true;\n\t}\n\n\t/**\n\t * Add 'num' objects, all equal to 'o' to the list.\n\t */\n\tbool addFill(Pool& p, size_t num, const T& o) {\n\t\tif(!ensure(p, num)) return false;\n\t\tfor(size_t i = 0; i < num; i++) {\n\t\t\tif(cur_ == PLIST_PER_PAGE) {\n\t\t\t\tcur_ = 0;\n\t\t\t\tcurPage_++;\n\t\t\t}\n\t\t\tassert_lt(curPage_, pages_.size());\n\t\t\tassert(repOk());\n\t\t\tassert_lt(cur_, PLIST_PER_PAGE);\n\t\t\tpages_[curPage_][cur_++] = o;\n\t\t}\n\t\treturn true;\n\t}\n\n\t/**\n\t * Free all pages associated with the list.\n\t */\n\tvoid clear() {\n\t\tpages_.clear();\n\t\tcur_ = curPage_ = 0;\n\t}\n\n#ifndef NDEBUG\n\t/**\n\t * Check that list is internally consistent.\n\t */\n\tbool repOk() const {\n\t\tassert(pages_.size() == 0 || curPage_ < pages_.size());\n\t\tassert_leq(cur_, PLIST_PER_PAGE);\n\t\treturn true;\n\t}\n#endif\n\n\t/**\n\t * Return the number of elements in the list.\n\t */\n\tsize_t size() const {\n\t\treturn curPage_ * PLIST_PER_PAGE + cur_;\n\t}\n\t\n\t/**\n\t * Return true iff the PList has no elements.\n\t */\n\tbool empty() const {\n\t\treturn size() == 0;\n\t}\n\n\t/**\n\t * Get the ith element added to the list.\n\t */\n\tinline const T& getConst(size_t i) const {\n\t\tassert_lt(i, size());\n\t\tsize_t page = i / PLIST_PER_PAGE;\n\t\tsize_t elt = i % PLIST_PER_PAGE;\n\t\treturn pages_[page][elt];\n\t}\n\n\t/**\n\t * Get the ith element added to the list.\n\t */\n\tinline T& get(size_t i) {\n\t\tassert_lt(i, size());\n\t\tsize_t page = i / PLIST_PER_PAGE;\n\t\tsize_t elt = i % PLIST_PER_PAGE;\n\t\tassert_lt(page, pages_.size());\n\t\tassert(page < pages_.size()-1 || elt < cur_);\n\t\treturn pages_[page][elt];\n\t}\n\t\n\t/**\n\t * Get the most recently added element.\n\t */\n\tinline T& back() {\n\t\tsize_t page = (size()-1) / PLIST_PER_PAGE;\n\t\tsize_t elt = (size()-1) % PLIST_PER_PAGE;\n\t\tassert_lt(page, pages_.size());\n\t\tassert(page < pages_.size()-1 || elt < cur_);\n\t\treturn pages_[page][elt];\n\t}\n\t\n\t/**\n\t * Get const version of the most recently added element.\n\t */\n\tinline const T& back() const {\n\t\tsize_t page = (size()-1) / PLIST_PER_PAGE;\n\t\tsize_t elt = (size()-1) % PLIST_PER_PAGE;\n\t\tassert_lt(page, pages_.size());\n\t\tassert(page < pages_.size()-1 || elt < cur_);\n\t\treturn pages_[page][elt];\n\t}\n\n\t/**\n\t * Get the element most recently added to the list.\n\t */\n\tT& last() {\n\t\tassert(!pages_.empty());\n\t\tassert_gt(PLIST_PER_PAGE, 0);\n\t\tif(cur_ == 0) {\n\t\t\tassert_gt(pages_.size(), 1);\n\t\t\treturn pages_[pages_.size()-2][PLIST_PER_PAGE-1];\n\t\t} else {\n\t\t\treturn pages_.back()[cur_-1];\n\t\t}\n\t}\n\n\t/**\n\t * Return true iff 'num' additional objects will fit in the pages\n\t * allocated to the list.  If more pages are needed, they are\n\t * added if possible.\n\t */\n\tbool ensure(Pool& p, size_t num) {\n\t\tassert(repOk());\n\t\tif(num == 0) return true;\n\t\t// Allocation of the first page\n\t\tif(pages_.size() == 0) {\n\t\t\tif(expand(p) == NULL) {\n\t\t\t\treturn false;\n\t\t\t}\n\t\t\tassert_eq(1, pages_.size());\n\t\t}\n\t\tsize_t cur = cur_;\n\t\tsize_t curPage = curPage_;\n\t\twhile(cur + num > PLIST_PER_PAGE) {\n\t\t\tassert_lt(curPage, pages_.size());\n\t\t\tif(curPage == pages_.size()-1 && expand(p) == NULL) {\n\t\t\t\treturn false;\n\t\t\t}\n\t\t\tnum -= (PLIST_PER_PAGE - cur);\n\t\t\tcur = 0;\n\t\t\tcurPage++;\n\t\t}\n\t\treturn true;\n\t}\n\nprotected:\n\n\t/**\n\t * Expand our page supply by 1\n\t */\n\tT* expand(Pool& p) {\n\t\tT* newpage = (T*)p.alloc();\n\t\tif(newpage == NULL) {\n\t\t\treturn NULL;\n\t\t}\n\t\tpages_.push_back(newpage);\n\t\treturn pages_.back();\n\t}\n\n\tsize_t       cur_;     // current elt within page\n\tsize_t       curPage_; // current page\n\tEList<T*>    pages_;   // the pages\n};\n\n/**\n * A slice of an EList.\n */\ntemplate<typename T, int S>\nclass EListSlice {\n\npublic:\n\tEListSlice() :\n\t\ti_(0),\n\t\tlen_(0),\n\t\tlist_()\n\t{ }\n\n\tEListSlice(\n\t\tEList<T, S>& list,\n\t\tsize_t i,\n\t\tsize_t len) :\n\t\ti_(i),\n\t\tlen_(len),\n\t\tlist_(&list)\n\t{ }\n\t\n\t/**\n\t * Initialize from a piece of another PListSlice.\n\t */\n\tvoid init(const EListSlice<T, S>& sl, size_t first, size_t last) {\n\t\tassert_gt(last, first);\n\t\tassert_leq(last - first, sl.len_);\n\t\ti_ = sl.i_ + first;\n\t\tlen_ = last - first;\n\t\tlist_ = sl.list_;\n\t}\n\t\n\t/**\n\t * Reset state to be empty.\n\t */\n\tvoid reset() {\n\t\ti_ = len_ = 0;\n\t\tlist_ = NULL;\n\t}\n\t\n\t/**\n\t * Get the ith element of the slice.\n\t */\n\tinline const T& get(size_t i) const {\n\t\tassert(valid());\n\t\tassert_lt(i, len_);\n\t\treturn list_->get(i + i_);\n\t}\n\n\t/**\n\t * Get the ith element of the slice.\n\t */\n\tinline T& get(size_t i) {\n\t\tassert(valid());\n\t\tassert_lt(i, len_);\n\t\treturn list_->get(i + i_);\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline T& operator[](size_t i) {\n\t\tassert(valid());\n\t\tassert_lt(i, len_);\n\t\treturn list_->get(i + i_);\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline const T& operator[](size_t i) const {\n\t\tassert(valid());\n\t\tassert_lt(i, len_);\n\t\treturn list_->get(i + i_);\n\t}\n\n\t/**\n\t * Return true iff this slice is initialized.\n\t */\n\tbool valid() const {\n\t\treturn len_ != 0;\n\t}\n\t\n\t/**\n\t * Return number of elements in the slice.\n\t */\n\tsize_t size() const {\n\t\treturn len_;\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Ensure that the PListSlice is internally consistent and\n\t * consistent with the backing PList.\n\t */\n\tbool repOk() const {\n\t\tassert_leq(i_ + len_, list_->size());\n\t\treturn true;\n\t}\n#endif\n\t\n\t/**\n\t * Return true iff this slice refers to the same slice of the same\n\t * list as the given slice.\n\t */\n\tbool operator==(const EListSlice& sl) const {\n\t\treturn i_ == sl.i_ && len_ == sl.len_ && list_ == sl.list_;\n\t}\n\n\t/**\n\t * Return false iff this slice refers to the same slice of the same\n\t * list as the given slice.\n\t */\n\tbool operator!=(const EListSlice& sl) const {\n\t\treturn !(*this == sl);\n\t}\n\t\n\t/**\n\t * Set the length.  This could leave things inconsistent (e.g. could\n\t * include elements that fall off the end of list_).\n\t */\n\tvoid setLength(size_t nlen) {\n\t\tlen_ = (uint32_t)nlen;\n\t}\n\t\nprotected:\n\tsize_t i_;\n\tsize_t len_;\n\tEList<T, S>* list_;\n};\n\n/**\n * A slice of a PList.\n */\ntemplate<typename T, int S>\nclass PListSlice {\n\npublic:\n\tPListSlice() :\n\t\ti_(0),\n\t\tlen_(0),\n\t\tlist_()\n\t{ }\n\n\tPListSlice(\n\t\tPList<T, S>& list,\n\t\tTIndexOffU i,\n\t\tTIndexOffU len) :\n\t\ti_(i),\n\t\tlen_(len),\n\t\tlist_(&list)\n\t{ }\n\t\n\t/**\n\t * Initialize from a piece of another PListSlice.\n\t */\n\tvoid init(const PListSlice<T, S>& sl, size_t first, size_t last) {\n\t\tassert_gt(last, first);\n\t\tassert_leq(last - first, sl.len_);\n\t\ti_ = (uint32_t)(sl.i_ + first);\n\t\tlen_ = (uint32_t)(last - first);\n\t\tlist_ = sl.list_;\n\t}\n\t\n\t/**\n\t * Reset state to be empty.\n\t */\n\tvoid reset() {\n\t\ti_ = len_ = 0;\n\t\tlist_ = NULL;\n\t}\n\t\n\t/**\n\t * Get the ith element of the slice.\n\t */\n\tinline const T& get(size_t i) const {\n\t\tassert(valid());\n\t\tassert_lt(i, len_);\n\t\treturn list_->get(i+i_);\n\t}\n\n\t/**\n\t * Get the ith element of the slice.\n\t */\n\tinline T& get(size_t i) {\n\t\tassert(valid());\n\t\tassert_lt(i, len_);\n\t\treturn list_->get(i+i_);\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline T& operator[](size_t i) {\n\t\tassert(valid());\n\t\tassert_lt(i, len_);\n\t\treturn list_->get(i+i_);\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline const T& operator[](size_t i) const {\n\t\tassert(valid());\n\t\tassert_lt(i, len_);\n\t\treturn list_->get(i+i_);\n\t}\n\n\t/**\n\t * Return true iff this slice is initialized.\n\t */\n\tbool valid() const {\n\t\treturn len_ != 0;\n\t}\n\t\n\t/**\n\t * Return number of elements in the slice.\n\t */\n\tsize_t size() const {\n\t\treturn len_;\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Ensure that the PListSlice is internally consistent and\n\t * consistent with the backing PList.\n\t */\n\tbool repOk() const {\n\t\tassert_leq(i_ + len_, list_->size());\n\t\treturn true;\n\t}\n#endif\n\t\n\t/**\n\t * Return true iff this slice refers to the same slice of the same\n\t * list as the given slice.\n\t */\n\tbool operator==(const PListSlice& sl) const {\n\t\treturn i_ == sl.i_ && len_ == sl.len_ && list_ == sl.list_;\n\t}\n\n\t/**\n\t * Return false iff this slice refers to the same slice of the same\n\t * list as the given slice.\n\t */\n\tbool operator!=(const PListSlice& sl) const {\n\t\treturn !(*this == sl);\n\t}\n\t\n\t/**\n\t * Set the length.  This could leave things inconsistent (e.g. could\n\t * include elements that fall off the end of list_).\n\t */\n\tvoid setLength(size_t nlen) {\n\t\tlen_ = (uint32_t)nlen;\n\t}\n\t\nprotected:\n\tuint32_t i_;\n\tuint32_t len_;\n\tPList<T, S>* list_;\n};\n\n/**\n * A Red-Black tree node.  Links to parent & left and right children.\n * Key and Payload are of types K and P.  Node total ordering is based\n * on K's total ordering.  K must implement <, == and > operators.\n */\ntemplate<typename K, typename P> // K=key, P=payload\nclass RedBlackNode {\n\n\ttypedef RedBlackNode<K,P> TNode;\n\npublic:\n\tTNode *parent;  // parent\n\tTNode *left;    // left child\n\tTNode *right;   // right child\n\tbool   red;     // true -> red, false -> black\n\tK      key;     // key, for ordering\n\tP      payload; // payload (i.e. value)\n\n\t/**\n\t * Return the parent of this node's parent, or NULL if none exists.\n\t */\n\tRedBlackNode *grandparent() {\n\t\treturn parent != NULL ? parent->parent : NULL;\n\t}\n\n\t/**\n\t * Return the sibling of this node's parent, or NULL if none exists.\n\t */\n\tRedBlackNode *uncle() {\n\t\tif(parent == NULL) return NULL; // no parent\n\t\tif(parent->parent == NULL) return NULL; // parent has no siblings\n\t\treturn (parent->parent->left == parent) ? parent->parent->right : parent->parent->left;\n\t}\n\t\n\t/**\n\t * Return true iff this node is its parent's left child.\n\t */\n\tbool isLeftChild() const { assert(parent != NULL); return parent->left == this; }\n\n\t/**\n\t * Return true iff this node is its parent's right child.\n\t */\n\tbool isRightChild() const { assert(parent != NULL); return parent->right == this; }\n\n\t/**\n\t * Return true iff this node is its parent's right child.\n\t */\n\tvoid replaceChild(RedBlackNode* ol, RedBlackNode* nw) {\n\t\tif(left == ol) {\n\t\t\tleft = nw;\n\t\t} else {\n\t\t\tassert(right == ol);\n\t\t\tright = nw;\n\t\t}\n\t}\n\n\t/**\n\t * Return the number of non-null children this node has.\n\t */\n\tint numChildren() const {\n\t\treturn ((left != NULL) ? 1 : 0) + ((right != NULL) ? 1 : 0);\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that node is internally consistent.\n\t */ \n\tbool repOk() const {\n\t\tif(parent != NULL) {\n\t\t\tassert(parent->left == this || parent->right == this);\n\t\t}\n\t\treturn true;\n\t}\n#endif\n\n\t/**\n\t * True -> my key is less than than the given node's key.\n\t */\n\tbool operator<(const TNode& o) const { return key < o.key; }\n\n\t/**\n\t * True -> my key is greater than the given node's key.\n\t */\n\tbool operator>(const TNode& o) const { return key > o.key; }\n\n\t/**\n\t * True -> my key equals the given node's key.\n\t */\n\tbool operator==(const TNode& o) const { return key == o.key; }\n\n\t/**\n\t * True -> my key is less than the given key.\n\t */\n\tbool operator<(const K& okey) const { return key < okey; }\n\n\t/**\n\t * True -> my key is greater than the given key.\n\t */\n\tbool operator>(const K& okey) const { return key > okey; }\n\n\t/**\n\t * True -> my key is equal to the given key.\n\t */\n\tbool operator==(const K& okey) const { return key == okey; }\n};\n\n/**\n * A Red-Black tree that associates keys (of type K) with payloads (of\n * type P).  Red-Black trees are self-balancing and guarantee that the\n * tree as always \"balanced\" to a factor of 2, i.e., the longest\n * root-to-leaf path is never more than twice as long as the shortest\n * root-to-leaf path.\n */\ntemplate<typename K, typename P> // K=key, P=payload\nclass RedBlack {\n\n\ttypedef RedBlackNode<K,P> TNode;\n\npublic:\n    /**\n\t * Initialize the current-edit pointer to 0 and set the number of\n\t * edits per memory page.\n\t */\n\tRedBlack(uint32_t pageSz, int cat = 0) :\n\t\tperPage_(pageSz/sizeof(TNode)), pages_(cat) { clear(); }\n\n\t/**\n\t * Given a DNA string, find the red-black node corresponding to it,\n\t * if one exists.\n\t */\n\tinline TNode* lookup(const K& key) const {\n\t\tTNode* cur = root_;\n\t\twhile(cur != NULL) {\n\t\t\tif((*cur) == key) return cur;\n\t\t\tif((*cur) < key) {\n\t\t\t\tcur = cur->right;\n\t\t\t} else {\n\t\t\t\tcur = cur->left;\n\t\t\t}\n\t\t}\n\t\treturn NULL;\n\t}\n\n\t/**\n\t * Add a new key as a node in the red-black tree.\n\t */\n\tTNode* add(\n\t\tPool& p,      // in: pool for memory pages\n\t\tconst K& key, // in: key to insert\n\t\tbool* added)  // if true, assert is thrown if key exists\n\t{\n\t\t// Look for key; if it's not there, get its parent\n\t\tTNode* cur = root_;\n\t\tassert(root_ == NULL || !root_->red);\n\t\tTNode* parent = NULL;\n\t\tbool leftChild = true;\n\t\twhile(cur != NULL) {\n\t\t\tif((*cur) == key) {\n\t\t\t\t// Found it; break out of loop with cur != NULL\n\t\t\t\tbreak;\n\t\t\t}\n\t\t\tparent = cur;\n\t\t\tif((*cur) < key) {\n\t\t\t\tif((cur = cur->right) == NULL) {\n\t\t\t\t\t// Fell off the bottom of the tree as the right\n\t\t\t\t\t// child of parent 'lastCur'\n\t\t\t\t\tleftChild = false;\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\tif((cur = cur->left) == NULL) {\n\t\t\t\t\t// Fell off the bottom of the tree as the left\n\t\t\t\t\t// child of parent 'lastCur'\n\t\t\t\t\tleftChild = true;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tif(cur != NULL) {\n\t\t\t// Found an entry; assert if we weren't supposed to\n\t\t\tif(added != NULL) *added = false;\n\t\t} else {\n\t\t\tassert(root_ == NULL || !root_->red);\n\t\t\tif(!addNode(p, cur)) {\n\t\t\t\t// Exhausted memory\n\t\t\t\treturn NULL;\n\t\t\t}\n\t\t\tassert(cur != NULL);\n\t\t\tassert(cur != root_);\n\t\t\tassert(cur != parent);\n\t\t\t// Initialize new node\n\t\t\tcur->key = key;\n\t\t\tcur->left = cur->right = NULL;\n\t\t\tcur->red = true; // red until proven black\n\t\t\tkeys_++;\n\t\t\tif(added != NULL) *added = true;\n\t\t\t// Put it where we know it should go\n\t\t\taddNode(cur, parent, leftChild);\n\t\t}\n\t\treturn cur; // return the added or found node\n\t}\n\n#ifndef NDEBUG\n\t/**\n\t * Check that list is internally consistent.\n\t */\n\tbool repOk() const {\n\t\tassert(curPage_ == 0 || curPage_ < pages_.size());\n\t\tassert_leq(cur_, perPage_);\n\t\tassert(root_ == NULL || !root_->red);\n\t\treturn true;\n\t}\n#endif\n\t\n\t/**\n\t * Clear all state.\n\t */\n\tvoid clear() {\n\t\tcur_ = curPage_ = 0;\n\t\troot_ = NULL;\n\t\tkeys_ = 0;\n\t\tintenseRepOkCnt_ = 0;\n\t\tpages_.clear();\n\t}\n\t\n\t/**\n\t * Return number of keys added.\n\t */\n\tsize_t size() const {\n\t\treturn keys_;\n\t}\n\t\n\t/**\n\t * Return true iff there are no keys in the map.\n\t */\n\tbool empty() const {\n\t\treturn keys_ == 0;\n\t}\n\n\t/**\n\t * Add another node and return a pointer to it in 'node'.  A new\n\t * page is allocated if necessary.  If the allocation fails, false\n\t * is returned.\n\t */\n\tbool addNode(Pool& p, TNode*& node) {\n\t\tassert_leq(cur_, perPage_);\n\t\tassert(repOk());\n\t\tassert(this != NULL);\n\t\t// Allocation of the first page\n\t\tif(pages_.size() == 0) {\n\t\t\tif(addPage(p) == NULL) {\n\t\t\t\tnode = NULL;\n\t\t\t\treturn false;\n\t\t\t}\n\t\t\tassert_eq(1, pages_.size());\n\t\t}\n\t\tif(cur_ == perPage_) {\n\t\t\tassert_lt(curPage_, pages_.size());\n\t\t\tif(curPage_ == pages_.size()-1 && addPage(p) == NULL) {\n\t\t\t\treturn false;\n\t\t\t}\n\t\t\tcur_ = 0;\n\t\t\tcurPage_++;\n\t\t}\n\t\tassert_lt(cur_, perPage_);\n\t\tassert_lt(curPage_, pages_.size());\n\t\tnode = &pages_[curPage_][cur_];\n\t\tassert(node != NULL);\n\t\tcur_++;\n\t\treturn true;\n\t}\n    \n    const TNode* root() const { return root_; }\n\nprotected:\n\n#ifndef NDEBUG\n\t/**\n\t * Check specifically that the red-black invariants are satistfied.\n\t */\n\tbool redBlackRepOk(TNode* n) {\n\t\tif(n == NULL) return true;\n\t\tif(++intenseRepOkCnt_ < 500) return true;\n\t\tintenseRepOkCnt_ = 0;\n\t\tint minNodes = -1; // min # nodes along any n->leaf path\n\t\tint maxNodes = -1; // max # nodes along any n->leaf path\n\t\t// The number of black nodes along paths from n to leaf\n\t\t// (must be same for all paths)\n\t\tint blackConst = -1;\n\t\tsize_t nodesTot = 0;\n\t\tredBlackRepOk(\n\t\t\tn,\n\t\t\t1, /* 1 node so far */\n\t\t\tn->red ? 0 : 1, /* black nodes so far */\n\t\t\tblackConst,\n\t\t\tminNodes,\n\t\t\tmaxNodes,\n\t\t\tnodesTot);\n\t\tif(n == root_) {\n\t\t\tassert_eq(nodesTot, keys_);\n\t\t}\n\t\tassert_gt(minNodes, 0);\n\t\tassert_gt(maxNodes, 0);\n\t\tassert_leq(maxNodes, 2*minNodes);\n\t\treturn true;\n\t}\n\n\t/**\n\t * Check specifically that the red-black invariants are satistfied.\n\t */\n\tbool redBlackRepOk(\n\t\tTNode* n,\n\t\tint nodes,\n\t\tint black,\n\t\tint& blackConst,\n\t\tint& minNodes,\n\t\tint& maxNodes,\n\t\tsize_t& nodesTot) const\n\t{\n\t\tassert_gt(black, 0);\n\t\tnodesTot++; // account for leaf node\n\t\tif(n->left == NULL) {\n\t\t\tif(blackConst == -1) blackConst = black;\n\t\t\tassert_eq(black, blackConst);\n\t\t\tif(nodes+1 > maxNodes) maxNodes = nodes+1;\n\t\t\tif(nodes+1 < minNodes || minNodes == -1) minNodes = nodes+1;\n\t\t} else {\n\t\t\tif(n->red) assert(!n->left->red); // Red can't be child of a red\n\t\t\tredBlackRepOk(\n\t\t\t\tn->left,                         // next node\n\t\t\t\tnodes + 1,                       // # nodes so far on path\n\t\t\t\tblack + (n->left->red ? 0 : 1),  // # black so far on path\n\t\t\t\tblackConst,                      // invariant # black nodes on root->leaf path\n\t\t\t\tminNodes,                        // min root->leaf len so far         \n\t\t\t\tmaxNodes,                        // max root->leaf len so far\n\t\t\t\tnodesTot);                       // tot nodes so far\n\t\t}\n\t\tif(n->right == NULL) {\n\t\t\tif(blackConst == -1) blackConst = black;\n\t\t\tassert_eq(black, blackConst);\n\t\t\tif(nodes+1 > maxNodes) maxNodes = nodes+1;\n\t\t\tif(nodes+1 < minNodes || minNodes == -1) minNodes = nodes+1;\n\t\t} else {\n\t\t\tif(n->red) assert(!n->right->red); // Red can't be child of a red\n\t\t\tredBlackRepOk(\n\t\t\t\tn->right,                        // next node\n\t\t\t\tnodes + 1,                       // # nodes so far on path\n\t\t\t\tblack + (n->right->red ? 0 : 1), // # black so far on path\n\t\t\t\tblackConst,                      // invariant # black nodes on root->leaf path\n\t\t\t\tminNodes,                        // min root->leaf len so far         \n\t\t\t\tmaxNodes,                        // max root->leaf len so far\n\t\t\t\tnodesTot);                       // tot nodes so far\n\t\t}\n\t\treturn true;\n\t}\n#endif\n\n\t/**\n\t * Rotate to the left such that n is replaced by its right child\n\t * w/r/t n's current parent.\n\t */\n\tvoid leftRotate(TNode* n) {\n\t\tTNode* r = n->right;\n\t\tassert(n->repOk());\n\t\tassert(r->repOk());\n\t\tn->right = r->left;\n\t\tif(n->right != NULL) {\n\t\t\tn->right->parent = n;\n\t\t\tassert(n->right->repOk());\n\t\t}\n\t\tr->parent = n->parent;\n\t\tn->parent = r;\n\t\tr->left = n;\n\t\tif(r->parent != NULL) {\n\t\t\tr->parent->replaceChild(n, r);\n\t\t}\n\t\tif(root_ == n) root_ = r;\n\t\tassert(!root_->red);\n\t\tassert(n->repOk());\n\t\tassert(r->repOk());\n\t}\n\n\t/**\n\t * Rotate to the right such that n is replaced by its left child\n\t * w/r/t n's current parent.  n moves down to the right and loses\n\t * its left child, while its former left child moves up and gains a\n\t * right child.\n\t */\n\tvoid rightRotate(TNode* n) {\n\t\tTNode* r = n->left;\n\t\tassert(n->repOk());\n\t\tassert(r->repOk());\n\t\tn->left = r->right;\n\t\tif(n->left != NULL) {\n\t\t\tn->left->parent = n;\n\t\t\tassert(n->left->repOk());\n\t\t}\n\t\tr->parent = n->parent;\n\t\tn->parent = r;\n\t\tr->right = n;\n\t\tif(r->parent != NULL) {\n\t\t\tr->parent->replaceChild(n, r);\n\t\t}\n\t\tif(root_ == n) root_ = r;\n\t\tassert(!root_->red);\n\t\tassert(n->repOk());\n\t\tassert(r->repOk());\n\t}\n\n\t/**\n\t * Add a node to the red-black tree, maintaining the red-black\n\t * invariants.\n\t */\n\tvoid addNode(TNode* n, TNode* parent, bool leftChild) {\n\t\tassert(n != NULL);\n\t\tif(parent == NULL) {\n\t\t\t// Case 1: inserted at root\n\t\t\troot_ = n;\n\t\t\troot_->red = false; // root must be black\n\t\t\tn->parent = NULL;\n\t\t\tassert(redBlackRepOk(root_));\n\t\t\tassert(n->repOk());\n\t\t} else {\n\t\t\tassert(!root_->red);\n\t\t\t// Add new node to tree\n\t\t\tif(leftChild) {\n\t\t\t\tassert(parent->left == NULL);\n\t\t\t\tparent->left = n;\n\t\t\t} else {\n\t\t\t\tassert(parent->right == NULL);\n\t\t\t\tparent->right = n;\n\t\t\t}\n\t\t\tn->parent = parent;\n\t\t\tint thru = 0;\n\t\t\twhile(true) {\n\t\t\t\tthru++;\n\t\t\t\tparent = n->parent;\n\t\t\t\tif(parent != NULL) assert(parent->repOk());\n\t\t\t\tif(parent == NULL && n->red) {\n\t\t\t\t\tn->red = false;\n\t\t\t\t}\n\t\t\t\tif(parent == NULL || !parent->red) {\n\t\t\t\t\tassert(redBlackRepOk(root_));\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t\tTNode* uncle = n->uncle();\n\t\t\t\tTNode* gparent = n->grandparent();\n\t\t\t\tassert(gparent != NULL); // if parent is red, grandparent must exist\n\t\t\t\tbool uncleRed = (uncle != NULL ? uncle->red : false);\n\t\t\t\tif(uncleRed) {\n\t\t\t\t\t// Parent is red, uncle is red; recursive case\n\t\t\t\t\tassert(uncle != NULL);\n\t\t\t\t\tparent->red = uncle->red = false;\n\t\t\t\t\tgparent->red = true;\n\t\t\t\t\tn = gparent;\n\t\t\t\t\tcontinue;\n\t\t\t\t} else {\n\t\t\t\t\tif(parent->isLeftChild()) {\n\t\t\t\t\t\t// Parent is red, uncle is black, parent is\n\t\t\t\t\t\t// left child\n\t\t\t\t\t\tif(!n->isLeftChild()) {\n\t\t\t\t\t\t\tn = parent;\n\t\t\t\t\t\t\tleftRotate(n);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tn = n->parent;\n\t\t\t\t\t\tn->red = false;\n\t\t\t\t\t\tn->parent->red = true;\n\t\t\t\t\t\trightRotate(n->parent);\n\t\t\t\t\t\tassert(redBlackRepOk(n));\n\t\t\t\t\t\tassert(redBlackRepOk(root_));\n\t\t\t\t\t} else {\n\t\t\t\t\t\t// Parent is red, uncle is black, parent is\n\t\t\t\t\t\t// right child.\n\t\t\t\t\t\tif(!n->isRightChild()) {\n\t\t\t\t\t\t\tn = parent;\n\t\t\t\t\t\t\trightRotate(n);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tn = n->parent;\n\t\t\t\t\t\tn->red = false;\n\t\t\t\t\t\tn->parent->red = true;\n\t\t\t\t\t\tleftRotate(n->parent);\n\t\t\t\t\t\tassert(redBlackRepOk(n));\n\t\t\t\t\t\tassert(redBlackRepOk(root_));\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tbreak;\n\t\t\t}\n\t\t}\n\t\tassert(redBlackRepOk(root_));\n\t}\n\n\t/**\n\t * Expand our page supply by 1\n\t */\n\tTNode* addPage(Pool& p) {\n\t\tTNode *n = (TNode *)p.alloc();\n\t\tif(n != NULL) {\n\t\t\tpages_.push_back(n);\n\t\t}\n\t\treturn n;\n\t}\n\n\tsize_t        keys_;    // number of keys so far\n\tsize_t        cur_;     // current elt within page\n\tsize_t        curPage_; // current page\n\tconst size_t  perPage_; // # edits fitting in a page\n\tTNode*        root_;    // root node\n\tEList<TNode*> pages_;   // the pages\n\tint intenseRepOkCnt_;   // counter for the computationally intensive repOk function\n};\n\n/**\n * For assembling doubly-linked lists of Edits.\n */\ntemplate <typename T>\nstruct DoublyLinkedList {\n\t\n\tDoublyLinkedList() : payload(), prev(NULL), next(NULL) { }\n\t\n\t/**\n\t * Add all elements in the doubly-linked list to the provided EList.\n\t */\n\tvoid toList(EList<T>& l) {\n\t\t// Add this and all subsequent elements\n\t\tDoublyLinkedList<T> *cur = this;\n\t\twhile(cur != NULL) {\n\t\t\tl.push_back(cur->payload);\n\t\t\tcur = cur->next;\n\t\t}\n\t\t// Add all previous elements\n\t\tcur = prev;\n\t\twhile(cur != NULL) {\n\t\t\tl.push_back(cur->payload);\n\t\t\tcur = cur->prev;\n\t\t}\n\t}\n\t\n\tT                    payload;\n\tDoublyLinkedList<T> *prev;\n\tDoublyLinkedList<T> *next;\n};\n\ntemplate <typename T1, typename T2>\nstruct Pair {\n\tT1 a;\n\tT2 b;\n\n\tPair() : a(), b() { }\n\t\n\tPair(\n\t\tconst T1& a_,\n\t\tconst T2& b_) { a = a_; b = b_; }\n\n\tbool operator==(const Pair& o) const {\n\t\treturn a == o.a && b == o.b;\n\t}\n\t\n\tbool operator<(const Pair& o) const {\n\t\tif(a < o.a) return true;\n\t\tif(a > o.a) return false;\n\t\tif(b < o.b) return true;\n\t\treturn false;\n\t}\n};\n\ntemplate <typename T1, typename T2, typename T3>\nstruct Triple {\n\tT1 a;\n\tT2 b;\n\tT3 c;\n\n\tTriple() : a(), b(), c() { }\n\n\tTriple(\n\t\tconst T1& a_,\n\t\tconst T2& b_,\n\t\tconst T3& c_) { a = a_; b = b_; c = c_; }\n\n\tbool operator==(const Triple& o) const {\n\t\treturn a == o.a && b == o.b && c == o.c;\n\t}\n\t\n\tbool operator<(const Triple& o) const {\n\t\tif(a < o.a) return true;\n\t\tif(a > o.a) return false;\n\t\tif(b < o.b) return true;\n\t\tif(b > o.b) return false;\n\t\tif(c < o.c) return true;\n\t\treturn false;\n\t}\n};\n\ntemplate <typename T1, typename T2, typename T3, typename T4>\nstruct Quad {\n\n\tQuad() : a(), b(), c(), d() { }\n\n\tQuad(\n\t\tconst T1& a_,\n\t\tconst T2& b_,\n\t\tconst T3& c_,\n\t\tconst T4& d_) { a = a_; b = b_; c = c_; d = d_; }\n\n\tQuad(\n\t\tconst T1& a_,\n\t\tconst T1& b_,\n\t\tconst T1& c_,\n\t\tconst T1& d_)\n\t{\n\t\tinit(a_, b_, c_, d_);\n\t}\n\t\n\tvoid init(\n\t\tconst T1& a_,\n\t\tconst T1& b_,\n\t\tconst T1& c_,\n\t\tconst T1& d_)\n\t{\n\t\ta = a_; b = b_; c = c_; d = d_;\n\t}\n\n\tbool operator==(const Quad& o) const {\n\t\treturn a == o.a && b == o.b && c == o.c && d == o.d;\n\t}\n\t\n\tbool operator<(const Quad& o) const {\n\t\tif(a < o.a) return true;\n\t\tif(a > o.a) return false;\n\t\tif(b < o.b) return true;\n\t\tif(b > o.b) return false;\n\t\tif(c < o.c) return true;\n\t\tif(c > o.c) return false;\n\t\tif(d < o.d) return true;\n\t\treturn false;\n\t}\n\n\tT1 a;\n\tT2 b;\n\tT3 c;\n\tT4 d;\n};\n\n/**\n * For assembling doubly-linked lists of EList.\n */\ntemplate <typename T>\nstruct LinkedEListNode {\n\t\n\tLinkedEListNode() : payload(), next(NULL) { }\n\t\t\n\tT                  payload;\n\tLinkedEListNode<T> *next;\n};\n\n/**\n * For assembling doubly-linked lists of EList.\n */\ntemplate <typename T>\nstruct LinkedEList {\n\t\n\tLinkedEList() : head(NULL) {\n        ASSERT_ONLY(num_allocated = 0);\n        ASSERT_ONLY(num_new_node = 0);\n        ASSERT_ONLY(num_delete_node = 0);\n    }\n    \n    ~LinkedEList() {\n        ASSERT_ONLY(size_t num_deallocated = 0);\n        while(head != NULL) {\n            LinkedEListNode<T>* next = head->next;\n            delete head;\n            ASSERT_ONLY(num_deallocated++);\n            head = next;\n        }\n        // daehwan - for debugging purposes\n        // assert_eq(num_allocated, num_deallocated);\n    }\n    \n    LinkedEListNode<T>* new_node() {\n        ASSERT_ONLY(num_new_node++);\n        LinkedEListNode<T> *result = NULL;\n        if(head == NULL) {\n            head = new LinkedEListNode<T>();\n            head-> next = NULL;\n            ASSERT_ONLY(num_allocated++);\n        }\n        assert(head != NULL);\n        result = head;\n        head = head->next;\n        assert(result != NULL);\n        return result;\n    }\n    \n    void delete_node(LinkedEListNode<T> *node) {\n        ASSERT_ONLY(num_delete_node++);\n        assert(node != NULL);\n        // check if this is already deleted.\n#ifndef NDEBUG\n        LinkedEListNode<T> *temp = head;\n        while(temp != NULL) {\n            assert(temp != node);\n            temp = temp->next;\n        }\n#endif\n        node->next = head;\n        head = node;\n    }\n    \n\tLinkedEListNode<T> *head;\n    \n    ASSERT_ONLY(size_t num_allocated);\n    ASSERT_ONLY(size_t num_new_node);\n    ASSERT_ONLY(size_t num_delete_node);\n};\n\n\n#endif /* DS_H_ */\n"
  },
  {
    "path": "edit.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <iostream>\n#include \"edit.h\"\n\nusing namespace std;\n\n/**\n * Print a single edit to a std::ostream.  Format is\n * (pos):(ref chr)>(read chr).  Where 'pos' is an offset from the 5'\n * end of the read, and the ref and read chrs are expressed w/r/t the\n * Watson strand.\n */\nostream& operator<< (ostream& os, const Edit& e) {\n    if(e.type != EDIT_TYPE_SPL) {\n        os << e.pos << \":\" << (char)e.chr << \">\" << (char)e.qchr;\n    } else {\n        os << e.pos << \":\" << e.splLen;\n    }\n\n\treturn os;\n}\n\n/**\n * Print a list of edits to a std::ostream, separated by commas.\n */\nvoid Edit::print(ostream& os, const EList<Edit>& edits, char delim) {\n\tfor(size_t i = 0; i < edits.size(); i++) {\n\t\tos << edits[i];\n\t\tif(i < edits.size()-1) os << delim;\n\t}\n}\n\n/**\n * Flip all the edits.pos fields so that they're with respect to\n * the other end of the read (of length 'sz').\n */\nvoid Edit::invertPoss(\n\tEList<Edit>& edits,\n\tsize_t sz,\n\tsize_t ei,\n\tsize_t en,\n\tbool sort)\n{\n\t// Invert elements\n\tsize_t ii = 0;\n\tfor(size_t i = ei; i < ei + en/2; i++) {\n\t\tEdit tmp = edits[i];\n\t\tedits[i] = edits[ei + en - ii - 1];\n\t\tedits[ei + en - ii - 1] = tmp;\n\t\tii++;\n\t}\n\tfor(size_t i = ei; i < ei + en; i++) {\n\t\tassert(edits[i].pos < sz ||\n\t\t\t   (edits[i].isReadGap() && edits[i].pos == sz));\n\t\t// Adjust pos\n        if(edits[i].isReadGap() || edits[i].isSpliced()) {\n            edits[i].pos = (uint32_t)(sz - edits[i].pos);\n        } else {\n            edits[i].pos = (uint32_t)(sz - edits[i].pos - 1);\n        }\n\t\t// Adjust pos2\n\t\tif(edits[i].isReadGap()) {\n\t\t\tint64_t pos2diff = (int64_t)(uint64_t)edits[i].pos2 - (int64_t)((uint64_t)std::numeric_limits<uint32_t>::max() >> 1);\n\t\t\tint64_t pos2new = (int64_t)(uint64_t)edits[i].pos2 - 2*pos2diff;\n\t\t\tassert(pos2diff == 0 || (uint32_t)pos2new != (std::numeric_limits<uint32_t>::max() >> 1));\n\t\t\tedits[i].pos2 = (uint32_t)pos2new;\n\t\t}\n\t}\n\tif(sort) {\n\t\t// Edits might not necessarily be in same order after inversion\n\t\tedits.sortPortion(ei, en);\n#ifndef NDEBUG\n\t\tfor(size_t i = ei + 1; i < ei + en; i++) {\n\t\t\tassert_geq(edits[i].pos, edits[i-1].pos);\n\t\t}\n#endif\n\t}\n}\n\n/**\n * For now, we pretend that the alignment is in the forward orientation\n * and that the Edits are listed from left- to right-hand side.\n */\nvoid Edit::printQAlign(\n\tstd::ostream& os,\n\tconst BTDnaString& read,\n\tconst EList<Edit>& edits)\n{\n\tprintQAlign(os, \"\", read, edits);\n}\n\n/**\n * For now, we pretend that the alignment is in the forward orientation\n * and that the Edits are listed from left- to right-hand side.\n */\nvoid Edit::printQAlignNoCheck(\n\tstd::ostream& os,\n\tconst BTDnaString& read,\n\tconst EList<Edit>& edits)\n{\n\tprintQAlignNoCheck(os, \"\", read, edits);\n}\n\n/**\n * For now, we pretend that the alignment is in the forward orientation\n * and that the Edits are listed from left- to right-hand side.\n */\nvoid Edit::printQAlign(\n\tstd::ostream& os,\n\tconst char *prefix,\n\tconst BTDnaString& read,\n\tconst EList<Edit>& edits)\n{\n\tsize_t eidx = 0;\n\tos << prefix;\n\t// Print read\n\tfor(size_t i = 0; i < read.length(); i++) {\n\t\tbool del = false, mm = false;\n\t\twhile(eidx < edits.size() && edits[eidx].pos == i) {\n\t\t\tif(edits[eidx].isReadGap()) {\n\t\t\t\tos << '-';\n\t\t\t} else if(edits[eidx].isRefGap()) {\n\t\t\t\tdel = true;\n\t\t\t\tassert_eq((int)edits[eidx].qchr, read.toChar(i));\n\t\t\t\tos << read.toChar(i);\n\t\t\t} else {\n\t\t\t\tmm = true;\n\t\t\t\tassert(edits[eidx].isMismatch());\n\t\t\t\tassert_eq((int)edits[eidx].qchr, read.toChar(i));\n\t\t\t\tos << (char)edits[eidx].qchr;\n\t\t\t}\n\t\t\teidx++;\n\t\t}\n\t\tif(!del && !mm) os << read.toChar(i);\n\t}\n\tos << endl;\n\tos << prefix;\n\teidx = 0;\n\t// Print match bars\n\tfor(size_t i = 0; i < read.length(); i++) {\n\t\tbool del = false, mm = false;\n\t\twhile(eidx < edits.size() && edits[eidx].pos == i) {\n\t\t\tif(edits[eidx].isReadGap()) {\n\t\t\t\tos << ' ';\n\t\t\t} else if(edits[eidx].isRefGap()) {\n\t\t\t\tdel = true;\n\t\t\t\tos << ' ';\n\t\t\t} else {\n\t\t\t\tmm = true;\n\t\t\t\tassert(edits[eidx].isMismatch());\n\t\t\t\tos << ' ';\n\t\t\t}\n\t\t\teidx++;\n\t\t}\n\t\tif(!del && !mm) os << '|';\n\t}\n\tos << endl;\n\tos << prefix;\n\teidx = 0;\n\t// Print reference\n\tfor(size_t i = 0; i < read.length(); i++) {\n\t\tbool del = false, mm = false;\n\t\twhile(eidx < edits.size() && edits[eidx].pos == i) {\n\t\t\tif(edits[eidx].isReadGap()) {\n\t\t\t\tos << (char)edits[eidx].chr;\n\t\t\t} else if(edits[eidx].isRefGap()) {\n\t\t\t\tdel = true;\n\t\t\t\tos << '-';\n\t\t\t} else {\n\t\t\t\tmm = true;\n\t\t\t\tassert(edits[eidx].isMismatch());\n\t\t\t\tos << (char)edits[eidx].chr;\n\t\t\t}\n\t\t\teidx++;\n\t\t}\n\t\tif(!del && !mm) os << read.toChar(i);\n\t}\n\tos << endl;\n}\n\n/**\n * For now, we pretend that the alignment is in the forward orientation\n * and that the Edits are listed from left- to right-hand side.\n */\nvoid Edit::printQAlignNoCheck(\n\tstd::ostream& os,\n\tconst char *prefix,\n\tconst BTDnaString& read,\n\tconst EList<Edit>& edits)\n{\n\tsize_t eidx = 0;\n\tos << prefix;\n\t// Print read\n\tfor(size_t i = 0; i < read.length(); i++) {\n\t\tbool del = false, mm = false;\n\t\twhile(eidx < edits.size() && edits[eidx].pos == i) {\n\t\t\tif(edits[eidx].isReadGap()) {\n\t\t\t\tos << '-';\n\t\t\t} else if(edits[eidx].isRefGap()) {\n\t\t\t\tdel = true;\n\t\t\t\tos << read.toChar(i);\n\t\t\t} else {\n\t\t\t\tmm = true;\n\t\t\t\tos << (char)edits[eidx].qchr;\n\t\t\t}\n\t\t\teidx++;\n\t\t}\n\t\tif(!del && !mm) os << read.toChar(i);\n\t}\n\tos << endl;\n\tos << prefix;\n\teidx = 0;\n\t// Print match bars\n\tfor(size_t i = 0; i < read.length(); i++) {\n\t\tbool del = false, mm = false;\n\t\twhile(eidx < edits.size() && edits[eidx].pos == i) {\n\t\t\tif(edits[eidx].isReadGap()) {\n\t\t\t\tos << ' ';\n\t\t\t} else if(edits[eidx].isRefGap()) {\n\t\t\t\tdel = true;\n\t\t\t\tos << ' ';\n\t\t\t} else {\n\t\t\t\tmm = true;\n\t\t\t\tos << ' ';\n\t\t\t}\n\t\t\teidx++;\n\t\t}\n\t\tif(!del && !mm) os << '|';\n\t}\n\tos << endl;\n\tos << prefix;\n\teidx = 0;\n\t// Print reference\n\tfor(size_t i = 0; i < read.length(); i++) {\n\t\tbool del = false, mm = false;\n\t\twhile(eidx < edits.size() && edits[eidx].pos == i) {\n\t\t\tif(edits[eidx].isReadGap()) {\n\t\t\t\tos << (char)edits[eidx].chr;\n\t\t\t} else if(edits[eidx].isRefGap()) {\n\t\t\t\tdel = true;\n\t\t\t\tos << '-';\n\t\t\t} else {\n\t\t\t\tmm = true;\n\t\t\t\tos << (char)edits[eidx].chr;\n\t\t\t}\n\t\t\teidx++;\n\t\t}\n\t\tif(!del && !mm) os << read.toChar(i);\n\t}\n\tos << endl;\n}\n\n/**\n * Sort the edits in the provided list.\n */\nvoid Edit::sort(EList<Edit>& edits) {\n\tedits.sort(); // simple!\n}\n\n/**\n * Given a read string and some edits, generate and append the corresponding\n * reference string to 'ref'.  If read aligned to the Watson strand, the caller\n * should pass the original read sequence and original edits.  If a read\n * aligned to the Crick strand, the caller should pass the reverse complement\n * of the read and a version of the edits list that has had Edit:invertPoss\n * called on it to cause edits to be listed in 3'-to-5' order.\n */\nvoid Edit::toRef(\n\tconst BTDnaString& read,\n\tconst EList<Edit>& edits,\n\tBTDnaString& ref,\n\tbool fw,\n\tsize_t trim5,\n\tsize_t trim3)\n{\n\t// edits should be sorted\n\tsize_t eidx = 0;\n\t// Print reference\n\tconst size_t rdlen = read.length();\n\tsize_t trimBeg = fw ? trim5 : trim3;\n\tsize_t trimEnd = fw ? trim3 : trim5;\n\tassert(Edit::repOk(edits, read, fw, trim5, trim3));\n\tif(!fw) {\n\t\tinvertPoss(const_cast<EList<Edit>&>(edits), read.length()-trimBeg-trimEnd, false);\n\t}\n\tfor(size_t i = 0; i < rdlen; i++) {\n\t\tASSERT_ONLY(int c = read[i]);\n\t\tassert_range(0, 4, c);\n\t\tbool del = false, mm = false;\n\t\tbool append = i >= trimBeg && rdlen - i - 1 >= trimEnd;\n\t\tbool appendIns = i >= trimBeg && rdlen - i >= trimEnd;\n\t\twhile(eidx < edits.size() && edits[eidx].pos+trimBeg == i) {\n\t\t\tif(edits[eidx].isReadGap()) {\n\t\t\t\t// Inserted characters come before the position's\n\t\t\t\t// character\n\t\t\t\tif(appendIns) {\n\t\t\t\t\tref.appendChar((char)edits[eidx].chr);\n\t\t\t\t}\n\t\t\t} else if(edits[eidx].isRefGap()) {\n\t\t\t\tassert_eq(\"ACGTN\"[c], edits[eidx].qchr);\n\t\t\t\tdel = true;\n\t\t\t} else if(edits[eidx].isMismatch()){\n\t\t\t\tmm = true;\n\t\t\t\tassert(edits[eidx].qchr != edits[eidx].chr || edits[eidx].qchr == 'N');\n\t\t\t\tassert_eq(\"ACGTN\"[c], edits[eidx].qchr);\n\t\t\t\tif(append) {\n\t\t\t\t\tref.appendChar((char)edits[eidx].chr);\n\t\t\t\t}\n\t\t\t}\n\t\t\teidx++;\n\t\t}\n\t\tif(!del && !mm) {\n\t\t\tif(append) {\n\t\t\t\tref.append(read[i]);\n\t\t\t}\n\t\t}\n\t}\n\tif(trimEnd == 0) {\n\t\twhile(eidx < edits.size()) {\n\t\t\tassert_gt(rdlen, edits[eidx].pos);\n\t\t\tif(edits[eidx].isReadGap()) {\n\t\t\t\tref.appendChar((char)edits[eidx].chr);\n\t\t\t}\n\t\t\teidx++;\n\t\t}\n\t}\n\tif(!fw) {\n\t\tinvertPoss(const_cast<EList<Edit>&>(edits), read.length()-trimBeg-trimEnd, false);\n\t}\n}\n\n#ifndef NDEBUG\n/**\n * Check that the edit is internally consistent.\n */\nbool Edit::repOk() const {\n    assert(inited());\n\t// Ref and read characters cannot be the same unless they're both Ns\n    if(type != EDIT_TYPE_SPL) {\n        assert(qchr != chr || qchr == 'N');\n        // Type must match characters\n        assert(isRefGap() ||  chr != '-');\n        assert(isReadGap() || qchr != '-');\n        assert(!isMismatch() || (qchr != '-' && chr != '-'));\n    } else {\n        assert_gt(splLen, 0);\n    }\n\treturn true;\n}\n\n/**\n * Given a list of edits and a DNA string representing the query\n * sequence, check that the edits are consistent with respect to the\n * query.\n */\nbool Edit::repOk(\n\tconst EList<Edit>& edits,\n\tconst BTDnaString& s,\n\tbool fw,\n\tsize_t trimBeg,\n\tsize_t trimEnd)\n{\n\tif(!fw) {\n\t\tinvertPoss(const_cast<EList<Edit>&>(edits), s.length()-trimBeg-trimEnd, false);\n\t\tswap(trimBeg, trimEnd);\n\t}\n\tfor(size_t i = 0; i < edits.size(); i++) {\n\t\tconst Edit& e = edits[i];\n\t\tsize_t pos = e.pos;\n\t\tif(i > 0) {\n\t\t\tassert_geq(pos, edits[i-1].pos);\n\t\t}\n\t\tbool del = false, mm = false;\n\t\twhile(i < edits.size() && edits[i].pos == pos) {\n\t\t\tconst Edit& ee = edits[i];\n\t\t\tassert_lt(ee.pos, s.length());\n            if(ee.type != EDIT_TYPE_SPL) {\n                if(ee.qchr != '-') {\n                    assert(ee.isRefGap() || ee.isMismatch());\n                    assert_eq((int)ee.qchr, s.toChar(ee.pos+trimBeg));\n                }\n            }\n\t\t\tif(ee.isMismatch()) {\n\t\t\t\tassert(!mm);\n\t\t\t\tmm = true;\n\t\t\t\tassert(!del);\n\t\t\t} else if(ee.isReadGap()) {\n\t\t\t\tassert(!mm);\n\t\t\t} else if(ee.isRefGap()) {\n\t\t\t\tassert(!mm);\n\t\t\t\tassert(!del);\n\t\t\t\tdel = true;\n\t\t\t} else if(ee.isSpliced()) {\n                \n            }\n\t\t\ti++;\n\t\t}\n\t}\n\tif(!fw) {\n\t\tinvertPoss(const_cast<EList<Edit>&>(edits), s.length()-trimBeg-trimEnd, false);\n\t}\n\treturn true;\n}\n#endif\n\n/**\n * Merge second argument into the first.  Assume both are sorted to\n * begin with.\n */\nvoid Edit::merge(EList<Edit>& dst, const EList<Edit>& src) {\n\tsize_t di = 0, si = 0;\n\twhile(di < dst.size()) {\n\t\tif(src[si].pos < dst[di].pos) {\n\t\t\tdst.insert(src[si], di);\n\t\t\tsi++; di++;\n\t\t} else if(src[si].pos == dst[di].pos) {\n\t\t\t// There can be two inserts at a given position, but we\n\t\t\t// can't merge them because there's no way to know their\n\t\t\t// order\n\t\t\tassert(src[si].isReadGap() != dst[di].isReadGap());\n\t\t\tif(src[si].isReadGap()) {\n\t\t\t\tdst.insert(src[si], di);\n\t\t\t\tsi++; di++;\n\t\t\t} else if(dst[di].isReadGap()) {\n\t\t\t\tdi++;\n\t\t\t}\n\t\t}\n\t}\n\twhile(si < src.size()) dst.push_back(src[si++]);\n}\n\n/**\n * Clip off some of the low-numbered positions.\n */\nvoid Edit::clipLo(EList<Edit>& ed, size_t len, size_t amt) {\n\tsize_t nrm = 0;\n\tfor(size_t i = 0; i < ed.size(); i++) {\n\t\tassert_lt(ed[i].pos, len);\n\t\tif(ed[i].pos < amt) {\n\t\t\tnrm++;\n\t\t} else {\n\t\t\t// Shift everyone else up\n\t\t\ted[i].pos -= (uint32_t)amt;\n\t\t}\n\t}\n\ted.erase(0, nrm);\n}\n\n/**\n * Clip off some of the high-numbered positions.\n */\nvoid Edit::clipHi(EList<Edit>& ed, size_t len, size_t amt) {\n\tassert_leq(amt, len);\n\tsize_t max = len - amt;\n\tsize_t nrm = 0;\n\tfor(size_t i = 0; i < ed.size(); i++) {\n\t\tsize_t ii = ed.size() - i - 1;\n\t\tassert_lt(ed[ii].pos, len);\n\t\tif(ed[ii].pos > max) {\n\t\t\tnrm++;\n\t\t} else if(ed[ii].pos == max && !ed[ii].isReadGap()) {\n\t\t\tnrm++;\n\t\t} else {\n\t\t\tbreak;\n\t\t}\n\t}\n\ted.resize(ed.size() - nrm);\n}\n"
  },
  {
    "path": "edit.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef EDIT_H_\n#define EDIT_H_\n\n#include <iostream>\n#include <stdint.h>\n#include <limits>\n#include \"assert_helpers.h\"\n#include \"filebuf.h\"\n#include \"sstring.h\"\n#include \"ds.h\"\n\n/**\n * 3 types of edits; mismatch (substitution), insertion in the\n * reference, deletion in the reference.\n */\nenum {\n\tEDIT_TYPE_READ_GAP = 1,\n\tEDIT_TYPE_REF_GAP,\n\tEDIT_TYPE_MM,\n\tEDIT_TYPE_SNP,\n    EDIT_TYPE_SPL, // splicing of pre-messenger RNAs into messenger RNAs\n};\n\nenum {\n    EDIT_SPL_UNKNOWN = 1,\n    EDIT_SPL_FW,\n    EDIT_SPL_RC\n};\n\n/**\n * Encapsulates an edit between the read sequence and the reference sequence.\n * We obey a few conventions when populating its fields.  The fields are:\n *\n * \tuint8_t  chr;  // reference character involved (for subst and ins)\n *  uint8_t  qchr; // read character involved (for subst and del)\n *  uint8_t  type; // 1 -> mm, 2 -> SNP, 3 -> ins, 4 -> del\n *  uint32_t pos;  // position w/r/t search root\n *\n * One convention is that pos is always an offset w/r/t the 5' end of the read.\n *\n * Another is that chr and qchr are expressed in terms of the nucleotides on\n * the forward version of the read.  So if we're aligning the reverse\n * complement of the read, and an A in the reverse complement mismatches a C in\n * the reference, chr should be G and qchr should be T.\n */\nstruct Edit {\n\n\tEdit() { reset(); }\n\n\tEdit(\n\t\tuint32_t po,\n\t\tint ch,\n\t\tint qc,\n\t\tint ty,\n\t\tbool chrs = true)\n\t{\n\t\tinit(po, ch, qc, ty, chrs);\n\t}\n    \n    Edit(\n         uint32_t po,\n         int ch,\n         int qc,\n         int ty,\n         uint32_t sl,\n         uint8_t sdir,\n         bool knowns,\n         bool chrs = true)\n\t{\n\t\tinit(po, ch, qc, ty, sl, sdir, knowns, chrs);\n\t}\n\t\n    /**\n     * Reset Edit to uninitialized state.\n     */\n\tvoid reset() {\n\t\tpos = pos2 = std::numeric_limits<uint32_t>::max();\n\t\tchr = qchr = type = 0;\n        splLen = 0;\n        splDir = EDIT_SPL_UNKNOWN;\n        knownSpl = false;\n\t}\n\t\n    /**\n     * Return true iff the Edit is initialized.\n     */\n\tbool inited() const {\n\t\treturn pos != std::numeric_limits<uint32_t>::max();\n\t}\n\t\n    /**\n     * Initialize a new Edit.\n     */\n\tvoid init(\n\t\tuint32_t po,\n\t\tint ch,\n\t\tint qc,\n\t\tint ty,\n\t\tbool chrs = true)\n\t{\n\t\tchr = ch;\n\t\tqchr = qc;\n\t\ttype = ty;\n        splLen = 0;\n        splDir = EDIT_SPL_UNKNOWN;\n\t\tpos = po;\n\t\tif(qc == '-') {\n\t\t\t// Read gap\n\t\t\tpos2 = std::numeric_limits<uint32_t>::max() >> 1;\n\t\t} else {\n\t\t\tpos2 = std::numeric_limits<uint32_t>::max();\n\t\t}\n\t\tif(!chrs) {\n\t\t\tassert_range(0, 4, (int)chr);\n\t\t\tassert_range(0, 4, (int)qchr);\n\t\t\tchr = \"ACGTN\"[chr];\n\t\t\tqchr = \"ACGTN\"[qchr];\n\t\t}\n#ifndef NDEBUG\n        if(type != EDIT_TYPE_SPL) {\n            assert_in(chr, \"ACMGRSVTWYHKDBN-\");\n            assert_in(qchr, \"ACGTN-\");\n            assert(chr != qchr || chr == 'N');\n        }\n#endif\n\t\tassert(inited());\n\t}\n    \n    /**\n     * Initialize a new Edit.\n     */\n\tvoid init(\n              uint32_t po,\n              int ch,\n              int qc,\n              int ty,\n              uint32_t sl,\n              uint32_t sdir,\n              bool knowns,\n              bool chrs = true)\n\t{\n        assert_eq(ty, EDIT_TYPE_SPL);\n        init(po, ch, qc, ty, chrs);\n        splLen = sl;\n        splDir = sdir;\n        knownSpl = knowns;\n\t}\n\t\n\t/**\n\t * Return true iff one part of the edit or the other has an 'N'.\n\t */\n\tbool hasN() const {\n\t\tassert(inited());\n\t\treturn chr == 'N' || qchr == 'N';\n\t}\n\n\t/**\n\t * Edit less-than overload.\n\t */\n\tint operator< (const Edit &rhs) const {\n\t\tassert(inited());\n\t\tif(pos  < rhs.pos) return 1;\n\t\tif(pos  > rhs.pos) return 0;\n\t\tif(pos2 < rhs.pos2) return 1;\n\t\tif(pos2 > rhs.pos2) return 0;\n\t\tif(type < rhs.type) return 1;\n\t\tif(type > rhs.type) return 0;\n\t\tif(chr  < rhs.chr) return 1;\n\t\tif(chr  > rhs.chr) return 0;\n\t\treturn (qchr < rhs.qchr)? 1 : 0;\n\t}\n\n\t/**\n\t * Edit equals overload.\n\t */\n\tint operator== (const Edit &rhs) const {\n\t\tassert(inited());\n\t\treturn(pos  == rhs.pos &&\n\t\t\t   pos2 == rhs.pos2 &&\n\t\t\t   chr  == rhs.chr &&\n\t\t\t   qchr == rhs.qchr &&\n\t\t\t   type == rhs.type &&\n               splLen == rhs.splLen &&\n               splDir == rhs.splDir /* &&\n               knownSpl == rhs.knownSpl */);\n\t}\n\n\t/**\n\t * Return true iff this Edit is an initialized insertion.\n\t */\n\tbool isReadGap() const {\n\t\tassert(inited());\n\t\treturn type == EDIT_TYPE_READ_GAP;\n\t}\n\n\t/**\n\t * Return true iff this Edit is an initialized deletion.\n\t */\n\tbool isRefGap() const {\n\t\tassert(inited());\n\t\treturn type == EDIT_TYPE_REF_GAP;\n\t}\n\n\t/**\n\t * Return true if this Edit is either an initialized deletion or an\n\t * initialized insertion.\n\t */\n\tbool isGap() const {\n\t\tassert(inited());\n\t\treturn (type == EDIT_TYPE_REF_GAP || type == EDIT_TYPE_READ_GAP);\n\t}\n    \n    bool isSpliced() const {\n        assert(inited());\n        return type == EDIT_TYPE_SPL;\n    }\n\t\n\t/**\n\t * Return the number of gaps in the given edit list.\n\t */\n\tstatic size_t numGaps(const EList<Edit>& es) {\n\t\tsize_t gaps = 0;\n\t\tfor(size_t i = 0; i < es.size(); i++) {\n\t\t\tif(es[i].isGap()) gaps++;\n\t\t}\n\t\treturn gaps;\n\t}\n\n\t/**\n\t * Return true iff this Edit is an initialized mismatch.\n\t */\n\tbool isMismatch() const {\n\t\tassert(inited());\n\t\treturn type == EDIT_TYPE_MM;\n\t}\n\n\t/**\n\t * Sort the edits in the provided list.\n\t */\n\tstatic void sort(EList<Edit>& edits);\n\n\t/**\n\t * Flip all the edits.pos fields so that they're with respect to\n\t * the other end of the read (of length 'sz').\n\t */\n\tstatic void invertPoss(\n\t\tEList<Edit>& edits,\n\t\tsize_t sz,\n\t\tsize_t ei,\n\t\tsize_t en,\n\t\tbool sort = false);\n\n\t/**\n\t * Flip all the edits.pos fields so that they're with respect to\n\t * the other end of the read (of length 'sz').\n\t */\n\tstatic void invertPoss(EList<Edit>& edits, size_t sz, bool sort = false) {\n\t\tinvertPoss(edits, sz, 0, edits.size(), sort);\n\t}\n\t\n\t/**\n\t * Clip off some of the low-numbered positions.\n\t */\n\tstatic void clipLo(EList<Edit>& edits, size_t len, size_t amt);\n\n\t/**\n\t * Clip off some of the high-numbered positions.\n\t */\n\tstatic void clipHi(EList<Edit>& edits, size_t len, size_t amt);\n\n\t/**\n\t * Given a read string and some edits, generate and append the\n\t * corresponding reference string to 'ref'.\n\t */\n\tstatic void toRef(\n\t\tconst BTDnaString& read,\n\t\tconst EList<Edit>& edits,\n\t\tBTDnaString& ref,\n\t\tbool fw = true,\n\t\tsize_t trim5 = 0,\n\t\tsize_t trim3 = 0);\n\n\t/**\n\t * Given a string and its edits with respect to some other string,\n\t * print the alignment between the strings with the strings stacked\n\t * vertically, with vertical bars denoting matches.\n\t */\n\tstatic void printQAlign(\n\t\tstd::ostream& os,\n\t\tconst BTDnaString& read,\n\t\tconst EList<Edit>& edits);\n\n\t/**\n\t * Given a string and its edits with respect to some other string,\n\t * print the alignment between the strings with the strings stacked\n\t * vertically, with vertical bars denoting matches.  Add 'prefix'\n\t * before each line of output.\n\t */\n\tstatic void printQAlign(\n\t\tstd::ostream& os,\n\t\tconst char *prefix,\n\t\tconst BTDnaString& read,\n\t\tconst EList<Edit>& edits);\n\n\t/**\n\t * Given a string and its edits with respect to some other string,\n\t * print the alignment between the strings with the strings stacked\n\t * vertically, with vertical bars denoting matches.\n\t */\n\tstatic void printQAlignNoCheck(\n\t\tstd::ostream& os,\n\t\tconst BTDnaString& read,\n\t\tconst EList<Edit>& edits);\n\n\t/**\n\t * Given a string and its edits with respect to some other string,\n\t * print the alignment between the strings with the strings stacked\n\t * vertically, with vertical bars denoting matches.  Add 'prefix'\n\t * before each line of output.\n\t */\n\tstatic void printQAlignNoCheck(\n\t\tstd::ostream& os,\n\t\tconst char *prefix,\n\t\tconst BTDnaString& read,\n\t\tconst EList<Edit>& edits);\n\n#ifndef NDEBUG\n\tbool repOk() const;\n\n\t/**\n\t * Given a list of edits and a DNA string representing the query\n\t * sequence, check that the edits are consistent with respect to the\n\t * query.\n\t */\n\tstatic bool repOk(\n\t\tconst EList<Edit>& edits,\n\t\tconst BTDnaString& s,\n\t\tbool fw = true,\n\t\tsize_t trim5 = 0,\n\t\tsize_t trim3 = 0);\n#endif\n\n\tuint8_t  chr;  // reference character involved (for subst and ins)\n\tuint8_t  qchr; // read character involved (for subst and del)\n\tuint8_t  type; // 1 -> mm, 2 -> SNP, 3 -> ins, 4 -> del\n\tuint32_t pos;  // position w/r/t search root\n\tuint32_t pos2; // Second int to take into account when sorting.  Useful for\n\t               // sorting read gap edits that are all part of the same long\n\t\t\t\t   // gap.\n    \n    uint32_t splLen; // skip over the genome due to an intron\n    uint8_t  splDir;\n    bool     knownSpl;\n    \n    int64_t  donor_seq;\n    int64_t  acceptor_seq;\n\n\tfriend std::ostream& operator<< (std::ostream& os, const Edit& e);\n\n\t/**\n\t * Print a comma-separated list of Edits to given output stream.\n\t */\n\tstatic void print(\n\t\tstd::ostream& os,\n\t\tconst EList<Edit>& edits,\n\t\tchar delim = '\\t');\n\n\t/**\n\t * Merge second argument into the first.  Assume both are sorted to\n\t * begin with.\n\t */\n\tstatic void merge(EList<Edit>& dst, const EList<Edit>& src);\n};\n\n#endif /* EDIT_H_ */\n"
  },
  {
    "path": "endian_swap.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef ENDIAN_SWAP_H\n#define ENDIAN_SWAP_H\n\n#include <stdint.h>\n#include <inttypes.h>\n\n/**\n * Return true iff the machine running this program is big-endian.\n */\nstatic inline bool currentlyBigEndian() {\n\tstatic uint8_t endianCheck[] = {1, 0, 0, 0};\n\treturn *((uint32_t*)endianCheck) != 1;\n}\n\n/**\n * Return copy of uint32_t argument with byte order reversed.\n */\nstatic inline uint16_t endianSwapU16(uint16_t u) {\n\tuint16_t tmp = 0;\n\ttmp |= ((u >> 8) & (0xff << 0));\n\ttmp |= ((u << 8) & (0xff << 8));\n\treturn tmp;\n}\n\n/**\n * Return copy of uint32_t argument with byte order reversed.\n */\nstatic inline uint32_t endianSwapU32(uint32_t u) {\n\tuint32_t tmp = 0;\n\ttmp |= ((u >> 24) & (0xff <<  0));\n\ttmp |= ((u >>  8) & (0xff <<  8));\n\ttmp |= ((u <<  8) & (0xff << 16));\n\ttmp |= ((u << 24) & (0xff << 24));\n\treturn tmp;\n}\n\n/**\n * Return copy of uint64_t argument with byte order reversed.\n */\nstatic inline uint64_t endianSwapU64(uint64_t u) {\n\tuint64_t tmp = 0;\n\ttmp |= ((u >> 56) & (0xffull <<  0));\n\ttmp |= ((u >> 40) & (0xffull <<  8));\n\ttmp |= ((u >> 24) & (0xffull << 16));\n\ttmp |= ((u >>  8) & (0xffull << 24));\n\ttmp |= ((u <<  8) & (0xffull << 32));\n\ttmp |= ((u << 24) & (0xffull << 40));\n\ttmp |= ((u << 40) & (0xffull << 48));\n\ttmp |= ((u << 56) & (0xffull << 56));\n\treturn tmp;\n}\n\n/**\n * Return copy of uint_t argument with byte order reversed.\n */\ntemplate <typename index_t>\nstatic inline index_t endianSwapIndex(index_t u) {\n\tif(sizeof(index_t) == 8) {\n\t\treturn (index_t)endianSwapU64(u);\n\t} else if(sizeof(index_t) == 4) {\n\t\treturn endianSwapU32((uint32_t)u);\n\t} else {\n\t\treturn endianSwapU16(u);\n\t}\n}\n\n/**\n * Return copy of int16_t argument with byte order reversed.\n */\nstatic inline int16_t endianSwapI16(int16_t i) {\n\tint16_t tmp = 0;\n\ttmp |= ((i >> 8) & (0xff << 0));\n\ttmp |= ((i << 8) & (0xff << 8));\n\treturn tmp;\n}\n\n/**\n * Convert uint16_t argument to the specified endianness.  It's assumed\n * that u currently has the endianness of the current machine.\n */\nstatic inline uint16_t endianizeU16(uint16_t u, bool toBig) {\n\tif(toBig == currentlyBigEndian()) {\n\t\treturn u;\n\t}\n\treturn endianSwapU16(u);\n}\n\n/**\n * Convert int16_t argument to the specified endianness.  It's assumed\n * that u currently has the endianness of the current machine.\n */\nstatic inline int16_t endianizeI16(int16_t i, bool toBig) {\n\tif(toBig == currentlyBigEndian()) {\n\t\treturn i;\n\t}\n\treturn endianSwapI16(i);\n}\n\n/**\n * Return copy of int32_t argument with byte order reversed.\n */\nstatic inline int32_t endianSwapI32(int32_t i) {\n\tint32_t tmp = 0;\n\ttmp |= ((i >> 24) & (0xff <<  0));\n\ttmp |= ((i >>  8) & (0xff <<  8));\n\ttmp |= ((i <<  8) & (0xff << 16));\n\ttmp |= ((i << 24) & (0xff << 24));\n\treturn tmp;\n}\n\n/**\n * Convert uint32_t argument to the specified endianness.  It's assumed\n * that u currently has the endianness of the current machine.\n */\nstatic inline uint32_t endianizeU32(uint32_t u, bool toBig) {\n\tif(toBig == currentlyBigEndian()) {\n\t\treturn u;\n\t}\n\treturn endianSwapU32(u);\n}\n\n/**\n * Convert int32_t argument to the specified endianness.  It's assumed\n * that u currently has the endianness of the current machine.\n */\nstatic inline int32_t endianizeI32(int32_t i, bool toBig) {\n\tif(toBig == currentlyBigEndian()) {\n\t\treturn i;\n\t}\n\treturn endianSwapI32(i);\n}\n\ntemplate <typename index_t>\nindex_t endianizeIndex(index_t u, bool toBig) {\n\tif(toBig == currentlyBigEndian()) {\n\t\treturn u;\n\t}\n\treturn endianSwapIndex(u);\n}\n\n#endif\n"
  },
  {
    "path": "evaluation/centrifuge_evaluate.py",
    "content": "#!/usr/bin/env python\n\nimport sys, os, subprocess, inspect\nimport platform, multiprocessing\nimport string, re\nfrom datetime import datetime, date, time\nimport copy\nfrom argparse import ArgumentParser, FileType\n\n\n\"\"\"\n\"\"\"\ndef read_taxonomy_tree(tax_file):\n    taxonomy_tree = {}\n    for line in tax_file:\n        fields = line.strip().split('\\t')\n        assert len(fields) == 5\n        tax_id, parent_tax_id, rank = fields[0], fields[2], fields[4]\n        assert tax_id not in taxonomy_tree\n        taxonomy_tree[tax_id] = [parent_tax_id, rank]        \n    return taxonomy_tree\n\n\n\"\"\"\n\"\"\"\ndef compare_scm(centrifuge_out, true_out, taxonomy_tree, rank):\n    ancestors = set()\n    for tax_id in taxonomy_tree.keys():\n        if tax_id in ancestors:\n            continue\n        while True:\n            parent_tax_id, cur_rank = taxonomy_tree[tax_id]\n            if parent_tax_id in ancestors:\n                break\n            if tax_id == parent_tax_id:\n                break\n            tax_id = parent_tax_id\n            ancestors.add(tax_id)\n\n    db_dic = {}\n    first = True\n    for line in open(centrifuge_out):\n        if first:\n            first = False\n            continue\n        read_name, seq_id, tax_id, score, _, _, _, _ = line.strip().split('\\t')\n        # Traverse up taxonomy tree to match the given rank parameter\n        rank_tax_id = tax_id\n        if rank != \"strain\":\n            while True:\n                if tax_id not in taxonomy_tree:\n                    rank_tax_id = \"\"\n                    break\n                parent_tax_id, cur_rank = taxonomy_tree[tax_id]\n                if cur_rank == rank:\n                    rank_tax_id = tax_id\n                    break\n                if tax_id == parent_tax_id:\n                    rank_tax_id = \"\"\n                    break\n                tax_id = parent_tax_id\n        else:\n            assert rank == \"strain\"\n            if tax_id in ancestors:\n                continue\n\n        if rank_tax_id == \"\":\n            continue            \n        if read_name not in db_dic:\n            db_dic[read_name] = set()\n        db_dic[read_name].add(rank_tax_id)\n\n    classified, unclassified, unique_classified = 0, 0, 0\n    for line in open(true_out):\n        if line.startswith('@'):\n            continue\n        \n        read_name, tax_id = line.strip().split('\\t')[:2]\n        # Traverse up taxonomy tree to match the given rank parameter\n        rank_tax_id = tax_id\n        if rank != \"strain\":\n            while True:\n                if tax_id not in taxonomy_tree:\n                    rank_tax_id = \"\"\n                    break\n                parent_tax_id, cur_rank = taxonomy_tree[tax_id]\n                if cur_rank == rank:\n                    rank_tax_id = tax_id\n                    break\n                if tax_id == parent_tax_id:\n                    rank_tax_id = \"\"\n                    break\n                tax_id = parent_tax_id\n        if rank_tax_id == \"\":\n            continue\n        if read_name not in db_dic:\n            unclassified += 1\n            continue\n\n        maps = db_dic[read_name]\n        if rank_tax_id in maps:\n            classified += 1\n            if len(maps) == 1:\n                unique_classified += 1\n        else:\n            unclassified += 1\n\n    raw_unique_classified = 0\n    for value in db_dic.values():\n        if len(value) == 1:\n            raw_unique_classified += 1\n    return classified, unique_classified, unclassified, len(db_dic), raw_unique_classified\n\n\n\"\"\"\n\"\"\"\ndef compare_abundance(centrifuge_out, true_out, taxonomy_tree, debug):\n    db_dic = {}\n    first = True\n    for line in open(centrifuge_out):\n        if first:\n            first = False\n            continue\n        genome_name, tax_id, tax_rank, genome_len, num_reads, num_unique_reads, abundance = line.strip().split('\\t')\n        db_dic[tax_id] = float(abundance)\n\n    SSR = 0.0 # Sum of squared residuals\n    first = True\n    for line in open(true_out):\n        if first:\n            first = False\n            continue\n        \n        tax_id, genome_len, num_reads, abundance, genome_name = line.strip().split('\\t')\n\n        # daehwan - for debugging purposes\n        \"\"\"\n        cur_tax_id = tax_id\n        while True:\n            if cur_tax_id not in taxonomy_tree:\n                break\n            parent_tax_id, rank = taxonomy_tree[cur_tax_id]\n            print \"%s: %s\" % (cur_tax_id, rank)\n            if cur_tax_id == parent_tax_id:\n                break\n            cur_tax_id = parent_tax_id\n        print\n        print\n        \"\"\"\n        \n        abundance = float(abundance)\n        if tax_id in db_dic:\n            SSR += (abundance - db_dic[tax_id]) ** 2;\n            if debug:\n                print >> sys.stderr, \"\\t\\t\\t\\t{:<10}: {:.6} vs. {:.6} (truth vs. centrifuge)\".format(tax_id, abundance, db_dic[tax_id])\n        else:\n            SSR += (abundance) ** 2\n\n    return SSR\n\n\n\"\"\"\ne.g.\n     sqlite3 analysis.db --header --separator $'\\t' \"select * from Classification;\"\n\"\"\"\ndef sql_execute(sql_db, sql_query):\n    sql_cmd = [\n        \"sqlite3\", sql_db,\n        \"-separator\", \"\\t\",\n        \"%s;\" % sql_query\n        ]\n    # print >> sys.stderr, sql_cmd\n    sql_process = subprocess.Popen(sql_cmd, stdout=subprocess.PIPE)\n    output = sql_process.communicate()[0][:-1]\n    return output\n\n\n\"\"\"\n\"\"\"\ndef create_sql_db(sql_db):\n    if os.path.exists(sql_db):\n        print >> sys.stderr, sql_db, \"already exists!\"\n        return\n    \n    columns = [\n        [\"id\", \"integer primary key autoincrement\"],\n        [\"centrifutgeIndex\", \"text\"],\n        [\"readBase\", \"text\"],\n        [\"readType\", \"text\"],\n        [\"program\", \"text\"],\n        [\"version\", \"text\"],\n        [\"numFragments\", \"integer\"],\n        [\"strain_classified\", \"integer\"],\n        [\"strain_uniqueclassified\", \"integer\"],\n        [\"strain_unclassified\", \"integer\"],\n        [\"species_classified\", \"integer\"],\n        [\"species_uniqueclassified\", \"integer\"],\n        [\"species_unclassified\", \"integer\"],\n        [\"genus_classified\", \"integer\"],\n        [\"genus_uniqueclassified\", \"integer\"],\n        [\"genus_unclassified\", \"integer\"],\n        [\"family_classified\", \"integer\"],\n        [\"family_uniqueclassified\", \"integer\"],\n        [\"family_unclassified\", \"integer\"],\n        [\"order_classified\", \"integer\"],\n        [\"order_uniqueclassified\", \"integer\"],\n        [\"order_unclassified\", \"integer\"],\n        [\"class_classified\", \"integer\"],\n        [\"class_uniqueclassified\", \"integer\"],\n        [\"class_unclassified\", \"integer\"],\n        [\"phylum_classified\", \"integer\"],\n        [\"phylum_uniqueclassified\", \"integer\"],\n        [\"phylum_unclassified\", \"integer\"],\n        [\"time\", \"real\"],\n        [\"host\", \"text\"],\n        [\"created\", \"text\"],\n        [\"cmd\", \"text\"]\n        ]\n    \n    sql_create_table = \"CREATE TABLE Classification (\"\n    for i in range(len(columns)):\n        name, type = columns[i]\n        if i != 0:\n            sql_create_table += \", \"\n        sql_create_table += (\"%s %s\" % (name, type))\n    sql_create_table += \");\"\n    sql_execute(sql_db, sql_create_table)\n\n\n\"\"\"\n\"\"\"\ndef write_analysis_data(sql_db, genome_name, database_name):\n    if not os.path.exists(sql_db):\n        return\n\n    \"\"\"\n    programs = []\n    sql_aligners = \"SELECT aligner FROM ReadCosts GROUP BY aligner\"\n    output = sql_execute(sql_db, sql_aligners)\n    aligners = output.split()\n\n    can_read_types = [\"all\", \"M\", \"2M_gt_15\", \"2M_8_15\", \"2M_1_7\", \"gt_2M\"]    \n    tmp_read_types = []\n    sql_types = \"SELECT type FROM ReadCosts GROUP BY type\"\n    output = sql_execute(sql_db, sql_types)\n    tmp_read_types = output.split()\n\n    read_types = []\n    for read_type in can_read_types:\n        if read_type in tmp_read_types:\n            read_types.append(read_type)\n\n    for paired in [False, True]:\n        database_fname = genome_name + \"_\" + database_name\n        if paired:\n            end_type = \"paired\"\n            database_fname += \"_paired\"\n        else:\n            end_type = \"single\"\n            database_fname += \"_single\"\n        database_fname += \".analysis\"\n        database_file = open(database_fname, \"w\")\n        print >> database_file, \"end_type\\ttype\\taligner\\tnum_reads\\ttime\\tmapped_reads\\tunique_mapped_reads\\tunmapped_reads\\tmapping_point\\ttrue_gtf_junctions\\ttemp_junctions\\ttemp_gtf_junctions\"\n        for aligner in aligners:\n            for read_type in read_types:\n                sql_row = \"SELECT end_type, type, aligner, num_reads, time, mapped_reads, unique_mapped_reads, unmapped_reads, mapping_point, true_gtf_junctions, temp_junctions, temp_gtf_junctions FROM ReadCosts\"\n                sql_row += \" WHERE genome = '%s' and head = '%s' and aligner = '%s' and type = '%s' and end_type = '%s' ORDER BY created DESC LIMIT 1\" % (genome_name, database_name, aligner, read_type, end_type)\n                output = sql_execute(sql_db, sql_row)\n                if output:\n                    print >> database_file, output\n\n        database_file.close()\n    \"\"\"\n\n\n\"\"\"\n\"\"\"\ndef evaluate(index_base,\n             index_base_for_read,\n             num_fragment,\n             paired,\n             error_rate,\n             ranks,\n             programs,\n             runtime_only,\n             sql,\n             verbose,\n             debug):\n    # Current script directory\n    curr_script = os.path.realpath(inspect.getsourcefile(evaluate))\n    path_base = os.path.dirname(curr_script)\n\n    sql_db_name = \"analysis.db\"\n    if not os.path.exists(sql_db_name):\n        create_sql_db(sql_db_name)\n\n    num_cpus = multiprocessing.cpu_count()\n    if num_cpus > 8:\n        num_threads = min(8, num_cpus)\n        desktop = False\n    else:\n        num_threads = min(3, num_cpus)\n        desktop = True\n\n    def check_files(fnames):\n        for fname in fnames:\n            if not os.path.exists(fname):\n                return False\n        return True\n\n    # Check if indexes exists, otherwise create indexes\n    index_path = \"%s/indexes/Centrifuge\" % path_base\n    if not os.path.exists(path_base + \"/indexes\"):\n        os.mkdir(path_base + \"/indexes\")\n    if not os.path.exists(index_path):\n        os.mkdir(index_path)\n    index_fnames = [\"%s/%s.%d.cf\" % (index_path, index_base, i+1) for i in range(3)]\n    if not check_files(index_fnames):\n        print >> sys.stderr, \"Downloading indexes: %s\" % (\"index\")\n        os.system(\"cd %s; wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/%s.tar.gz; tar xvzf %s.tar.gz; rm %s.tar.gz; ln -s %s/%s* .; cd -\" % \\\n                      (index_path, index_base, index_base, index_base, index_base, index_base))\n        assert check_files(index_fnames)        \n\n    # Read taxonomic IDs\n    centrifuge_inspect = os.path.join(path_base, \"../centrifuge-inspect\")\n    tax_ids = set()\n    tax_cmd = [centrifuge_inspect,\n               \"--conversion-table\",\n               \"%s/%s\" % (index_path, index_base_for_read)]\n    tax_proc = subprocess.Popen(tax_cmd, stdout=subprocess.PIPE)\n    for line in tax_proc.stdout:\n        _, tax_id = line.strip().split()\n        tax_ids.add(tax_id)\n    tax_ids = list(tax_ids)\n\n    # Read taxonomic tree\n    tax_tree_cmd = [centrifuge_inspect,\n                    \"--taxonomy-tree\",\n                    \"%s/%s\" % (index_path, index_base_for_read)]    \n    tax_tree_proc = subprocess.Popen(tax_tree_cmd, stdout=subprocess.PIPE, stderr=open(\"/dev/null\", 'w'))\n    taxonomy_tree = read_taxonomy_tree(tax_tree_proc.stdout)\n\n    compressed = (index_base.find(\"compressed\") != -1) or (index_base_for_read.find(\"compressed\") != -1)\n\n    # Check if simulated reads exist, otherwise simulate reads\n    read_path = \"%s/reads\" % path_base\n    if not os.path.exists(read_path):\n        os.mkdir(read_path)\n    read_base = \"%s_%dM\" % (index_base_for_read, num_fragment / 1000000)\n    if error_rate > 0.0:\n        read_base += \"%.2fe\" % error_rate\n\n    read1_fname = \"%s/%s_1.fa\" % (read_path, read_base)\n    read2_fname = \"%s/%s_2.fa\" % (read_path, read_base)\n    truth_fname = \"%s/%s.truth\" % (read_path, read_base)\n    scm_fname = \"%s/%s.scm\" % (read_path, read_base)\n    read_fnames = [read1_fname, read2_fname, truth_fname, scm_fname]\n    if not check_files(read_fnames):\n        print >> sys.stderr, \"Simulating reads %s_1.fq %s_2.fq ...\" % (read_base, read_base)\n        centrifuge_simulate = os.path.join(path_base, \"centrifuge_simulate_reads.py\")\n        simulate_cmd = [centrifuge_simulate,\n                        \"--num-fragment\", str(num_fragment)]\n        if error_rate > 0.0:\n            simulate_cmd += [\"--error-rate\", str(error_rate)]\n        simulate_cmd += [\"%s/%s\" % (index_path, index_base_for_read),\n                         \"%s/%s\" % (read_path, read_base)]\n        \n        simulate_proc = subprocess.Popen(simulate_cmd, stdout=open(\"/dev/null\", 'w'))\n        simulate_proc.communicate()\n        assert check_files(read_fnames)\n\n    if runtime_only:\n        verbose = True\n\n    if paired:\n        base_fname = read_base + \"_paired\"\n    else:\n        base_fname = read_base + \"_single\"\n\n    print >> sys.stderr, \"Database: %s\" % (index_base)\n    if paired:\n        print >> sys.stderr, \"\\t%d million pairs\" % (num_fragment / 1000000)\n    else:\n        print >> sys.stderr, \"\\t%d million reads\" % (num_fragment / 1000000)\n\n    program_bin_base = \"%s/..\" % path_base\n    def get_program_version(program, version):\n        version = \"\"\n        if program == \"centrifuge\":\n            if version:\n                cmd = [\"%s/%s_%s/%s\" % (program_bin_base, program, version, program)]\n            else:\n                cmd = [\"%s/%s\" % (program_bin_base, program)]\n            cmd += [\"--version\"]                    \n            cmd_process = subprocess.Popen(cmd, stdout=subprocess.PIPE)\n            version = cmd_process.communicate()[0][:-1].split(\"\\n\")[0]\n            version = version.split()[-1]\n        else:\n            assert False\n\n        return version\n\n    def get_program_cmd(program, version, read1_fname, read2_fname, out_fname):\n        cmd = []\n        if program == \"centrifuge\":\n            if version:\n                cmd = [\"%s/centrifuge_%s/centrifuge\" % (program_bin_base, version)]\n            else:\n                cmd = [\"%s/centrifuge\" % (program_bin_base)]\n            cmd += [\"-f\",\n                    \"-p\", str(num_threads),\n                    \"%s/%s\" % (index_path, index_base)]\n            # cmd += [\"-k\", \"5\"]\n            # cmd += [\"--no-traverse\"]\n            if paired:\n                cmd += [\"-1\", read1_fname,\n                        \"-2\", read2_fname]\n            else:\n                cmd += [\"-U\", read1_fname]                        \n        else:\n            assert False\n\n        return cmd\n\n    init_time = {\"centrifuge\" : 0.0}\n    for program, version in programs:\n        program_name = program\n        if version:\n            program_name += (\"_%s\" % version)\n\n        print >> sys.stderr, \"\\t%s\\t%s\" % (program_name, str(datetime.now()))\n        if paired:\n            program_dir = program_name + \"_paired\"\n        else:\n            program_dir = program_name + \"_single\"\n            \n        if not os.path.exists(program_dir):\n            os.mkdir(program_dir)\n        os.chdir(program_dir)\n\n        out_fname = \"centrifuge.output\"\n        if runtime_only:\n            out_fname = \"/dev/null\"\n\n        if os.path.exists(out_fname):\n            continue\n\n        # Classify all reads\n        program_cmd = get_program_cmd(program, version, read1_fname, read2_fname, out_fname)\n        start_time = datetime.now()\n        if verbose:\n            print >> sys.stderr, \"\\t\", start_time, \" \".join(program_cmd)\n        if program in [\"centrifuge\"]:\n            proc = subprocess.Popen(program_cmd, stdout=open(out_fname, \"w\"), stderr=subprocess.PIPE)\n        else:\n            proc = subprocess.Popen(program_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n        proc.communicate()\n        finish_time = datetime.now()\n        duration = finish_time - start_time\n        assert program in init_time\n        duration = duration.total_seconds() - init_time[program]\n        if duration < 0.1:\n            duration = 0.1\n        if verbose:\n            print >> sys.stderr, \"\\t\", finish_time, \"finished:\", duration            \n\n        results = {\"strain\"  : [0, 0, 0],\n                   \"species\" : [0, 0, 0],\n                   \"genus\"   : [0, 0, 0],\n                   \"family\"  : [0, 0, 0],\n                   \"order\"   : [0, 0, 0],\n                   \"class\"   : [0, 0, 0],\n                   \"phylum\"  : [0, 0, 0]}\n        for rank in ranks:\n            if runtime_only:\n                break\n            if compressed and rank == \"strain\":\n                continue\n\n            classified, unique_classified, unclassified, raw_classified, raw_unique_classified = \\\n                compare_scm(out_fname, scm_fname, taxonomy_tree, rank)\n            results[rank] = [classified, unique_classified, unclassified]\n            num_cases = classified + unclassified\n            # if rank == \"strain\":\n            #    assert num_cases == num_fragment\n\n            print >> sys.stderr, \"\\t\\t%s\" % rank\n            print >> sys.stderr, \"\\t\\t\\tsensitivity: {:,} / {:,} ({:.2%})\".format(classified, num_cases, float(classified) / num_cases)\n            print >> sys.stderr, \"\\t\\t\\tprecision  : {:,} / {:,} ({:.2%})\".format(classified, raw_classified, float(classified) / raw_classified)\n            print >> sys.stderr, \"\\n\\t\\t\\tfor uniquely classified \",\n            if paired:\n                print >> sys.stderr, \"pairs\"\n            else:\n                print >> sys.stderr, \"reads\"\n            print >> sys.stderr, \"\\t\\t\\t\\t\\tsensitivity: {:,} / {:,} ({:.2%})\".format(unique_classified, num_cases, float(unique_classified) / num_cases)\n            print >> sys.stderr, \"\\t\\t\\t\\t\\tprecision  : {:,} / {:,} ({:.2%})\".format(unique_classified, raw_unique_classified, float(unique_classified) / raw_unique_classified)\n\n            # Calculate sum of squared residuals in abundance\n            if rank == \"strain\":\n                abundance_SSR = compare_abundance(\"centrifuge_report.tsv\", truth_fname, taxonomy_tree, debug)\n                print >> sys.stderr, \"\\t\\t\\tsum of squared residuals in abundance: {}\".format(abundance_SSR)\n\n        if runtime_only:\n            os.chdir(\"..\")\n            continue\n\n        if sql and os.path.exists(\"../\" + sql_db_name):\n            if paired:\n                end_type = \"paired\"\n            else:\n                end_type = \"single\"\n            sql_insert = \"INSERT INTO \\\"Classification\\\" VALUES(NULL, '%s', '%s', '%s', '%s', '%s', %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %f, '%s', datetime('now', 'localtime'), '%s');\" % \\\n                (index_base, read_base, end_type, program_name, get_program_version(program, version), num_fragment, \\\n                     results[\"strain\"][0],  results[\"strain\"][1],  results[\"strain\"][2], \\\n                     results[\"species\"][0], results[\"species\"][1], results[\"species\"][2], \\\n                     results[\"genus\"][0],   results[\"genus\"][1],   results[\"genus\"][2], \\\n                     results[\"family\"][0],  results[\"family\"][1],  results[\"family\"][2], \\\n                     results[\"order\"][0],   results[\"order\"][1],   results[\"order\"][2], \\\n                     results[\"class\"][0],   results[\"class\"][1],   results[\"class\"][2], \\\n                     results[\"phylum\"][0],  results[\"phylum\"][1],  results[\"phylum\"][2], \\\n                     duration, platform.node(), \" \".join(program_cmd))\n            sql_execute(\"../\" + sql_db_name, sql_insert)     \n\n \n        os.system(\"touch done\")\n        os.chdir(\"..\")\n\n        \"\"\"\n        if os.path.exists(sql_db_name):\n            write_analysis_data(sql_db_name, genome, data_base)\n        \"\"\"\n        \n\nif __name__ == \"__main__\":\n    parser = ArgumentParser(\n        description='Centrifuge evaluation')\n    parser.add_argument(\"index_base\",\n                        nargs='?',\n                        type=str,\n                        help='Centrifuge index')\n    parser.add_argument(\"--index-base-for-read\",\n                        dest=\"index_base_for_read\",\n                        type=str,\n                        default=\"\",\n                        help='index base for read (default same as index base)')    \n    parser.add_argument(\"--num-fragment\",\n                        dest=\"num_fragment\",\n                        action='store',\n                        type=int,\n                        default=1,\n                        help='Number of fragments in millions (default: 1)')\n    parser.add_argument(\"--paired\",\n                        dest='paired',\n                        action='store_true',\n                        help='Paired-end reads')\n    parser.add_argument(\"--error-rate\",\n                        dest='error_rate',\n                        action='store',\n                        type=float,\n                        default=0.0,\n                        help='per-base sequencing error rate (%%) (default: 0.0)')\n    rank_list_default = \"strain,species,genus,family,order,class,phylum\"\n    parser.add_argument(\"--rank-list\",\n                        dest=\"ranks\",\n                        type=str,\n                        default=rank_list_default,\n                        help=\"A comma-separated list of ranks (default: %s)\" % rank_list_default)\n    parser.add_argument(\"--program-list\",\n                        dest=\"programs\",\n                        type=str,\n                        default=\"centrifuge\",\n                        help=\"A comma-separated list of aligners (default: centrifuge)\")\n    parser.add_argument(\"--runtime-only\",\n                        dest='runtime_only',\n                        action='store_true',\n                        help='Just check runtime without evaluation')    \n    parser.add_argument(\"--no-sql\",\n                        dest='sql',\n                        action='store_false',\n                        help='Do not write results into a sqlite database')\n    parser.add_argument(\"-v\", \"--verbose\",\n                        dest='verbose',\n                        action='store_true',\n                        help='also print some statistics to stderr')\n    parser.add_argument(\"--debug\",\n                        dest='debug',\n                        action='store_true',\n                        help='Debug')\n\n    args = parser.parse_args()\n    if not args.index_base:\n        parser.print_help()\n        exit(1)\n    if args.index_base_for_read == \"\":\n        args.index_base_for_read = args.index_base\n    ranks = args.ranks.split(',')\n    programs = []\n    for program in args.programs.split(','):\n        if '_' in program:\n            programs.append(program.split('_'))\n        else:\n            programs.append([program, \"\"])\n            \n    evaluate(args.index_base,\n             args.index_base_for_read,\n             args.num_fragment * 1000000,\n             args.paired,\n             args.error_rate,\n             ranks,\n             programs,\n             args.runtime_only,\n             args.sql,\n             args.verbose,\n             args.debug)\n"
  },
  {
    "path": "evaluation/centrifuge_simulate_reads.py",
    "content": "#!/usr/bin/env python\n\n#\n# Copyright 2015, Daehwan Kim <infphilo@gmail.com>\n#\n# This file is part of HISAT 2.\n#\n# HISAT 2 is free software: you can redistribute it and/or modify\n# it under the terms of the GNU General Public License as published by\n# the Free Software Foundation, either version 3 of the License, or\n# (at your option) any later version.\n#\n# HISAT 2 is distributed in the hope that it will be useful,\n# but WITHOUT ANY WARRANTY; without even the implied warranty of\n# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n# GNU General Public License for more details.\n#\n# You should have received a copy of the GNU General Public License\n# along with HISAT 2.  If not, see <http://www.gnu.org/licenses/>.\n#\n\nimport sys, os, subprocess, inspect\nimport math, random, re\nfrom collections import defaultdict, Counter\nfrom argparse import ArgumentParser, FileType\n\n\n\"\"\"\n\"\"\"\ndef reverse_complement(seq):\n    result = \"\"\n    for nt in seq:\n        base = nt\n        if nt == 'A':\n            base = 'T'\n        elif nt == 'a':\n            base = 't'\n        elif nt == 'C':\n            base = 'G'\n        elif nt == 'c':\n            base = 'g'\n        elif nt == 'G':\n            base = 'C'\n        elif nt == 'g':\n            base = 'c'\n        elif nt == 'T':\n            base = 'A'\n        elif nt == 't':\n            base = 'a'\n        \n        result = base + result\n    \n    return result\n\n\n\"\"\"\n\"\"\"\ndef get_genome_seq_id(genome_name):\n    genome_seq_id = genome_name.split()[0]\n    if len(genome_seq_id.split('|')) >= 2:\n        genome_seq_id = '|'.join(genome_seq_id.split('|')[:2])\n    return genome_seq_id\n    \n\n\"\"\"\nRandom source for sequencing errors\n\"\"\"\nclass ErrRandomSource:\n    def __init__(self, prob = 0.0, size = 1 << 20):\n        self.size = size\n        self.rands = []\n        for i in range(self.size):\n            if random.random() < prob:\n                self.rands.append(1)\n            else:\n                self.rands.append(0)\n        self.cur = 0\n        \n    def getRand(self):\n        assert self.cur < len(self.rands)\n        rand = self.rands[self.cur]\n        self.cur = (self.cur + 1) % len(self.rands)\n        return rand\n\n\n\"\"\"\n\"\"\"\ndef read_genomes(genomes_file, seq2taxID):\n    genome_dic = {}    \n    tax_id, sequence = \"\", \"\"\n    for line in genomes_file:\n        if line[0] == \">\":\n            if tax_id and sequence:\n                if genome_seq_id in genome_dic:\n                    genome_dic[tax_id] += sequence\n                else:\n                    genome_dic[tax_id] = sequence\n            \n            genome_name = line[1:-1]\n            genome_seq_id = get_genome_seq_id(genome_name)\n            assert genome_seq_id in seq2taxID\n            tax_id = seq2taxID[genome_seq_id]\n            sequence = \"\"\n        else:\n            sequence += line[:-1]\n\n    if tax_id and sequence:\n        if tax_id in genome_dic:\n            genome_dic[tax_id] += sequence\n        else:\n            genome_dic[tax_id] = sequence\n    \n    return genome_dic\n\n\n\"\"\"\n\"\"\"\ndef read_transcript(genomes_seq, gtf_file, frag_len):\n    genes = defaultdict(list)\n    transcripts = {}\n\n    # Parse valid exon lines from the GTF file into a dict by transcript_id\n    for line in gtf_file:\n        line = line.strip()\n        if not line or line.startswith('#'):\n            continue\n        if '#' in line:\n            line = line.split('#')[0].strip()\n        try:\n            chrom, source, feature, left, right, score, \\\n                strand, frame, values = line.split('\\t')\n        except ValueError:\n            continue\n        if not chrom in genome_seq:\n            continue\n        \n        # Zero-based offset\n        left, right = int(left) - 1, int(right) - 1\n        if feature != 'exon' or left >= right:\n            continue\n\n        values_dict = {}\n        for attr in values.split(';')[:-1]:\n            attr, _, val = attr.strip().partition(' ')\n            values_dict[attr] = val.strip('\"')\n\n        if 'gene_id' not in values_dict or \\\n                'transcript_id' not in values_dict:\n            continue\n\n        transcript_id = values_dict['transcript_id']\n        if transcript_id not in transcripts:\n            transcripts[transcript_id] = [chrom, strand, [[left, right]]]\n            genes[values_dict['gene_id']].append(transcript_id)\n        else:\n            transcripts[transcript_id][2].append([left, right])\n\n    # Sort exons and merge where separating introns are <=5 bps\n    for tran, [chr, strand, exons] in transcripts.items():\n            exons.sort()\n            tmp_exons = [exons[0]]\n            for i in range(1, len(exons)):\n                if exons[i][0] - tmp_exons[-1][1] <= 5:\n                    tmp_exons[-1][1] = exons[i][1]\n                else:\n                    tmp_exons.append(exons[i])\n            transcripts[tran] = [chr, strand, tmp_exons]\n\n    tmp_transcripts = {}\n    for tran, [chr, strand, exons] in transcripts.items():\n        exon_lens = [e[1] - e[0] + 1 for e in exons]\n        transcript_len = sum(exon_lens)\n        if transcript_len >= frag_len:\n            tmp_transcripts[tran] = [chr, strand, transcript_len, exons]\n\n    transcripts = tmp_transcripts\n\n    return genes, transcripts\n    \n\n\"\"\"\n\"\"\"\ndef generate_rna_expr_profile(expr_profile_type, num_transcripts = 10000):\n    # Modelling and simulating generic RNA-Seq experiments with the flux simulator\n    # http://nar.oxfordjournals.org/content/suppl/2012/06/29/gks666.DC1/nar-02667-n-2011-File002.pdf\n    def calc_expr(x, a):\n        x, a, b = float(x), 9500.0, 9500.0\n        k = -0.6\n        return (x**k) * math.exp(x/a * (x/b)**2)\n    \n    expr_profile = [0.0] * num_transcripts\n    for i in range(len(expr_profile)):\n        if expr_profile_type == \"flux\":\n            expr_profile[i] = calc_expr(i + 1, num_transcripts)\n        elif expr_profile_type == \"constant\":\n            expr_profile[i] = 1.0\n        else:\n            assert False\n\n    expr_sum = sum(expr_profile)\n    expr_profile = [expr_profile[i] / expr_sum for i in range(len(expr_profile))]\n    assert abs(sum(expr_profile) - 1.0) < 0.001\n    return expr_profile\n\n\n\"\"\"\n\"\"\"\ndef generate_dna_expr_profile(expr_profile_type, num_genomes):\n    # Modelling and simulating generic RNA-Seq experiments with the flux simulator\n    # http://nar.oxfordjournals.org/content/suppl/2012/06/29/gks666.DC1/nar-02667-n-2011-File002.pdf\n    def calc_expr(x, a):\n        x, a, b = float(x), 9500.0, 9500.0\n        k = -0.6\n        return (x**k) * math.exp(x/a * (x/b)**2)\n    \n    expr_profile = [0.0] * num_genomes\n    for i in range(len(expr_profile)):\n        if expr_profile_type == \"flux\":\n            expr_profile[i] = calc_expr(i + 1, num_genomes)\n        elif expr_profile_type == \"constant\":\n            expr_profile[i] = 1.0\n        else:\n            assert False\n\n    expr_sum = sum(expr_profile)\n    expr_profile = [expr_profile[i] / expr_sum for i in range(len(expr_profile))]\n    assert abs(sum(expr_profile) - 1.0) < 0.001\n    return expr_profile\n\n\n\"\"\"\n\"\"\"\ndef getSamAlignment(dna, exons, genome_seq, trans_seq, frag_pos, read_len, err_rand_src, max_mismatch):\n    # Find the genomic position for frag_pos and exon number\n    tmp_frag_pos, tmp_read_len = frag_pos, read_len\n    pos, cigars, cigar_descs = exons[0][0], [], []\n    e_pos = 0\n    prev_e = None\n    for e_i in range(len(exons)):\n        e = exons[e_i]\n        if prev_e:\n            i_len = e[0] - prev_e[1] - 1\n            pos += i_len\n        e_len = e[1] - e[0] + 1\n        if e_len <= tmp_frag_pos:\n            tmp_frag_pos -= e_len\n            pos += e_len\n        else:\n            pos += tmp_frag_pos\n            e_pos = tmp_frag_pos\n            break                        \n        prev_e = e\n\n    # Define Cigar and its descriptions\n    assert e_i < len(exons)\n    e_len = exons[e_i][1] - exons[e_i][0] + 1\n    assert e_pos < e_len\n    cur_pos = pos\n    match_len = 0\n    prev_e = None\n    mismatch, remain_trans_len = 0, len(trans_seq) - (frag_pos + read_len)\n    assert remain_trans_len >= 0\n    for e_i in range(e_i, len(exons)):\n        e = exons[e_i]\n        if prev_e:\n            i_len = e[0] - prev_e[1] - 1\n            cur_pos += i_len\n            cigars.append((\"{}N\".format(i_len)))\n            cigar_descs.append([])\n        tmp_e_left = e_left = e[0] + e_pos\n        e_pos = 0\n\n        # Simulate mismatches due to sequencing errors\n        mms = []\n        for i in range(e_left, min(e[1], e_left + tmp_read_len - 1)):\n            if err_rand_src.getRand() == 1:\n                assert i < len(genome_seq)\n                err_base = \"A\"\n                rand = random.randint(0, 2)\n                if genome_seq[i] == \"A\":\n                    err_base = \"GCT\"[rand]\n                elif genome_seq[i] == \"C\":\n                    err_base = \"AGT\"[rand]\n                elif genome_seq[i] == \"G\":\n                    err_base = \"ACT\"[rand]\n                else:\n                    err_base = \"ACG\"[rand]                    \n                mms.append([\"\", \"single\", i, err_base])\n\n        tmp_diffs = mms\n        def diff_sort(a , b):\n            return a[2] - b[2]\n\n        tmp_diffs = sorted(tmp_diffs, cmp=diff_sort)\n        diffs = []\n        if len(tmp_diffs) > 0:\n            diffs = tmp_diffs[:1]\n            for diff in tmp_diffs[1:]:\n                _, tmp_type, tmp_pos, tmp_data = diff\n                _, prev_type, prev_pos, prev_data = diffs[-1]\n                if prev_type == \"deletion\":\n                    prev_pos += prev_data\n                if tmp_pos <= prev_pos:\n                    continue\n                diffs.append(diff)\n\n        cigar_descs.append([])\n        prev_diff = None\n        for diff in diffs:\n            diff_id, diff_type, diff_pos, diff_data = diff\n            if prev_diff:\n                prev_diff_id, prev_diff_type, prev_diff_pos, prev_diff_data = prev_diff\n                if prev_diff_type == \"deletion\":\n                    prev_diff_pos += prev_diff_data\n                assert prev_diff_pos < diff_pos\n            diff_pos2 = diff_pos\n            if diff_type == \"deletion\":\n                diff_pos2 += diff_data\n            if e_left + tmp_read_len - 1 < diff_pos2 or e[1] < diff_pos2:\n                break            \n            if diff_type == \"single\":\n                if diff_id == \"\" and mismatch >= max_mismatch:\n                    continue                \n                cigar_descs[-1].append([diff_pos - tmp_e_left, diff_data, diff_id])\n                tmp_e_left = diff_pos + 1\n                if diff_id == \"\":\n                    mismatch += 1\n            elif diff_type == \"deletion\":\n                if len(cigars) <= 0:\n                    continue\n                del_len = diff_data\n                if remain_trans_len < del_len:\n                    continue\n                remain_trans_len -= del_len\n                if diff_pos - e_left > 0:\n                    cigars.append(\"{}M\".format(diff_pos - e_left))\n                    cigar_descs[-1].append([diff_pos - tmp_e_left, \"\", \"\"])\n                    cigar_descs.append([])\n                cigars.append(\"{}D\".format(del_len))\n                cigar_descs[-1].append([0, del_len, diff_id])\n                cigar_descs.append([])\n                tmp_read_len -= (diff_pos - e_left)\n                e_left = tmp_e_left = diff_pos + del_len\n            elif diff_type == \"insertion\":\n                if len(cigars) > 0:\n                    ins_len = len(diff_data)\n                    if e_left + tmp_read_len - 1 < diff_pos + ins_len:\n                        break\n                    if diff_pos - e_left > 0:\n                        cigars.append(\"{}M\".format(diff_pos - e_left))\n                        cigar_descs[-1].append([diff_pos - tmp_e_left, \"\", \"\"])\n                        cigar_descs.append([])\n                    cigars.append(\"{}I\".format(ins_len))\n                    cigar_descs[-1].append([0, diff_data, diff_id])\n                    cigar_descs.append([])\n                    tmp_read_len -= (diff_pos - e_left)\n                    tmp_read_len -= ins_len\n                    e_left = tmp_e_left = diff_pos\n            else:\n                assert False\n            prev_diff = diff\n\n        e_right = min(e[1], e_left + tmp_read_len - 1)\n        e_len = e_right - e_left + 1\n        remain_e_len = e_right - tmp_e_left + 1\n        if remain_e_len > 0:\n            cigar_descs[-1].append([remain_e_len, \"\", \"\"])\n        if e_len < tmp_read_len:\n            tmp_read_len -= e_len\n            cigars.append((\"{}M\".format(e_len)))\n        else:\n            assert e_len == tmp_read_len\n            cigars.append((\"{}M\".format(tmp_read_len)))\n            tmp_read_len = 0\n            break\n        prev_e = e\n\n    # Define MD, XM, NM, Zs, read_seq\n    MD, XM, NM, Zs, read_seq = \"\", 0, 0, \"\", \"\"\n    assert len(cigars) == len(cigar_descs)\n    MD_match_len, Zs_match_len = 0, 0\n    cur_trans_pos = frag_pos\n    for c in range(len(cigars)):\n        cigar = cigars[c]\n        cigar_len, cigar_op = int(cigar[:-1]), cigar[-1]\n        cigar_desc = cigar_descs[c]\n        if cigar_op == 'N':\n            continue\n        if cigar_op == 'M':\n            for add_match_len, alt_base, snp_id in cigar_desc:\n                MD_match_len += add_match_len\n                Zs_match_len += add_match_len\n                assert cur_trans_pos + add_match_len <= len(trans_seq)\n                read_seq += trans_seq[cur_trans_pos:cur_trans_pos+add_match_len]\n                cur_trans_pos += add_match_len\n                if alt_base != \"\":\n                    if MD_match_len > 0:\n                        MD += (\"{}\".format(MD_match_len))\n                        MD_match_len = 0\n                    MD += trans_seq[cur_trans_pos]\n                    if snp_id != \"\":\n                        if Zs != \"\":\n                            Zs += \",\"\n                        Zs += (\"{}|S|{}\".format(Zs_match_len, snp_id))\n                        Zs_match_len = 0\n                    else:\n                        Zs_match_len += 1\n                    if snp_id == \"\":\n                        XM += 1\n                        NM += 1\n                    read_seq += alt_base\n                    cur_trans_pos += 1\n        elif cigar_op == 'D':\n            assert len(cigar_desc) == 1\n            add_match_len, del_len, snp_id = cigar_desc[0]\n            MD_match_len += add_match_len\n            Zs_match_len += add_match_len\n            if MD_match_len > 0:\n                MD += (\"{}\".format(MD_match_len))\n                MD_match_len = 0\n            MD += (\"^{}\".format(trans_seq[cur_trans_pos:cur_trans_pos+cigar_len]))\n            read_seq += trans_seq[cur_trans_pos:cur_trans_pos+add_match_len]\n            if Zs != \"\":\n                Zs += \",\"\n            Zs += (\"{}|D|{}\".format(Zs_match_len, cigar_desc[0][-1]))\n            Zs_match_len = 0\n            cur_trans_pos += cigar_len\n        elif cigar_op == 'I':\n            assert len(cigar_desc) == 1\n            add_match_len, ins_seq, snp_id = cigar_desc[0]\n            ins_len = len(ins_seq)\n            MD_match_len += add_match_len\n            Zs_match_len += add_match_len\n            read_seq += trans_seq[cur_trans_pos:cur_trans_pos+add_match_len]\n            read_seq += ins_seq\n            if Zs != \"\":\n                Zs += \",\"\n            Zs += (\"{}|I|{}\".format(Zs_match_len, cigar_desc[0][-1]))\n            Zs_match_len = 0\n        else:\n            assert False\n\n    if MD_match_len > 0:\n        MD += (\"{}\".format(MD_match_len))\n\n    if len(read_seq) != read_len:\n        print >> sys.stderr, \"read length differs:\", len(read_seq), \"vs.\", read_len\n        print >> sys.stderr, pos, \"\".join(cigars), cigar_descs, MD, XM, NM, Zs\n        assert False\n\n    return pos, cigars, cigar_descs, MD, XM, NM, Zs, read_seq\n\n\n\"\"\"\n\"\"\"\ncigar_re = re.compile('\\d+\\w')\ndef samRepOk(genome_seq, read_seq, chr, pos, cigar, XM, NM, MD, Zs, max_mismatch):\n    assert chr in genome_seq\n    chr_seq = genome_seq[chr]\n    assert pos < len(chr_seq)\n\n    # Calculate XM and NM based on Cigar and Zs\n    cigars = cigar_re.findall(cigar)\n    cigars = [[int(cigars[i][:-1]), cigars[i][-1]] for i in range(len(cigars))]\n    ref_pos, read_pos = pos, 0\n    ann_ref_seq, ann_ref_rel, ann_read_seq, ann_read_rel = [], [], [], []\n    for i in range(len(cigars)):\n        cigar_len, cigar_op = cigars[i]\n        if cigar_op == \"M\":\n            partial_ref_seq = chr_seq[ref_pos:ref_pos+cigar_len]\n            partial_read_seq = read_seq[read_pos:read_pos+cigar_len]\n            assert len(partial_ref_seq) == len(partial_read_seq)\n            ann_ref_seq += list(partial_ref_seq)\n            ann_read_seq += list(partial_read_seq)\n            for j in range(len(partial_ref_seq)):\n                if partial_ref_seq[j] == partial_read_seq[j]:\n                    ann_ref_rel.append(\"=\")\n                    ann_read_rel.append(\"=\")\n                else:\n                    ann_ref_rel.append(\"X\")\n                    ann_read_rel.append(\"X\")\n            ref_pos += cigar_len\n            read_pos += cigar_len\n        elif cigar_op == \"D\":\n            partial_ref_seq = chr_seq[ref_pos:ref_pos+cigar_len]\n            ann_ref_rel += list(partial_ref_seq)\n            ann_ref_seq += list(partial_ref_seq)\n            ann_read_rel += ([\"-\"] * cigar_len)\n            ann_read_seq += ([\"-\"] * cigar_len)\n            ref_pos += cigar_len\n        elif cigar_op == \"I\":\n            partial_read_seq = read_seq[read_pos:read_pos+cigar_len]\n            ann_ref_rel += ([\"-\"] * cigar_len)\n            ann_ref_seq += ([\"-\"] * cigar_len)\n            ann_read_rel += list(partial_read_seq)\n            ann_read_seq += list(partial_read_seq) \n            read_pos += cigar_len\n        elif cigar_op == \"N\":\n            ref_pos += cigar_len\n        else:\n            assert False\n    \n    assert len(ann_ref_seq) == len(ann_read_seq)\n    assert len(ann_ref_seq) == len(ann_ref_rel)\n    assert len(ann_ref_seq) == len(ann_read_rel)\n    ann_Zs_seq = [\"0\" for i in range(len(ann_ref_seq))]\n\n    Zss, Zs_i, snp_pos_add = [], 0, 0\n    if Zs != \"\":\n        Zss = Zs.split(',')\n        Zss = [zs.split('|') for zs in Zss]\n\n    ann_read_pos = 0\n    for zs in Zss:\n        zs_pos, zs_type, zs_id = zs\n        zs_pos = int(zs_pos)\n        for i in range(zs_pos):\n            while ann_read_rel[ann_read_pos] == '-':\n                ann_read_pos += 1\n            ann_read_pos += 1\n        if zs_type == \"S\":\n            ann_Zs_seq[ann_read_pos] = \"1\"\n            ann_read_pos += 1\n        elif zs_type == \"D\":\n            while ann_read_rel[ann_read_pos] == '-':\n                ann_Zs_seq[ann_read_pos] = \"1\"\n                ann_read_pos += 1\n        elif zs_type == \"I\":\n            while ann_ref_rel[ann_read_pos] == '-':\n                ann_Zs_seq[ann_read_pos] = \"1\"\n                ann_read_pos += 1\n        else:\n            assert False\n\n    tMD, tXM, tNM = \"\", 0, 0\n    match_len = 0\n    i = 0\n    while i < len(ann_ref_seq):\n        if ann_ref_rel[i] == \"=\":\n            assert ann_read_rel[i] == \"=\"\n            match_len += 1\n            i += 1\n            continue\n        assert ann_read_rel[i] != \"=\"\n        if ann_ref_rel[i] == \"X\" and ann_read_rel[i] == \"X\":\n            if match_len > 0:\n                tMD += (\"{}\".format(match_len))\n                match_len = 0\n            tMD += ann_ref_seq[i]\n            if ann_Zs_seq[i] == \"0\":\n                tXM += 1\n                tNM += 1\n            i += 1\n        else:\n            assert ann_ref_rel[i] == \"-\" or ann_read_rel[i] == \"-\"\n            if ann_ref_rel[i] == '-':\n                while ann_ref_rel[i] == '-':\n                    if ann_Zs_seq[i] == \"0\":\n                        tNM += 1\n                    i += 1\n            else:\n                assert ann_read_rel[i] == '-'\n                del_seq = \"\"\n                while  ann_read_rel[i] == '-':\n                    del_seq += ann_ref_seq[i]\n                    if ann_Zs_seq[i] == \"0\":\n                        tNM += 1\n                    i += 1\n                if match_len > 0:\n                    tMD += (\"{}\".format(match_len))\n                    match_len = 0\n                tMD += (\"^{}\".format(del_seq))\n\n    if match_len > 0:\n        tMD += (\"{}\".format(match_len))\n\n    if tMD != MD or tXM != XM or tNM != NM or XM > max_mismatch or XM != NM:\n        print >> sys.stderr, chr, pos, cigar, MD, XM, NM, Zs\n        print >> sys.stderr, tMD, tXM, tNM\n        assert False\n        \n        \n\"\"\"\n\"\"\"\ndef simulate_reads(index_fname, base_fname, \\\n                       dna, paired_end, read_len, frag_len, \\\n                       num_frag, expr_profile_type, error_rate, max_mismatch, \\\n                       random_seed, sanity_check, verbose):\n    random.seed(random_seed)\n    \n    # Current script directory\n    curr_script = os.path.realpath(inspect.getsourcefile(simulate_reads))\n    ex_path = os.path.dirname(curr_script)\n    centrifuge_inspect = os.path.join(ex_path, \"../centrifuge-inspect\")\n\n    err_rand_src = ErrRandomSource(error_rate / 100.0)\n    \n    if read_len > frag_len:\n        frag_len = read_len\n\n    # Read taxonomic IDs\n    seq2texID = {}\n    tax_cmd = [centrifuge_inspect,\n               \"--conversion-table\",\n               index_fname]\n    tax_proc = subprocess.Popen(tax_cmd, stdout=subprocess.PIPE)\n    for line in tax_proc.stdout:\n        seq_id, tax_id = line.strip().split()\n        seq2texID[seq_id] = tax_id\n\n    # Read names\n    names = {}\n    name_cmd = [centrifuge_inspect,\n                \"--name-table\",\n                index_fname]\n    name_proc = subprocess.Popen(name_cmd, stdout=subprocess.PIPE)\n    for line in name_proc.stdout:\n        tax_id, name = line.strip().split('\\t')\n        names[tax_id] = name\n\n    # Genome sizes\n    sizes = {}\n    size_cmd = [centrifuge_inspect,\n                \"--size-table\",\n                index_fname]\n    size_proc = subprocess.Popen(size_cmd, stdout=subprocess.PIPE)\n    for line in size_proc.stdout:\n        tax_id, size = line.strip().split('\\t')\n        sizes[tax_id] = int(size)\n\n    # Read genome sequences into memory\n    genomes_fname = index_fname + \".fa\"\n    if not os.path.exists(genomes_fname):\n        print >> sys.stderr, \"Extracting genomes from Centrifuge index to %s, which may take a few hours ...\"  % (genomes_fname)\n        extract_cmd = [centrifuge_inspect,\n                       index_fname]\n        extract_proc = subprocess.Popen(extract_cmd, stdout=open(genomes_fname, 'w'))\n        extract_proc.communicate()\n    genome_seqs = read_genomes(open(genomes_fname), seq2texID)\n\n    if dna:\n        genes, transcripts = {}, {}\n    else:\n        genes, transcripts = read_transcript(genome_seqs, gtf_file, frag_len)\n        \n    if sanity_check:\n        sanity_check_input(genomes_seq, genes, transcripts, frag_len)\n\n    if dna:\n        expr_profile = generate_dna_expr_profile(expr_profile_type, min(len(genome_seqs), 100))\n    else:\n        num_transcripts = min(len(transcripts), 10000)\n        expr_profile = generate_rna_expr_profile(expr_profile_type, num_transcripts)\n\n    expr_profile = [int(expr_profile[i] * num_frag) for i in range(len(expr_profile))]\n    assert num_frag >= sum(expr_profile)\n    while sum(expr_profile) < num_frag:\n        for i in range(min(num_frag - sum(expr_profile), len(expr_profile))):\n            expr_profile[i] += 1\n    assert num_frag == sum(expr_profile)\n\n    if dna:\n        genome_ids = genome_seqs.keys()\n    else:\n        transcript_ids = transcripts.keys()\n        random.shuffle(transcript_ids)\n        assert len(transcript_ids) >= len(expr_profile)\n\n    # Truth table\n    truth_file = open(base_fname + \".truth\", \"w\")\n    print >> truth_file, \"taxID\\tgenomeLen\\tnumReads\\tabundance\\tname\"\n    truth_list = []\n    normalized_sum = 0.0\n    debug_num_frag = 0\n    for t in range(len(expr_profile)):\n        t_num_frags = expr_profile[t]\n        if dna:\n            tax_id = genome_ids[t]\n        else:\n            transcript_id = transcript_ids[t]\n            chr, strand, transcript_len, exons = transcripts[transcript_id]\n        assert tax_id in genome_seqs and tax_id in sizes\n        genome_len = sizes[tax_id]\n        raw_abundance = float(t_num_frags)/num_frag\n        normalized_sum += (raw_abundance / genome_len)\n        truth_list.append([tax_id, genome_len, t_num_frags, raw_abundance])\n        debug_num_frag += t_num_frags\n    assert debug_num_frag == num_frag\n    for truth in truth_list:\n        tax_id, genome_len, t_num_frags, raw_abundance = truth\n        can_tax_id = tax_id\n        if '.' in can_tax_id:\n            can_tax_id = can_tax_id.split('.')[0]\n        name = \"N/A\"        \n        if can_tax_id in names:\n            name = names[can_tax_id]\n        abundance = raw_abundance / genome_len / normalized_sum\n        print >> truth_file, \"{}\\t{}\\t{}\\t{:.6}\\t{}\".format(tax_id, genome_len, t_num_frags, abundance, name)\n    truth_file.close()\n\n    # Sequence Classification Map (SCM) - something I made up ;-)\n    scm_file = open(base_fname + \".scm\", \"w\")\n\n    # Write SCM header\n    print >> scm_file, \"@HD\\tVN:1.0\\tSO:unsorted\"\n    for tax_id in genome_seqs.keys():\n        name = \"\"\n        if tax_id in names:\n            name = names[tax_id]\n        print >> scm_file, \"@SQ\\tTID:%s\\tSN:%s\\tLN:%d\" % (tax_id, name, len(genome_seqs[tax_id]))\n\n    read_file = open(base_fname + \"_1.fa\", \"w\")\n    if paired_end:\n        read2_file = open(base_fname + \"_2.fa\", \"w\")\n\n    cur_read_id = 1\n    for t in range(len(expr_profile)):\n        t_num_frags = expr_profile[t]\n        if dna:\n            tax_id = genome_ids[t]\n            print >> sys.stderr, \"TaxID: %s, num fragments: %d\" % (tax_id, t_num_frags)\n        else:\n            transcript_id = transcript_ids[t]\n            chr, strand, transcript_len, exons = transcripts[transcript_id]\n            print >> sys.stderr, transcript_id, t_num_frags\n\n        genome_seq = genome_seqs[tax_id]\n        genome_len = len(genome_seq)\n        if dna:\n            t_seq = genome_seq\n            exons = [[0, genome_len - 1]]\n        else:            \n            t_seq = \"\"\n            for e in exons:\n                assert e[0] < e[1]\n                t_seq += genome_seq[e[0]:e[1]+1]\n            assert len(t_seq) == transcript_len\n            \n        for f in range(t_num_frags):\n            if dna:\n                while True:\n                    frag_pos = random.randint(0, genome_len - frag_len)\n                    if 'N' not in genome_seq[frag_pos:frag_pos + frag_len]:\n                        break\n            else:\n                frag_pos = random.randint(0, transcript_len - frag_len)\n\n            pos, cigars, cigar_descs, MD, XM, NM, Zs, read_seq = getSamAlignment(dna, exons, genome_seq, t_seq, frag_pos, read_len, err_rand_src, max_mismatch)\n            pos2, cigars2, cigar2_descs, MD2, XM2, NM2, Zs2, read2_seq = getSamAlignment(dna, exons, genome_seq, t_seq, frag_pos+frag_len-read_len, read_len, err_rand_src, max_mismatch)\n            cigar_str, cigar2_str = \"\".join(cigars), \"\".join(cigars2)\n            if sanity_check:\n                samRepOk(genome_seq, read_seq, chr, pos, cigar_str, XM, NM, MD, Zs, max_mismatch)\n                samRepOk(genome_seq, read2_seq, chr, pos2, cigar2_str, XM2, NM2, MD2, Zs2, max_mismatch)\n\n            if Zs != \"\":\n                Zs = (\"\\tZs:Z:{}\".format(Zs))\n            if Zs2 != \"\":\n                Zs2 = (\"\\tZs:Z:{}\".format(Zs2))\n            \n            if dna:\n                XS, TI = \"\", \"\"                \n            else:\n                XS = \"\\tXS:A:{}\".format(strand)\n                TI = \"\\tTI:Z:{}\".format(transcript_id)                \n\n            print >> read_file, \">{}\".format(cur_read_id)\n            print >> read_file, read_seq\n            output = \"{}\\t{}\\t{}\\t{}\\tNM:i:{}\\tMD:Z:{}\".format(cur_read_id, tax_id, pos + 1, cigar_str, NM, MD)\n            if paired_end:\n                print >> read2_file, \">{}\".format(cur_read_id)\n                print >> read2_file, reverse_complement(read2_seq)\n                output += \"\\t{}\\t{}\\tNM2:i:{}\\tMD2:Z:{}\".format(pos2 + 1, cigar2_str, NM2, MD2)\n            print >> scm_file, output\n                \n            cur_read_id += 1\n            \n    scm_file.close()\n    read_file.close()\n    if paired_end:\n        read2_file.close()\n\n\nif __name__ == '__main__':\n    parser = ArgumentParser(\n        description='Simulate reads from Centrifuge index')\n    parser.add_argument('index_fname',\n                        nargs='?',\n                        type=str,\n                        help='Centrifuge index')\n    \"\"\"\n    parser.add_argument('gtf_file',\n                        nargs='?',\n                        type=FileType('r'),\n                        help='input GTF file')\n    \"\"\"\n    parser.add_argument('base_fname',\n                        nargs='?',\n                        type=str,\n                        help='output base filename')\n    parser.add_argument('--rna',\n                        dest='dna',\n                        action='store_false',\n                        default=True,\n                        help='RNA-seq reads (default: DNA-seq reads)')\n    parser.add_argument('--single-end',\n                        dest='paired_end',\n                        action='store_false',\n                        default=True,\n                        help='single-end reads (default: paired-end reads)')\n    parser.add_argument('-r', '--read-length',\n                        dest='read_len',\n                        action='store',\n                        type=int,\n                        default=100,\n                        help='read length (default: 100)')\n    parser.add_argument('-f', '--fragment-length',\n                        dest='frag_len',\n                        action='store',\n                        type=int,\n                        default=250,\n                        help='fragment length (default: 250)')\n    parser.add_argument('-n', '--num-fragment',\n                        dest='num_frag',\n                        action='store',\n                        type=int,\n                        default=1000000,\n                        help='number of fragments (default: 1000000)')\n    parser.add_argument('-e', '--expr-profile',\n                        dest='expr_profile',\n                        action='store',\n                        type=str,\n                        default='flux',\n                        help='expression profile: flux or constant (default: flux)')\n    parser.add_argument('--error-rate',\n                        dest='error_rate',\n                        action='store',\n                        type=float,\n                        default=0.0,\n                        help='per-base sequencing error rate (%%) (default: 0.0)')\n    parser.add_argument('--max-mismatch',\n                        dest='max_mismatch',\n                        action='store',\n                        type=int,\n                        default=3,\n                        help='max mismatches due to sequencing errors (default: 3)')\n    parser.add_argument('--random-seed',\n                        dest='random_seed',\n                        action='store',\n                        type=int,\n                        default=0,\n                        help='random seeding value (default: 0)')\n    parser.add_argument('--sanity-check',\n                        dest='sanity_check',\n                        action='store_true',\n                        help='sanity check')\n    parser.add_argument('-v', '--verbose',\n                        dest='verbose',\n                        action='store_true',\n                        help='also print some statistics to stderr')\n    parser.add_argument('--version', \n                        action='version',\n                        version='%(prog)s 2.0.0-alpha')\n    args = parser.parse_args()\n    if not args.index_fname:\n        parser.print_help()\n        exit(1)\n    if not args.dna:\n        print >> sys.stderr, \"Error: --rna is not implemented.\"\n        exit(1)\n    # if args.dna:\n    #    args.expr_profile = \"constant\"\n    simulate_reads(args.index_fname, args.base_fname, \\\n                       args.dna, args.paired_end, args.read_len, args.frag_len, \\\n                       args.num_frag, args.expr_profile, args.error_rate, args.max_mismatch, \\\n                       args.random_seed, args.sanity_check, args.verbose)\n"
  },
  {
    "path": "evaluation/test/abundance.Rmd",
    "content": "---\ntitle: \"Centrifuge abundance\"\n# author: \"Daehwan Kim\"\ndate: \"August 15, 2016\"\noutput: html_document\n---\n\n```{r setup}\nlibrary(ggplot2)\n```\n\n```{r}\n\nabundance.cmp <- read.delim(\"abundance_k5.cmp\", stringsAsFactors = FALSE)\nlevels <- c(\"species\", \"genus\", \"family\", \"order\", \"class\", \"phylum\")\nabundance.cmp$rank <- factor(abundance.cmp$rank, levels = levels)\nabundance.cmp$log_true <- log(abundance.cmp$true)\nabundance.cmp$log_calc <- log(abundance.cmp$calc)\nhead(abundance.cmp)\n\nggplot(abundance.cmp) + \n  geom_bar(aes(x = log_true), binwidth = 0.2) +\n  xlab(\"abundance (log_truth)\") +\n  facet_wrap(~ rank, scales = \"free_y\")\n\nggplot(abundance.cmp) + \n  geom_bar(aes(x = log_calc), binwidth = 0.2) +\n  xlab(\"abundance (log_calc)\") +\n  facet_wrap(~ rank, scales = \"free_y\")\n\nggplot(abundance.cmp) + \n  geom_point(aes(x = true, y = calc), size = 0.7) +\n  xlab(\"abundance (truth)\") + ylab(\"abundance (centrifuge)\") +\n  facet_wrap(~ rank)  + \n  # geom_text(aes(x = true, y = calc, label = name),\n  #          check_overlap = TRUE, hjust = 0, nudge_x = 0.01) +\n  geom_abline(slope = 1, color = \"red\")\n\nggplot(abundance.cmp) + \n  geom_point(aes(x = log_true, y = log_calc), size = 0.7) +\n  xlab(\"abundance (log_truth)\") + ylab(\"abundance (log_centrifuge)\") +\n  facet_wrap(~ rank)  + \n  geom_abline(slope = 1, color = \"red\")\n\nfor(level in levels) {\n  print(level)\n  abundance.cmp.rank <- abundance.cmp[abundance.cmp$rank==level,]\n  for(method in c(\"pearson\", \"spearman\", \"kendall\")) {\n    print(paste(' ', method))\n    print(paste('  ', cor(abundance.cmp.rank$true, abundance.cmp.rank$calc, method=method)))\n  }\n}\n```"
  },
  {
    "path": "evaluation/test/centrifuge_evaluate_mason.py",
    "content": "#!/usr/bin/env python\n\nimport sys, os, subprocess, inspect\nimport platform, multiprocessing\nimport string, re\nfrom datetime import datetime, date, time\nimport copy\nfrom argparse import ArgumentParser, FileType\n\n\n\"\"\"\n\"\"\"\ndef read_taxonomy_tree(tax_file):\n    taxonomy_tree = {}\n    for line in tax_file:\n        fields = line.strip().split('\\t')\n        assert len(fields) == 5\n        tax_id, parent_tax_id, rank = fields[0], fields[2], fields[4]\n        assert tax_id not in taxonomy_tree\n        taxonomy_tree[tax_id] = [parent_tax_id, rank]        \n    return taxonomy_tree\n\n\n\"\"\"\n\"\"\"\ndef compare_scm(centrifuge_out, true_out, taxonomy_tree, rank):\n    higher_ranked = {}\n        \n    ancestors = set()\n    for tax_id in taxonomy_tree.keys():\n        if tax_id in ancestors:\n            continue\n        while True:\n            parent_tax_id, cur_rank = taxonomy_tree[tax_id]\n            if parent_tax_id in ancestors:\n                break\n            if tax_id == parent_tax_id:\n                break\n            tax_id = parent_tax_id\n            ancestors.add(tax_id)\n\n    db_dic = {}\n    first = True\n    for line in open(centrifuge_out):\n        if first:\n            first = False\n            continue\n        read_name, seq_id, tax_id, score, _, _, _, _ = line.strip().split('\\t')\n\n        # Traverse up taxonomy tree to match the given rank parameter\n        rank_tax_id = tax_id\n        if rank != \"strain\":\n            while True:\n                if tax_id not in taxonomy_tree:\n                    rank_tax_id = \"\"\n                    break\n                parent_tax_id, cur_rank = taxonomy_tree[tax_id]\n                if cur_rank == rank:\n                    rank_tax_id = tax_id\n                    break\n                if tax_id == parent_tax_id:\n                    rank_tax_id = \"\"\n                    break\n                tax_id = parent_tax_id\n        else:\n            assert rank == \"strain\"\n            if tax_id in ancestors:\n                continue\n\n        if rank_tax_id == \"\":\n            # higher_ranked[read_name] = True            \n            continue\n        \n        if read_name not in db_dic:\n            db_dic[read_name] = set()\n        db_dic[read_name].add(rank_tax_id)\n\n    classified, unclassified, unique_classified = 0, 0, 0\n    for line in open(true_out):\n        if line.startswith('@'):\n            continue\n\n        fields = line.strip().split('\\t')\n        if len(fields) != 3:\n            print >> sys.stderr, \"Warning: %s missing\" % (line.strip())\n            continue\n        read_name, tax_id = fields[1:3] \n        # Traverse up taxonomy tree to match the given rank parameter\n        rank_tax_id = tax_id\n        if rank != \"strain\":\n            while True:\n                if tax_id not in taxonomy_tree:\n                    rank_tax_id = \"\"\n                    break\n                parent_tax_id, cur_rank = taxonomy_tree[tax_id]\n                if cur_rank == rank:\n                    rank_tax_id = tax_id\n                    break\n                if tax_id == parent_tax_id:\n                    rank_tax_id = \"\"\n                    break\n                tax_id = parent_tax_id\n        if rank_tax_id == \"\":\n            continue\n        if read_name not in db_dic:\n            unclassified += 1\n            continue\n\n        maps = db_dic[read_name]\n        if rank_tax_id in maps:\n            classified += 1\n            if len(maps) == 1 and read_name not in higher_ranked:\n                unique_classified += 1\n        else:\n            unclassified += 1\n            # daehwan - for debugging purposes\n            # print read_name\n\n    raw_unique_classified = 0\n    for read_name, maps in db_dic.items():\n        if len(maps) == 1 and read_name not in higher_ranked:\n            raw_unique_classified += 1\n    return classified, unique_classified, unclassified, len(db_dic), raw_unique_classified\n\n\n\"\"\"\n\"\"\"\ndef evaluate(index_base,\n             ranks,\n             verbose,\n             debug):\n    # Current script directory\n    curr_script = os.path.realpath(inspect.getsourcefile(evaluate))\n    path_base = os.path.dirname(curr_script) + \"/..\"\n\n    def check_files(fnames):\n        for fname in fnames:\n            if not os.path.exists(fname):\n                return False\n        return True\n\n    # Check if indexes exists, otherwise create indexes\n    index_path = \"%s/indexes/Centrifuge\" % path_base\n    # index_path = \".\"\n    if not os.path.exists(path_base + \"/indexes\"):\n        os.mkdir(path_base + \"/indexes\")\n    if not os.path.exists(index_path):\n        os.mkdir(index_path)\n    index_fnames = [\"%s/%s.%d.cf\" % (index_path, index_base, i+1) for i in range(3)]\n    assert check_files(index_fnames)\n\n    # Read taxonomic IDs\n    centrifuge_inspect = os.path.join(path_base, \"../centrifuge-inspect\")\n    tax_ids = set()\n    tax_cmd = [centrifuge_inspect,\n               \"--conversion-table\",\n               \"%s/%s\" % (index_path, \"b+h+v\")]\n    tax_proc = subprocess.Popen(tax_cmd, stdout=subprocess.PIPE)\n    for line in tax_proc.stdout:\n        _, tax_id = line.strip().split()\n        tax_ids.add(tax_id)\n    tax_ids = list(tax_ids)\n\n    # Read taxonomic tree\n    tax_tree_cmd = [centrifuge_inspect,\n                    \"--taxonomy-tree\",\n                    \"%s/%s\" % (index_path, \"b+h+v\")]    \n    tax_tree_proc = subprocess.Popen(tax_tree_cmd, stdout=subprocess.PIPE, stderr=open(\"/dev/null\", 'w'))\n    taxonomy_tree = read_taxonomy_tree(tax_tree_proc.stdout)\n\n    compressed = (index_base.find(\"compressed\") != -1) or (index_base == \"centrifuge_Dec_Bonly\")\n\n    read_fname = \"centrifuge_data/bacteria_sim10K/bacteria_sim10K.fa\"\n    scm_fname = \"centrifuge_data/bacteria_sim10K/bacteria_sim10K.truth_species\"\n    read_fnames = [read_fname, scm_fname]\n\n    program_bin_base = \"%s/..\" % path_base\n    centrifuge_cmd = [\"%s/centrifuge\" % program_bin_base,\n                      # \"-k\", \"20\",\n                      # \"--min-hitlen\", \"15\",\n                      \"-f\",\n                      \"-p\", \"1\",\n                      \"%s/%s\" % (index_path, index_base),\n                      read_fname]\n\n    if verbose:\n        print >> sys.stderr, ' '.join(centrifuge_cmd)\n\n    out_fname = \"centrifuge.output\"\n    proc = subprocess.Popen(centrifuge_cmd, stdout=open(out_fname, \"w\"), stderr=subprocess.PIPE)\n    proc.communicate()\n\n    results = {\"strain\"  : [0, 0, 0],\n               \"species\" : [0, 0, 0],\n               \"genus\"   : [0, 0, 0],\n               \"family\"  : [0, 0, 0],\n               \"order\"   : [0, 0, 0],\n               \"class\"   : [0, 0, 0],\n               \"phylum\"  : [0, 0, 0]}\n    for rank in ranks:\n        if compressed and rank == \"strain\":\n            continue\n\n        classified, unique_classified, unclassified, raw_classified, raw_unique_classified = \\\n            compare_scm(out_fname, scm_fname, taxonomy_tree, rank)\n        results[rank] = [classified, unique_classified, unclassified]\n        num_cases = classified + unclassified\n        # if rank == \"strain\":\n        #    assert num_cases == num_fragment\n\n        print >> sys.stderr, \"\\t\\t%s\" % rank\n        print >> sys.stderr, \"\\t\\t\\tsensitivity: {:,} / {:,} ({:.2%})\".format(classified, num_cases, float(classified) / num_cases)\n        print >> sys.stderr, \"\\t\\t\\tprecision  : {:,} / {:,} ({:.2%})\".format(classified, raw_classified, float(classified) / raw_classified)\n        print >> sys.stderr, \"\\n\\t\\t\\tfor uniquely classified \"\n        print >> sys.stderr, \"\\t\\t\\t\\t\\tsensitivity: {:,} / {:,} ({:.2%})\".format(unique_classified, num_cases, float(unique_classified) / num_cases)\n        print >> sys.stderr, \"\\t\\t\\t\\t\\tprecision  : {:,} / {:,} ({:.2%})\".format(unique_classified, raw_unique_classified, float(unique_classified) / raw_unique_classified)\n\n        # Calculate sum of squared residuals in abundance\n        \"\"\"\n        if rank == \"strain\":\n            abundance_SSR = compare_abundance(\"centrifuge_report.tsv\", truth_fname, taxonomy_tree, debug)\n            print >> sys.stderr, \"\\t\\t\\tsum of squared residuals in abundance: {}\".format(abundance_SSR)\n        \"\"\"\n\n    # calculate true abundance\n    true_abundance = {}\n    total_sum = 0.0\n    num_genomes, num_species = 0, 0\n    for line in open(\"abundance.txt\"):\n        seqID, taxID, genomeSize, reads, reads10K, genomeName = line.strip().split(',')[:6]\n        genomeSize, reads, reads10K = int(genomeSize), int(reads), int(reads10K)\n        if reads <= 0:\n            continue\n        num_genomes += 1\n        while True:\n            if taxID not in taxonomy_tree:\n                rank_taxID = \"\"\n                break\n            parent_taxID, rank = taxonomy_tree[taxID]\n            if rank == \"species\":\n                rank_taxID = taxID\n                break\n            if taxID == parent_taxID:\n                rank_taxID = \"\"\n                break\n            taxID = parent_taxID\n        if rank_taxID == \"\":\n            continue\n        assert rank == \"species\"\n        num_species += 1\n        total_sum += (reads / float(genomeSize))\n        if rank_taxID not in true_abundance:\n            true_abundance[rank_taxID] = 0.0\n        true_abundance[rank_taxID] += (reads / float(genomeSize))\n    for taxID, reads in true_abundance.items():\n        true_abundance[taxID] /= total_sum\n\n    print >> sys.stderr, \"number of genomes:\", num_genomes\n    print >> sys.stderr, \"number of species:\", num_species\n    print >> sys.stderr, \"number of uniq species:\", len(true_abundance)\n\n    read_fname = \"centrifuge_data/bacteria_sim10M/bacteria_sim10M.fa\"\n    summary_fname = \"centrifuge.summary\"\n    centrifuge_cmd = [\"%s/centrifuge\" % program_bin_base,\n                      # \"-k\", \"20\",\n                      # \"--min-hitlen\", \"15\",\n                      \"--report-file\", summary_fname,\n                      \"-f\",\n                      \"-p\", \"3\",\n                      \"%s/%s\" % (index_path, index_base),\n                      read_fname]\n\n    if verbose:\n        print >> sys.stderr, ' '.join(centrifuge_cmd)\n\n    out_fname = \"centrifuge.output\"\n    proc = subprocess.Popen(centrifuge_cmd, stdout=open(out_fname, \"w\"), stderr=subprocess.PIPE)\n    proc.communicate()\n\n    calc_abundance = {}\n    for taxID in true_abundance.keys():\n        calc_abundance[taxID] = 0.0\n    first = True\n    for line in open(summary_fname):\n        if first:\n            first = False\n            continue\n        name, taxID, taxRank, genomeSize, numReads, numUniqueReads, abundance = line.strip().split(\"\\t\")\n        genomeSize, numReads, numUniqueReads, abundance = int(genomeSize), int(numReads), int(numUniqueReads), float(abundance)\n        calc_abundance[taxID] = abundance\n\n        # DK - for debugging purposes\n        \"\"\"\n        if taxID in true_abundance:\n            print >> sys.stderr, \"%s: %.6f vs. %.6f\" % (taxID, abundance, true_abundance[taxID])\n        \"\"\"\n\n    abundance_file = open(\"abundance.cmp\", 'w')\n    print >> abundance_file, \"taxID\\ttrue\\tcalc\\trank\"\n    for rank in ranks:\n        if rank == \"strain\":\n            continue\n        true_abundance_rank, calc_abundance_rank = {}, {}\n        for taxID in true_abundance.keys():\n            assert taxID in calc_abundance\n            rank_taxID = taxID\n            while True:\n                if rank_taxID not in taxonomy_tree:\n                    rank_taxID = \"\"\n                    break\n                parent_taxID, cur_rank = taxonomy_tree[rank_taxID]\n                if cur_rank == rank:\n                    break\n                if rank_taxID == parent_taxID:\n                    rank_taxID = \"\"\n                    break\n                rank_taxID = parent_taxID\n            if rank_taxID not in true_abundance_rank:\n                true_abundance_rank[rank_taxID] = 0.0\n                calc_abundance_rank[rank_taxID] = 0.0\n            true_abundance_rank[rank_taxID] += true_abundance[taxID]\n            calc_abundance_rank[rank_taxID] += calc_abundance[taxID]\n\n        ssr = 0.0 # Sum of Squared Residuals\n        for taxID in true_abundance_rank.keys():\n            assert taxID in calc_abundance_rank\n            ssr += (true_abundance_rank[taxID] - calc_abundance_rank[taxID]) ** 2\n            print >> abundance_file, \"%s\\t%.6f\\t%.6f\\t%s\" % (taxID, true_abundance_rank[taxID], calc_abundance_rank[taxID], rank)\n        print >> sys.stderr, \"%s) Sum of squared residuals: %.6f\" % (rank, ssr)\n    abundance_file.close()\n\n\n\nif __name__ == \"__main__\":\n    parser = ArgumentParser(\n        description='Centrifuge evaluation on Mason simulated reads')\n    parser.add_argument(\"--index-base\",\n                        type=str,\n                        default=\"b_compressed\",\n                        help='Centrifuge index such as b_compressed, b+h+v, and centrifuge_Dec_Bonly (default: b_compressed)')\n    rank_list_default = \"strain,species,genus,family,order,class,phylum\"\n    parser.add_argument(\"--rank-list\",\n                        dest=\"ranks\",\n                        type=str,\n                        default=rank_list_default,\n                        help=\"A comma-separated list of ranks (default: %s)\" % rank_list_default)\n    parser.add_argument('-v', '--verbose',\n                        dest='verbose',\n                        action='store_true',\n                        help='also print some statistics to stderr')\n    parser.add_argument('--debug',\n                        dest='debug',\n                        action='store_true',\n                        help='Debug')\n\n    args = parser.parse_args()\n    if not args.index_base:\n        parser.print_help()\n        exit(1)\n    ranks = args.ranks.split(',')\n    evaluate(args.index_base,\n             ranks,\n             args.verbose,\n             args.debug)\n"
  },
  {
    "path": "example/reads/input.fa",
    "content": ">C_1\nGATCCTCCCCAGGCCCCTACACCCAATGTGGAACCGGGGTCCCGAATGAAAATGCTGCTGTTCCCTGGAGGTGTTTTCCT\n>C_2\nGATCCTCCCCAGGCCCCTACACCCAATGTGGAACCGGGGTCCCGAATGAAAATGCTGCTGTTCCCTGGAGGTGTTTTCCT\n>C_3\nGATCCTCCCCAGGCCCCTACACCCAATGTGGAACCGGGGTCCCGAATGAAAATGCTGCTGTTCCCTGGAGGTGTTTTCCT\n>C_4\nGATCCTCCCCAGGCCCCTACACCCAATGTGGAACCGGGGTCCCGAATGAAAATGCTGCTGTTCCCTGGAGGTGTTTTCCT\n>1_1\nGGACGCTCTGCTTTGTTACCAATGAGAAGGGCGCTGAATCCTCGAAAATCCTGACCCTTTTAATTCATGCTCCCTTACTC\n>1_2\nACGAGAGATGATGATCGTTGATATTTCCCTGGACTGTGTGGGGTCTCAGAGACCACTATGGGGCACTCTCGTCAGGCTTC\n>2_1\nTGGCCGGGCAGATGCAAAGCCTGGTGATGCAGAGTCGGGCAAAGGCGCAGCCTTCGTGTCCAAGCAGGAGGGCAGCGAGG\n>2_2\nTGGCCGGGCAGATGCAAAGCCTGGTGATGCAGAGTCGGGCAAAGGCGCAGCCTTCGTGTCCAAGCAGGAGGGCAGCGAGG\n>2_3\nTGGCCGGGCAGATGCAAAGCCTGGTGATGCAGAGTCGGGCAAAGGCGCAGCCTTCGTGTCCAAGCAGGAGGGCAGCGAGG\n>2_4\nTGGCCGGGCAGATGCAAAGCCTGGTGATGCAGAGTCGGGCAAAGGCGCAGCCTTCGTGTCCAAGCAGGAGGGCAGCGAGG\n>2_5\nTGGCCGGGCAGATGCAAAGCCTGGTGATGCAGAGTCGGGCAAAGGCGCAGCCTTCGTGTCCAAGCAGGAGGGCAGCGAGG\n>2_6\nTGGCCGGGCAGATGCAAAGCCTGGTGATGCAGAGTCGGGCAAAGGCGCAGCCTTCGTGTCCAAGCAGGAGGGCAGCGAGG\n"
  },
  {
    "path": "example/reference/gi_to_tid.dmp",
    "content": "gi|4\t9646\ngi|7\t9913\n"
  },
  {
    "path": "example/reference/names.dmp",
    "content": "1\t|\tall\t|\t\t|\tsynonym\t|\n1\t|\troot\t|\t\t|\tscientific name\t|\n2759\t|\tEucarya\t|\t\t|\tsynonym\t|\n2759\t|\tEucaryotae\t|\t\t|\tsynonym\t|\n2759\t|\tEukarya\t|\t\t|\tsynonym\t|\n2759\t|\tEukaryota\t|\t\t|\tscientific name\t|\n2759\t|\tEukaryotae\t|\t\t|\tsynonym\t|\n2759\t|\teucaryotes\t|\t\t|\tgenbank common name\t|\n2759\t|\teukaryotes\t|\t\t|\tcommon name\t|\n2759\t|\teukaryotes\t|\teukaryotes<blast2759>\t|\tblast name\t|\n6072\t|\tEumetazoa\t|\t\t|\tscientific name\t|\n7711\t|\tChordata\t|\t\t|\tscientific name\t|\n7711\t|\tchordates\t|\t\t|\tgenbank common name\t|\n7711\t|\tchordates\t|\tchordates<blast7711>\t|\tblast name\t|\n7742\t|\tVertebrata\t|\tVertebrata <Metazoa>\t|\tscientific name\t|\n7742\t|\tVertebrata Cuvier, 1812\t|\t\t|\tauthority\t|\n7742\t|\tvertebrates\t|\t\t|\tgenbank common name\t|\n7742\t|\tvertebrates\t|\tvertebrates<blast7742>\t|\tblast name\t|\n7776\t|\tGnathostomata\t|\tGnathostomata <vertebrate>\t|\tscientific name\t|\n7776\t|\tjawed vertebrates\t|\t\t|\tgenbank common name\t|\n8287\t|\tSarcopterygii\t|\t\t|\tscientific name\t|\n9347\t|\tEutheria\t|\t\t|\tscientific name\t|\n9347\t|\tPlacentalia\t|\t\t|\tsynonym\t|\n9347\t|\teutherian mammals\t|\t\t|\tcommon name\t|\n9347\t|\tplacental mammals\t|\t\t|\tcommon name\t|\n9347\t|\tplacentals\t|\t\t|\tgenbank common name\t|\n9347\t|\tplacentals\t|\tplacentals <blast9347>\t|\tblast name\t|\n9632\t|\tUrsidae\t|\t\t|\tscientific name\t|\n9632\t|\tbears\t|\t\t|\tgenbank common name\t|\n9645\t|\tAiluropoda\t|\t\t|\tscientific name\t|\n9646\t|\tAiluropoda melanoleuca\t|\t\t|\tscientific name\t|\n9646\t|\tAiluropoda melanoleuca (David, 1869)\t|\t\t|\tauthority\t|\n9646\t|\tAiluropoda melanoleura\t|\t\t|\tmisspelling\t|\n9646\t|\tgiant panda\t|\t\t|\tgenbank common name\t|\n9845\t|\tArtiodactyla\t|\tArtiodactyla <Ruminantia>\t|\tin-part\t|\n9845\t|\tRuminantia\t|\t\t|\tscientific name\t|\n9895\t|\tBovidae\t|\t\t|\tscientific name\t|\n9903\t|\tBos\t|\t\t|\tscientific name\t|\n9903\t|\toxen, cattle\t|\t\t|\tgenbank common name\t|\n9913\t|\tBos Tauurus\t|\t\t|\tmisspelling\t|\n9913\t|\tBos bovis\t|\t\t|\tsynonym\t|\n9913\t|\tBos primigenius taurus\t|\t\t|\tsynonym\t|\n9913\t|\tBos taurus\t|\t\t|\tscientific name\t|\n9913\t|\tBos taurus Linnaeus, 1758\t|\t\t|\tauthority\t|\n9913\t|\tBovidae sp. Adi Nefas\t|\t\t|\tincludes\t|\n9913\t|\tbovine\t|\t\t|\tcommon name\t|\n9913\t|\tcattle\t|\t\t|\tgenbank common name\t|\n9913\t|\tcow\t|\t\t|\tcommon name\t|\n9913\t|\tdomestic cattle\t|\t\t|\tcommon name\t|\n9913\t|\tdomestic cow\t|\t\t|\tcommon name\t|\n27592\t|\tBovinae\t|\t\t|\tscientific name\t|\n32523\t|\tTetrapoda\t|\t\t|\tscientific name\t|\n32523\t|\ttetrapods\t|\t\t|\tgenbank common name\t|\n32524\t|\tAmniota\t|\t\t|\tscientific name\t|\n32524\t|\tamniotes\t|\t\t|\tgenbank common name\t|\n32525\t|\tTheria\t|\tTheria <Mammalia>\t|\tscientific name\t|\n32525\t|\tTheria Parker & Haswell, 1897\t|\t\t|\tauthority\t|\n33154\t|\tFungi/Metazoa group\t|\t\t|\tsynonym\t|\n33154\t|\tOpisthokonta\t|\t\t|\tscientific name\t|\n33154\t|\tOpisthokonta Cavalier-Smith 1987\t|\t\t|\tauthority\t|\n33154\t|\topisthokonts\t|\t\t|\tsynonym\t|\n33208\t|\tAnimalia\t|\t\t|\tsynonym\t|\n33208\t|\tMetazoa\t|\t\t|\tscientific name\t|\n33208\t|\tanimals\t|\t\t|\tblast name\t|\n33208\t|\tmetazoans\t|\t\t|\tgenbank common name\t|\n33208\t|\tmulticellular animals\t|\t\t|\tcommon name\t|\n33213\t|\tBilateria\t|\t\t|\tscientific name\t|\n33511\t|\tDeuterostomia\t|\t\t|\tscientific name\t|\n33511\t|\tdeuterostomes\t|\t\t|\tcommon name\t|\n33554\t|\tCarnivora\t|\t\t|\tscientific name\t|\n33554\t|\tcarnivores\t|\t\t|\tgenbank common name\t|\n33554\t|\tcarnivores\t|\tcarnivores <blast33554>\t|\tblast name\t|\n35500\t|\tPecora\t|\t\t|\tscientific name\t|\n40674\t|\tMammalia\t|\t\t|\tscientific name\t|\n40674\t|\tmammals\t|\t\t|\tgenbank common name\t|\n40674\t|\tmammals\t|\tmammals<blast40674>\t|\tblast name\t|\n89593\t|\tCraniata\t|\tCraniata <chordata>\t|\tscientific name\t|\n91561\t|\tCetartiodactyla\t|\t\t|\tscientific name\t|\n91561\t|\teven-toed ungulates\t|\t\t|\tblast name\t|\n91561\t|\twhales, hippos, ruminants, pigs, camels etc.\t|\t\t|\tgenbank common name\t|\n117570\t|\tTeleostomi\t|\t\t|\tscientific name\t|\n117571\t|\tEuteleostomi\t|\t\t|\tscientific name\t|\n117571\t|\tbony vertebrates\t|\t\t|\tgenbank common name\t|\n131567\t|\tbiota\t|\t\t|\tsynonym\t|\n131567\t|\tcellular organisms\t|\t\t|\tscientific name\t|\n314145\t|\tLaurasiatheria\t|\t\t|\tscientific name\t|\n379584\t|\tCaniformia\t|\t\t|\tscientific name\t|\n1338369\t|\tDipnotetrapodomorpha\t|\t\t|\tscientific name\t|\n1437010\t|\tBoreoeutheria\t|\t\t|\tscientific name\t|\n1437010\t|\tBoreotheria\t|\t\t|\tsynonym\t|\n"
  },
  {
    "path": "example/reference/nodes.dmp",
    "content": "1\t|\t1\t|\tno rank\n2759\t|\t131567\t|\tsuperkingdom\n6072\t|\t33208\t|\tno rank\n7711\t|\t33511\t|\tphylum\n7742\t|\t89593\t|\tno rank\n7776\t|\t7742\t|\tno rank\n8287\t|\t117571\t|\tno rank\n9347\t|\t32525\t|\tno rank\n9632\t|\t379584\t|\tfamily\n9645\t|\t9632\t|\tgenus\n9646\t|\t9645\t|\tspecies\n9845\t|\t91561\t|\tsuborder\n9895\t|\t35500\t|\tfamily\n9903\t|\t27592\t|\tgenus\n9913\t|\t9903\t|\tspecies\n27592\t|\t9895\t|\tsubfamily\n32523\t|\t1338369\t|\tno rank\n32524\t|\t32523\t|\tno rank\n32525\t|\t40674\t|\tno rank\n33154\t|\t2759\t|\tno rank\n33208\t|\t33154\t|\tkingdom\n33213\t|\t6072\t|\tno rank\n33511\t|\t33213\t|\tno rank\n33554\t|\t314145\t|\torder\n35500\t|\t9845\t|\tinfraorder\n40674\t|\t32524\t|\tclass\n89593\t|\t7711\t|\tsubphylum\n91561\t|\t314145\t|\tno rank\n117570\t|\t7776\t|\tno rank\n117571\t|\t117570\t|\tno rank\n131567\t|\t1\t|\tno rank\n314145\t|\t1437010\t|\tsuperorder\n379584\t|\t33554\t|\tsuborder\n1338369\t|\t8287\t|\tno rank\n1437010\t|\t9347\t|\tno rank\n"
  },
  {
    "path": "example/reference/test.fa",
    "content": ">gi|4|emb|X17276.1| Giant Panda satellite 1 DNA\nGGACGCTCTGCTTTGTTACCAATGAGAAGGGCGCTGAATCCTCGAAAATCCTGACCCTTTTAATTCATGCTCCCTTACTC\nACGAGAGATGATGATCGTTGATATTTCCCTGGACTGTGTGGGGTCTCAGAGACCACTATGGGGCACTCTCGTCAGGCTTC\nGATCCTCCCCAGGCCCCTACACCCAATGTGGAACCGGGGTCCCGAATGAAAATGCTGCTGTTCCCTGGAGGTGTTTTCCT\nCGCGACCACGTTCCCTCATGTTTCCCTATTAACGAAGGGTGATGATAGTGCTAAGACGGTCCCTGTACGGTGTTGTTTCT\nGACAGACGTGTTTTGGGCCTTTTCGTTCCATTGCCGCCAGCAGTTTTGACAGGATTTCCCCAGGGAGCAAACTTTTCGAT\nGGAAACGGGTTTTGGCCGAATTGTCTTTCTCAGTGCTGTGTTCGTCGTGTTTCACTCACGGTACCAAAACACCTTGATTA\nTTGTTCCACCCTCCATAAGGCCGTCGTGACTTCAAGGGCTTTCCCCTCAAACTTTGTTTCTTGGTTCTACGGGCTG\n>gi|7|emb|X51700.1| Bos taurus mRNA for bone Gla protein\nGTCCACGCAGCCGCTGACAGACACACCATGAGAACCCCCATGCTGCTCGCCCTGCTGGCCCTGGCCACACTCTGCCTCGC\nTGGCCGGGCAGATGCAAAGCCTGGTGATGCAGAGTCGGGCAAAGGCGCAGCCTTCGTGTCCAAGCAGGAGGGCAGCGAGG\nGATCCTCCCCAGGCCCCTACACCCAATGTGGAACCGGGGTCCCGAATGAAAATGCTGCTGTTCCCTGGAGGTGTTTTCCT\nTGGTGAAGAGACTCAGGCGCTACCTGGACCACTGGCTGGGAGCCCCAGCCCCCTACCCAGATCCGCTGGAGCCCAAGAGG\nGAGGTGTGTGAGCTCAACCCTGACTGTGACGAGCTAGCTGACCACATCGGCTTCCAGGAAGCCTATCGGCGCTTCTACGG\nCCCAGTCTAGAGCTTGCAGCCCTGCCCACCTGGCTGGCAGCCCCCAGCTCTGGCTTCTCTCCAGGACCCCTCCCCTCCCC\nGTCATCCCCGCTGCTCTAGAATAAACTCCAGAAGAGG\n"
  },
  {
    "path": "fast_mutex.h",
    "content": "/* -*- mode: c++; tab-width: 2; indent-tabs-mode: nil; -*-\nCopyright (c) 2010-2012 Marcus Geelnard\n\nThis software is provided 'as-is', without any express or implied\nwarranty. In no event will the authors be held liable for any damages\narising from the use of this software.\n\nPermission is granted to anyone to use this software for any purpose,\nincluding commercial applications, and to alter it and redistribute it\nfreely, subject to the following restrictions:\n\n    1. The origin of this software must not be misrepresented; you must not\n    claim that you wrote the original software. If you use this software\n    in a product, an acknowledgment in the product documentation would be\n    appreciated but is not required.\n\n    2. Altered source versions must be plainly marked as such, and must not be\n    misrepresented as being the original software.\n\n    3. This notice may not be removed or altered from any source\n    distribution.\n*/\n\n#ifndef _FAST_MUTEX_H_\n#define _FAST_MUTEX_H_\n\n/// @file\n\n// Which platform are we on?\n#if !defined(_TTHREAD_PLATFORM_DEFINED_)\n  #if defined(_WIN32) || defined(__WIN32__) || defined(__WINDOWS__)\n    #define _TTHREAD_WIN32_\n  #else\n    #define _TTHREAD_POSIX_\n  #endif\n  #define _TTHREAD_PLATFORM_DEFINED_\n#endif\n\n// Check if we can support the assembly language level implementation (otherwise\n// revert to the system API)\n#if (defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))) || \\\n    (defined(_MSC_VER) && (defined(_M_IX86) || defined(_M_X64))) || \\\n    (defined(__GNUC__) && (defined(__ppc__)))\n  #define _FAST_MUTEX_ASM_\n#else\n  #define _FAST_MUTEX_SYS_\n#endif\n\n#if defined(_TTHREAD_WIN32_)\n  #ifndef WIN32_LEAN_AND_MEAN\n    #define WIN32_LEAN_AND_MEAN\n    #define __UNDEF_LEAN_AND_MEAN\n  #endif\n  #include <windows.h>\n  #ifdef __UNDEF_LEAN_AND_MEAN\n    #undef WIN32_LEAN_AND_MEAN\n    #undef __UNDEF_LEAN_AND_MEAN\n  #endif\n#else\n  #ifdef _FAST_MUTEX_ASM_\n    #include <sched.h>\n  #else\n    #include <pthread.h>\n  #endif\n#endif\n\nnamespace tthread {\n\n/// Fast mutex class.\n/// This is a mutual exclusion object for synchronizing access to shared\n/// memory areas for several threads. It is similar to the tthread::mutex class,\n/// but instead of using system level functions, it is implemented as an atomic\n/// spin lock with very low CPU overhead.\n///\n/// The \\c fast_mutex class is NOT compatible with the \\c condition_variable\n/// class (however, it IS compatible with the \\c lock_guard class). It should\n/// also be noted that the \\c fast_mutex class typically does not provide\n/// as accurate thread scheduling as a the standard \\c mutex class does.\n///\n/// Because of the limitations of the class, it should only be used in\n/// situations where the mutex needs to be locked/unlocked very frequently.\n///\n/// @note The \"fast\" version of this class relies on inline assembler language,\n/// which is currently only supported for 32/64-bit Intel x86/AMD64 and\n/// PowerPC architectures on a limited number of compilers (GNU g++ and MS\n/// Visual C++).\n/// For other architectures/compilers, system functions are used instead.\nclass fast_mutex {\n  public:\n    /// Constructor.\n#if defined(_FAST_MUTEX_ASM_)\n    fast_mutex() : mLock(0) {}\n#else\n    fast_mutex()\n    {\n  #if defined(_TTHREAD_WIN32_)\n      InitializeCriticalSection(&mHandle);\n  #elif defined(_TTHREAD_POSIX_)\n      pthread_mutex_init(&mHandle, NULL);\n  #endif\n    }\n#endif\n\n#if !defined(_FAST_MUTEX_ASM_)\n    /// Destructor.\n    ~fast_mutex()\n    {\n  #if defined(_TTHREAD_WIN32_)\n      DeleteCriticalSection(&mHandle);\n  #elif defined(_TTHREAD_POSIX_)\n      pthread_mutex_destroy(&mHandle);\n  #endif\n    }\n#endif\n\n    /// Lock the mutex.\n    /// The method will block the calling thread until a lock on the mutex can\n    /// be obtained. The mutex remains locked until \\c unlock() is called.\n    /// @see lock_guard\n    inline void lock()\n    {\n#if defined(_FAST_MUTEX_ASM_)\n      bool gotLock;\n      do {\n        gotLock = try_lock();\n        if(!gotLock)\n        {\n  #if defined(_TTHREAD_WIN32_)\n          Sleep(0);\n  #elif defined(_TTHREAD_POSIX_)\n          sched_yield();\n  #endif\n        }\n      } while(!gotLock);\n#else\n  #if defined(_TTHREAD_WIN32_)\n      EnterCriticalSection(&mHandle);\n  #elif defined(_TTHREAD_POSIX_)\n      pthread_mutex_lock(&mHandle);\n  #endif\n#endif\n    }\n\n    /// Try to lock the mutex.\n    /// The method will try to lock the mutex. If it fails, the function will\n    /// return immediately (non-blocking).\n    /// @return \\c true if the lock was acquired, or \\c false if the lock could\n    /// not be acquired.\n    inline bool try_lock()\n    {\n#if defined(_FAST_MUTEX_ASM_)\n      int oldLock;\n  #if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))\n      asm volatile (\n        \"movl $1,%%eax\\n\\t\"\n        \"xchg %%eax,%0\\n\\t\"\n        \"movl %%eax,%1\\n\\t\"\n        : \"=m\" (mLock), \"=m\" (oldLock)\n        :\n        : \"%eax\", \"memory\"\n      );\n  #elif defined(_MSC_VER) && (defined(_M_IX86) || defined(_M_X64))\n      int *ptrLock = &mLock;\n      __asm {\n        mov eax,1\n        mov ecx,ptrLock\n        xchg eax,[ecx]\n        mov oldLock,eax\n      }\n  #elif defined(__GNUC__) && (defined(__ppc__))\n      int newLock = 1;\n      asm volatile (\n        \"\\n1:\\n\\t\"\n        \"lwarx  %0,0,%1\\n\\t\"\n        \"cmpwi  0,%0,0\\n\\t\"\n        \"bne-   2f\\n\\t\"\n        \"stwcx. %2,0,%1\\n\\t\"\n        \"bne-   1b\\n\\t\"\n        \"isync\\n\"\n        \"2:\\n\\t\"\n        : \"=&r\" (oldLock)\n        : \"r\" (&mLock), \"r\" (newLock)\n        : \"cr0\", \"memory\"\n      );\n  #endif\n      return (oldLock == 0);\n#else\n  #if defined(_TTHREAD_WIN32_)\n      return TryEnterCriticalSection(&mHandle) ? true : false;\n  #elif defined(_TTHREAD_POSIX_)\n      return (pthread_mutex_trylock(&mHandle) == 0) ? true : false;\n  #endif\n#endif\n    }\n\n    /// Unlock the mutex.\n    /// If any threads are waiting for the lock on this mutex, one of them will\n    /// be unblocked.\n    inline void unlock()\n    {\n#if defined(_FAST_MUTEX_ASM_)\n  #if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))\n      asm volatile (\n        \"movl $0,%%eax\\n\\t\"\n        \"xchg %%eax,%0\\n\\t\"\n        : \"=m\" (mLock)\n        :\n        : \"%eax\", \"memory\"\n      );\n  #elif defined(_MSC_VER) && (defined(_M_IX86) || defined(_M_X64))\n      int *ptrLock = &mLock;\n      __asm {\n        mov eax,0\n        mov ecx,ptrLock\n        xchg eax,[ecx]\n      }\n  #elif defined(__GNUC__) && (defined(__ppc__))\n      asm volatile (\n        \"sync\\n\\t\"  // Replace with lwsync where possible?\n        : : : \"memory\"\n      );\n      mLock = 0;\n  #endif\n#else\n  #if defined(_TTHREAD_WIN32_)\n      LeaveCriticalSection(&mHandle);\n  #elif defined(_TTHREAD_POSIX_)\n      pthread_mutex_unlock(&mHandle);\n  #endif\n#endif\n    }\n\n  private:\n#if defined(_FAST_MUTEX_ASM_)\n    int mLock;\n#else\n  #if defined(_TTHREAD_WIN32_)\n    CRITICAL_SECTION mHandle;\n  #elif defined(_TTHREAD_POSIX_)\n    pthread_mutex_t mHandle;\n  #endif\n#endif\n};\n\n}\n\n#endif // _FAST_MUTEX_H_\n\n"
  },
  {
    "path": "filebuf.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef FILEBUF_H_\n#define FILEBUF_H_\n\n#include <iostream>\n#include <fstream>\n#include <string>\n#include <stdio.h>\n#include <string.h>\n#include <stdint.h>\n#include <stdexcept>\n#include \"assert_helpers.h\"\n\n/**\n * Simple, fast helper for determining if a character is a newline.\n */\nstatic inline bool isnewline(int c) {\n\treturn c == '\\r' || c == '\\n';\n}\n\n/**\n * Simple, fast helper for determining if a character is a non-newline\n * whitespace character.\n */\nstatic inline bool isspace_notnl(int c) {\n\treturn isspace(c) && !isnewline(c);\n}\n\n/**\n * Simple wrapper for a FILE*, istream or ifstream that reads it in chunks\n * using fread and keeps those chunks in a buffer.  It also services calls to\n * get(), peek() and gets() from the buffer, reading in additional chunks when\n * necessary.\n *\n * Helper functions do things like parse strings, numbers, and FASTA records.\n *\n *\n */\nclass FileBuf {\npublic:\n\tFileBuf() {\n\t\tinit();\n\t}\n\n\tFileBuf(FILE *in) {\n\t\tinit();\n\t\t_in = in;\n\t\tassert(_in != NULL);\n\t}\n\n\tFileBuf(std::ifstream *inf) {\n\t\tinit();\n\t\t_inf = inf;\n\t\tassert(_inf != NULL);\n\t}\n\n\tFileBuf(std::istream *ins) {\n\t\tinit();\n\t\t_ins = ins;\n\t\tassert(_ins != NULL);\n\t}\n\n\t/**\n\t * Return true iff there is a stream ready to read.\n\t */\n\tbool isOpen() {\n\t\treturn _in != NULL || _inf != NULL || _ins != NULL;\n\t}\n\n\t/**\n\t * Close the input stream (if that's possible)\n\t */\n\tvoid close() {\n\t\tif(_in != NULL && _in != stdin) {\n\t\t\tfclose(_in);\n\t\t} else if(_inf != NULL) {\n\t\t\t_inf->close();\n\t\t} else {\n\t\t\t// can't close _ins\n\t\t}\n\t}\n\n\t/**\n\t * Get the next character of input and advance.\n\t */\n\tint get() {\n\t\tassert(_in != NULL || _inf != NULL || _ins != NULL);\n\t\tint c = peek();\n\t\tif(c != -1) {\n\t\t\t_cur++;\n\t\t\tif(_lastn_cur < LASTN_BUF_SZ) _lastn_buf[_lastn_cur++] = c;\n\t\t}\n\t\treturn c;\n\t}\n\n\t/**\n\t * Return true iff all input is exhausted.\n\t */\n\tbool eof() {\n\t\treturn (_cur == _buf_sz) && _done;\n\t}\n\n\t/**\n\t * Initialize the buffer with a new C-style file.\n\t */\n\tvoid newFile(FILE *in) {\n\t\t_in = in;\n\t\t_inf = NULL;\n\t\t_ins = NULL;\n\t\t_cur = BUF_SZ;\n\t\t_buf_sz = BUF_SZ;\n\t\t_done = false;\n\t}\n\n\t/**\n\t * Initialize the buffer with a new ifstream.\n\t */\n\tvoid newFile(std::ifstream *__inf) {\n\t\t_in = NULL;\n\t\t_inf = __inf;\n\t\t_ins = NULL;\n\t\t_cur = BUF_SZ;\n\t\t_buf_sz = BUF_SZ;\n\t\t_done = false;\n\t}\n\n\t/**\n\t * Initialize the buffer with a new istream.\n\t */\n\tvoid newFile(std::istream *__ins) {\n\t\t_in = NULL;\n\t\t_inf = NULL;\n\t\t_ins = __ins;\n\t\t_cur = BUF_SZ;\n\t\t_buf_sz = BUF_SZ;\n\t\t_done = false;\n\t}\n\n\t/**\n\t * Restore state as though we just started reading the input\n\t * stream.\n\t */\n\tvoid reset() {\n\t\tif(_inf != NULL) {\n\t\t\t_inf->clear();\n\t\t\t_inf->seekg(0, std::ios::beg);\n\t\t} else if(_ins != NULL) {\n\t\t\t_ins->clear();\n\t\t\t_ins->seekg(0, std::ios::beg);\n\t\t} else {\n\t\t\trewind(_in);\n\t\t}\n\t\t_cur = BUF_SZ;\n\t\t_buf_sz = BUF_SZ;\n\t\t_done = false;\n\t}\n\n\t/**\n\t * Peek at the next character of the input stream without\n\t * advancing.  Typically we can simple read it from the buffer.\n\t * Occasionally we'll need to read in a new buffer's worth of data.\n\t */\n\tint peek() {\n\t\tassert(_in != NULL || _inf != NULL || _ins != NULL);\n\t\tassert_leq(_cur, _buf_sz);\n\t\tif(_cur == _buf_sz) {\n\t\t\tif(_done) {\n\t\t\t\t// We already exhausted the input stream\n\t\t\t\treturn -1;\n\t\t\t}\n\t\t\t// Read a new buffer's worth of data\n\t\t\telse {\n\t\t\t\t// Get the next chunk\n\t\t\t\tif(_inf != NULL) {\n\t\t\t\t\t_inf->read((char*)_buf, BUF_SZ);\n\t\t\t\t\t_buf_sz = _inf->gcount();\n\t\t\t\t} else if(_ins != NULL) {\n\t\t\t\t\t_ins->read((char*)_buf, BUF_SZ);\n\t\t\t\t\t_buf_sz = _ins->gcount();\n\t\t\t\t} else {\n\t\t\t\t\tassert(_in != NULL);\n\t\t\t\t\t_buf_sz = fread(_buf, 1, BUF_SZ, _in);\n\t\t\t\t}\n\t\t\t\t_cur = 0;\n\t\t\t\tif(_buf_sz == 0) {\n\t\t\t\t\t// Exhausted, and we have nothing to return to the\n\t\t\t\t\t// caller\n\t\t\t\t\t_done = true;\n\t\t\t\t\treturn -1;\n\t\t\t\t} else if(_buf_sz < BUF_SZ) {\n\t\t\t\t\t// Exhausted\n\t\t\t\t\t_done = true;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\treturn (int)_buf[_cur];\n\t}\n\n\t/**\n\t * Store a string of characters from the input file into 'buf',\n\t * until we see a newline, EOF, or until 'len' characters have been\n\t * read.\n\t */\n\tsize_t gets(char *buf, size_t len) {\n\t\tsize_t stored = 0;\n\t\twhile(true) {\n\t\t\tint c = get();\n\t\t\tif(c == -1) {\n\t\t\t\t// End-of-file\n\t\t\t\tbuf[stored] = '\\0';\n\t\t\t\treturn stored;\n\t\t\t}\n\t\t\tif(stored == len-1 || isnewline(c)) {\n\t\t\t\t// End of string\n\t\t\t\tbuf[stored] = '\\0';\n\t\t\t\t// Skip over all end-of-line characters\n\t\t\t\tint pc = peek();\n\t\t\t\twhile(isnewline(pc)) {\n\t\t\t\t\tget(); // discard\n\t\t\t\t\tpc = peek();\n\t\t\t\t}\n\t\t\t\t// Next get() will be after all newline characters\n\t\t\t\treturn stored;\n\t\t\t}\n\t\t\tbuf[stored++] = (char)c;\n\t\t}\n\t}\n\n\t/**\n\t * Store a string of characters from the input file into 'buf',\n\t * until we see a newline, EOF, or until 'len' characters have been\n\t * read.\n\t */\n\tsize_t get(char *buf, size_t len) {\n\t\tsize_t stored = 0;\n\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\tint c = get();\n\t\t\tif(c == -1) return i;\n\t\t\tbuf[stored++] = (char)c;\n\t\t}\n\t\treturn len;\n\t}\n\n\tstatic const size_t LASTN_BUF_SZ = 8 * 1024;\n\n\t/**\n\t * Keep get()ing characters until a non-whitespace character (or\n\t * -1) is reached, and return it.\n\t */\n\tint getPastWhitespace() {\n\t\tint c;\n\t\twhile(isspace(c = get()) && c != -1);\n\t\treturn c;\n\t}\n\n\t/**\n\t * Keep get()ing characters until a we've passed over the next\n\t * string of newline characters (\\r's and \\n's) or -1 is reached,\n\t * and return it.\n\t */\n\tint getPastNewline() {\n\t\tint c = get();\n\t\twhile(!isnewline(c) && c != -1) c = get();\n\t\twhile(isnewline(c)) c = get();\n\t\tassert_neq(c, '\\r');\n\t\tassert_neq(c, '\\n');\n\t\treturn c;\n\t}\n\n\t/**\n\t * Keep get()ing characters until a we've passed over the next\n\t * string of newline characters (\\r's and \\n's) or -1 is reached,\n\t * and return it.\n\t */\n\tint peekPastNewline() {\n\t\tint c = peek();\n\t\twhile(!isnewline(c) && c != -1) c = get();\n\t\twhile(isnewline(c)) c = get();\n\t\tassert_neq(c, '\\r');\n\t\tassert_neq(c, '\\n');\n\t\treturn c;\n\t}\n\n\t/**\n\t * Keep peek()ing then get()ing characters until the next return\n\t * from peek() is just after the last newline of the line.\n\t */\n\tint peekUptoNewline() {\n\t\tint c = peek();\n\t\twhile(!isnewline(c) && c != -1) {\n\t\t\tget(); c = peek();\n\t\t}\n\t\twhile(isnewline(c)) {\n\t\t\tget();\n\t\t\tc = peek();\n\t\t}\n\t\tassert_neq(c, '\\r');\n\t\tassert_neq(c, '\\n');\n\t\treturn c;\n\t}\n\t\n\t/**\n\t * Parse a FASTA record.  Append name characters to 'name' and and append\n\t * all sequence characters to 'seq'.  If gotCaret is true, assuming the\n\t * file cursor has already moved just past the starting '>' character.\n\t */\n\ttemplate <typename TNameStr, typename TSeqStr>\n\tvoid parseFastaRecord(\n\t\tTNameStr& name,\n\t\tTSeqStr&  seq,\n\t\tbool      gotCaret = false)\n\t{\n\t\tint c;\n\t\tif(!gotCaret) {\n\t\t\t// Skip over caret and non-newline whitespace\n\t\t\tc = peek();\n\t\t\twhile(isspace_notnl(c) || c == '>') { get(); c = peek(); }\n\t\t} else {\n\t\t\t// Skip over non-newline whitespace\n\t\t\tc = peek();\n\t\t\twhile(isspace_notnl(c)) { get(); c = peek(); }\n\t\t}\n\t\tsize_t namecur = 0, seqcur = 0;\n\t\t// c is the first character of the fasta name record, or is the first\n\t\t// newline character if the name record is empty\n\t\twhile(!isnewline(c) && c != -1) {\n\t\t\tname[namecur++] = c; get(); c = peek();\n\t\t}\n\t\t// sequence consists of all the non-whitespace characters between here\n\t\t// and the next caret\n\t\twhile(true) {\n\t\t\t// skip over whitespace\n\t\t\twhile(isspace(c)) { get(); c = peek(); }\n\t\t\t// if we see caret or EOF, break\n\t\t\tif(c == '>' || c == -1) break;\n\t\t\t// append and continue\n\t\t\tseq[seqcur++] = c;\n\t\t\tget(); c = peek();\n\t\t}\n\t}\n\n\t/**\n\t * Parse a FASTA record and return its length.  If gotCaret is true,\n\t * assuming the file cursor has already moved just past the starting '>'\n\t * character.\n\t */\n\tvoid parseFastaRecordLength(\n\t\tsize_t&   nameLen,\n\t\tsize_t&   seqLen,\n\t\tbool      gotCaret = false)\n\t{\n\t\tint c;\n\t\tnameLen = seqLen = 0;\n\t\tif(!gotCaret) {\n\t\t\t// Skip over caret and non-newline whitespace\n\t\t\tc = peek();\n\t\t\twhile(isspace_notnl(c) || c == '>') { get(); c = peek(); }\n\t\t} else {\n\t\t\t// Skip over non-newline whitespace\n\t\t\tc = peek();\n\t\t\twhile(isspace_notnl(c)) { get(); c = peek(); }\n\t\t}\n\t\t// c is the first character of the fasta name record, or is the first\n\t\t// newline character if the name record is empty\n\t\twhile(!isnewline(c) && c != -1) {\n\t\t\tnameLen++; get(); c = peek();\n\t\t}\n\t\t// sequence consists of all the non-whitespace characters between here\n\t\t// and the next caret\n\t\twhile(true) {\n\t\t\t// skip over whitespace\n\t\t\twhile(isspace(c)) { get(); c = peek(); }\n\t\t\t// if we see caret or EOF, break\n\t\t\tif(c == '>' || c == -1) break;\n\t\t\t// append and continue\n\t\t\tseqLen++;\n\t\t\tget(); c = peek();\n\t\t}\n\t}\n\n\t/**\n\t * Reset to the beginning of the last-N-chars buffer.\n\t */\n\tvoid resetLastN() {\n\t\t_lastn_cur = 0;\n\t}\n\n\t/**\n\t * Copy the last several characters in the last-N-chars buffer\n\t * (since the last reset) into the provided buffer.\n\t */\n\tsize_t copyLastN(char *buf) {\n\t\tmemcpy(buf, _lastn_buf, _lastn_cur);\n\t\treturn _lastn_cur;\n\t}\n\n\t/**\n\t * Get const pointer to the last-N-chars buffer.\n\t */\n\tconst char *lastN() const {\n\t\treturn _lastn_buf;\n\t}\n\n\t/**\n\t * Get current size of the last-N-chars buffer.\n\t */\n\tsize_t lastNLen() const {\n\t\treturn _lastn_cur;\n\t}\n\nprivate:\n\n\tvoid init() {\n\t\t_in = NULL;\n\t\t_inf = NULL;\n\t\t_ins = NULL;\n\t\t_cur = _buf_sz = BUF_SZ;\n\t\t_done = false;\n\t\t_lastn_cur = 0;\n\t\t// no need to clear _buf[]\n\t}\n\n\tstatic const size_t BUF_SZ = 256 * 1024;\n\tFILE     *_in;\n\tstd::ifstream *_inf;\n\tstd::istream  *_ins;\n\tsize_t    _cur;\n\tsize_t    _buf_sz;\n\tbool      _done;\n\tuint8_t   _buf[BUF_SZ]; // (large) input buffer\n\tsize_t    _lastn_cur;\n\tchar      _lastn_buf[LASTN_BUF_SZ]; // buffer of the last N chars dispensed\n};\n\n/**\n * Wrapper for a buffered output stream that writes bitpairs.\n */\nclass BitpairOutFileBuf {\npublic:\n\t/**\n\t * Open a new output stream to a file with given name.\n\t */\n\tBitpairOutFileBuf(const char *in) : bpPtr_(0), cur_(0) {\n\t\tassert(in != NULL);\n\t\tout_ = fopen(in, \"wb\");\n\t\tif(out_ == NULL) {\n\t\t\tstd::cerr << \"Error: Could not open bitpair-output file \" << in << std::endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tmemset(buf_, 0, BUF_SZ);\n\t}\n\n\t/**\n\t * Write a single bitpair into the buf.  Flush the buffer if it's\n\t * full.\n\t */\n\tvoid write(int bp) {\n\t\tassert_lt(bp, 4);\n\t\tassert_geq(bp, 0);\n\t\tbuf_[cur_] |= (bp << bpPtr_);\n\t\tif(bpPtr_ == 6) {\n\t\t\tbpPtr_ = 0;\n\t\t\tcur_++;\n\t\t\tif(cur_ == BUF_SZ) {\n\t\t\t\t// Flush the buffer\n\t\t\t\tif(!fwrite((const void *)buf_, BUF_SZ, 1, out_)) {\n\t\t\t\t\tstd::cerr << \"Error writing to the reference index file (.4.ebwt)\" << std::endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t\t// Reset to beginning of the buffer\n\t\t\t\tcur_ = 0;\n\t\t\t}\n\t\t\t// Initialize next octet to 0\n\t\t\tbuf_[cur_] = 0;\n\t\t} else {\n\t\t\tbpPtr_ += 2;\n\t\t}\n\t}\n\n\t/**\n\t * Write any remaining bitpairs and then close the input\n\t */\n\tvoid close() {\n\t\tif(cur_ > 0 || bpPtr_ > 0) {\n\t\t\tif(bpPtr_ == 0) cur_--;\n\t\t\tif(!fwrite((const void *)buf_, cur_ + 1, 1, out_)) {\n\t\t\t\tstd::cerr << \"Error writing to the reference index file (.4.ebwt)\" << std::endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t}\n\t\tfclose(out_);\n\t}\nprivate:\n\tstatic const size_t BUF_SZ = 128 * 1024;\n\tFILE    *out_;\n\tint      bpPtr_;\n\tsize_t   cur_;\n\tchar     buf_[BUF_SZ]; // (large) input buffer\n};\n\n/**\n * Wrapper for a buffered output stream that writes characters and\n * other data types.  This class is *not* synchronized; the caller is\n * responsible for synchronization.\n */\nclass OutFileBuf {\n\npublic:\n\n\t/**\n\t * Open a new output stream to a file with given name.\n\t */\n\tOutFileBuf(const std::string& out, bool binary = false) :\n\t\tname_(out.c_str()), cur_(0), closed_(false)\n\t{\n\t\tout_ = fopen(out.c_str(), binary ? \"wb\" : \"w\");\n\t\tif(out_ == NULL) {\n\t\t\tstd::cerr << \"Error: Could not open alignment output file \" << out.c_str() << std::endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tif(setvbuf(out_, NULL, _IOFBF, 10* 1024* 1024)) \n\t\t\tstd::cerr << \"Warning: Could not allocate the proper buffer size for output file stream. \" << std::endl;\n\t}\n\n\t/**\n\t * Open a new output stream to a file with given name.\n\t */\n\tOutFileBuf(const char *out, bool binary = false) :\n\t\tname_(out), cur_(0), closed_(false)\n\t{\n\t\tassert(out != NULL);\n\t\tout_ = fopen(out, binary ? \"wb\" : \"w\");\n\t\tif(out_ == NULL) {\n\t\t\tstd::cerr << \"Error: Could not open alignment output file \" << out << std::endl;\n\t\t\tthrow 1;\n\t\t}\n\t}\n\n\t/**\n\t * Open a new output stream to standard out.\n\t */\n\tOutFileBuf() : name_(\"cout\"), cur_(0), closed_(false) {\n\t\tout_ = stdout;\n\t}\n\t\n\t/**\n\t * Close buffer when object is destroyed.\n\t */\n\t~OutFileBuf() { close(); }\n\n\t/**\n\t * Open a new output stream to a file with given name.\n\t */\n\tvoid setFile(const char *out, bool binary = false) {\n\t\tassert(out != NULL);\n\t\tout_ = fopen(out, binary ? \"wb\" : \"w\");\n\t\tif(out_ == NULL) {\n\t\t\tstd::cerr << \"Error: Could not open alignment output file \" << out << std::endl;\n\t\t\tthrow 1;\n\t\t}\n\t\treset();\n\t}\n\n\t/**\n\t * Write a single character into the write buffer and, if\n\t * necessary, flush.\n\t */\n\tvoid write(char c) {\n\t\tassert(!closed_);\n\t\tif(cur_ == BUF_SZ) flush();\n\t\tbuf_[cur_++] = c;\n\t}\n\n\t/**\n\t * Write a c++ string to the write buffer and, if necessary, flush.\n\t */\n\tvoid writeString(const std::string& s) {\n\t\tassert(!closed_);\n\t\tsize_t slen = s.length();\n\t\tif(cur_ + slen > BUF_SZ) {\n\t\t\tif(cur_ > 0) flush();\n\t\t\tif(slen >= BUF_SZ) {\n\t\t\t\tfwrite(s.c_str(), slen, 1, out_);\n\t\t\t} else {\n\t\t\t\tmemcpy(&buf_[cur_], s.data(), slen);\n\t\t\t\tassert_eq(0, cur_);\n\t\t\t\tcur_ = slen;\n\t\t\t}\n\t\t} else {\n\t\t\tmemcpy(&buf_[cur_], s.data(), slen);\n\t\t\tcur_ += slen;\n\t\t}\n\t\tassert_leq(cur_, BUF_SZ);\n\t}\n\n\t/**\n\t * Write a c++ string to the write buffer and, if necessary, flush.\n\t */\n\ttemplate<typename T>\n\tvoid writeString(const T& s) {\n\t\tassert(!closed_);\n\t\tsize_t slen = s.length();\n\t\tif(cur_ + slen > BUF_SZ) {\n\t\t\tif(cur_ > 0) flush();\n\t\t\tif(slen >= BUF_SZ) {\n\t\t\t\tfwrite(s.toZBuf(), slen, 1, out_);\n\t\t\t} else {\n\t\t\t\tmemcpy(&buf_[cur_], s.toZBuf(), slen);\n\t\t\t\tassert_eq(0, cur_);\n\t\t\t\tcur_ = slen;\n\t\t\t}\n\t\t} else {\n\t\t\tmemcpy(&buf_[cur_], s.toZBuf(), slen);\n\t\t\tcur_ += slen;\n\t\t}\n\t\tassert_leq(cur_, BUF_SZ);\n\t}\n\n\t/**\n\t * Write a c++ string to the write buffer and, if necessary, flush.\n\t */\n\tvoid writeChars(const char * s, size_t len) {\n\t\tassert(!closed_);\n\t\tif(cur_ + len > BUF_SZ) {\n\t\t\tif(cur_ > 0) flush();\n\t\t\tif(len >= BUF_SZ) {\n\t\t\t\tfwrite(s, len, 1, out_);\n\t\t\t} else {\n\t\t\t\tmemcpy(&buf_[cur_], s, len);\n\t\t\t\tassert_eq(0, cur_);\n\t\t\t\tcur_ = len;\n\t\t\t}\n\t\t} else {\n\t\t\tmemcpy(&buf_[cur_], s, len);\n\t\t\tcur_ += len;\n\t\t}\n\t\tassert_leq(cur_, BUF_SZ);\n\t}\n\n\t/**\n\t * Write a 0-terminated C string to the output stream.\n\t */\n\tvoid writeChars(const char * s) {\n\t\twriteChars(s, strlen(s));\n\t}\n\n\t/**\n\t * Write any remaining bitpairs and then close the input\n\t */\n\tvoid close() {\n\t\tif(closed_) return;\n\t\tif(cur_ > 0) flush();\n\t\tclosed_ = true;\n\t\tif(out_ != stdout) {\n\t\t\tfclose(out_);\n\t\t}\n\t}\n\n\t/**\n\t * Reset so that the next write is as though it's the first.\n\t */\n\tvoid reset() {\n\t\tcur_ = 0;\n\t\tclosed_ = false;\n\t}\n\n\tvoid flush() {\n\t\tif(!fwrite((const void *)buf_, cur_, 1, out_)) {\n\t\t\tstd::cerr << \"Error while flushing and closing output\" << std::endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tcur_ = 0;\n\t}\n\n\t/**\n\t * Return true iff this stream is closed.\n\t */\n\tbool closed() const {\n\t\treturn closed_;\n\t}\n\n\t/**\n\t * Return the filename.\n\t */\n\tconst char *name() {\n\t\treturn name_;\n\t}\n\nprivate:\n\n\tstatic const size_t BUF_SZ = 16 * 1024;\n\n\tconst char *name_;\n\tFILE       *out_;\n\tsize_t      cur_;\n\tchar        buf_[BUF_SZ]; // (large) input buffer\n\tbool        closed_;\n};\n\n#endif /*ndef FILEBUF_H_*/\n"
  },
  {
    "path": "formats.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef FORMATS_H_\n#define FORMATS_H_\n\n#include <iostream>\n\n/**\n * File-format constants and names\n */\n\nenum file_format {\n\tFASTA = 1,\n\tFASTA_CONT,\n\tFASTQ,\n\tTAB_MATE5,\n\tTAB_MATE6,\n\tRAW,\n\tCMDLINE,\n\tQSEQ,\n    SRA_FASTA,\n    SRA_FASTQ\n};\n\nstatic const std::string file_format_names[] = {\n\t\"Invalid!\",\n\t\"FASTA\",\n\t\"FASTA sampling\",\n\t\"FASTQ\",\n\t\"Tabbed mated\",\n\t\"Raw\",\n\t\"Command line\",\n\t\"Chain file\",\n\t\"Random\",\n\t\"Qseq\",\n    \"SRA_FASTA\",\n    \"SRA_FASTQ\"\n};\n\n#endif /*FORMATS_H_*/\n"
  },
  {
    "path": "functions.sh",
    "content": "#!/bin/bash\n\nfunction check_or_mkdir {\n    echo -n \"Creating $1 ... \" >&2\n    if [[ -d $1 && ! -n `find $1 -prune -empty -type d` ]]; then\n        echo \"Directory exists - skipping it!\" >&2\n        return `false`\n    else \n        echo \"Done\" >&2\n        mkdir -p $1\n        return `true`\n    fi\n}\n\nfunction check_or_mkdir_no_fail {\n    echo -n \"Creating $1 ... \" >&2\n    if [[ -d $1 && ! -n `find $1 -prune -empty -type d` ]]; then\n        echo \"Directory exists already! Continuing\" >&2\n        return `true`\n    else \n        echo \"Done\" >&2\n        mkdir -p $1\n        return `true`\n    fi\n}\n\n\n\n## Functions\nfunction validate_url(){\n  if [[ `wget --reject=\"index.html*\" -S --spider $1  2>&1 | egrep 'HTTP/1.1 200 OK|File .* exists.'` ]]; then echo \"true\"; fi\n}\nexport -f validate_url\n\nfunction c_echo() {\n        printf \"\\033[34m$*\\033[0m\\n\"\n}\n\nprogressfilt () {\n    # from http://stackoverflow.com/a/4687912/299878\n    local flag=false c count cr=$'\\r' nl=$'\\n'\n    while IFS='' read -d '' -rn 1 c\n    do\n        if $flag\n        then\n            printf '%c' \"$c\"\n        else\n            if [[ $c != $cr && $c != $nl ]]\n            then\n                count=0\n            else\n                ((count++))\n                if ((count > 1))\n                then\n                    flag=true\n                fi\n            fi\n        fi\n    done\n}\n"
  },
  {
    "path": "group_walk.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"group_walk.h\"\n"
  },
  {
    "path": "group_walk.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n/*\n * group_walk.h\n *\n * Classes and routines for walking a set of BW ranges backwards from the edge\n * of a seed hit with the goal of resolving the offset of each row in each\n * range.  Here \"offset\" means offset into the concatenated string of all\n * references.  The main class is 'GroupWalk' and an important helper is\n * 'GWState'.\n *\n * For each combination of seed offset and orientation, there is an associated\n * QVal.  Each QVal describes a (possibly empty) set of suffix array ranges.\n * Call these \"seed range sets.\"  Each range in the set is \"backed\" by a range\n * of the salist, represented as a PListSlice. Such a range is the origin of a\n * walk.\n *\n * When an offset is resolved, it is entered into the salist via the\n * PListSlice.  Note that other routines in this same thread might also be\n * setting elements of the salist, so routines here should expect that elements\n * can go from unresolved to resolved at any time.\n *\n * What bookkeeping do we have to do as we walk?  Before the first step, we\n * convert the initial QVal into a list of SATuples; the SATuples are our link\n * to the correpsonding ranges in the suffix array.  The list of SATuples is\n * then converted to a list of GWState objects; these keep track of where we\n * are in our walk (e.g. what 'top' and 'bot' are, how many steps have we gone,\n * etc) as well as how the elements in the current range correspond to elements\n * from the original range.\n *\n * The user asks the GroupWalk to resolve another offset by calling advance().\n * advance() can be called in various ways:\n *\n * (a) The user can request that the GroupWalk proceed until a\n *     *particular* element is resolved, then return that resolved\n *     element.  Other elements may be resolved along the way, but\n *     those results are buffered and may be dispensed in future calls\n *     to advance().\n *\n * (b) The user can request that the GroupWalk select an as-yet-\n *     unreported element at random and and proceed until that element\n *     is resolved and report it.  Again, other elements may be\n *     resolved along the way but they are buffered.\n *\n * (c) The user can request that the GroupWalk resolve elements in a\n *     particular BW range (with a particular offset and orientation)\n *     in an order of its choosing.  The GroupWalk in this case\n *     attempts to resolve as many offsets as possible as quickly as\n *     possible, and returns them as soon as they're found.  The res_\n *     buffer is used in this case.\n *\n * (d) Like (c) but resolving elements at a paritcular offset and\n *     orientation instead of at a specific BW range.  The res_ buffer\n *     is used in this case, since there's a chance that the \n *\n * There are simple ways to heuristically reduce the problem size while\n * maintaining randomness.  For instance, the user put a ceiling on the\n * number of elements that we walk from any given seed offset or range.\n * We can then trim away random subranges to reduce the size of the\n * problem.  There is no need for the caller to do this for us.\n */\n\n#ifndef GROUP_WALK_H_\n#define GROUP_WALK_H_\n\n#include <stdint.h>\n#include <limits>\n#include \"ds.h\"\n#include \"bt2_idx.h\"\n#include \"read.h\"\n#include \"reference.h\"\n#include \"mem_ids.h\"\n\n/**\n * Encapsulate an SA range and an associated list of slots where the resolved\n * offsets can be placed.\n */\ntemplate<typename T>\nclass SARangeWithOffs {\n\npublic:\n\n\tSARangeWithOffs() { reset(); };\n\n\tSARangeWithOffs(TIndexOffU tf, size_t len, const T& o) {\n\t\tinit(tf, len, o);\n\t}\n\t\n\tvoid init(TIndexOffU tf, size_t len_, const T& o) {\n\t\ttopf = tf; len = len_, offs = o;\n\t}\n\n\t/**\n\t * Reset to uninitialized state.\n\t */\n\tvoid reset() { topf = std::numeric_limits<TIndexOffU>::max(); }\n\t\n\t/**\n\t * Return true if this is initialized.\n\t */\n\tbool inited() const {\n\t\treturn topf != std::numeric_limits<TIndexOffU>::max();\n\t}\n\t\n\t/**\n\t * Return the number of times this reference substring occurs in the\n\t * reference, which is also the size of the 'offs' TSlice.\n\t */\n\tsize_t size() const { return offs.size(); }\n\n\tTIndexOffU topf; // top in BWT index\n\tsize_t    len;  // length of the reference sequence involved\n\tT         offs; // offsets\n};\n\n/**\n * A group of per-thread state that can be shared between all the GroupWalks\n * used in that thread.\n */\ntemplate <typename index_t>\nstruct GroupWalkState {\n\n\tGroupWalkState(int cat) : map(cat) {\n\t\tmasks[0].setCat(cat);\n\t\tmasks[1].setCat(cat);\n\t\tmasks[2].setCat(cat);\n\t\tmasks[3].setCat(cat);\n\t}\n\n\tEList<bool> masks[4];      // temporary list for masks; used in GWState\n\tEList<index_t, 16> map;   // temporary list of GWState maps\n};\n\n/**\n * Encapsulates counters that encode how much work the walk-left logic\n * has done.\n */\nstruct WalkMetrics {\n\n\tWalkMetrics() {\n\t    reset();\n\t}\n\n\t/**\n\t * Sum each across this object and 'm'.  This is the only safe way\n\t * to update a WalkMetrics shared by many threads.\n\t */\n\tvoid merge(const WalkMetrics& m, bool getLock = false) {\n\t\tThreadSafe ts(&mutex_m, getLock);\n\t\tbwops += m.bwops;\n\t\tbranches += m.branches;\n\t\tresolves += m.resolves;\n\t\trefresolves += m.refresolves;\n\t\treports += m.reports;\n\t}\n\t\n\t/**\n\t * Set all to 0.\n\t */\n\tvoid reset() {\n\t\tbwops = branches = resolves = refresolves = reports = 0;\n\t}\n\n\tuint64_t bwops;       // Burrows-Wheeler operations\n\tuint64_t branches;    // BW range branch-offs\n\tuint64_t resolves;    // # offs resolved with BW walk-left\n\tuint64_t refresolves; // # resolutions caused by reference scanning\n\tuint64_t reports;     // # offs reported (1 can be reported many times)\n\tMUTEX_T mutex_m;\n};\n\n/**\n * Coordinates for a BW element that the GroupWalk might resolve.\n */\ntemplate <typename index_t>\nstruct GWElt {\n\n\tGWElt() { reset(); }\n\t\n\t/**\n\t * Reset GWElt to uninitialized state.\n\t */\n\tvoid reset() {\n\t\toffidx = range = elt = len = (index_t)0xffffffff;\n\t\tfw = false;\n\t}\n\n\t/**\n\t * Initialize this WalkResult.\n\t */\n\tvoid init(\n\t\tindex_t oi,\n\t\tbool f,\n\t\tindex_t r,\n\t\tindex_t e,\n\t\tindex_t l)\n\t{\n\t\toffidx = oi;\n\t\tfw = f;\n\t\trange = r;\n\t\telt = e;\n\t\tlen = l;\n\t}\n\n\t/**\n\t * Return true iff this GWElt and the given GWElt refer to the same\n\t * element.\n\t */\n\tbool operator==(const GWElt& o) const {\n\t\treturn offidx == o.offidx &&\n\t\t       fw == o.fw &&\n\t\t       range == o.range &&\n\t\t       elt == o.elt &&\n\t\t       len == o.len;\n\t}\n\t\n\t/**\n\t * Return true iff this GWElt and the given GWElt refer to\n\t * different elements.\n\t */\n\tbool operator!=(const GWElt& o) const {\n\t\treturn !(*this == o);\n\t}\n\n\tindex_t offidx; // seed offset index\n\tbool    fw;     // strand\n\tindex_t range;  // range\n\tindex_t elt;    // element\n\tindex_t len;    // length\n};\n\n/**\n * A record encapsulating the result of looking up one BW element in\n * the Bowtie index.\n */\ntemplate <typename index_t>\nstruct WalkResult {\n\n\tWalkResult() { reset(); }\n\t\n\t/**\n\t * Reset GWElt to uninitialized state.\n\t */\n\tvoid reset() {\n\t\telt.reset();\n\t\tbwrow = toff = (index_t)OFF_MASK;\n\t}\n\n\t/**\n\t * Initialize this WalkResult.\n\t */\n\tvoid init(\n\t\tindex_t oi,  // seed offset index\n\t\tbool f,       // strand\n\t\tindex_t r,   // range\n\t\tindex_t e,   // element\n\t\tindex_t bwr, // BW row\n\t\tindex_t len, // length\n\t\tindex_t to)  // text offset\n\t{\n\t\telt.init(oi, f, r, e, len);\n\t\tbwrow = bwr;\n\t\ttoff = to;\n\t}\n\n\tGWElt<index_t> elt;   // element resolved\n\tindex_t        bwrow; // SA row resolved\n\tindex_t        toff;  // resolved offset from SA sample\n};\n\n/**\n * A GW hit encapsulates an SATuple describing a reference substring\n * in the cache, along with a bool indicating whether each element of\n * the hit has been reported yet.\n */\ntemplate<typename index_t, typename T>\nclass GWHit {\n\npublic:\n\tGWHit() :\n\t\tfmap(0, GW_CAT),\n\t\toffidx((index_t)OFF_MASK),\n\t\tfw(false),\n\t\trange((index_t)OFF_MASK),\n\t\tlen((index_t)OFF_MASK),\n\t\treported_(0, GW_CAT),\n\t\tnrep_(0)\n\t{\n\t\tassert(repOkBasic());\n\t}\n\n\t/**\n\t * Initialize with a new SA range.  Resolve the done vector so that\n\t * there's one bool per suffix array element.\n\t */\n\tvoid init(\n\t\tSARangeWithOffs<T>& sa,\n\t\tindex_t oi,\n\t\tbool f,\n\t\tindex_t r)\n\t{\n\t\tnrep_ = 0;\n\t\toffidx = oi;\n\t\tfw = f;\n\t\trange = r;\n\t\tlen = (index_t)sa.len;\n\t\treported_.resize(sa.offs.size());\n\t\treported_.fill(false);\n\t\tfmap.resize(sa.offs.size());\n\t\tfmap.fill(make_pair((index_t)OFF_MASK, (index_t)OFF_MASK));\n\t}\n\t\n\t/**\n\t * Clear contents of sat and done.\n\t */\n\tvoid reset() {\n\t\treported_.clear();\n\t\tfmap.clear();\n\t\tnrep_ = 0;\n\t\toffidx = (index_t)OFF_MASK;\n\t\tfw = false;\n\t\trange = (index_t)OFF_MASK;\n\t\tlen = (index_t)OFF_MASK;\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that GWHit is internally consistent.  If a pointer to an\n\t * EList of GWStates is given, we assume that it is the EList\n\t * corresponding to this GWHit and check whether the forward and\n\t * reverse mappings match up for the as-yet-unresolved elements.\n\t */\n\tbool repOk(const SARangeWithOffs<T>& sa) const {\n\t\tassert_eq(reported_.size(), sa.offs.size());\n\t\tassert_eq(fmap.size(), sa.offs.size());\n\t\t// Shouldn't be any repeats among as-yet-unresolveds\n\t\tsize_t nrep = 0;\n\t\tfor(size_t i = 0; i < fmap.size(); i++) {\n\t\t\tif(reported_[i]) nrep++;\n\t\t\tif(sa.offs[i] != (index_t)OFF_MASK) {\n\t\t\t\tcontinue;\n\t\t\t}\n\t\t\tfor(size_t j = i+1; j < fmap.size(); j++) {\n\t\t\t\tif(sa.offs[j] != (index_t)OFF_MASK) {\n\t\t\t\t\tcontinue;\n\t\t\t\t}\n\t\t\t\tassert(fmap[i] != fmap[j]);\n\t\t\t}\n\t\t}\n\t\tassert_eq(nrep_, nrep);\n\t\treturn true;\n\t}\n\n\t/**\n\t * Return true iff this GWHit is not obviously corrupt.\n\t */\n\tbool repOkBasic() {\n\t\treturn true;\n\t}\n#endif\n\t\n\t/**\n\t * Set the ith element to be reported.\n\t */\n\tvoid setReported(index_t i) {\n\t\tassert(!reported_[i]);\n\t\tassert_lt(i, reported_.size());\n\t\treported_[i] = true;\n\t\tnrep_++;\n\t}\n\t\n\t/**\n\t * Return true iff element i has been reported.\n\t */\n\tbool reported(index_t i) const {\n\t\tassert_lt(i, reported_.size());\n\t\treturn reported_[i];\n\t}\n\t\n\t/**\n\t * Return true iff all elements have been reported.\n\t */\n\tbool done() const {\n\t\tassert_leq(nrep_, reported_.size());\n\t\treturn nrep_ == reported_.size();\n\t}\n\n\tEList<std::pair<index_t, index_t>, 16> fmap; // forward map; to GWState & elt\n\tindex_t offidx; // offset idx\n\tbool fw;         // orientation\n\tindex_t range;  // original range index\n\tindex_t len;    // length of hit\n\nprotected:\n\n\tEList<bool, 16> reported_; // per-elt bool indicating whether it's been reported\n\tindex_t nrep_;\n};\n\n/**\n * Encapsulates the progress made along a particular path from the original\n * range.\n */\ntemplate<typename index_t, typename T>\nclass GWState {\n\t\npublic:\n\n\tGWState() : map_(0, GW_CAT) {\n\t\treset(); assert(repOkBasic());\n\t}\n\t\n\t/**\n\t * Initialize this GWState with new ebwt, top, bot, step, and sat.\n\t *\n\t * We assume map is already set up.\n\t *\n\t * Returns true iff at least one elt was resolved.\n\t */\n\ttemplate<int S>\n\tpair<int, int> init(\n\t\tconst Ebwt<index_t>& ebwt,    // index to walk left in\n\t\tconst BitPairReference& ref,  // bitpair-encoded reference\n\t\tSARangeWithOffs<T>& sa,       // SA range with offsets\n\t\tEList<GWState, S>& sts,       // EList of GWStates for range being advanced\n\t\tGWHit<index_t, T>& hit,       // Corresponding hit structure\n\t\tindex_t range,                // which range is this?\n\t\tbool reportList,              // if true, \"report\" resolved offsets immediately by adding them to 'res' list\n\t\tEList<WalkResult<index_t>, 16>* res,   // EList where resolved offsets should be appended\n\t\tindex_t tp,                   // top of range at this step\n\t\tindex_t bt,                   // bot of range at this step\n\t\tindex_t st,                   // # steps taken to get to this step\n\t\tWalkMetrics& met)\n\t{\n\t\tassert_gt(bt, tp);\n\t\tassert_lt(range, sts.size());\n\t\ttop = tp;\n\t\tbot = bt;\n\t\tstep = (int)st;\n\t\tassert(!inited_);\n\t\tASSERT_ONLY(inited_ = true);\n\t\tASSERT_ONLY(lastStep_ = step-1);\n\t\treturn init(ebwt, ref, sa, sts, hit, range, reportList, res, met);\n\t}\n\n\t/**\n\t * Initialize this GWState.\n\t *\n\t * We assume map is already set up, and that 'step' is equal to the\n\t * number of steps taken to get to the new top/bot pair *currently*\n\t * in the top and bot fields.\n\t *\n\t * Returns a pair of numbers, the first being the number of\n\t * resolved but unreported offsets found during this advance, the\n\t * second being the number of as-yet-unresolved offsets.\n\t */\n\ttemplate<int S>\n\tpair<int, int> init(\n\t\tconst Ebwt<index_t>& ebwt,    // forward Bowtie index\n\t\tconst BitPairReference& ref,  // bitpair-encoded reference\n\t\tSARangeWithOffs<T>& sa,       // SA range with offsets\n\t\tEList<GWState, S>& st,        // EList of GWStates for advancing range\n\t\tGWHit<index_t, T>& hit,       // Corresponding hit structure\n\t\tindex_t range,                // range being inited\n\t\tbool reportList,              // report resolutions, adding to 'res' list?\n\t\tEList<WalkResult<index_t>, 16>* res,   // EList to append resolutions\n\t\tWalkMetrics& met)             // update these metrics\n\t{\n\t\tassert(inited_);\n\t\tassert_eq(step, lastStep_+1);\n\t\tASSERT_ONLY(lastStep_++);\n\t\tassert_leq((index_t)step, ebwt.eh().len());\n\t\tassert_lt(range, st.size());\n\t\tpair<int, int> ret = make_pair(0, 0);\n\t\tindex_t trimBegin = 0, trimEnd = 0;\n\t\tbool empty = true; // assume all resolved until proven otherwise\n\t\t// Commit new information, if any, to the PListSlide.  Also,\n\t\t// trim and check if we're done.\n\t\tfor(size_t i = mapi_; i < map_.size(); i++) {\n\t\t\tbool resolved = (off(i, sa) != (index_t)OFF_MASK);\n\t\t\tif(!resolved) {\n\t\t\t\t// Elt not resolved yet; try to resolve it now\n\t\t\t\tindex_t bwrow = (index_t)(top - mapi_ + i);\n\t\t\t\tindex_t toff = ebwt.tryOffset(bwrow);\n\t\t\t\tASSERT_ONLY(index_t origBwRow = sa.topf + map(i));\n\t\t\t\tassert_eq(bwrow, ebwt.walkLeft(origBwRow, step));\n\t\t\t\tif(toff != (index_t)OFF_MASK) {\n\t\t\t\t\t// Yes, toff was resolvable\n\t\t\t\t\tassert_eq(toff, ebwt.getOffset(bwrow));\n\t\t\t\t\tmet.resolves++;\n#ifdef CENTRIFUGE\n#else\n\t\t\t\t\ttoff += step;\n                    assert_eq(toff, ebwt.getOffset(origBwRow));\n#endif\n\t\t\t\t\tsetOff(i, toff, sa, met);\n\t\t\t\t\tif(!reportList) ret.first++;\n#if 0\n// used to be #ifndef NDEBUG, but since we no longer require that the reference\n// string info be included, this is no longer relevant.\n\n\t\t\t\t\t// Sanity check that the reference characters under this\n\t\t\t\t\t// hit match the seed characters in hit.satup->key.seq.\n\t\t\t\t\t// This is NOT a check that we associated the exact right\n\t\t\t\t\t// text offset with the BW row.  This is an important\n\t\t\t\t\t// distinction because when resolved offsets are filled in\n\t\t\t\t\t// via refernce scanning, they are not necessarily the\n\t\t\t\t\t// exact right text offsets to associate with the\n\t\t\t\t\t// respective BW rows but they WILL all be correct w/r/t\n\t\t\t\t\t// the reference sequence underneath, which is what really\n\t\t\t\t\t// matters here.\n\t\t\t\t\tindex_t tidx = (index_t)OFF_MASK, tof, tlen;\n\t\t\t\t\tbool straddled = false;\n\t\t\t\t\tebwt.joinedToTextOff(\n\t\t\t\t\t\thit.len, // length of seed\n\t\t\t\t\t\ttoff,    // offset in joined reference string\n\t\t\t\t\t\ttidx,    // reference sequence id\n\t\t\t\t\t\ttof,     // offset in reference coordinates\n\t\t\t\t\t\ttlen,    // length of reference sequence\n\t\t\t\t\t\ttrue,    // don't reject straddlers\n\t\t\t\t\t\tstraddled);\n\t\t\t\t\tif(tidx != (index_t)OFF_MASK &&\n\t\t\t\t\t   hit.satup->key.seq != std::numeric_limits<uint64_t>::max())\n\t\t\t\t\t{\n\t\t\t\t\t\t// key: 2-bit characters packed into a 64-bit word with\n\t\t\t\t\t\t// the least significant bitpair corresponding to the\n\t\t\t\t\t\t// rightmost character on the Watson reference strand.\n\t\t\t\t\t\tuint64_t key = hit.satup->key.seq;\n\t\t\t\t\t\tfor(int64_t j = tof + hit.len-1; j >= tof; j--) {\n\t\t\t\t\t\t\t// Get next reference base to the left\n\t\t\t\t\t\t\tint c = ref.getBase(tidx, j);\n\t\t\t\t\t\t\tassert_range(0, 3, c);\n\t\t\t\t\t\t\t// Must equal least significant bitpair of key\n\t\t\t\t\t\t\tif(c != (int)(key & 3)) {\n\t\t\t\t\t\t\t\t// Oops; when we jump to the piece of the\n\t\t\t\t\t\t\t\t// reference where the seed hit is, it doesn't\n\t\t\t\t\t\t\t\t// match the seed hit.  Before dying, check\n\t\t\t\t\t\t\t\t// whether we have the right spot in the joined\n\t\t\t\t\t\t\t\t// reference string\n\t\t\t\t\t\t\t\tSString<char> jref;\n\t\t\t\t\t\t\t\tebwt.restore(jref);\n\t\t\t\t\t\t\t\tuint64_t key2 = hit.satup->key.seq;\n\t\t\t\t\t\t\t\tfor(int64_t k = toff + hit.len-1; k >= toff; k--) {\n\t\t\t\t\t\t\t\t\tint c = jref[k];\n\t\t\t\t\t\t\t\t\tassert_range(0, 3, c);\n\t\t\t\t\t\t\t\t\tassert_eq(c, (int)(key2 & 3));\n\t\t\t\t\t\t\t\t\tkey2 >>= 2;\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\tassert(false);\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tkey >>= 2;\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n#endif\n\t\t\t\t}\n\t\t\t}\n\t\t\t// Is the element resolved?  We ask this regardless of how it was\n\t\t\t// resolved (whether this function did it just now, whether it did\n\t\t\t// it a while ago, or whether some other function outside GroupWalk\n\t\t\t// did it).\n\t\t\tif(off(i, sa) != (index_t)OFF_MASK) {\n\t\t\t\tif(reportList && !hit.reported(map(i))) {\n\t\t\t\t\t// Report it\n\t\t\t\t\tindex_t toff = off(i, sa);\n\t\t\t\t\tassert(res != NULL);\n\t\t\t\t\tres->expand();\n\t\t\t\t\tindex_t origBwRow = sa.topf + map(i);\n\t\t\t\t\tres->back().init(\n\t\t\t\t\t\thit.offidx, // offset idx\n\t\t\t\t\t\thit.fw,     // orientation\n\t\t\t\t\t\thit.range,  // original range index\n\t\t\t\t\t\tmap(i),     // original element offset\n\t\t\t\t\t\torigBwRow,  // BW row resolved\n\t\t\t\t\t\thit.len,    // hit length\n\t\t\t\t\t\ttoff);      // text offset\n\t\t\t\t\thit.setReported(map(i));\n\t\t\t\t\tmet.reports++;\n\t\t\t\t}\n\t\t\t\t// Offset resolved\n\t\t\t\tif(empty) {\n\t\t\t\t\t// Haven't seen a non-empty entry yet, so we\n\t\t\t\t\t// can trim this from the beginning.\n\t\t\t\t\ttrimBegin++;\n\t\t\t\t} else {\n\t\t\t\t\ttrimEnd++;\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\t// Offset not yet resolved\n\t\t\t\tret.second++;\n\t\t\t\ttrimEnd = 0;\n\t\t\t\tempty = false;\n\t\t\t\t// Set the forward map in the corresponding GWHit\n\t\t\t\t// object to point to the appropriate element of our\n\t\t\t\t// range\n\t\t\t\tassert_geq(i, mapi_);\n\t\t\t\tindex_t bmap = map(i);\n\t\t\t\thit.fmap[bmap].first = range;\n\t\t\t\thit.fmap[bmap].second = (index_t)i;\n#ifndef NDEBUG\n\t\t\t\tfor(size_t j = 0; j < bmap; j++) {\n\t\t\t\t\tif(sa.offs[j] == (index_t)OFF_MASK &&\n\t\t\t\t\t   hit.fmap[j].first == range)\n\t\t\t\t\t{\n\t\t\t\t\t\tassert_neq(i, hit.fmap[j].second);\n\t\t\t\t\t}\n\t\t\t\t}\n#endif\n\t\t\t}\n\t\t}\n\t\t// Trim from beginning\n\t\tassert_geq(trimBegin, 0);\n\t\tmapi_ += trimBegin;\n\t\ttop += trimBegin;\n\t\tif(trimEnd > 0) {\n\t\t\t// Trim from end\n\t\t\tmap_.resize(map_.size() - trimEnd);\n\t\t\tbot -= trimEnd;\n\t\t}\n\t\tif(empty) {\n\t\t\tassert(done());\n#ifndef NDEBUG\n\t\t\t// If range is done, all elements from map should be\n\t\t\t// resolved\n\t\t\tfor(size_t i = mapi_; i < map_.size(); i++) {\n\t\t\t\tassert_neq((index_t)OFF_MASK, off(i, sa));\n\t\t\t}\n\t\t\t// If this range is done, then it should be the case that\n\t\t\t// all elements in the corresponding GWHit that point to\n\t\t\t// this range are resolved.\n\t\t\tfor(size_t i = 0; i < hit.fmap.size(); i++) {\n\t\t\t\tif(sa.offs[i] == (index_t)OFF_MASK) {\n\t\t\t\t\tassert_neq(range, hit.fmap[i].first);\n\t\t\t\t}\n\t\t\t}\n#endif\n\t\t\treturn ret;\n\t\t} else {\n\t\t\tassert(!done());\n\t\t}\n\t\t// Is there a dollar sign in the middle of the range?\n\t\tassert_neq(top, ebwt._zOff);\n\t\tassert_neq(bot-1, ebwt._zOff);\n\t\tif(ebwt._zOff > top && ebwt._zOff < bot-1) {\n\t\t\t// Yes, the dollar sign is in the middle of this range.  We\n\t\t\t// must split it into the two ranges on either side of the\n\t\t\t// dollar.  Let 'bot' and 'top' delimit the portion of the\n\t\t\t// range prior to the dollar.\n\t\t\tindex_t oldbot = bot;\n\t\t\tbot = ebwt._zOff;\n\t\t\t// Note: might be able to do additional trimming off the\n\t\t\t// end.\n\t\t\t// Create a new range for the portion after the dollar.\n\t\t\tst.expand();\n\t\t\tst.back().reset();\n\t\t\tindex_t ztop = ebwt._zOff+1;\n\t\t\tst.back().initMap(oldbot - ztop);\n\t\t\tassert_eq((index_t)map_.size(), oldbot-top+mapi_);\n\t\t\tfor(index_t i = ztop; i < oldbot; i++) {\n\t\t\t\tst.back().map_[i - ztop] = map(i-top+mapi_);\n\t\t\t}\n\t\t\tmap_.resize(bot - top + mapi_);\n\t\t\tst.back().init(\n\t\t\t\tebwt,\n\t\t\t\tref,\n\t\t\t\tsa,\n\t\t\t\tst,\n\t\t\t\thit,\n\t\t\t\t(index_t)st.size()-1,\n\t\t\t\treportList,\n\t\t\t\tres,\n\t\t\t\tztop,\n\t\t\t\toldbot,\n\t\t\t\tstep,\n\t\t\t\tmet);\n\t\t}\n\t\tassert_gt(bot, top);\n\t\t// Prepare SideLocus's for next step\n\t\tif(bot-top > 1) {\n\t\t\tSideLocus<index_t>::initFromTopBot(top, bot, ebwt.eh(), ebwt.ebwt(), tloc, bloc);\n\t\t\tassert(tloc.valid()); assert(tloc.repOk(ebwt.eh()));\n\t\t\tassert(bloc.valid()); assert(bloc.repOk(ebwt.eh()));\n\t\t} else {\n\t\t\ttloc.initFromRow(top, ebwt.eh(), ebwt.ebwt());\n\t\t\tassert(tloc.valid()); assert(tloc.repOk(ebwt.eh()));\n\t\t\tbloc.invalidate();\n\t\t}\n\t\treturn ret;\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check if this GWP is internally consistent.\n\t */\n\tbool repOk(\n\t\tconst Ebwt<index_t>& ebwt,\n\t\tGWHit<index_t, T>& hit,\n\t\tindex_t range) const\n\t{\n\t\tassert(done() || bot > top);\n\t\tassert(doneResolving(hit) || (tloc.valid() && tloc.repOk(ebwt.eh())));\n\t\tassert(doneResolving(hit) || bot == top+1 || (bloc.valid() && bloc.repOk(ebwt.eh())));\n\t\tassert_eq(map_.size()-mapi_, bot-top);\n\t\t// Make sure that 'done' is compatible with whether we have >=\n\t\t// 1 elements left to resolve.\n\t\tint left = 0;\n\t\tfor(size_t i = mapi_; i < map_.size(); i++) {\n\t\t\tASSERT_ONLY(index_t row = (index_t)(top + i - mapi_));\n\t\t\tASSERT_ONLY(index_t origRow = hit.satup->topf + map(i));\n\t\t\tassert(step == 0 || row != origRow);\n\t\t\tassert_eq(row, ebwt.walkLeft(origRow, step));\n\t\t\tassert_lt(map_[i], hit.satup->offs.size());\n\t\t\tif(off(i, hit) == (index_t)OFF_MASK) left++;\n\t\t}\n\t\tassert(repOkMapRepeats());\n\t\tassert(repOkMapInclusive(hit, range));\n\t\treturn true;\n\t}\n\t\n\t/**\n\t * Return true iff this GWState is not obviously corrupt.\n\t */\n\tbool repOkBasic() {\n\t\tassert_geq(bot, top);\n\t\treturn true;\n\t}\n\n\t/**\n\t * Check that the fmap elements pointed to by our map_ include all\n\t * of the fmap elements that point to this range.\n\t */\n\tbool repOkMapInclusive(GWHit<index_t, T>& hit, index_t range) const {\n\t\tfor(size_t i = 0; i < hit.fmap.size(); i++) {\n\t\t\tif(hit.satup->offs[i] == (index_t)OFF_MASK) {\n\t\t\t\tif(range == hit.fmap[i].first) {\n\t\t\t\t\tASSERT_ONLY(bool found = false);\n\t\t\t\t\tfor(size_t j = mapi_; j < map_.size(); j++) {\n\t\t\t\t\t\tif(map(j) == i) {\n\t\t\t\t\t\t\tASSERT_ONLY(found = true);\n\t\t\t\t\t\t\tbreak;\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\tassert(found);\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\treturn true;\n\t}\n\t\n\t/**\n\t * Check that no two elements in map_ are the same.\n\t */\n\tbool repOkMapRepeats() const {\n\t\tfor(size_t i = mapi_; i < map_.size(); i++) {\n\t\t\tfor(size_t j = i+1; j < map_.size(); j++) {\n\t\t\t\tassert_neq(map_[i], map_[j]);\n\t\t\t}\n\t\t}\n\t\treturn true;\n\t}\n#endif\n\t\n\t/**\n\t * Return the offset currently assigned to the ith element.  If it\n\t * has not yet been resolved, return 0xffffffff.\n\t */\n\tindex_t off(\n\t\t\t\tindex_t i,\n\t\t\t\tconst SARangeWithOffs<T>& sa)\n\t{\n\t\tassert_geq(i, mapi_);\n\t\tassert_lt(i, map_.size());\n\t\tassert_lt(map_[i], sa.offs.size());\n\t\treturn sa.offs.get(map_[i]);\n\t}\n\n\t/**\n\t * Return the offset of the element within the original range's\n\t * PListSlice that the ith element of this range corresponds to.\n\t */\n\tindex_t map(index_t i) const {\n\t\tassert_geq(i, mapi_);\n\t\tassert_lt(i, map_.size());\n\t\treturn map_[i];\n\t}\n\n\t/**\n\t * Return the offset of the first untrimmed offset in the map.\n\t */\n\tindex_t mapi() const {\n\t\treturn mapi_;\n\t}\n\n\t/**\n\t * Return number of active elements in the range being tracked by\n\t * this GWState.\n\t */\n\tindex_t size() const {\n\t\treturn map_.size() - mapi_;\n\t}\n\t\n\t/**\n\t * Return true iff all elements in this leaf range have been\n\t * resolved.\n\t */\n\tbool done() const {\n\t\treturn size() == 0;\n\t}\n\n\t/**\n\t * Set the PListSlice element that corresponds to the ith element\n\t * of 'map' to the specified offset.\n\t */\n\tvoid setOff(\n\t\tindex_t i,\n\t\tindex_t off,\n\t\tSARangeWithOffs<T>& sa,\n\t\tWalkMetrics& met)\n\t{\n\t\tassert_lt(i + mapi_, map_.size());\n\t\tassert_lt(map_[i + mapi_], sa.offs.size());\n\t\tsize_t saoff = map_[i + mapi_];\n\t\tsa.offs[saoff] = off;\n\t\tassert_eq(off, sa.offs[saoff]);\n\t}\n\n\t/**\n\t * Advance this GWState by one step (i.e. one BW operation).  In\n\t * the event of a \"split\", more elements are added to the EList\n\t * 'st', which must have room for at least 3 more elements without\n\t * needing another expansion.  If an expansion of 'st' is\n\t * triggered, this GWState object becomes invalid.\n\t *\n\t * Returns a pair of numbers, the first being the number of\n\t * resolved but unreported offsets found during this advance, the\n\t * second being the number of as-yet-unresolved offsets.\n\t */\n\ttemplate <int S>\n\tpair<int, int> advance(\n\t\tconst Ebwt<index_t>& ebwt,   // the forward Bowtie index, for stepping left\n\t\tconst BitPairReference& ref, // bitpair-encoded reference\n\t\tSARangeWithOffs<T>& sa,      // SA range with offsets\n\t\tGWHit<index_t, T>& hit,      // the associated GWHit object\n\t\tindex_t range,               // which range is this?\n\t\tbool reportList,             // if true, \"report\" resolved offsets immediately by adding them to 'res' list\n\t\tEList<WalkResult<index_t>, 16>* res,  // EList where resolved offsets should be appended\n\t\tEList<GWState, S>& st,       // EList of GWStates for range being advanced\n\t\tGroupWalkState<index_t>& gws,         // temporary storage for masks\n\t\tWalkMetrics& met,\n\t\tPerReadMetrics& prm)\n\t{\n\t\tASSERT_ONLY(index_t origTop = top);\n\t\tASSERT_ONLY(index_t origBot = bot);\n\t\tassert_geq(step, 0);\n\t\tassert_eq(step, lastStep_);\n\t\tassert_geq(st.capacity(), st.size() + 4);\n\t\tassert(tloc.valid()); assert(tloc.repOk(ebwt.eh()));\n\t\tassert_eq(bot-top, (index_t)(map_.size()-mapi_));\n\t\tpair<int, int> ret = make_pair(0, 0);\n\t\tassert_eq(top, tloc.toBWRow());\n\t\tif(bloc.valid()) {\n\t\t\t// Still multiple elements being tracked\n\t\t\tassert_lt(top+1, bot);\n\t\t\tindex_t upto[4], in[4];\n\t\t\tupto[0] = in[0] = upto[1] = in[1] =\n\t\t\tupto[2] = in[2] = upto[3] = in[3] = 0;\n\t\t\tassert_eq(bot, bloc.toBWRow());\n\t\t\tmet.bwops++;\n\t\t\tprm.nExFmops++;\n\t\t\t// Assert that there's not a dollar sign in the middle of\n\t\t\t// this range\n\t\t\tassert(bot <= ebwt._zOff || top > ebwt._zOff);\n\t\t\tebwt.mapLFRange(tloc, bloc, bot-top, upto, in, gws.masks);\n#ifndef NDEBUG\n\t\t\tfor(int i = 0; i < 4; i++) {\n\t\t\t  assert_eq(bot-top, (index_t)(gws.masks[i].size()));\n\t\t\t}\n#endif\n\t\t\tbool first = true;\n\t\t\tASSERT_ONLY(index_t sum = 0);\n\t\t\tindex_t newtop = 0, newbot = 0;\n\t\t\tgws.map.clear();\n\t\t\tfor(int i = 0; i < 4; i++) {\n\t\t\t\tif(in[i] > 0) {\n\t\t\t\t\t// Non-empty range resulted\n\t\t\t\t\tif(first) {\n\t\t\t\t\t\t// For the first one, \n\t\t\t\t\t\tfirst = false;\n\t\t\t\t\t\tnewtop = upto[i];\n\t\t\t\t\t\tnewbot = newtop + in[i];\n\t\t\t\t\t\tassert_leq(newbot-newtop, bot-top);\n\t\t\t\t\t\t// Range narrowed so we have to look at the masks\n\t\t\t\t\t\tfor(size_t j = 0; j < gws.masks[i].size(); j++) {\n\t\t\t\t\t\t\tassert_lt(j+mapi_, map_.size());\n\t\t\t\t\t\t\tif(gws.masks[i][j]) {\n\t\t\t\t\t\t\t\tgws.map.push_back(map_[j+mapi_]);\n\t\t\t\t\t\t\t\tassert(gws.map.size() <= 1 || gws.map.back() != gws.map[gws.map.size()-2]);\n#ifndef NDEBUG\n\t\t\t\t\t\t\t\t// If this element is not yet resolved,\n\t\t\t\t\t\t\t\t// then check that it really is the\n\t\t\t\t\t\t\t\t// expected number of steps to the left\n\t\t\t\t\t\t\t\t// of the corresponding element in the\n\t\t\t\t\t\t\t\t// root range\n\t\t\t\t\t\t\t\tassert_lt(gws.map.back(), sa.size());\n\t\t\t\t\t\t\t\tif(sa.offs[gws.map.back()] == (index_t)OFF_MASK) {\n\t\t\t\t\t\t\t\t\tassert_eq(newtop + gws.map.size() - 1,\n\t\t\t\t\t\t\t\t\t\t\t  ebwt.walkLeft(sa.topf + gws.map.back(), step+1));\n\t\t\t\t\t\t\t\t}\n#endif\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n \t\t\t\t\t\tassert_eq(newbot-newtop, (index_t)(gws.map.size()));\n\t\t\t\t\t} else {\n\t\t\t\t\t\t// For each beyond the first, create a new\n\t\t\t\t\t\t// GWState and add it to the GWState list. \n\t\t\t\t\t\t// NOTE: this can cause the underlying list to\n\t\t\t\t\t\t// be expanded which in turn might leave 'st'\n\t\t\t\t\t\t// pointing to bad memory.\n\t\t\t\t\t\tst.expand();\n\t\t\t\t\t\tst.back().reset();\n\t\t\t\t\t\tindex_t ntop = upto[i];\n\t\t\t\t\t\tindex_t nbot = ntop + in[i];\n\t\t\t\t\t\tassert_lt(nbot-ntop, bot-top);\n\t\t\t\t\t\tst.back().mapi_ = 0;\n\t\t\t\t\t\tst.back().map_.clear();\n\t\t\t\t\t\tmet.branches++;\n\t\t\t\t\t\t// Range narrowed so we have to look at the masks\n\t\t\t\t\t\tfor(size_t j = 0; j < gws.masks[i].size(); j++) {\n\t\t\t\t\t\t\tif(gws.masks[i][j]) st.back().map_.push_back(map_[j+mapi_]);\n\t\t\t\t\t\t}\n\t\t\t\t\t\tpair<int, int> rret =\n\t\t\t\t\t\tst.back().init(\n\t\t\t\t\t\t\tebwt,        // forward Bowtie index\n\t\t\t\t\t\t\tref,         // bitpair-encodede reference\n\t\t\t\t\t\t\tsa,          // SA range with offsets\n\t\t\t\t\t\t\tst,          // EList of all GWStates associated with original range\n\t\t\t\t\t\t\thit,         // associated GWHit object\n\t\t\t\t\t\t\t(index_t)st.size()-1, // range offset\n\t\t\t\t\t\t\treportList,  // if true, report hits to 'res' list\n\t\t\t\t\t\t\tres,         // report hits here if reportList is true\n\t\t\t\t\t\t\tntop,        // BW top of new range\n\t\t\t\t\t\t\tnbot,        // BW bot of new range\n\t\t\t\t\t\t\tstep+1,      // # steps taken to get to this new range\n\t\t\t\t\t\t\tmet);        // update these metrics\n\t\t\t\t\t\tret.first += rret.first;\n\t\t\t\t\t\tret.second += rret.second;\n\t\t\t\t\t}\n\t\t\t\t\tASSERT_ONLY(sum += in[i]);\n\t\t\t\t}\n\t\t\t}\n\t\t\tmapi_ = 0;\n\t\t\tassert_eq(bot-top, sum);\n\t\t\tassert_gt(newbot, newtop);\n\t\t\tassert_leq(newbot-newtop, bot-top);\n\t\t\tassert(top != newtop || bot != newbot);\n\t\t\t//assert(!(newtop < top && newbot > top));\n\t\t\ttop = newtop;\n\t\t\tbot = newbot;\n\t\t\tif(!gws.map.empty()) {\n\t\t\t\tmap_ = gws.map;\n\t\t\t}\n\t\t\t//assert(repOkMapRepeats());\n\t\t\t//assert(repOkMapInclusive(hit, range));\n\t\t\tassert_eq(bot-top, (index_t)map_.size());\n\t\t} else {\n\t\t\t// Down to one element\n\t\t\tassert_eq(bot, top+1);\n\t\t\tassert_eq(1, map_.size()-mapi_);\n\t\t\t// Sets top, returns char walked through (which we ignore)\n\t\t\tASSERT_ONLY(index_t oldtop = top);\n\t\t\tmet.bwops++;\n\t\t\tprm.nExFmops++;\n\t\t\tebwt.mapLF1(top, tloc);\n\t\t\tassert_neq(top, oldtop);\n\t\t\tbot = top+1;\n\t\t\tif(mapi_ > 0) {\n\t\t\t\tmap_[0] = map_[mapi_];\n\t\t\t\tmapi_ = 0;\n\t\t\t}\n\t\t\tmap_.resize(1);\n\t\t}\n\t\tassert(top != origTop || bot != origBot);\n\t\tstep++;\n\t\tassert_gt(step, 0);\n\t\tassert_leq((index_t)step, ebwt.eh().len());\n\t\tpair<int, int> rret =\n\t\tinit<S>(\n\t\t\tebwt,       // forward Bowtie index\n\t\t\tref,        // bitpair-encodede reference\n\t\t\tsa,         // SA range with offsets\n\t\t\tst,         // EList of all GWStates associated with original range\n\t\t\thit,        // associated GWHit object\n\t\t\trange,      // range offset\n\t\t\treportList, // if true, report hits to 'res' list\n\t\t\tres,        // report hits here if reportList is true\n\t\t\tmet);       // update these metrics\n\t\tret.first += rret.first;\n\t\tret.second += rret.second;\n\t\treturn ret;\n\t}\n\n\t/**\n\t * Clear all state in preparation for the next walk.\n\t */\n\tvoid reset() {\n\t\ttop = bot = step = mapi_ = 0;\n\t\tASSERT_ONLY(lastStep_ = -1);\n\t\tASSERT_ONLY(inited_ = false);\n\t\ttloc.invalidate();\n\t\tbloc.invalidate();\n\t\tmap_.clear();\n\t}\n\t\n\t/**\n\t * Resize the map_ field to the given size.\n\t */\n\tvoid initMap(size_t newsz) {\n\t\tmapi_ = 0;\n\t\tmap_.resize(newsz);\n\t\tfor(size_t i = 0; i < newsz; i++) {\n\t\t\tmap_[i] = (index_t)i;\n\t\t}\n\t}\n\n\t/**\n\t * Return true iff all rows corresponding to this GWState have been\n\t * resolved and reported.\n\t */\n\tbool doneReporting(const GWHit<index_t, T>& hit) const {\n\t\tfor(size_t i = mapi_; i < map_.size(); i++) {\n\t\t\tif(!hit.reported(map(i))) return false;\n\t\t}\n\t\treturn true;\n\t}\n\n\t/**\n\t * Return true iff all rows corresponding to this GWState have been\n\t * resolved (but not necessarily reported).\n\t */\n\tbool doneResolving(const SARangeWithOffs<T>& sa) const {\n\t\tfor(size_t i = mapi_; i < map_.size(); i++) {\n\t\t\tif(sa.offs[map(i)] == (index_t)OFF_MASK) return false;\n\t\t}\n\t\treturn true;\n\t}\n\n\tSideLocus<index_t> tloc;      // SideLocus for top\n\tSideLocus<index_t> bloc;      // SideLocus for bottom\n\tindex_t            top;       // top elt of range in BWT\n\tindex_t            bot;       // bot elt of range in BWT\n\tint                step;      // how many steps have we walked to the left so far\n\nprotected:\n\t\n\tASSERT_ONLY(bool inited_);\n\tASSERT_ONLY(int lastStep_);\n\tEList<index_t, 16> map_; // which elts in range 'range' we're tracking\n\tindex_t mapi_;           // first untrimmed element of map\n};\n\ntemplate<typename index_t, typename T, int S>\nclass GroupWalk2S {\npublic:\n\ttypedef EList<GWState<index_t, T>, S> TStateV;\n\n\tGroupWalk2S() : st_(8, GW_CAT) {\n\t\treset();\n\t}\n\t\n\t/**\n\t * Reset the GroupWalk in preparation for the next SeedResults.\n\t */\n\tvoid reset() {\n\t\telt_ = rep_ = 0;\n\t\tASSERT_ONLY(inited_ = false);\n\t}\n\n\t/**\n\t * Initialize a new group walk w/r/t a QVal object.\n\t */\n\tvoid init(\n\t\tconst Ebwt<index_t>& ebwtFw, // forward Bowtie index for walking left\n\t\tconst BitPairReference& ref, // bitpair-encoded reference\n\t\tSARangeWithOffs<T>& sa,      // SA range with offsets\n\t\tRandomSource& rnd,           // pseudo-random generator for sampling rows\n\t\tWalkMetrics& met)            // update metrics here\n\t{\n\t\treset();\n#ifndef NDEBUG\n\t\tinited_ = true;\n#endif\n\t\t// Init GWHit\n\t\thit_.init(sa, 0, false, 0);\n\t\t// Init corresponding GWState\n\t\tst_.resize(1);\n\t\tst_.back().reset();\n\t\tassert(st_.back().repOkBasic());\n\t\tindex_t top = sa.topf;\n\t\tindex_t bot = (index_t)(top + sa.size());\n\t\tst_.back().initMap(bot-top);\n\t\tst_.ensure(4);\n\t\tst_.back().init(\n\t\t\tebwtFw,             // Bowtie index\n\t\t\tref,                // bitpair-encoded reference\n\t\t\tsa,                 // SA range with offsets\n\t\t\tst_,                // EList<GWState>\n\t\t\thit_,               // GWHit\n\t\t\t0,                  // range 0\n\t\t\tfalse,              // put resolved elements into res_?\n\t\t\tNULL,               // put resolved elements here\n\t\t\ttop,                // BW row at top\n\t\t\tbot,                // BW row at bot\n\t\t\t0,                  // # steps taken\n\t\t\tmet);               // update metrics here\n\t\telt_ += sa.size();\n\t\tassert(hit_.repOk(sa));\n\t}\n\n\t//\n\t// ELEMENT-BASED\n\t//\n\n\t/**\n\t * Advance the GroupWalk until all elements have been resolved.\n\t * FIXME FB: Commented as the types of advanceElements do not correlate with the types of the function definition.\n\t */\n//\tvoid resolveAll(WalkMetrics& met, PerReadMetrics& prm) {\n//\t\tWalkResult<index_t> res; // ignore results for now\n//\t\tfor(size_t i = 0; i < elt_; i++) {\n//\t\t\tadvanceElement((index_t)i, res, met, prm);\n//\t\t}\n//\t}\n\n\t/**\n\t * Advance the GroupWalk until the specified element has been\n\t * resolved.\n\t */\n\tbool advanceElement(\n\t\tindex_t elt,                  // element within the range\n\t\tconst Ebwt<index_t>& ebwtFw,  // forward Bowtie index for walking left\n\t\tconst BitPairReference& ref,  // bitpair-encoded reference\n\t\tSARangeWithOffs<T>& sa,       // SA range with offsets\n\t\tGroupWalkState<index_t>& gws, // GroupWalk state; scratch space\n\t\tWalkResult<index_t>& res,     // put the result here\n\t\tWalkMetrics& met,             // metrics\n\t\tPerReadMetrics& prm)          // per-read metrics\n\t{\n\t\tassert(inited_);\n\t\tassert(!done());\n\t\tassert(hit_.repOk(sa));\n\t\tassert_lt(elt, sa.size()); // elt must fall within range\n\t\t// Until we've resolved our element of interest...\n\t\twhile(sa.offs[elt] == (index_t)OFF_MASK) {\n\t\t\t// Get the GWState that contains our element of interest\n\t\t\tsize_t range = hit_.fmap[elt].first;\n\t\t\tst_.ensure(4);\n\t\t\tGWState<index_t, T>& st = st_[range];\n\t\t\tassert(!st.doneResolving(sa));\n\t\t\t// Returns a pair of numbers, the first being the number of\n\t\t\t// resolved but unreported offsets found during this advance, the\n\t\t\t// second being the number of as-yet-unresolved offsets.\n\t\t\tst.advance(\n\t\t\t\tebwtFw,\n\t\t\t\tref,\n\t\t\t\tsa,\n\t\t\t\thit_,\n\t\t\t\t(index_t)range,\n\t\t\t\tfalse,\n\t\t\t\tNULL,\n\t\t\t\tst_,\n\t\t\t\tgws,\n\t\t\t\tmet,\n\t\t\t\tprm);\n\t\t\tassert(sa.offs[elt] != (index_t)OFF_MASK ||\n\t\t\t       !st_[hit_.fmap[elt].first].doneResolving(sa));\n\t\t}\n\t\tassert_neq((index_t)OFF_MASK, sa.offs[elt]);\n\t\t// Report it!\n\t\tif(!hit_.reported(elt)) {\n\t\t\thit_.setReported(elt);\n\t\t}\n\t\tmet.reports++;\n\t\tres.init(\n\t\t\t0,              // seed offset\n\t\t\tfalse,          // orientation\n\t\t\t0,              // range\n\t\t\telt,            // element\n\t\t\tsa.topf + elt,  // bw row\n\t\t\t(index_t)sa.len, // length of hit\n\t\t\tsa.offs[elt]);  // resolved text offset\n\t\trep_++;\n\t\treturn true;\n\t}\n\n\t/**\n\t * Return true iff all elements have been resolved and reported.\n\t */\n\tbool done() const { return rep_ == elt_; }\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that GroupWalk is internally consistent.\n\t */\n\tbool repOk(const SARangeWithOffs<T>& sa) const {\n\t\tassert(hit_.repOk(sa));\n\t\tassert_leq(rep_, elt_);\n\t\t// This is a lot of work\n\t\tsize_t resolved = 0, reported = 0;\n\t\t// For each element\n\t\tconst size_t sz = sa.size();\n\t\tfor(size_t m = 0; m < sz; m++) {\n\t\t\t// Is it resolved?\n\t\t\tif(sa.offs[m] != (index_t)OFF_MASK) {\n\t\t\t\tresolved++;\n\t\t\t} else {\n\t\t\t\tassert(!hit_.reported(m));\n\t\t\t}\n\t\t\t// Is it reported?\n\t\t\tif(hit_.reported(m)) {\n\t\t\t\treported++;\n\t\t\t}\n\t\t\tassert_geq(resolved, reported);\n\t\t}\n\t\tassert_geq(resolved, reported);\n\t\tassert_eq(rep_, reported);\n\t\tassert_eq(elt_, sz);\n\t\treturn true;\n\t}\n#endif\n\n\t/**\n\t * Return the number of BW elements that we can resolve.\n\t */\n\tindex_t numElts() const { return elt_; }\n\t\n\t/**\n\t * Return the size occupied by this GroupWalk and all its constituent\n\t * objects.\n\t */\n\tsize_t totalSizeBytes() const {\n\t\treturn 2 * sizeof(size_t) + st_.totalSizeBytes() + sizeof(GWHit<index_t, T>);\n\t}\n\t/**\n\t * Return the capacity of this GroupWalk and all its constituent objects.\n\t */\n\tsize_t totalCapacityBytes() const {\n\t\treturn 2 * sizeof(size_t) + st_.totalCapacityBytes() + sizeof(GWHit<index_t, T>);\n\t}\n\t\n#ifndef NDEBUG\n\tbool initialized() const { return inited_; }\n#endif\n\t\nprotected:\n\n\tASSERT_ONLY(bool inited_);    // initialized?\n\t\n\tindex_t elt_;    // # BW elements under the control of the GropuWalk\n\tindex_t rep_;    // # BW elements reported\n\n\t// For each orientation and seed offset, keep a GWState object that\n\t// holds the state of the walk so far.\n\tTStateV st_;\n\n\t// For each orientation and seed offset, keep an EList of GWHit.\n\tGWHit<index_t, T> hit_;\n};\n\n#endif /*GROUP_WALK_H_*/\n"
  },
  {
    "path": "hi_aligner.h",
    "content": "/*\n * Copyright 2014, Daehwan Kim <infphilo@gmail.com>\n *\n * This file is part of HISAT.\n *\n * HISAT is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * HISAT is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with HISAT.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef HI_ALIGNER_H_\n#define HI_ALIGNER_H_\n\n#include <iostream>\n#include <utility>\n#include <limits>\n#include \"qual.h\"\n#include \"ds.h\"\n#include \"sstring.h\"\n#include \"alphabet.h\"\n#include \"edit.h\"\n#include \"read.h\"\n// Threading is necessary to synchronize the classes that dump\n// intermediate alignment results to files.  Otherwise, all data herein\n// is constant and shared, or per-thread.\n#include \"threading.h\"\n#include \"aligner_result.h\"\n#include \"scoring.h\"\n#include \"mem_ids.h\"\n#include \"simple_func.h\"\n#include \"group_walk.h\"\n\n/**\n * Hit types for BWTHit class below\n * Three hit types to anchor a read on the genome\n *\n */\nenum {\n    CANDIDATE_HIT = 1,\n    PSEUDOGENE_HIT,\n    ANCHOR_HIT,\n};\n\n/**\n * Simple struct for holding a partial alignment for the read\n * The alignment locations are represented by FM offsets [top, bot),\n * and later genomic offsets are calculated when necessary\n */\ntemplate <typename index_t>\nstruct BWTHit {\n\t\n\tBWTHit() { reset(); }\n\t\n\tvoid reset() {\n\t\t_top = _bot = 0;\n\t\t_fw = true;\n\t\t_bwoff = (index_t)OFF_MASK;\n\t\t_len = 0;\n\t\t_coords.clear();\n        _anchor_examined = false;\n        _hit_type = CANDIDATE_HIT;\n\t}\n\t\n\tvoid init(\n\t\t\t  index_t top,\n\t\t\t  index_t bot,\n  \t\t\t  bool fw,\n\t\t\t  uint32_t bwoff,\n\t\t\t  uint32_t len,\n              index_t hit_type = CANDIDATE_HIT)\n\t{\n\t\t_top = top;\n        _bot = bot;\n\t\t_fw = fw;\n\t\t_bwoff = bwoff;\n\t\t_len = len;\n        _coords.clear();\n        _anchor_examined = false;\n        _hit_type = hit_type;\n\t}\n    \n    bool hasGenomeCoords() const { return !_coords.empty(); }\n\t\n\t/**\n\t * Return true iff there is no hit.\n\t */\n\tbool empty() const {\n\t\treturn _bot <= _top;\n\t}\n\t\n\t/**\n\t * Higher score = higher priority.\n\t */\n\tbool operator<(const BWTHit& o) const {\n\t\treturn _len > o._len;\n\t}\n\t\n\t/**\n\t * Return the size of the alignments SA ranges.\n\t */\n\tindex_t size() const {\n        assert_leq(_top, _bot);\n        return _bot - _top;\n    }\n    \n    index_t len() const {\n        // assert_gt(_len, 0);\n        return _len;\n    }\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that hit is sane w/r/t read.\n\t */\n\tbool repOk(const Read& rd) const {\n\t\tassert_gt(_bot, _top);\n\t\tassert_neq(_bwoff, (index_t)OFF_MASK);\n\t\tassert_gt(_len, 0);\n\t\treturn true;\n\t}\n#endif\n\t\n\tindex_t         _top;               // start of the range in the FM index\n\tindex_t         _bot;               // end of the range in the FM index\n\tbool            _fw;                // whether read is forward or reverse complemented\n\tindex_t         _bwoff;             // current base of a read to search from the right end\n\tindex_t         _len;               // read length\n\t\n    EList<Coord>    _coords;            // genomic offsets corresponding to [_top, _bot)\n    \n    bool            _anchor_examined;   // whether or not this hit is examined\n    index_t         _hit_type;          // hit type (anchor hit, pseudogene hit, or candidate hit)\n};\n\n\n/**\n * Simple struct for holding alignments for the read\n * The alignments are represented by chains of BWTHits\n */\ntemplate <typename index_t>\nstruct ReadBWTHit {\n\t\n\tReadBWTHit() { reset(); }\n\t\n\tvoid reset() {\n        _fw = true;\n\t\t_len = 0;\n        _cur = 0;\n        _done = false;\n        _numPartialSearch = 0;\n        _numUniqueSearch = 0;\n        _partialHits.clear();\n\t}\n\n\tvoid init(\n\t\t\t  bool fw,\n              index_t len)\n\t{\n        _fw = fw;\n        assert_gt(len, 0);\n        _len = len;\n        _cur = 0;\n        _done = false;\n        _numPartialSearch = 0;\n        _numUniqueSearch = 0;\n        _partialHits.clear();\n\t}\n    \n    bool done() {\n#ifndef NDEBUG\n        assert_gt(_len, 0);\n        if(_cur >= _len) {\n            assert(_done);\n        }\n#endif\n        return _done;\n    }\n    \n    void done(bool done) {\n        // assert(!_done);\n        assert(done);\n        _done = done;\n    }\n    \n    index_t len() const { return _len; }\n    index_t cur() const { return _cur; }\n    \n    size_t  offsetSize()             { return _partialHits.size(); }\n    size_t  numPartialSearch()       { return _numPartialSearch; }\n    size_t  numActualPartialSearch()\n    {\n        assert_leq(_numUniqueSearch, _numPartialSearch);\n        return _numPartialSearch - _numUniqueSearch;\n    }\n    \n    bool width(index_t offset_) {\n        assert_lt(offset_, _partialHits.size());\n        return _partialHits[offset_].size();\n    }\n    \n    bool hasGenomeCoords(index_t offset_) {\n        assert_lt(offset_, _partialHits.size());\n        index_t width_ = width(offset_);\n        if(width_ == 0) {\n            return true;\n        } else {\n            return _partialHits[offset_].hasGenomeCoords();\n        }\n    }\n    \n    bool hasAllGenomeCoords() {\n        if(_cur < _len) return false;\n        if(_partialHits.size() <= 0) return false;\n        for(size_t oi = 0; oi < _partialHits.size(); oi++) {\n            if(!_partialHits[oi].hasGenomeCoords())\n                return false;\n        }\n        return true;\n    }\n    \n    /**\n     *\n     */\n    index_t minWidth(index_t& offset) const {\n        index_t minWidth_ = (index_t)OFF_MASK;\n        index_t minWidthLen_ = 0;\n        for(size_t oi = 0; oi < _partialHits.size(); oi++) {\n            const BWTHit<index_t>& hit = _partialHits[oi];\n            if(hit.empty()) continue;\n            // if(!hit.hasGenomeCoords()) continue;\n            assert_gt(hit.size(), 0);\n            if((minWidth_ > hit.size()) ||\n               (minWidth_ == hit.size() && minWidthLen_ < hit.len())) {\n                minWidth_ = hit.size();\n                minWidthLen_ = hit.len();\n                offset = (index_t)oi;\n            }\n        }\n        return minWidth_;\n    }\n    \n    // add policy for calculating a search score\n    int64_t searchScore(index_t minK) {\n        int64_t score = 0;\n        const int64_t penaltyPerOffset = minK * minK;\n        for(size_t i = 0; i < _partialHits.size(); i++) {\n            index_t len = _partialHits[i]._len;\n            score += (len * len);\n        }\n        \n        assert_geq(_numPartialSearch, _partialHits.size());\n        index_t actualPartialSearch = numActualPartialSearch();\n        score -= (actualPartialSearch * penaltyPerOffset);\n        score -= (1 << (actualPartialSearch << 1));\n        return score;\n    }\n    \n    BWTHit<index_t>& getPartialHit(index_t offset_) {\n        assert_lt(offset_, _partialHits.size());\n        return _partialHits[offset_];\n    }\n    \n    bool adjustOffset(index_t minK) {\n        assert_gt(_partialHits.size(), 0);\n        const BWTHit<index_t>& hit = _partialHits.back();\n        if(hit.len() >= minK + 3) {\n            return false;\n        }\n        assert_geq(_cur, hit.len());\n        index_t origCur = _cur - hit.len();\n        _cur = origCur + max(hit.len(), minK + 1) - minK;\n        _partialHits.pop_back();\n        return true;\n    }\n    \n    void setOffset(index_t offset) {\n        //assert_lt(offset, _len); //FIXME: assertion fails as offset == _len\n        _cur = offset;\n    }\n    \n#ifndef NDEBUG\n\t/**\n\t */\n\tbool repOk() const {\n        for(size_t i = 0; i < _partialHits.size(); i++) {\n            if(i == 0) {\n                assert_geq(_partialHits[i]._bwoff, 0);\n            }\n            \n            if(i + 1 < _partialHits.size()) {\n                assert_leq(_partialHits[i]._bwoff + _partialHits[i]._len, _partialHits[i+1]._bwoff);\n            } else {\n                assert_eq(i+1, _partialHits.size());\n                assert_eq(_partialHits[i]._bwoff + _partialHits[i]._len, _cur);\n            }\n        }\n\t\treturn true;\n\t}\n#endif\n\t\n\tbool     _fw;\n\tindex_t  _len;\n    index_t  _cur;\n    bool     _done;\n    index_t  _numPartialSearch;\n    index_t  _numUniqueSearch;\n    index_t  _cur_local;\n    \n    EList<BWTHit<index_t> >  _partialHits;\n};\n\n\n/**\n * this is per-thread data, which are shared by GenomeHit classes\n * the main purpose of this struct is to avoid extensive use of memory related functions\n * such as new and delete - those are really slow and lock based\n */\ntemplate <typename index_t>\nstruct SharedTempVars {\n    SStringExpandable<char> raw_refbuf;\n    SStringExpandable<char> raw_refbuf2;\n    EList<int64_t> temp_scores;\n    EList<int64_t> temp_scores2;\n    ASSERT_ONLY(SStringExpandable<uint32_t> destU32);\n    \n    ASSERT_ONLY(BTDnaString editstr);\n    ASSERT_ONLY(BTDnaString partialseq);\n    ASSERT_ONLY(BTDnaString refstr);\n    ASSERT_ONLY(EList<index_t> reflens);\n    ASSERT_ONLY(EList<index_t> refoffs);\n    \n    LinkedEList<EList<Edit> > raw_edits;\n};\n\n/**\n * GenomeHit represents read alignment or alignment of a part of a read\n * Two GenomeHits that represents alignments of different parts of a read\n * can be combined together.  Also, GenomeHit can be extended in both directions.\n */\ntemplate <typename index_t>\nstruct GenomeHit {\n\t\n\tGenomeHit() :\n    _fw(false),\n    _rdoff((index_t)OFF_MASK),\n    _len((index_t)OFF_MASK),\n    _trim5(0),\n    _trim3(0),\n    _tidx((index_t)OFF_MASK),\n    _toff((index_t)OFF_MASK),\n    _edits(NULL),\n    _score(MIN_I64),\n    _hitcount(1),\n    _edits_node(NULL),\n    _sharedVars(NULL)\n    {\n    }\n    \n    GenomeHit(const GenomeHit& otherHit) :\n    _fw(false),\n    _rdoff((index_t)OFF_MASK),\n    _len((index_t)OFF_MASK),\n    _trim5(0),\n    _trim3(0),\n    _tidx((index_t)OFF_MASK),\n    _toff((index_t)OFF_MASK),\n    _edits(NULL),\n    _score(MIN_I64),\n    _hitcount(1),\n    _edits_node(NULL),\n    _sharedVars(NULL)\n    {\n        init(otherHit._fw,\n             otherHit._rdoff,\n             otherHit._len,\n             otherHit._trim5,\n             otherHit._trim3,\n             otherHit._tidx,\n             otherHit._toff,\n             *(otherHit._sharedVars),\n             otherHit._edits,\n             otherHit._score,\n             otherHit._splicescore);\n    }\n    \n    GenomeHit<index_t>& operator=(const GenomeHit<index_t>& otherHit) {\n        if(this == &otherHit) return *this;\n        init(otherHit._fw,\n             otherHit._rdoff,\n             otherHit._len,\n             otherHit._trim5,\n             otherHit._trim3,\n             otherHit._tidx,\n             otherHit._toff,\n             *(otherHit._sharedVars),\n             otherHit._edits,\n             otherHit._score,\n             otherHit._splicescore);\n        \n        return *this;\n    }\n    \n    ~GenomeHit() {\n        if(_edits_node != NULL) {\n            assert(_edits != NULL);\n            assert(_sharedVars != NULL);\n            _sharedVars->raw_edits.delete_node(_edits_node);\n            _edits = NULL;\n            _edits_node = NULL;\n            _sharedVars = NULL;\n        }\n    }\n\t\n\tvoid init(\n              bool                      fw,\n\t\t\t  index_t                   rdoff,\n\t\t\t  index_t                   len,\n              index_t                   trim5,\n              index_t                   trim3,\n              index_t                   tidx,\n              index_t                   toff,\n              SharedTempVars<index_t>&  sharedVars,\n              EList<Edit>*              edits = NULL,\n              int64_t                   score = 0,\n              double                    splicescore = 0.0)\n\t{\n\t\t_fw = fw;\n\t\t_rdoff = rdoff;\n\t\t_len = len;\n        _trim5 = trim5;\n        _trim3 = trim3;\n        _tidx = tidx;\n        _toff = toff;\n\t\t_score = score;\n        _splicescore = splicescore;\n        \n        assert(_sharedVars == NULL || _sharedVars == &sharedVars);\n        _sharedVars = &sharedVars;\n        if(_edits == NULL) {\n            assert(_edits_node == NULL);\n            _edits_node = _sharedVars->raw_edits.new_node();\n            assert(_edits_node != NULL);\n            _edits = &(_edits_node->payload);\n        }\n        assert(_edits != NULL);\n        _edits->clear();\n        \n        if(edits != NULL) *_edits = *edits;\n        _hitcount = 1;\n\t}\n    \n    bool inited() const {\n        return _len >= 0 && _len < (index_t)OFF_MASK;\n    }\n    \n    index_t rdoff() const { return _rdoff; }\n    index_t len()   const { return _len; }\n    index_t trim5() const { return _trim5; }\n    index_t trim3() const { return _trim3; }\n    \n    void trim5(index_t trim5) { _trim5 = trim5; }\n    void trim3(index_t trim3) { _trim3 = trim3; }\n    \n    index_t ref()    const { return _tidx; }\n    index_t refoff() const { return _toff; }\n    index_t fw()     const { return _fw; }\n    \n    index_t hitcount() const { return _hitcount; }\n    \n    /**\n     * Leftmost coordinate\n     */\n    Coord coord() const {\n        return Coord(_tidx, _toff, _fw);\n    }\n    \n    const EList<Edit>& edits() const { return *_edits; }\n    \n    bool operator== (const GenomeHit<index_t>& other) const {\n        if(_fw != other._fw ||\n           _rdoff != other._rdoff ||\n           _len != other._len ||\n           _tidx != other._tidx ||\n           _toff != other._toff ||\n           _trim5 != other._trim5 ||\n           _trim3 != other._trim3) {\n            return false;\n        }\n        \n        if(_edits->size() != other._edits->size()) return false;\n        for(index_t i = 0; i < _edits->size(); i++) {\n            if(!((*_edits)[i] == (*other._edits)[i])) return false;\n        }\n        // daehwan - this may not be true when some splice sites are provided from outside\n        // assert_eq(_score, other._score);\n        return true;\n    }\n    \n    bool contains(const GenomeHit<index_t>& other) const {\n        return (*this) == other;\n    }\n\n\n#ifndef NDEBUG\n\t/**\n\t * Check that hit is sane w/r/t read.\n\t */\n\tbool repOk(const Read& rd, const BitPairReference& ref);\n#endif\n    \npublic:\n\tbool            _fw;\n\tindex_t         _rdoff;\n\tindex_t         _len;\n    index_t         _trim5;\n    index_t         _trim3;\n    \n    index_t         _tidx;\n    index_t         _toff;\n\tEList<Edit>*    _edits;\n    int64_t         _score;\n    double          _splicescore;\n    \n    index_t         _hitcount;  // for selection purposes\n    \n    LinkedEListNode<EList<Edit> >*  _edits_node;\n    SharedTempVars<index_t>* _sharedVars;\n};\n\n\n#ifndef NDEBUG\n/**\n * Check that hit is sane w/r/t read.\n */\ntemplate <typename index_t>\nbool GenomeHit<index_t>::repOk(const Read& rd, const BitPairReference& ref)\n{\n    assert(_sharedVars != NULL);\n    SStringExpandable<char>& raw_refbuf = _sharedVars->raw_refbuf;\n    SStringExpandable<uint32_t>& destU32 = _sharedVars->destU32;\n    \n    BTDnaString& editstr = _sharedVars->editstr;\n    BTDnaString& partialseq = _sharedVars->partialseq;\n    BTDnaString& refstr = _sharedVars->refstr;\n    EList<index_t>& reflens = _sharedVars->reflens;\n    EList<index_t>& refoffs = _sharedVars->refoffs;\n    \n    editstr.clear(); partialseq.clear(); refstr.clear();\n    reflens.clear(); refoffs.clear();\n    \n    const BTDnaString& seq = _fw ? rd.patFw : rd.patRc;\n    partialseq.install(seq.buf() + this->_rdoff, (size_t)this->_len);\n    Edit::toRef(partialseq, *_edits, editstr);\n    \n    index_t refallen = 0;\n    int64_t reflen = 0;\n    int64_t refoff = this->_toff;\n    refoffs.push_back(refoff);\n    size_t eidx = 0;\n    for(size_t i = 0; i < _len; i++, reflen++, refoff++) {\n        while(eidx < _edits->size() && (*_edits)[eidx].pos == i) {\n            const Edit& edit = (*_edits)[eidx];\n            if(edit.isReadGap()) {\n                reflen++;\n                refoff++;\n            } else if(edit.isRefGap()) {\n                reflen--;\n                refoff--;\n            }\n            if(edit.isSpliced()) {\n                assert_gt(reflen, 0);\n                refallen += reflen;\n                reflens.push_back((index_t)reflen);\n                reflen = 0;\n                refoff += edit.splLen;\n                assert_gt(refoff, 0);\n                refoffs.push_back((index_t)refoff);\n            }\n            eidx++;\n        }\n    }\n    assert_gt(reflen, 0);\n    refallen += (index_t)reflen;\n    reflens.push_back(reflen);\n    assert_gt(reflens.size(), 0);\n    assert_gt(refoffs.size(), 0);\n    assert_eq(reflens.size(), refoffs.size());\n    refstr.clear();\n    for(index_t i = 0; i < reflens.size(); i++) {\n        assert_gt(reflens[i], 0);\n        if(i > 0) {\n            assert_gt(refoffs[i], refoffs[i-1]);\n        }\n        raw_refbuf.resize(reflens[i] + 16);\n        raw_refbuf.clear();\n        int off = ref.getStretch(\n                                 reinterpret_cast<uint32_t*>(raw_refbuf.wbuf()),\n                                 (size_t)this->_tidx,\n                                 (size_t)max<TRefOff>(refoffs[i], 0),\n                                 reflens[i],\n                                 destU32);\n        assert_leq(off, 16);\n        for(index_t j = 0; j < reflens[i]; j++) {\n            char rfc = *(raw_refbuf.buf()+off+j);\n            refstr.append(rfc);\n        }\n    }\n    if(refstr != editstr) {\n        cerr << \"Decoded nucleotides and edits don't match reference:\" << endl;\n        //cerr << \"           score: \" << score.score()\n        //<< \" (\" << gaps << \" gaps)\" << endl;\n        cerr << \"           edits: \";\n        Edit::print(cerr, *_edits);\n        cerr << endl;\n        cerr << \"    decoded nucs: \" << partialseq << endl;\n        cerr << \"     edited nucs: \" << editstr << endl;\n        cerr << \"  reference nucs: \" << refstr << endl;\n        assert(0);\n    }\n\n    return true;\n}\n#endif\n\n\n/**\n * Encapsulates counters that measure how much work has been done by\n * hierarchical indexing\n */\nstruct HIMetrics {\n    \n\tHIMetrics() : mutex_m() {\n\t    reset();\n\t}\n    \n\tvoid reset() {\n\t\tanchoratts = 0;\n        localatts = 0;\n        localindexatts = 0;\n        localextatts = 0;\n        localsearchrecur = 0;\n        globalgenomecoords = 0;\n        localgenomecoords = 0;\n\t}\n\t\n\tvoid init(\n              uint64_t localatts_,\n              uint64_t anchoratts_,\n              uint64_t localindexatts_,\n              uint64_t localextatts_,\n              uint64_t localsearchrecur_,\n              uint64_t globalgenomecoords_,\n              uint64_t localgenomecoords_)\n\t{\n        localatts = localatts_;\n        anchoratts = anchoratts_;\n        localindexatts = localindexatts_;\n        localextatts = localextatts_;\n        localsearchrecur = localsearchrecur_;\n        globalgenomecoords = globalgenomecoords_;\n        localgenomecoords = localgenomecoords_;\n    }\n\t\n\t/**\n\t * Merge (add) the counters in the given HIMetrics object into this\n\t * object.  This is the only safe way to update a HIMetrics shared\n\t * by multiple threads.\n\t */\n\tvoid merge(const HIMetrics& r, bool getLock = false) {\n        ThreadSafe ts(&mutex_m, getLock);\n        localatts += r.localatts;\n        anchoratts += r.anchoratts;\n        localindexatts += r.localindexatts;\n        localextatts += r.localextatts;\n        localsearchrecur += r.localsearchrecur;\n        globalgenomecoords += r.globalgenomecoords;\n        localgenomecoords += r.localgenomecoords;\n    }\n\t   \n    uint64_t localatts;      // # attempts of local search\n    uint64_t anchoratts;     // # attempts of anchor search\n    uint64_t localindexatts; // # attempts of local index search\n    uint64_t localextatts;   // # attempts of extension search\n    uint64_t localsearchrecur;\n    uint64_t globalgenomecoords;\n    uint64_t localgenomecoords;\n\t\n\tMUTEX_T mutex_m;\n};\n\n/**\n * With a hierarchical indexing, SplicedAligner provides several alignment strategies\n * , which enable effective alignment of RNA-seq reads\n */\ntemplate <typename index_t, typename local_index_t>\nclass HI_Aligner {\n\npublic:\n\t\n\t/**\n\t * Initialize with index.\n\t */\n\tHI_Aligner(\n               const Ebwt<index_t>& ebwt,\n               bool secondary = false,\n               bool local = false,\n               uint64_t threads_rids_mindist = 0,\n               bool no_spliced_alignment = false) :\n    _secondary(secondary),\n    _local(local),\n    _gwstate(GW_CAT),\n    _gwstate_local(GW_CAT),\n    _thread_rids_mindist(threads_rids_mindist),\n    _no_spliced_alignment(no_spliced_alignment)\n    {\n        index_t genomeLen = ebwt.eh().len();\n        _minK = 0;\n        while(genomeLen > 0) {\n            genomeLen >>= 2;\n            _minK++;\n        }\n        _minK_local = 8;\n    }\n    \n    HI_Aligner() {\n    }\n    \n    /**\n     */\n    void initRead(Read *rd, bool nofw, bool norc, TAlScore minsc, TAlScore maxpen, bool rightendonly = false) {\n        assert(rd != NULL);\n        _rds[0] = rd;\n        _rds[1] = NULL;\n\t\t_paired = false;\n        _rightendonly = rightendonly;\n        _nofw[0] = nofw;\n        _nofw[1] = true;\n        _norc[0] = norc;\n        _norc[1] = true;\n        _minsc[0] = minsc;\n        _minsc[1] = OFF_MASK;\n        _maxpen[0] = maxpen;\n        _maxpen[1] = OFF_MASK;\n        for(size_t fwi = 0; fwi < 2; fwi++) {\n            bool fw = (fwi == 0);\n            _hits[0][fwi].init(fw, _rds[0]->length());\n        }\n        _genomeHits.clear();\n        _concordantPairs.clear();\n        _hits_searched[0].clear();\n        assert(!_paired);\n    }\n    \n    /**\n     */\n    void initReads(Read *rds[2], bool nofw[2], bool norc[2], TAlScore minsc[2], TAlScore maxpen[2]) {\n        assert(rds[0] != NULL && rds[1] != NULL);\n\t\t_paired = true;\n        _rightendonly = false;\n        for(size_t rdi = 0; rdi < 2; rdi++) {\n            _rds[rdi] = rds[rdi];\n            _nofw[rdi] = nofw[rdi];\n            _norc[rdi] = norc[rdi];\n            _minsc[rdi] = minsc[rdi];\n            _maxpen[rdi] = maxpen[rdi];\n            for(size_t fwi = 0; fwi < 2; fwi++) {\n                bool fw = (fwi == 0);\n\t\t        _hits[rdi][fwi].init(fw, _rds[rdi]->length());\n            }\n            _hits_searched[rdi].clear();\n        }\n        _genomeHits.clear();\n        _concordantPairs.clear();\n        assert(_paired);\n        assert(!_rightendonly);\n    }\n    \n    /**\n     * Aligns a read or a pair\n     * This funcion is called per read or pair\n     */\n    virtual\n    int go(\n           const Scoring&           sc,\n           const Ebwt<index_t>&     ebwtFw,\n           const Ebwt<index_t>&     ebwtBw,\n           const BitPairReference&  ref,\n           WalkMetrics&             wlm,\n           PerReadMetrics&          prm,\n           HIMetrics&               him,\n\t\t   SpeciesMetrics&          spm,\n           RandomSource&            rnd,\n           AlnSinkWrap<index_t>&    sink) = 0;\n    \n   \t/**\n     * Align a part of a read without any edits\n\t */\n    size_t partialSearch(\n                         const Ebwt<index_t>&    ebwt,    // BWT index\n                         const Read&             read,    // read to align\n                         const Scoring&          sc,      // scoring scheme\n                         bool                    fw,      // don't align forward read\n                         size_t                  mineMax, // don't care about edit bounds > this\n                         size_t&                 mineFw,  // minimum # edits for forward read\n                         size_t&                 mineRc,  // minimum # edits for revcomp read\n                         ReadBWTHit<index_t>&    hit,     // holds all the seed hits (and exact hit)\n                         RandomSource&           rnd);\n    \nprotected:\n  \n    Read *   _rds[2];\n    bool     _paired;\n    bool     _rightendonly;\n    bool     _nofw[2];\n    bool     _norc[2];\n    TAlScore _minsc[2];\n    TAlScore _maxpen[2];\n    \n    bool     _secondary;  // allow secondary alignments\n    bool     _local;      // perform local alignments\n    \n    ReadBWTHit<index_t> _hits[2][2];\n    \n    EList<index_t, 16>                                 _offs;\n    SARangeWithOffs<EListSlice<index_t, 16> >          _sas;\n    GroupWalk2S<index_t, EListSlice<index_t, 16>, 16>  _gws;\n    GroupWalkState<index_t>                            _gwstate;\n    \n    EList<local_index_t, 16>                                       _offs_local;\n    SARangeWithOffs<EListSlice<local_index_t, 16> >                _sas_local;\n    GroupWalk2S<local_index_t, EListSlice<local_index_t, 16>, 16>  _gws_local;\n    GroupWalkState<local_index_t>                                  _gwstate_local;\n            \n    // temporary and shared variables used for GenomeHit\n    // this should be defined before _genomeHits and _hits_searched\n    SharedTempVars<index_t> _sharedVars;\n    \n    // temporary and shared variables for AlnRes\n    LinkedEList<EList<Edit> > _rawEdits;\n    \n    // temporary\n    EList<GenomeHit<index_t> >     _genomeHits;\n    EList<bool>                    _genomeHits_done;\n    ELList<Coord>                  _coords;\n    \n    EList<pair<index_t, index_t> >  _concordantPairs;\n    \n    size_t _minK; // log4 of the size of a genome\n    size_t _minK_local; // log4 of the size of a local index (8)\n\n    ELList<GenomeHit<index_t> >     _local_genomeHits;\n    EList<uint8_t>                  _anchors_added;\n    uint64_t max_localindexatts;\n    \n\tuint64_t bwops_;                    // Burrows-Wheeler operations\n\tuint64_t bwedits_;                  // Burrows-Wheeler edits\n    \n    //\n    EList<GenomeHit<index_t> >     _hits_searched[2];\n    \n    uint64_t   _thread_rids_mindist;\n    bool _no_spliced_alignment;\n\n    // For AlnRes::matchesRef\n\tASSERT_ONLY(EList<bool> raw_matches_);\n\tASSERT_ONLY(BTDnaString tmp_rf_);\n\tASSERT_ONLY(BTDnaString tmp_rdseq_);\n\tASSERT_ONLY(BTString tmp_qseq_);\n};\n\n#define HIER_INIT_LOCS(top, bot, tloc, bloc, e) { \\\n\tif(bot - top == 1) { \\\n\t\ttloc.initFromRow(top, (e).eh(), (e).ebwt()); \\\n\t\tbloc.invalidate(); \\\n\t} else { \\\n\t\tSideLocus<index_t>::initFromTopBot(top, bot, (e).eh(), (e).ebwt(), tloc, bloc); \\\n\t\tassert(bloc.valid()); \\\n\t} \\\n}\n\n#define HIER_SANITY_CHECK_4TUP(t, b, tp, bp) { \\\n\tASSERT_ONLY(cur_index_t tot = (b[0]-t[0])+(b[1]-t[1])+(b[2]-t[2])+(b[3]-t[3])); \\\n\tASSERT_ONLY(cur_index_t totp = (bp[0]-tp[0])+(bp[1]-tp[1])+(bp[2]-tp[2])+(bp[3]-tp[3])); \\\n\tassert_eq(tot, totp); \\\n}\n\n/**\n * Sweep right-to-left and left-to-right using exact matching.  Remember all\n * the SA ranges encountered along the way.  Report exact matches if there are\n * any.  Calculate a lower bound on the number of edits in an end-to-end\n * alignment.\n */\ntemplate <typename index_t, typename local_index_t>\nsize_t HI_Aligner<index_t, local_index_t>::partialSearch(\n                                                         const Ebwt<index_t>&      ebwt,    // BWT index\n                                                         const Read&               read,    // read to align\n                                                         const Scoring&            sc,      // scoring scheme\n                                                         bool                      fw,\n                                                         size_t                    mineMax, // don't care about edit bounds > this\n                                                         size_t&                   mineFw,  // minimum # edits for forward read\n                                                         size_t&                   mineRc,  // minimum # edits for revcomp read\n                                                         ReadBWTHit<index_t>&      hit,     // holds all the seed hits (and exact hit)\n                                                         RandomSource&             rnd)     // pseudo-random source\n\n{\n\tconst index_t ftabLen = ebwt.eh().ftabChars();\n\tSideLocus<index_t> tloc, bloc;\n\tconst index_t len = (index_t)read.length();\n    const BTDnaString& seq = fw ? read.patFw : read.patRc;\n    assert(!seq.empty());\n    \n    size_t nelt = 0;\n    EList<BWTHit<index_t> >& partialHits = hit._partialHits;\n    index_t& cur = hit._cur;\n    assert_lt(cur, hit._len);\n    \n    hit._numPartialSearch++;\n    \n    index_t offset = cur;\n    index_t dep = offset;\n    index_t top = 0, bot = 0;\n    index_t topTemp = 0, botTemp = 0;\n    index_t left = len - dep;\n    assert_gt(left, 0);\n    if(left < ftabLen) {\n        cur = hit._len;\n        partialHits.expand();\n        partialHits.back().init((index_t)OFF_MASK,\n                                (index_t)OFF_MASK,\n                                fw,\n                                (uint32_t)offset,\n                                (uint32_t)(cur - offset));\n        hit.done(true);\n\t\treturn 0;\n    }\n    // Does N interfere with use of Ftab?\n    for(index_t i = 0; i < ftabLen; i++) {\n        int c = seq[len-dep-1-i];\n        if(c > 3) {\n            cur += (i+1);\n            partialHits.expand();\n            partialHits.back().init((index_t)OFF_MASK,\n                                    (index_t)OFF_MASK,\n                                    fw,\n                                    (uint32_t)offset,\n                                    (uint32_t)(cur - offset));\n            if(cur >= hit._len) {\n                hit.done(true);\n            }\n\t\t\treturn 0;\n        }\n    }\n    \n    // Use ftab\n    ebwt.ftabLoHi(seq, len - dep - ftabLen, false, top, bot);\n    dep += ftabLen;\n    if(bot <= top) {\n        cur = dep;\n        partialHits.expand();\n        partialHits.back().init((index_t)OFF_MASK,\n                                (index_t)OFF_MASK,\n                                fw,\n                                (uint32_t)offset,\n                                (uint32_t)(cur - offset));\n        if(cur >= hit._len) {\n            hit.done(true);\n        }\n        return 0;\n    }\n    HIER_INIT_LOCS(top, bot, tloc, bloc, ebwt);\n    // Keep going\n    while(dep < len) {\n        int c = seq[len-dep-1];\n        if(c > 3) {\n            topTemp = botTemp = 0;\n        } else {\n            if(bloc.valid()) {\n                bwops_ += 2;\n                topTemp = ebwt.mapLF(tloc, c);\n                botTemp = ebwt.mapLF(bloc, c);\n            } else {\n                bwops_++;\n                topTemp = ebwt.mapLF1(top, tloc, c);\n                if(topTemp == (index_t)OFF_MASK) {\n                    topTemp = botTemp = 0;\n                } else {\n                    botTemp = topTemp + 1;\n                }\n            }\n        }\n        if(botTemp <= topTemp) {\n            break;\n        }\n        top = topTemp;\n        bot = botTemp;\n        dep++;\n        HIER_INIT_LOCS(top, bot, tloc, bloc, ebwt);\n    }\n    \n    // Done\n    if(bot > top) {\n        // This is an exact hit\n        assert_gt(dep, offset);\n        assert_leq(dep, len);\n        partialHits.expand();\n        index_t hit_type = CANDIDATE_HIT;\n        partialHits.back().init(top,\n                                bot,\n                                fw,\n                                (uint32_t)offset,\n                                (uint32_t)(dep - offset),\n                                hit_type);\n        \n        nelt += (bot - top);\n        cur = dep;\n        if(cur >= hit._len) {\n            if(hit_type == CANDIDATE_HIT) hit._numUniqueSearch++;\n            hit.done(true);\n        }\n    }\n    return nelt;\n}\n\n#endif /*HI_ALIGNER_H_*/\n"
  },
  {
    "path": "hier_idx.h",
    "content": "/*\n * Copyright 2013, Daehwan Kim <infphilo@gmail.com>\n *\n * This file is part of Beast.  Beast is based on Bowtie 2.\n *\n * Beast is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Beast is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Beast.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef HIEREBWT_H_\n#define HIEREBWT_H_\n\n#include \"hier_idx_common.h\"\n#include \"bt2_idx.h\"\n#include \"bt2_io.h\"\n#include \"bt2_util.h\"\n\n/**\n * Extended Burrows-Wheeler transform data.\n * LocalEbwt is a specialized Ebwt index that represents ~64K bps\n * and therefore uses two bytes as offsets within 64K bps.\n * This class has only two additional member variables to denote the genomic sequenuce it represents:\n * (1) the contig index and (2) the offset within the contig.\n *\n */\ntemplate <typename index_t = uint16_t, typename full_index_t = uint32_t>\nclass LocalEbwt : public Ebwt<index_t> {\n\ttypedef Ebwt<index_t> PARENT_CLASS;\npublic:\n\t/// Construct an Ebwt from the given input file\n\tLocalEbwt(const string& in,\n\t\t\t  FILE *in5,\n\t\t\t  FILE *in6,\n\t\t\t  char *mmFile5,\n\t\t\t  char *mmFile6,\n\t\t\t  full_index_t& tidx,\n\t\t\t  full_index_t& localOffset,\n\t\t\t  bool switchEndian,\n\t\t\t  size_t& bytesRead,\n\t\t\t  int color,\n\t\t\t  int needEntireReverse,\n\t\t\t  bool fw,\n\t\t\t  int32_t overrideOffRate, // = -1,\n\t\t\t  int32_t offRatePlus, // = -1,\n\t\t\t  uint32_t lineRate,\n\t\t\t  uint32_t offRate,\n\t\t\t  uint32_t ftabChars,\n\t\t\t  bool useMm, // = false,\n\t\t\t  bool useShmem, // = false,\n\t\t\t  bool mmSweep, // = false,\n\t\t\t  bool loadNames, // = false,\n\t\t\t  bool loadSASamp, // = true,\n\t\t\t  bool loadFtab, // = true,\n\t\t\t  bool loadRstarts, // = true,\n\t\t\t  bool verbose, // = false,\n\t\t\t  bool startVerbose, // = false,\n\t\t\t  bool passMemExc, // = false,\n\t\t\t  bool sanityCheck) : // = false) :\n\tEbwt<index_t>(in,\n\t\t\t\t  color,\n\t\t\t\t  needEntireReverse,\n\t\t\t\t  fw,\n\t\t\t\t  overrideOffRate,\n\t\t\t\t  offRatePlus,\n\t\t\t\t  useMm,\n\t\t\t\t  useShmem,\n\t\t\t\t  mmSweep,\n\t\t\t\t  loadNames,\n\t\t\t\t  loadSASamp,\n\t\t\t\t  loadFtab,\n\t\t\t\t  loadRstarts,\n\t\t\t\t  verbose,\n\t\t\t\t  startVerbose,\n\t\t\t\t  passMemExc,\n\t\t\t\t  sanityCheck,\n\t\t\t\t  true)\n\t{\n\t\tthis->_in1Str = in + \".5.\" + gEbwt_ext;\n\t\tthis->_in2Str = in + \".5.\" + gEbwt_ext;\n\t\treadIntoMemory(\n\t\t\t\t\t   in5,\n\t\t\t\t\t   in6,\n\t\t\t\t\t   mmFile5,\n\t\t\t\t\t   mmFile6,\n\t\t\t\t\t   tidx,\n\t\t\t\t\t   localOffset,\n\t\t\t\t\t   switchEndian,\n\t\t\t\t\t   bytesRead,\n\t\t\t\t\t   color,\n\t\t\t\t\t   needEntireReverse,\n\t\t\t\t\t   loadSASamp,\n\t\t\t\t\t   loadFtab,\n\t\t\t\t\t   loadRstarts,\n\t\t\t\t\t   false,              //justHeader\n\t\t\t\t\t   lineRate,\n\t\t\t\t\t   offRate,\n\t\t\t\t\t   ftabChars,\n\t\t\t\t\t   mmSweep,\n\t\t\t\t\t   loadNames,\n\t\t\t\t\t   startVerbose);\n\t\t\n\t\t_tidx = tidx;\n\t\t_localOffset = localOffset;\n\t\t\n\t\t// If the offRate has been overridden, reflect that in the\n\t\t// _eh._offRate field\n\t\tif(offRatePlus > 0 && this->_overrideOffRate == -1) {\n\t\t\tthis->_overrideOffRate = this->_eh._offRate + offRatePlus;\n\t\t}\n\t\tif(this->_overrideOffRate > this->_eh._offRate) {\n\t\t\tthis->_eh.setOffRate(this->_overrideOffRate);\n\t\t\tassert_eq(this->_overrideOffRate, this->_eh._offRate);\n\t\t}\n\t\tassert(this->repOk());\n\t}\n\n\n\t/// Construct an Ebwt from the given header parameters and string\n\t/// vector, optionally using a blockwise suffix sorter with the\n\t/// given 'bmax' and 'dcv' parameters.  The string vector is\n\t/// ultimately joined and the joined string is passed to buildToDisk().\n\ttemplate<typename TStr>\n\tLocalEbwt(\n\t\t\t  TStr& s,\n\t\t\t  full_index_t tidx,\n\t\t\t  full_index_t local_offset,\n\t\t\t  index_t local_size,\n\t\t\t  bool packed,\n\t\t\t  int color,\n\t\t\t  int needEntireReverse,\n\t\t\t  int32_t lineRate,\n\t\t\t  int32_t offRate,\n\t\t\t  int32_t ftabChars,\n\t\t\t  const string& file,   // base filename for EBWT files\n\t\t\t  bool fw,\n\t\t\t  int dcv,\n\t\t\t  EList<RefRecord>& szs,\n\t\t\t  index_t sztot,\n\t\t\t  const RefReadInParams& refparams,\n\t\t\t  uint32_t seed,\n\t\t\t  ostream& out5,\n\t\t\t  ostream& out6,\n\t\t\t  int32_t overrideOffRate = -1,\n\t\t\t  bool verbose = false,\n\t\t\t  bool passMemExc = false,\n\t\t\t  bool sanityCheck = false) :\n\tEbwt<index_t>(packed,\n\t\t\t\t  color,\n\t\t\t\t  needEntireReverse,\n\t\t\t\t  lineRate,\n\t\t\t\t  offRate,\n\t\t\t\t  ftabChars,\n\t\t\t\t  file,\n\t\t\t\t  fw,\n\t\t\t\t  dcv,\n\t\t\t\t  szs,\n\t\t\t\t  sztot,\n\t\t\t\t  refparams,\n\t\t\t\t  seed,\n\t\t\t\t  overrideOffRate,\n\t\t\t\t  verbose,\n\t\t\t\t  passMemExc,\n\t\t\t\t  sanityCheck)\n\t{\n\t\tconst EbwtParams<index_t>& eh = this->_eh;\n\t\tassert(eh.repOk());\n\t\tuint32_t be = this->toBe();\n\t\tassert(out5.good());\n\t\tassert(out6.good());\n\t\twriteIndex<full_index_t>(out5, tidx, be);\n\t\twriteIndex<full_index_t>(out5, local_offset, be);\n\t\twriteU32(out5, eh._len,      be); // length of string (and bwt and suffix array)\n\t\tif(eh._len > 0) {\n\t\t\tassert_gt(szs.size(), 0);\n\t\t\tassert_gt(sztot, 0);\n\t\t\t// Not every fragment represents a distinct sequence - many\n\t\t\t// fragments may correspond to a single sequence.  Count the\n\t\t\t// number of sequences here by counting the number of \"first\"\n\t\t\t// fragments.\n\t\t\tthis->_nPat = 0;\n\t\t\tthis->_nFrag = 0;\n\t\t\tfor(size_t i = 0; i < szs.size(); i++) {\n\t\t\t\tif(szs[i].len > 0) this->_nFrag++;\n\t\t\t\tif(szs[i].first && szs[i].len > 0) this->_nPat++;\n\t\t\t}\n\t\t\tassert_eq(this->_nPat, 1);\n\t\t\tassert_geq(this->_nFrag, this->_nPat);\n\t\t\tthis->_rstarts.reset();\n\t\t\twriteIndex(out5, this->_nPat, be);\n\t\t\tassert_eq(this->_nPat, 1);\n\t\t\tthis->_plen.init(new index_t[this->_nPat], this->_nPat);\n\t\t\t// For each pattern, set plen\n\t\t\tint npat = -1;\n\t\t\tfor(size_t i = 0; i < szs.size(); i++) {\n\t\t\t\tif(szs[i].first && szs[i].len > 0) {\n\t\t\t\t\tif(npat >= 0) {\n\t\t\t\t\t\twriteIndex(out5, this->plen()[npat], be);\n\t\t\t\t\t}\n\t\t\t\t\tnpat++;\n\t\t\t\t\tthis->plen()[npat] = (szs[i].len + szs[i].off);\n\t\t\t\t} else {\n\t\t\t\t\tthis->plen()[npat] += (szs[i].len + szs[i].off);\n\t\t\t\t}\n\t\t\t}\n\t\t\tassert_eq((index_t)npat, this->_nPat-1);\n\t\t\twriteIndex(out5, this->plen()[npat], be);\n\t\t\t// Write the number of fragments\n\t\t\twriteIndex(out5, this->_nFrag, be);\n\t\t\t\n\t\t\tif(refparams.reverse == REF_READ_REVERSE) {\n\t\t\t\tEList<RefRecord> tmp(EBWT_CAT);\n                reverseRefRecords(szs, tmp, false, verbose);\n\t\t\t\tthis->szsToDisk(tmp, out5, refparams.reverse);\n\t\t\t} else {\n\t\t\t\tthis->szsToDisk(szs, out5, refparams.reverse);\n\t\t\t}\n\t\t\t\n\t\t\tVMSG_NL(\"Constructing suffix-array element generator\");\n\t\t\tKarkkainenBlockwiseSA<TStr> bsa(s, s.length()+1, dcv, seed, this->_sanity, this->_passMemExc, this->_verbose);\n\t\t\tassert(bsa.suffixItrIsReset());\n\t\t\tassert_eq(bsa.size(), s.length()+1);\n\t\t\tVMSG_NL(\"Converting suffix-array elements to index image\");\n\t\t\tbuildToDisk(bsa, s, out5, out6);\n\t\t}\n\t\t\n\t\tout5.flush(); out6.flush();\n\t\tif(out5.fail() || out6.fail()) {\n\t\t\tcerr << \"An error occurred writing the index to disk.  Please check if the disk is full.\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t}\n\t\n\ttemplate <typename TStr> void buildToDisk(\n\t\t\t\t\t\t\t\t\t\t\t  InorderBlockwiseSA<TStr>& sa,\n\t\t\t\t\t\t\t\t\t\t\t  const TStr& s,\n\t\t\t\t\t\t\t\t\t\t\t  ostream& out1, \n\t\t\t\t\t\t\t\t\t\t\t  ostream& out2);\n\t\n\t// I/O\n\tvoid readIntoMemory(\n\t\t\t\t\t\tFILE *in5,\n\t\t\t\t\t\tFILE *in6,\n\t\t\t\t\t\tchar *mmFile5,\n\t\t\t\t\t\tchar *mmFile6,\n\t\t\t\t\t\tfull_index_t& tidx,\n\t\t\t\t\t\tfull_index_t& localOffset,\n\t\t\t\t\t\tbool switchEndian,\n\t\t\t\t\t\tsize_t bytesRead,\n\t\t\t\t\t\tint color,\n\t\t\t\t\t\tint needEntireRev, \n\t\t\t\t\t\tbool loadSASamp, \n\t\t\t\t\t\tbool loadFtab,\n\t\t\t\t\t\tbool loadRstarts, \n\t\t\t\t\t\tbool justHeader, \n\t\t\t\t\t\tint32_t lineRate,\n\t\t\t\t\t\tint32_t offRate,\n\t\t\t\t\t\tint32_t ftabChars,\n\t\t\t\t\t\tbool mmSweep, \n\t\t\t\t\t\tbool loadNames, \n\t\t\t\t\t\tbool startVerbose);\n\t\n\t/**\n\t * Sanity-check various pieces of the Ebwt\n\t */\n\tvoid sanityCheckAll(int reverse) const {\n\t\tif(this->_eh._len > 0) {\n\t\t\tPARENT_CLASS::sanityCheckAll(reverse);\n\t\t}\n\t}\n    \n    bool empty() const { return this->_eh._len == 0; }\n\t\npublic:\n\tfull_index_t _tidx;\n\tfull_index_t _localOffset;\n};\n\n/**\n * Build an Ebwt from a string 's' and its suffix array 'sa' (which\n * might actually be a suffix array *builder* that builds blocks of the\n * array on demand).  The bulk of the Ebwt, i.e. the ebwt and offs\n * arrays, is written directly to disk.  This is by design: keeping\n * those arrays in memory needlessly increases the footprint of the\n * building process.  Instead, we prefer to build the Ebwt directly\n * \"to disk\" and then read it back into memory later as necessary.\n *\n * It is assumed that the header values and join-related values (nPat,\n * plen) have already been written to 'out1' before this function\n * is called.  When this function is finished, it will have\n * additionally written ebwt, zOff, fchr, ftab and eftab to the primary\n * file and offs to the secondary file.\n *\n * Assume DNA/RNA/any alphabet with 4 or fewer elements.\n * Assume occ array entries are 32 bits each.\n *\n * @param sa            the suffix array to convert to a Ebwt\n * @param s             the original string\n * @param out\n */\ntemplate <typename index_t, typename full_index_t>\ntemplate <typename TStr>\nvoid LocalEbwt<index_t, full_index_t>::buildToDisk(\n\t\t\t\t\t\t\t\t\t InorderBlockwiseSA<TStr>& sa,\n\t\t\t\t\t\t\t\t\t const TStr& s,\n\t\t\t\t\t\t\t\t\t ostream& out5,\n\t\t\t\t\t\t\t\t\t ostream& out6)\n{\n\tassert_leq(s.length(), std::numeric_limits<index_t>::max());\n\tconst EbwtParams<index_t>& eh = this->_eh;\n\t\n\tassert(eh.repOk());\n\tassert_eq(s.length()+1, sa.size());\n\tassert_eq(s.length(), eh._len);\n\tassert_gt(eh._lineRate, 3);\n\tassert(sa.suffixItrIsReset());\n\t\n\tindex_t len = eh._len;\n\tindex_t ftabLen = eh._ftabLen;\n\tindex_t sideSz = eh._sideSz;\n\tindex_t ebwtTotSz = eh._ebwtTotSz;\n\tindex_t fchr[] = {0, 0, 0, 0, 0};\n\tEList<index_t> ftab(EBWT_CAT);\n\tindex_t zOff = (index_t)OFF_MASK;\n\t\n\t// Save # of occurrences of each character as we walk along the bwt\n\tindex_t occ[4] = {0, 0, 0, 0};\n\tindex_t occSave[4] = {0, 0, 0, 0};\n\t\n\t// Record rows that should \"absorb\" adjacent rows in the ftab.\n\t// The absorbed rows represent suffixes shorter than the ftabChars\n\t// cutoff.\n\tuint8_t absorbCnt = 0;\n\tEList<uint8_t> absorbFtab(EBWT_CAT);\n\ttry {\n\t\tVMSG_NL(\"Allocating ftab, absorbFtab\");\n\t\tftab.resize(ftabLen);\n\t\tftab.fillZero();\n\t\tabsorbFtab.resize(ftabLen);\n\t\tabsorbFtab.fillZero();\n\t} catch(bad_alloc &e) {\n\t\tcerr << \"Out of memory allocating ftab[] or absorbFtab[] \"\n\t\t<< \"in Ebwt::buildToDisk() at \" << __FILE__ << \":\"\n\t\t<< __LINE__ << endl;\n\t\tthrow e;\n\t}\n\t\n\t// Allocate the side buffer; holds a single side as its being\n\t// constructed and then written to disk.  Reused across all sides.\n#ifdef SIXTY4_FORMAT\n\tEList<uint64_t> ebwtSide(EBWT_CAT);\n#else\n\tEList<uint8_t> ebwtSide(EBWT_CAT);\n#endif\n\ttry {\n#ifdef SIXTY4_FORMAT\n\t\tebwtSide.resize(sideSz >> 3);\n#else\n\t\tebwtSide.resize(sideSz);\n#endif\n\t} catch(bad_alloc &e) {\n\t\tcerr << \"Out of memory allocating ebwtSide[] in \"\n\t\t<< \"Ebwt::buildToDisk() at \" << __FILE__ << \":\"\n\t\t<< __LINE__ << endl;\n\t\tthrow e;\n\t}\n\t\n\t// Points to the base offset within ebwt for the side currently\n\t// being written\n\tindex_t side = 0;\n\t\n\t// Whether we're assembling a forward or a reverse bucket\n\tbool fw;\n\tint sideCur = 0;\n\tfw = true;\n\t\n\t// Have we skipped the '$' in the last column yet?\n\tASSERT_ONLY(bool dollarSkipped = false);\n\n\tindex_t si = 0;   // string offset (chars)\n\tASSERT_ONLY(uint32_t lastSufInt = 0);\n\tASSERT_ONLY(bool inSA = true); // true iff saI still points inside suffix\n\t// array (as opposed to the padding at the\n\t// end)\n\t// Iterate over packed bwt bytes\n\tVMSG_NL(\"Entering Ebwt loop\");\n\tASSERT_ONLY(uint32_t beforeEbwtOff = (uint32_t)out5.tellp());\n\twhile(side < ebwtTotSz) {\n\t\t// Sanity-check our cursor into the side buffer\n\t\tassert_geq(sideCur, 0);\n\t\tassert_lt(sideCur, (int)eh._sideBwtSz);\n\t\tassert_eq(0, side % sideSz); // 'side' must be on side boundary\n\t\tebwtSide[sideCur] = 0; // clear\n\t\tassert_lt(side + sideCur, ebwtTotSz);\n\t\t// Iterate over bit-pairs in the si'th character of the BWT\n#ifdef SIXTY4_FORMAT\n\t\tfor(int bpi = 0; bpi < 32; bpi++, si++) {\n#else\n\t\tfor(int bpi = 0; bpi < 4; bpi++, si++) {\n#endif\n\t\t\tint bwtChar;\n\t\t\tbool count = true;\n\t\t\tif(si <= len) {\n\t\t\t\t// Still in the SA; extract the bwtChar\n\t\t\t\tindex_t saElt = (index_t)sa.nextSuffix();\n\t\t\t\t// (that might have triggered sa to calc next suf block)\n\t\t\t\tif(saElt == 0) {\n\t\t\t\t\t// Don't add the '$' in the last column to the BWT\n\t\t\t\t\t// transform; we can't encode a $ (only A C T or G)\n\t\t\t\t\t// and counting it as, say, an A, will mess up the\n\t\t\t\t\t// LR mapping\n\t\t\t\t\tbwtChar = 0; count = false;\n\t\t\t\t\tASSERT_ONLY(dollarSkipped = true);\n\t\t\t\t\tzOff = si; // remember the SA row that\n\t\t\t\t\t// corresponds to the 0th suffix\n\t\t\t\t} else {\n\t\t\t\t\tbwtChar = (int)(s[saElt-1]);\n\t\t\t\t\tassert_lt(bwtChar, 4);\n\t\t\t\t\t// Update the fchr\n\t\t\t\t\tfchr[bwtChar]++;\n\t\t\t\t}\n\t\t\t\t// Update ftab\n\t\t\t\tif((len-saElt) >= (index_t)eh._ftabChars) {\n\t\t\t\t\t// Turn the first ftabChars characters of the\n\t\t\t\t\t// suffix into an integer index into ftab.  The\n\t\t\t\t\t// leftmost (lowest index) character of the suffix\n\t\t\t\t\t// goes in the most significant bit pair if the\n\t\t\t\t\t// integer.\n\t\t\t\t\tuint32_t sufInt = 0;\n\t\t\t\t\tfor(int i = 0; i < eh._ftabChars; i++) {\n\t\t\t\t\t\tsufInt <<= 2;\n\t\t\t\t\t\tassert_lt((index_t)i, len-saElt);\n\t\t\t\t\t\tsufInt |= (unsigned char)(s[saElt+i]);\n\t\t\t\t\t}\n\t\t\t\t\t// Assert that this prefix-of-suffix is greater\n\t\t\t\t\t// than or equal to the last one (true b/c the\n\t\t\t\t\t// suffix array is sorted)\n#ifndef NDEBUG\n\t\t\t\t\tif(lastSufInt > 0) assert_geq(sufInt, lastSufInt);\n\t\t\t\t\tlastSufInt = sufInt;\n#endif\n\t\t\t\t\t// Update ftab\n\t\t\t\t\tassert_lt(sufInt+1, ftabLen);\n\t\t\t\t\tftab[sufInt+1]++;\n\t\t\t\t\tif(absorbCnt > 0) {\n\t\t\t\t\t\t// Absorb all short suffixes since the last\n\t\t\t\t\t\t// transition into this transition\n\t\t\t\t\t\tabsorbFtab[sufInt] = absorbCnt;\n\t\t\t\t\t\tabsorbCnt = 0;\n\t\t\t\t\t}\n\t\t\t\t} else {\n\t\t\t\t\t// Otherwise if suffix is fewer than ftabChars\n\t\t\t\t\t// characters long, then add it to the 'absorbCnt';\n\t\t\t\t\t// it will be absorbed into the next transition\n\t\t\t\t\tassert_lt(absorbCnt, 255);\n\t\t\t\t\tabsorbCnt++;\n\t\t\t\t}\n\t\t\t\t// Suffix array offset boundary? - update offset array\n\t\t\t\tif((si & eh._offMask) == si) {\n\t\t\t\t\tassert_lt((si >> eh._offRate), eh._offsLen);\n\t\t\t\t\t// Write offsets directly to the secondary output\n\t\t\t\t\t// stream, thereby avoiding keeping them in memory\n\t\t\t\t\twriteIndex(out6, saElt, this->toBe());\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\t// Strayed off the end of the SA, now we're just\n\t\t\t\t// padding out a bucket\n#ifndef NDEBUG\n\t\t\t\tif(inSA) {\n\t\t\t\t\t// Assert that we wrote all the characters in the\n\t\t\t\t\t// string before now\n\t\t\t\t\tassert_eq(si, len+1);\n\t\t\t\t\tinSA = false;\n\t\t\t\t}\n#endif\n\t\t\t\t// 'A' used for padding; important that padding be\n\t\t\t\t// counted in the occ[] array\n\t\t\t\tbwtChar = 0;\n\t\t\t}\n\t\t\tif(count) occ[bwtChar]++;\n\t\t\t// Append BWT char to bwt section of current side\n\t\t\tif(fw) {\n\t\t\t\t// Forward bucket: fill from least to most\n#ifdef SIXTY4_FORMAT\n\t\t\t\tebwtSide[sideCur] |= ((uint64_t)bwtChar << (bpi << 1));\n\t\t\t\tif(bwtChar > 0) assert_gt(ebwtSide[sideCur], 0);\n#else\n\t\t\t\tpack_2b_in_8b(bwtChar, ebwtSide[sideCur], bpi);\n\t\t\t\tassert_eq((ebwtSide[sideCur] >> (bpi*2)) & 3, bwtChar);\n#endif\n\t\t\t} else {\n\t\t\t\t// Backward bucket: fill from most to least\n#ifdef SIXTY4_FORMAT\n\t\t\t\tebwtSide[sideCur] |= ((uint64_t)bwtChar << ((31 - bpi) << 1));\n\t\t\t\tif(bwtChar > 0) assert_gt(ebwtSide[sideCur], 0);\n#else\n\t\t\t\tpack_2b_in_8b(bwtChar, ebwtSide[sideCur], 3-bpi);\n\t\t\t\tassert_eq((ebwtSide[sideCur] >> ((3-bpi)*2)) & 3, bwtChar);\n#endif\n\t\t\t}\n\t\t} // end loop over bit-pairs\n\t\tassert_eq(dollarSkipped ? 3 : 0, (occ[0] + occ[1] + occ[2] + occ[3]) & 3);\n#ifdef SIXTY4_FORMAT\n\t\tassert_eq(0, si & 31);\n#else\n\t\tassert_eq(0, si & 3);\n#endif\n\t\t\n\t\tsideCur++;\n\t\tif(sideCur == (int)eh._sideBwtSz) {\n\t\t\tsideCur = 0;\n\t\t\tindex_t *uside = reinterpret_cast<index_t*>(ebwtSide.ptr());\n\t\t\t// Write 'A', 'C', 'G' and 'T' tallies\n\t\t\tside += sideSz;\n\t\t\tassert_leq(side, eh._ebwtTotSz);\n\t\t\tuside[(sideSz / sizeof(index_t))-4] = endianizeIndex(occSave[0], this->toBe());\n\t\t\tuside[(sideSz / sizeof(index_t))-3] = endianizeIndex(occSave[1], this->toBe());\n\t\t\tuside[(sideSz / sizeof(index_t))-2] = endianizeIndex(occSave[2], this->toBe());\n\t\t\tuside[(sideSz / sizeof(index_t))-1] = endianizeIndex(occSave[3], this->toBe());\n\t\t\toccSave[0] = occ[0];\n\t\t\toccSave[1] = occ[1];\n\t\t\toccSave[2] = occ[2];\n\t\t\toccSave[3] = occ[3];\n\t\t\t// Write backward side to primary file\n\t\t\tout5.write((const char *)ebwtSide.ptr(), sideSz);\n\t\t}\n\t}\n\tVMSG_NL(\"Exited Ebwt loop\");\n\tassert_neq(zOff, (index_t)OFF_MASK);\n\tif(absorbCnt > 0) {\n\t\t// Absorb any trailing, as-yet-unabsorbed short suffixes into\n\t\t// the last element of ftab\n\t\tabsorbFtab[ftabLen-1] = absorbCnt;\n\t}\n\t// Assert that our loop counter got incremented right to the end\n\tassert_eq(side, eh._ebwtTotSz);\n\t// Assert that we wrote the expected amount to out1\n\tassert_eq(((uint32_t)out5.tellp() - beforeEbwtOff), eh._ebwtTotSz);\n\t// assert that the last thing we did was write a forward bucket\n\t\n\t//\n\t// Write zOff to primary stream\n\t//\n\twriteIndex(out5, zOff, this->toBe());\n\t\n\t//\n\t// Finish building fchr\n\t//\n\t// Exclusive prefix sum on fchr\n\tfor(int i = 1; i < 4; i++) {\n\t\tfchr[i] += fchr[i-1];\n\t}\n\tassert_eq(fchr[3], len);\n\t// Shift everybody up by one\n\tfor(int i = 4; i >= 1; i--) {\n\t\tfchr[i] = fchr[i-1];\n\t}\n\tfchr[0] = 0;\n\tif(this->_verbose) {\n\t\tfor(int i = 0; i < 5; i++)\n\t\t\tcout << \"fchr[\" << \"ACGT$\"[i] << \"]: \" << fchr[i] << endl;\n\t}\n\t// Write fchr to primary file\n\tfor(int i = 0; i < 5; i++) {\n\t\twriteIndex(out5, fchr[i], this->toBe());\n\t}\n\t\n\t//\n\t// Finish building ftab and build eftab\n\t//\n\t// Prefix sum on ftable\n\tindex_t eftabLen = 0;\n\tassert_eq(0, absorbFtab[0]);\n\tfor(index_t i = 1; i < ftabLen; i++) {\n\t\tif(absorbFtab[i] > 0) eftabLen += 2;\n\t}\n\tassert_leq(eftabLen, (index_t)eh._ftabChars*2);\n\teftabLen = eh._ftabChars*2;\n\tEList<index_t> eftab(EBWT_CAT);\n\ttry {\n\t\teftab.resize(eftabLen);\n\t\teftab.fillZero();\n\t} catch(bad_alloc &e) {\n\t\tcerr << \"Out of memory allocating eftab[] \"\n\t\t<< \"in Ebwt::buildToDisk() at \" << __FILE__ << \":\"\n\t\t<< __LINE__ << endl;\n\t\tthrow e;\n\t}\n\tindex_t eftabCur = 0;\n\tfor(index_t i = 1; i < ftabLen; i++) {\n\t\tindex_t lo = ftab[i] + Ebwt<index_t>::ftabHi(ftab.ptr(), eftab.ptr(), len, ftabLen, eftabLen, i-1);\n\t\tif(absorbFtab[i] > 0) {\n\t\t\t// Skip a number of short pattern indicated by absorbFtab[i]\n\t\t\tindex_t hi = lo + absorbFtab[i];\n\t\t\tassert_lt(eftabCur*2+1, eftabLen);\n\t\t\teftab[eftabCur*2] = lo;\n\t\t\teftab[eftabCur*2+1] = hi;\n\t\t\tftab[i] = (eftabCur++) ^ (index_t)OFF_MASK; // insert pointer into eftab\n\t\t\tassert_eq(lo, Ebwt<index_t>::ftabLo(ftab.ptr(), eftab.ptr(), len, ftabLen, eftabLen, i));\n\t\t\tassert_eq(hi, Ebwt<index_t>::ftabHi(ftab.ptr(), eftab.ptr(), len, ftabLen, eftabLen, i));\n\t\t} else {\n\t\t\tftab[i] = lo;\n\t\t}\n\t}\n\tassert_eq(Ebwt<index_t>::ftabHi(ftab.ptr(), eftab.ptr(), len, ftabLen, eftabLen, ftabLen-1), len+1);\n\t// Write ftab to primary file\n\tfor(index_t i = 0; i < ftabLen; i++) {\n\t\twriteIndex(out5, ftab[i], this->toBe());\n\t}\n\t// Write eftab to primary file\n\tfor(index_t i = 0; i < eftabLen; i++) {\n\t\twriteIndex(out5, eftab[i], this->toBe());\n\t}\n\t\n\t// Note: if you'd like to sanity-check the Ebwt, you'll have to\n\t// read it back into memory first!\n\tassert(!this->isInMemory());\n\tVMSG_NL(\"Exiting Ebwt::buildToDisk()\");\n}\n\n/**\n * Read an Ebwt from file with given filename.\n */\ntemplate <typename index_t, typename full_index_t>\nvoid LocalEbwt<index_t, full_index_t>::readIntoMemory(\n\t\t\t\t\t\t\t\t\t\tFILE *in5,\n\t\t\t\t\t\t\t\t\t\tFILE *in6,\n\t\t\t\t\t\t\t\t\t\tchar *mmFile5,\n\t\t\t\t\t\t\t\t\t\tchar *mmFile6,\n\t\t\t\t\t\t\t\t\t\tfull_index_t& tidx,\n\t\t\t\t\t\t\t\t\t\tfull_index_t& localOffset,\n\t\t\t\t\t\t\t\t\t\tbool switchEndian,\n\t\t\t\t\t\t\t\t\t\tsize_t bytesRead,\n\t\t\t\t\t\t\t\t\t\tint color,\n\t\t\t\t\t\t\t\t\t\tint entireRev,\n\t\t\t\t\t\t\t\t\t\tbool loadSASamp,\n\t\t\t\t\t\t\t\t\t\tbool loadFtab,\n\t\t\t\t\t\t\t\t\t\tbool loadRstarts,\n\t\t\t\t\t\t\t\t\t\tbool justHeader,\n\t\t\t\t\t\t\t\t\t\tint32_t lineRate,\n\t\t\t\t\t\t\t\t\t\tint32_t offRate,\n\t\t\t\t\t\t\t\t\t\tint32_t ftabChars,\n\t\t\t\t\t\t\t\t\t\tbool mmSweep,\n\t\t\t\t\t\t\t\t\t\tbool loadNames,\n\t\t\t\t\t\t\t\t\t\tbool startVerbose)\n{\n#ifdef BOWTIE_MM\n\tchar *mmFile[] = { mmFile5, mmFile6 };\n#endif\n\t\n\t// Reads header entries one by one from primary stream\n\ttidx = readIndex<full_index_t>(in5, switchEndian); bytesRead += sizeof(full_index_t);\n\tlocalOffset = readIndex<full_index_t>(in5, switchEndian); bytesRead += sizeof(full_index_t);\n\tuint32_t len = readU32(in5, switchEndian); bytesRead += 4;\n\t\n\t// Create a new EbwtParams from the entries read from primary stream\n\tthis->_eh.init(len, lineRate, offRate, ftabChars, color, entireRev);\n\t\n\tif(len <= 0) {\n\t\treturn;\n\t}\n\t\n\t// Set up overridden suffix-array-sample parameters\n\tuint32_t offsLen = this->_eh._offsLen;\n\tuint32_t offRateDiff = 0;\n\tuint32_t offsLenSampled = offsLen;\n\tif(this->_overrideOffRate > offRate) {\n\t\toffRateDiff = this->_overrideOffRate - offRate;\n\t}\n\tif(offRateDiff > 0) {\n\t\toffsLenSampled >>= offRateDiff;\n\t\tif((offsLen & ~((index_t)OFF_MASK << offRateDiff)) != 0) {\n\t\t\toffsLenSampled++;\n\t\t}\n\t}\n\t\n\t// Can't override the offrate or isarate and use memory-mapped\n\t// files; ultimately, all processes need to copy the sparser sample\n\t// into their own memory spaces.\n\tif(this->_useMm && (offRateDiff)) {\n\t\tcerr << \"Error: Can't use memory-mapped files when the offrate is overridden\" << endl;\n\t\tthrow 1;\n\t}\n\t\n\t// Read nPat from primary stream\n\tthis->_nPat = readIndex<index_t>(in5, switchEndian);\n\tassert_eq(this->_nPat, 1);\n\tbytesRead += sizeof(index_t);\n\tthis->_plen.reset();\n\t\n\t// Read plen from primary stream\n\tif(this->_useMm) {\n#ifdef BOWTIE_MM\n\t\tthis->_plen.init((index_t*)(mmFile[0] + bytesRead), this->_nPat, false);\n\t\tbytesRead += this->_nPat*sizeof(index_t);\n\t\tfseek(in5, this->_nPat*sizeof(index_t), SEEK_CUR);\n#endif\n\t} else {\n\t\ttry {\n\t\t\tif(this->_verbose || startVerbose) {\n\t\t\t\tcerr << \"Reading plen (\" << this->_nPat << \"): \";\n\t\t\t\tlogTime(cerr);\n\t\t\t}\n\t\t\tthis->_plen.init(new index_t[this->_nPat], this->_nPat, true);\n\t\t\tif(switchEndian) {\n\t\t\t\tfor(index_t i = 0; i < this->_nPat; i++) {\n\t\t\t\t\tthis->plen()[i] = readIndex<index_t>(in5, switchEndian);\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\tsize_t r = MM_READ(in5, (void*)(this->plen()), this->_nPat*sizeof(index_t));\n\t\t\t\tif(r != (size_t)(this->_nPat*sizeof(index_t))) {\n\t\t\t\t\tcerr << \"Error reading _plen[] array: \" << r << \", \" << this->_nPat*sizeof(index_t) << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t}\n\t\t} catch(bad_alloc& e) {\n\t\t\tcerr << \"Out of memory allocating plen[] in Ebwt::read()\"\n\t\t\t<< \" at \" << __FILE__ << \":\" << __LINE__ << endl;\n\t\t\tthrow e;\n\t\t}\n\t}\n\n\tbool shmemLeader;\n\t\n\t// TODO: I'm not consistent on what \"header\" means.  Here I'm using\n\t// \"header\" to mean everything that would exist in memory if we\n\t// started to build the Ebwt but stopped short of the build*() step\n\t// (i.e. everything up to and including join()).\n\tif(justHeader) return;\n\t\n\tthis->_nFrag = readIndex<index_t>(in5, switchEndian);\n\tbytesRead += sizeof(index_t);\n\tif(this->_verbose || startVerbose) {\n\t\tcerr << \"Reading rstarts (\" << this->_nFrag*3 << \"): \";\n\t\tlogTime(cerr);\n\t}\n\tassert_geq(this->_nFrag, this->_nPat);\n\tthis->_rstarts.reset();\n\tif(loadRstarts) {\n\t\tif(this->_useMm) {\n#ifdef BOWTIE_MM\n\t\t\tthis->_rstarts.init((index_t*)(mmFile[0] + bytesRead), this->_nFrag*3, false);\n\t\t\tbytesRead += this->_nFrag*sizeof(index_t)*3;\n\t\t\tfseek(in5, this->_nFrag*sizeof(index_t)*3, SEEK_CUR);\n#endif\n\t\t} else {\n\t\t\tthis->_rstarts.init(new index_t[this->_nFrag*3], this->_nFrag*3, true);\n\t\t\tif(switchEndian) {\n\t\t\t\tfor(index_t i = 0; i < this->_nFrag*3; i += 3) {\n\t\t\t\t\t// fragment starting position in joined reference\n\t\t\t\t\t// string, text id, and fragment offset within text\n\t\t\t\t\tthis->rstarts()[i]   = readIndex<index_t>(in5, switchEndian);\n\t\t\t\t\tthis->rstarts()[i+1] = readIndex<index_t>(in5, switchEndian);\n\t\t\t\t\tthis->rstarts()[i+2] = readIndex<index_t>(in5, switchEndian);\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\tsize_t r = MM_READ(in5, (void *)this->rstarts(), this->_nFrag*sizeof(index_t)*3);\n\t\t\t\tif(r != (size_t)(this->_nFrag*sizeof(index_t)*3)) {\n\t\t\t\t\tcerr << \"Error reading _rstarts[] array: \" << r << \", \" << (this->_nFrag*sizeof(index_t)*3) << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t} else {\n\t\t// Skip em\n\t\tassert(this->rstarts() == NULL);\n\t\tbytesRead += this->_nFrag*sizeof(index_t)*3;\n\t\tfseek(in5, this->_nFrag*sizeof(index_t)*3, SEEK_CUR);\n\t}\n\t\n\tthis->_ebwt.reset();\n\tif(this->_useMm) {\n#ifdef BOWTIE_MM\n\t\tthis->_ebwt.init((uint8_t*)(mmFile[0] + bytesRead), this->_eh._ebwtTotLen, false);\n\t\tbytesRead += this->_eh._ebwtTotLen;\n\t\tfseek(in5, this->_eh._ebwtTotLen, SEEK_CUR);\n#endif\n\t} else {\n\t\t// Allocate ebwt (big allocation)\n\t\tif(this->_verbose || startVerbose) {\n\t\t\tcerr << \"Reading ebwt (\" << this->_eh._ebwtTotLen << \"): \";\n\t\t\tlogTime(cerr);\n\t\t}\n\t\tbool shmemLeader = true;\n\t\tif(this->useShmem_) {\n\t\t\tuint8_t *tmp = NULL;\n\t\t\tshmemLeader = ALLOC_SHARED_U8(\n\t\t\t\t\t\t\t\t\t\t  (this->_in1Str + \"[ebwt]\"), this->_eh._ebwtTotLen, &tmp,\n\t\t\t\t\t\t\t\t\t\t  \"ebwt[]\", (this->_verbose || startVerbose));\n\t\t\tassert(tmp != NULL);\n\t\t\tthis->_ebwt.init(tmp, this->_eh._ebwtTotLen, false);\n\t\t\tif(this->_verbose || startVerbose) {\n\t\t\t\tcerr << \"  shared-mem \" << (shmemLeader ? \"leader\" : \"follower\") << endl;\n\t\t\t}\n\t\t} else {\n\t\t\ttry {\n\t\t\t\tthis->_ebwt.init(new uint8_t[this->_eh._ebwtTotLen], this->_eh._ebwtTotLen, true);\n\t\t\t} catch(bad_alloc& e) {\n\t\t\t\tcerr << \"Out of memory allocating the ebwt[] array for the Bowtie index.  Please try\" << endl\n\t\t\t\t<< \"again on a computer with more memory.\" << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t}\n\t\tif(shmemLeader) {\n\t\t\t// Read ebwt from primary stream\n\t\t\tuint64_t bytesLeft = this->_eh._ebwtTotLen;\n\t\t\tchar *pebwt = (char*)this->ebwt();\n            \n\t\t\twhile (bytesLeft>0){\n\t\t\t\tsize_t r = MM_READ(in5, (void *)pebwt, bytesLeft);\n\t\t\t\tif(MM_IS_IO_ERR(in5, r, bytesLeft)) {\n\t\t\t\t\tcerr << \"Error reading _ebwt[] array: \" << r << \", \"\n                    << bytesLeft << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t\tpebwt += r;\n\t\t\t\tbytesLeft -= r;\n\t\t\t}\n\t\t\tif(switchEndian) {\n\t\t\t\tuint8_t *side = this->ebwt();\n\t\t\t\tfor(size_t i = 0; i < this->_eh._numSides; i++) {\n\t\t\t\t\tindex_t *cums = reinterpret_cast<index_t*>(side + this->_eh._sideSz - sizeof(index_t)*2);\n\t\t\t\t\tcums[0] = endianSwapIndex(cums[0]);\n\t\t\t\t\tcums[1] = endianSwapIndex(cums[1]);\n\t\t\t\t\tside += this->_eh._sideSz;\n\t\t\t\t}\n\t\t\t}\n#ifdef BOWTIE_SHARED_MEM\n\t\t\tif(useShmem_) NOTIFY_SHARED(this->ebwt(), this->_eh._ebwtTotLen);\n#endif\n\t\t} else {\n\t\t\t// Seek past the data and wait until master is finished\n\t\t\tfseek(in5, this->_eh._ebwtTotLen, SEEK_CUR);\n#ifdef BOWTIE_SHARED_MEM\n\t\t\tif(useShmem_) WAIT_SHARED(this->ebwt(), this->_eh._ebwtTotLen);\n#endif\n\t\t}\n\t}\n\t\n\t// Read zOff from primary stream\n\tthis->_zOff = readIndex<index_t>(in5, switchEndian);\n\tbytesRead += sizeof(index_t);\n\tassert_lt(this->_zOff, len);\n\t\n\ttry {\n\t\t// Read fchr from primary stream\n\t\tif(this->_verbose || startVerbose) cerr << \"Reading fchr (5)\" << endl;\n\t\tthis->_fchr.reset();\n\t\tif(this->_useMm) {\n#ifdef BOWTIE_MM\n\t\t\tthis->_fchr.init((index_t*)(mmFile[0] + bytesRead), 5, false);\n\t\t\tbytesRead += 5*sizeof(index_t);\n\t\t\tfseek(in5, 5*sizeof(index_t), SEEK_CUR);\n#endif\n\t\t} else {\n\t\t\tthis->_fchr.init(new index_t[5], 5, true);\n\t\t\tfor(index_t i = 0; i < 5; i++) {\n\t\t\t\tthis->fchr()[i] = readIndex<index_t>(in5, switchEndian);\n\t\t\t\tassert_leq(this->fchr()[i], len);\n\t\t\t\tassert(i <= 0 || this->fchr()[i] >= this->fchr()[i-1]);\n\t\t\t}\n\t\t}\n\t\tassert_gt(this->fchr()[4], this->fchr()[0]);\n\t\t// Read ftab from primary stream\n\t\tif(this->_verbose || startVerbose) {\n\t\t\tif(loadFtab) {\n\t\t\t\tcerr << \"Reading ftab (\" << this->_eh._ftabLen << \"): \";\n\t\t\t\tlogTime(cerr);\n\t\t\t} else {\n\t\t\t\tcerr << \"Skipping ftab (\" << this->_eh._ftabLen << \"): \";\n\t\t\t}\n\t\t}\n\t\tthis->_ftab.reset();\n\t\tif(loadFtab) {\n\t\t\tif(this->_useMm) {\n#ifdef BOWTIE_MM\n\t\t\t\tthis->_ftab.init((index_t*)(mmFile[0] + bytesRead), this->_eh._ftabLen, false);\n\t\t\t\tbytesRead += this->_eh._ftabLen*sizeof(index_t);\n\t\t\t\tfseek(in5, this->_eh._ftabLen*sizeof(index_t), SEEK_CUR);\n#endif\n\t\t\t} else {\n\t\t\t\tthis->_ftab.init(new index_t[this->_eh._ftabLen], this->_eh._ftabLen, true);\n\t\t\t\tif(switchEndian) {\n\t\t\t\t\tfor(uint32_t i = 0; i < this->_eh._ftabLen; i++)\n\t\t\t\t\t\tthis->ftab()[i] = readIndex<index_t>(in5, switchEndian);\n\t\t\t\t} else {\n\t\t\t\t\tsize_t r = MM_READ(in5, (void *)this->ftab(), this->_eh._ftabLen*sizeof(index_t));\n\t\t\t\t\tif(r != (size_t)(this->_eh._ftabLen*sizeof(index_t))) {\n\t\t\t\t\t\tcerr << \"Error reading _ftab[] array: \" << r << \", \" << (this->_eh._ftabLen*sizeof(index_t)) << endl;\n\t\t\t\t\t\tthrow 1;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\t// Read etab from primary stream\n\t\t\tif(this->_verbose || startVerbose) {\n\t\t\t\tif(loadFtab) {\n\t\t\t\t\tcerr << \"Reading eftab (\" << this->_eh._eftabLen << \"): \";\n\t\t\t\t\tlogTime(cerr);\n\t\t\t\t} else {\n\t\t\t\t\tcerr << \"Skipping eftab (\" << this->_eh._eftabLen << \"): \";\n\t\t\t\t}\n\t\t\t\t\n\t\t\t}\n\t\t\tthis->_eftab.reset();\n\t\t\tif(this->_useMm) {\n#ifdef BOWTIE_MM\n\t\t\t\tthis->_eftab.init((index_t*)(mmFile[0] + bytesRead), this->_eh._eftabLen, false);\n\t\t\t\tbytesRead += this->_eh._eftabLen*sizeof(index_t);\n\t\t\t\tfseek(in5, this->_eh._eftabLen*sizeof(index_t), SEEK_CUR);\n#endif\n\t\t\t} else {\n\t\t\t\tthis->_eftab.init(new index_t[this->_eh._eftabLen], this->_eh._eftabLen, true);\n\t\t\t\tif(switchEndian) {\n\t\t\t\t\tfor(uint32_t i = 0; i < this->_eh._eftabLen; i++)\n\t\t\t\t\t\tthis->eftab()[i] = readIndex<index_t>(in5, switchEndian);\n\t\t\t\t} else {\n\t\t\t\t\tsize_t r = MM_READ(in5, (void *)this->eftab(), this->_eh._eftabLen*sizeof(index_t));\n\t\t\t\t\tif(r != (size_t)(this->_eh._eftabLen*sizeof(index_t))) {\n\t\t\t\t\t\tcerr << \"Error reading _eftab[] array: \" << r << \", \" << (this->_eh._eftabLen*sizeof(index_t)) << endl;\n\t\t\t\t\t\tthrow 1;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\tfor(uint32_t i = 0; i < this->_eh._eftabLen; i++) {\n\t\t\t\tif(i > 0 && this->eftab()[i] > 0) {\n\t\t\t\t\tassert_geq(this->eftab()[i], this->eftab()[i-1]);\n\t\t\t\t} else if(i > 0 && this->eftab()[i-1] == 0) {\n\t\t\t\t\tassert_eq(0, this->eftab()[i]);\n\t\t\t\t}\n\t\t\t}\n\t\t} else {\n\t\t\tassert(this->ftab() == NULL);\n\t\t\tassert(this->eftab() == NULL);\n\t\t\t// Skip ftab\n\t\t\tbytesRead += this->_eh._ftabLen*sizeof(index_t);\n\t\t\tfseek(in5, this->_eh._ftabLen*sizeof(index_t), SEEK_CUR);\n\t\t\t// Skip eftab\n\t\t\tbytesRead += this->_eh._eftabLen*sizeof(index_t);\n\t\t\tfseek(in5, this->_eh._eftabLen*sizeof(index_t), SEEK_CUR);\n\t\t}\n\t} catch(bad_alloc& e) {\n\t\tcerr << \"Out of memory allocating fchr[], ftab[] or eftab[] arrays for the Bowtie index.\" << endl\n\t\t<< \"Please try again on a computer with more memory.\" << endl;\n\t\tthrow 1;\n\t}\n\t\n\tthis->_offs.reset();\n\tif(loadSASamp) {\n\t\tbytesRead = 4; // reset for secondary index file (already read 1-sentinel)\t\t\n\t\tshmemLeader = true;\n\t\tif(this->_verbose || startVerbose) {\n\t\t\tcerr << \"Reading offs (\" << offsLenSampled << \" \" << std::setw(2) << sizeof(index_t)*8 << \"-bit words): \";\n\t\t\tlogTime(cerr);\n\t\t}\n\t\t\n\t\tif(!this->_useMm) {\n\t\t\tif(!this->useShmem_) {\n\t\t\t\t// Allocate offs_\n\t\t\t\ttry {\n\t\t\t\t\tthis->_offs.init(new index_t[offsLenSampled], offsLenSampled, true);\n\t\t\t\t} catch(bad_alloc& e) {\n\t\t\t\t\tcerr << \"Out of memory allocating the offs[] array  for the Bowtie index.\" << endl\n\t\t\t\t\t<< \"Please try again on a computer with more memory.\" << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\tindex_t *tmp = NULL;\n\t\t\t\tshmemLeader = ALLOC_SHARED_U32(\n\t\t\t\t\t\t\t\t\t\t\t   (this->_in2Str + \"[offs]\"), offsLenSampled*2, &tmp,\n\t\t\t\t\t\t\t\t\t\t\t   \"offs\", (this->_verbose || startVerbose));\n\t\t\t\tthis->_offs.init((index_t*)tmp, offsLenSampled, false);\n\t\t\t}\n\t\t}\n\t\t\n\t\tif(this->_overrideOffRate < 32) {\n\t\t\tif(shmemLeader) {\n\t\t\t\t// Allocate offs (big allocation)\n\t\t\t\tif(switchEndian || offRateDiff > 0) {\n\t\t\t\t\tassert(!this->_useMm);\n\t\t\t\t\tconst uint32_t blockMaxSz = (2 * 1024 * 1024); // 2 MB block size\n\t\t\t\t\tconst uint32_t blockMaxSzUIndex = (blockMaxSz / sizeof(index_t)); // # UIndexs per block\n\t\t\t\t\tchar *buf;\n\t\t\t\t\ttry {\n\t\t\t\t\t\tbuf = new char[blockMaxSz];\n\t\t\t\t\t} catch(std::bad_alloc& e) {\n\t\t\t\t\t\tcerr << \"Error: Out of memory allocating part of _offs array: '\" << e.what() << \"'\" << endl;\n\t\t\t\t\t\tthrow e;\n\t\t\t\t\t}\n\t\t\t\t\tfor(index_t i = 0; i < offsLen; i += blockMaxSzUIndex) {\n\t\t\t\t\t  index_t block = min<index_t>((index_t)blockMaxSzUIndex, (index_t)(offsLen - i));\n\t\t\t\t\t\tsize_t r = MM_READ(in6, (void *)buf, block * sizeof(index_t));\n\t\t\t\t\t\tif(r != (size_t)(block * sizeof(index_t))) {\n\t\t\t\t\t\t\tcerr << \"Error reading block of _offs[] array: \" << r << \", \" << (block * sizeof(index_t)) << endl;\n\t\t\t\t\t\t\tthrow 1;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tindex_t idx = i >> offRateDiff;\n\t\t\t\t\t\tfor(index_t j = 0; j < block; j += (1 << offRateDiff)) {\n\t\t\t\t\t\t\tassert_lt(idx, offsLenSampled);\n\t\t\t\t\t\t\tthis->offs()[idx] = ((index_t*)buf)[j];\n\t\t\t\t\t\t\tif(switchEndian) {\n\t\t\t\t\t\t\t\tthis->offs()[idx] = endianSwapIndex(this->offs()[idx]);\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tidx++;\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\tdelete[] buf;\n\t\t\t\t} else {\n\t\t\t\t\tif(this->_useMm) {\n#ifdef BOWTIE_MM\n\t\t\t\t\t\tthis->_offs.init((index_t*)(mmFile[1] + bytesRead), offsLen, false);\n\t\t\t\t\t\tbytesRead += (offsLen * sizeof(index_t));\n\t\t\t\t\t\tfseek(in6, (offsLen * sizeof(index_t)), SEEK_CUR);\n#endif\n\t\t\t\t\t} else {\n\t\t\t\t\t\t// If any of the high two bits are set\n\t\t\t\t\t\tif((offsLen & 0xc0000000) != 0) {\n\t\t\t\t\t\t\tif(sizeof(char *) <= 4) {\n\t\t\t\t\t\t\t\tcerr << \"Sanity error: sizeof(char *) <= 4 but offsLen is \" << hex << offsLen << endl;\n\t\t\t\t\t\t\t\tthrow 1;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t// offsLen << 2 overflows, so do it in four reads\n\t\t\t\t\t\t\tchar *offs = (char *)this->offs();\n\t\t\t\t\t\t\tfor(size_t i = 0; i < sizeof(index_t); i++) {\n\t\t\t\t\t\t\t\tsize_t r = MM_READ(in6, (void*)offs, offsLen);\n\t\t\t\t\t\t\t\tif(r != (size_t)(offsLen)) {\n\t\t\t\t\t\t\t\t\tcerr << \"Error reading block of _offs[] array: \" << r << \", \" << offsLen << endl;\n\t\t\t\t\t\t\t\t\tthrow 1;\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\toffs += offsLen;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t// Do it all in one read\n\t\t\t\t\t\t\tsize_t r = MM_READ(in6, (void*)this->offs(), offsLen * sizeof(index_t));\n\t\t\t\t\t\t\tif(r != (size_t)(offsLen * sizeof(index_t))) {\n\t\t\t\t\t\t\t\tcerr << \"Error reading _offs[] array: \" << r << \", \" << (offsLen * sizeof(index_t)) << endl;\n\t\t\t\t\t\t\t\tthrow 1;\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n#ifdef BOWTIE_SHARED_MEM\t\t\t\t\n\t\t\t\tif(this->useShmem_) NOTIFY_SHARED(this->offs(), offsLenSampled*sizeof(index_t));\n#endif\n\t\t\t} else {\n\t\t\t\t// Not the shmem leader\n\t\t\t\tfseek(in6, offsLenSampled*sizeof(index_t), SEEK_CUR);\n#ifdef BOWTIE_SHARED_MEM\t\t\t\t\n\t\t\t\tif(this->useShmem_) WAIT_SHARED(this->offs(), offsLenSampled*sizeof(index_t));\n#endif\n\t\t\t}\n\t\t}\n\t}\n\t\n\tthis->postReadInit(this->_eh); // Initialize fields of Ebwt not read from file\n\tif(this->_verbose || startVerbose) this->print(cerr, this->_eh);\n}\n\n/**\n * Extended Burrows-Wheeler transform data.\n * HierEbwt is a specialized Ebwt index that represents one global index and a large set of local indexes.\n *\n */\ntemplate <typename index_t = uint32_t, typename local_index_t = uint16_t>\nclass HierEbwt : public Ebwt<index_t> {\n\ttypedef Ebwt<index_t> PARENT_CLASS;\npublic:\n\t/// Construct an Ebwt from the given input file\n\tHierEbwt(const string& in,\n\t\t\t int color,\n\t\t\t int needEntireReverse,\n\t\t\t bool fw,\n\t\t\t int32_t overrideOffRate, // = -1,\n\t\t\t int32_t offRatePlus, // = -1,\n\t\t\t bool useMm, // = false,\n\t\t\t bool useShmem, // = false,\n\t\t\t bool mmSweep, // = false,\n\t\t\t bool loadNames, // = false,\n\t\t\t bool loadSASamp, // = true,\n\t\t\t bool loadFtab, // = true,\n\t\t\t bool loadRstarts, // = true,\n\t\t\t bool verbose, // = false,\n\t\t\t bool startVerbose, // = false,\n\t\t\t bool passMemExc, // = false,\n\t\t\t bool sanityCheck, // = false\n             bool skipLoading = false) :\n\t         Ebwt<index_t>(in,\n\t\t\t\t\t\t   color,\n\t\t\t\t\t\t   needEntireReverse,\n\t\t\t\t\t\t   fw,\n\t\t\t\t\t\t   overrideOffRate,\n\t\t\t\t\t\t   offRatePlus,\n\t\t\t\t\t\t   useMm,\n\t\t\t\t\t\t   useShmem,\n\t\t\t\t\t\t   mmSweep,\n\t\t\t\t\t\t   loadNames,\n\t\t\t\t\t\t   loadSASamp,\n\t\t\t\t\t\t   loadFtab,\n\t\t\t\t\t\t   loadRstarts,\n\t\t\t\t\t\t   verbose,\n\t\t\t\t\t\t   startVerbose,\n\t\t\t\t\t\t   passMemExc,\n\t\t\t\t\t\t   sanityCheck,\n\t\t\t\t\t\t   skipLoading),\n\t         _in5(NULL),\n\t         _in6(NULL)\n\t{\n\t\t_in5Str = in + \".5.\" + gEbwt_ext;\n\t\t_in6Str = in + \".6.\" + gEbwt_ext;\n        \n        if(!skipLoading && false) {\n            readIntoMemory(\n                           color,       // expect index to be colorspace?\n                           fw ? -1 : needEntireReverse, // need REF_READ_REVERSE\n                           loadSASamp,  // load the SA sample portion?\n                           loadFtab,    // load the ftab & eftab?\n                           loadRstarts, // load the rstarts array?\n                           true,        // stop after loading the header portion?\n                           &(this->_eh),\n                           mmSweep,     // mmSweep\n                           loadNames,   // loadNames\n                           startVerbose); // startVerbose\n            // If the offRate has been overridden, reflect that in the\n            // _eh._offRate field\n            if(offRatePlus > 0 && this->_overrideOffRate == -1) {\n                this->_overrideOffRate = this->_eh._offRate + offRatePlus;\n            }\n            if(this->_overrideOffRate > this->_eh._offRate) {\n                this->_eh.setOffRate(this->_overrideOffRate);\n                assert_eq(this->_overrideOffRate, this->_eh._offRate);\n            }\n            assert(this->repOk());\n        }\n\t}\n\t\n\t/// Construct an Ebwt from the given header parameters and string\n\t/// vector, optionally using a blockwise suffix sorter with the\n\t/// given 'bmax' and 'dcv' parameters.  The string vector is\n\t/// ultimately joined and the joined string is passed to buildToDisk().\n\ttemplate<typename TStr>\n\tHierEbwt(\n\t\t\t TStr& s,\n\t\t\t bool packed,\n\t\t\t int color,\n\t\t\t int needEntireReverse,\n\t\t\t int32_t lineRate,\n\t\t\t int32_t offRate,\n\t\t\t int32_t ftabChars,\n             int32_t localOffRate,\n             int32_t localFtabChars,\n\t\t\t const string& file,   // base filename for EBWT files\n\t\t\t bool fw,\n\t\t\t bool useBlockwise,\n\t\t\t TIndexOffU bmax,\n\t\t\t TIndexOffU bmaxSqrtMult,\n\t\t\t TIndexOffU bmaxDivN,\n\t\t\t int dcv,\n\t\t\t EList<FileBuf*>& is,\n\t\t\t EList<RefRecord>& szs,\n\t\t\t index_t sztot,\n\t\t\t const RefReadInParams& refparams,\n\t\t\t uint32_t seed,\n\t\t\t int32_t overrideOffRate = -1,\n\t\t\t bool verbose = false,\n\t\t\t bool passMemExc = false,\n\t\t\t bool sanityCheck = false);\n\t        \t\n\t~HierEbwt() {\n\t\tclearLocalEbwts();\n\t}\n    \n    /**\n\t * Load this Ebwt into memory by reading it in from the _in1 and\n\t * _in2 streams.\n\t */\n\tvoid loadIntoMemory(\n                        int color,\n                        int needEntireReverse,\n                        bool loadSASamp,\n                        bool loadFtab,\n                        bool loadRstarts,\n                        bool loadNames,\n                        bool verbose)\n\t{\n\t\treadIntoMemory(\n                       color,       // expect index to be colorspace?\n                       needEntireReverse, // require reverse index to be concatenated reference reversed\n                       loadSASamp,  // load the SA sample portion?\n                       loadFtab,    // load the ftab (_ftab[] and _eftab[])?\n                       loadRstarts, // load the r-starts (_rstarts[])?\n                       false,       // stop after loading the header portion?\n                       NULL,        // params\n                       false,       // mmSweep\n                       loadNames,   // loadNames\n                       verbose);    // startVerbose\n\t}\n\t\n\t// I/O\n\tvoid readIntoMemory(\n                        int color,\n                        int needEntireRev,\n                        bool loadSASamp,\n                        bool loadFtab,\n                        bool loadRstarts,\n                        bool justHeader,\n                        EbwtParams<index_t> *params,\n                        bool mmSweep,\n                        bool loadNames,\n                        bool startVerbose);\n\t\n\t/**\n\t * Frees memory associated with the Ebwt.\n\t */\n\tvoid evictFromMemory() {\n\t\tassert(PARENT_CLASS::isInMemory());\n\t\tclearLocalEbwts();\n\t\tPARENT_CLASS::evictFromMemory();\t\t\n\t}\n\t\n\t/**\n\t * Sanity-check various pieces of the Ebwt\n\t */\n\tvoid sanityCheckAll(int reverse) const {\n\t\tPARENT_CLASS::sanityCheckAll(reverse);\n\t\tfor(size_t tidx = 0; tidx < _localEbwts.size(); tidx++) {\n\t\t\tfor(size_t local_idx = 0; local_idx < _localEbwts[tidx].size(); local_idx++) {\n\t\t\t\tassert(_localEbwts[tidx][local_idx] != NULL);\n\t\t\t\t_localEbwts[tidx][local_idx]->sanityCheckAll(reverse);\n\t\t\t}\n\t\t}\n\t}\n    \n    const LocalEbwt<local_index_t, index_t>* getLocalEbwt(index_t tidx, index_t offset) const {\n        assert_lt(tidx, _localEbwts.size());\n        const EList<LocalEbwt<local_index_t, index_t>*>& localEbwts = _localEbwts[tidx];\n        index_t offsetidx = offset / local_index_interval;\n        if(offsetidx >= localEbwts.size()) {\n            return NULL;\n        } else {\n            return localEbwts[offsetidx];\n        }\n    }\n    \n    const LocalEbwt<local_index_t, index_t>* prevLocalEbwt(const LocalEbwt<local_index_t, index_t>* currLocalEbwt) const {\n        assert(currLocalEbwt != NULL);\n        index_t tidx = currLocalEbwt->_tidx;\n        index_t offset = currLocalEbwt->_localOffset;\n        if(offset < local_index_interval) {\n            return NULL;\n        } else {\n            return getLocalEbwt(tidx, offset - local_index_interval);\n        }\n    }\n    \n    const LocalEbwt<local_index_t, index_t>* nextLocalEbwt(const LocalEbwt<local_index_t, index_t>* currLocalEbwt) const {\n        assert(currLocalEbwt != NULL);\n        index_t tidx = currLocalEbwt->_tidx;\n        index_t offset = currLocalEbwt->_localOffset;\n        return getLocalEbwt(tidx, offset + local_index_interval);\n    }\n\t\n\tvoid clearLocalEbwts() {\n\t\tfor(size_t tidx = 0; tidx < _localEbwts.size(); tidx++) {\n\t\t\tfor(size_t local_idx = 0; local_idx < _localEbwts[tidx].size(); local_idx++) {\n\t\t\t\tassert(_localEbwts[tidx][local_idx] != NULL);\n\t\t\t\tdelete _localEbwts[tidx][local_idx];\n\t\t\t}\n\t\t\t\n\t\t\t_localEbwts[tidx].clear();\n\t\t}\n\t\t\n\t\t_localEbwts.clear();\n\t}\n\t\n\npublic:\n\tindex_t                                  _nrefs;      /// the number of reference sequences\n\tEList<index_t>                           _refLens;    /// approx lens of ref seqs (excludes trailing ambig chars)\n\t\n\tEList<EList<LocalEbwt<local_index_t, index_t>*> > _localEbwts;\n\tindex_t                                  _nlocalEbwts;\n\t\n\tFILE                                     *_in5;    // input fd for primary index file\n\tFILE                                     *_in6;    // input fd for secondary index file\n\tstring                                   _in5Str;\n\tstring                                   _in6Str;\n\t\n\tchar                                     *mmFile5_;\n\tchar                                     *mmFile6_;\n};\n    \n/// Construct an Ebwt from the given header parameters and string\n/// vector, optionally using a blockwise suffix sorter with the\n/// given 'bmax' and 'dcv' parameters.  The string vector is\n/// ultimately joined and the joined string is passed to buildToDisk().\ntemplate <typename index_t, typename local_index_t>\ntemplate <typename TStr>\nHierEbwt<index_t, local_index_t>::HierEbwt(\n                                           TStr& s,\n                                           bool packed,\n                                           int color,\n                                           int needEntireReverse,\n                                           int32_t lineRate,\n                                           int32_t offRate,\n                                           int32_t ftabChars,\n                                           int32_t localOffRate,\n                                           int32_t localFtabChars,\n                                           const string& file,   // base filename for EBWT files\n                                           bool fw,\n                                           bool useBlockwise,\n                                           TIndexOffU bmax,\n                                           TIndexOffU bmaxSqrtMult,\n                                           TIndexOffU bmaxDivN,\n                                           int dcv,\n                                           EList<FileBuf*>& is,\n                                           EList<RefRecord>& szs,\n                                           index_t sztot,\n                                           const RefReadInParams& refparams,\n                                           uint32_t seed,\n                                           int32_t overrideOffRate,\n                                           bool verbose,\n                                           bool passMemExc,\n                                           bool sanityCheck) :\n    Ebwt<index_t>(s,\n                  packed,\n                  color,\n                  needEntireReverse,\n                  lineRate,\n                  offRate,\n                  ftabChars,\n                  file,\n                  fw,\n                  useBlockwise,\n                  bmax,\n                  bmaxSqrtMult,\n                  bmaxDivN,\n                  dcv,\n                  is,\n                  szs,\n                  sztot,\n                  refparams,\n                  seed,\n                  overrideOffRate,\n                  verbose,\n                  passMemExc,\n                  sanityCheck),\n    _in5(NULL),\n    _in6(NULL)\n{\n    _in5Str = file + \".5.\" + gEbwt_ext;\n    _in6Str = file + \".6.\" + gEbwt_ext;\n    \n    // Open output files\n    ofstream fout5(_in5Str.c_str(), ios::binary);\n    if(!fout5.good()) {\n        cerr << \"Could not open index file for writing: \\\"\" << _in5Str.c_str() << \"\\\"\" << endl\n        << \"Please make sure the directory exists and that permissions allow writing by\" << endl\n        << \"Bowtie.\" << endl;\n        throw 1;\n    }\n    ofstream fout6(_in6Str.c_str(), ios::binary);\n    if(!fout6.good()) {\n        cerr << \"Could not open index file for writing: \\\"\" << _in6Str.c_str() << \"\\\"\" << endl\n        << \"Please make sure the directory exists and that permissions allow writing by\" << endl\n        << \"Bowtie.\" << endl;\n        throw 1;\n    }\n    \n    // split the whole genome into a set of local indexes\n    _nrefs = 0;\n    _nlocalEbwts = 0;\n    \n    index_t cumlen = 0;\n    typedef EList<RefRecord, 1> EList_RefRecord;\n    EList<EList<EList_RefRecord> > all_local_recs;\n    // For each unambiguous stretch...\n    for(index_t i = 0; i < szs.size(); i++) {\n        const RefRecord& rec = szs[i];\n        if(rec.first) {\n            if(_nrefs > 0) {\n                // refLens_ links each reference sequence with the total number\n                // of ambiguous and unambiguous characters in it.\n                _refLens.push_back(cumlen);\n            }\n            cumlen = 0;\n            _nrefs++;\n            all_local_recs.expand();\n            assert_eq(_nrefs, all_local_recs.size());\n        } else if(i == 0) {\n            cerr << \"First record in reference index file was not marked as \"\n            << \"'first'\" << endl;\n            throw 1;\n        }\n        \n        assert_gt(_nrefs, 0);\n        assert_eq(_nrefs, all_local_recs.size());\n        EList<EList_RefRecord>& ref_local_recs = all_local_recs[_nrefs-1];\n        index_t next_cumlen = cumlen + rec.off + rec.len;\n        index_t local_off = (cumlen / local_index_interval) * local_index_interval;\n        if(local_off >= local_index_interval) {\n            local_off -= local_index_interval;\n        }\n        for(;local_off < next_cumlen; local_off += local_index_interval) {\n            if(local_off + local_index_size < cumlen) {\n                continue;\n            }\n            index_t local_idx = local_off / local_index_interval;\n            \n            if(local_idx >= ref_local_recs.size()) {\n                assert_eq(local_idx, ref_local_recs.size());\n                ref_local_recs.expand();\n                _nlocalEbwts++;\n            }\n            assert_lt(local_idx, ref_local_recs.size());\n            EList_RefRecord& local_recs = ref_local_recs[local_idx];\n            assert_gt(local_off + local_index_size, cumlen);\n            local_recs.expand();\n            if(local_off + local_index_size <= cumlen + rec.off) {\n                local_recs.back().off = local_off + local_index_size - std::max(local_off, cumlen);\n                local_recs.back().len = 0;\n            } else {\n                if(local_off < cumlen + rec.off) {\n                    local_recs.back().off = rec.off - (local_off > cumlen ? local_off - cumlen : 0);\n                } else {\n                    local_recs.back().off = 0;\n                }\n                local_recs.back().len = std::min(next_cumlen, local_off + local_index_size) - std::max(local_off, cumlen + rec.off);\n            }\n            local_recs.back().first = (local_recs.size() == 1);\n        }\n        cumlen = next_cumlen;\n    }\n    \n    // Store a cap entry for the end of the last reference seq\n    _refLens.push_back(cumlen);\n    \n#ifndef NDEBUG\n    EList<RefRecord> temp_szs;\n    index_t temp_sztot = 0;\n    index_t temp_nlocalEbwts = 0;\n    for(size_t tidx = 0; tidx < all_local_recs.size(); tidx++) {\n        assert_lt(tidx, _refLens.size());\n        EList<EList_RefRecord>& ref_local_recs = all_local_recs[tidx];\n        assert_eq((_refLens[tidx] + local_index_interval - 1) / local_index_interval, ref_local_recs.size());\n        temp_szs.expand();\n        temp_szs.back().off = 0;\n        temp_szs.back().len = 0;\n        temp_szs.back().first = true;\n        index_t temp_ref_len = 0;\n        index_t temp_ref_sztot = 0;\n        temp_nlocalEbwts += ref_local_recs.size();\n        for(size_t i = 0; i < ref_local_recs.size(); i++) {\n            EList_RefRecord& local_recs = ref_local_recs[i];\n            index_t local_len = 0;\n            for(size_t j = 0; j < local_recs.size(); j++) {\n                assert(local_recs[j].off != 0 || local_recs[j].len != 0);\n                assert(j != 0 || local_recs[j].first);\n                RefRecord local_rec = local_recs[j];\n                if(local_len < local_index_interval && local_recs[j].off > 0){\n                    if(local_len + local_recs[j].off > local_index_interval) {\n                        temp_ref_len += (local_index_interval - local_len);\n                        local_rec.off = local_index_interval - local_len;\n                    } else {\n                        temp_ref_len += local_recs[j].off;\n                    }\n                } else {\n                    local_rec.off = 0;\n                }\n                local_len += local_recs[j].off;\n                if(local_len < local_index_interval && local_recs[j].len > 0) {\n                    if(local_len + local_recs[j].len > local_index_interval) {\n                        temp_ref_len += (local_index_interval - local_len);\n                        temp_ref_sztot += (local_index_interval - local_len);\n                        local_rec.len = local_index_interval - local_len;\n                    } else {\n                        temp_ref_len += local_recs[j].len;\n                        temp_ref_sztot += local_recs[j].len;\n                    }\n                } else {\n                    local_rec.len = 0;\n                }\n                local_len += local_recs[j].len;\n                if(local_rec.off > 0) {\n                    if(temp_szs.back().len > 0) {\n                        temp_szs.expand();\n                        temp_szs.back().off = local_rec.off;\n                        temp_szs.back().len = local_rec.len;\n                        temp_szs.back().first = false;\n                    } else {\n                        temp_szs.back().off += local_rec.off;\n                        temp_szs.back().len = local_rec.len;\n                    }\n                } else if(local_rec.len > 0) {\n                    temp_szs.back().len += local_rec.len;\n                }\n            }\n            if(i + 1 < ref_local_recs.size()) {\n                assert_eq(local_len, local_index_size);\n                assert_eq(temp_ref_len % local_index_interval, 0);\n            } else {\n                assert_eq(local_len, _refLens[tidx] % local_index_interval);\n            }\n        }\n        assert_eq(temp_ref_len, _refLens[tidx]);\n        temp_sztot += temp_ref_sztot;\n    }\n    assert_eq(temp_sztot, sztot);\n    for(size_t i = 0; i < temp_szs.size(); i++) {\n        assert_lt(i, szs.size());\n        assert_eq(temp_szs[i].off, szs[i].off);\n        assert_eq(temp_szs[i].len, szs[i].len);\n        assert_eq(temp_szs[i].first, szs[i].first);\n    }\n    assert_eq(temp_szs.size(), szs.size());\n    assert_eq(_nlocalEbwts, temp_nlocalEbwts);\n#endif\n    \n    uint32_t be = this->toBe();\n    assert(fout5.good());\n    assert(fout6.good());\n    \n    // When building an Ebwt, these header parameters are known\n    // \"up-front\", i.e., they can be written to disk immediately,\n    // before we join() or buildToDisk()\n    writeI32(fout5, 1, be); // endian hint for priamry stream\n    writeI32(fout6, 1, be); // endian hint for secondary stream\n    writeIndex<index_t>(fout5, _nlocalEbwts, be); // number of local Ebwts\n    writeI32(fout5, local_lineRate,  be); // 2^lineRate = size in bytes of 1 line\n    writeI32(fout5, 2, be); // not used\n    writeI32(fout5, (int32_t)localOffRate,   be); // every 2^offRate chars is \"marked\"\n    writeI32(fout5, (int32_t)localFtabChars, be); // number of 2-bit chars used to address ftab\n    int32_t flags = 1;\n    if(this->_eh._color) flags |= EBWT_COLOR;\n    if(this->_eh._entireReverse) flags |= EBWT_ENTIRE_REV;\n    writeI32(fout5, -flags, be); // BTL: chunkRate is now deprecated\n    \n    // build local FM indexes\n    index_t curr_sztot = 0;\n    bool firstIndex = true;\n    for(size_t tidx = 0; tidx < _refLens.size(); tidx++) {\n        index_t refLen = _refLens[tidx];\n        index_t local_offset = 0;\n        _localEbwts.expand();\n        assert_lt(tidx, _localEbwts.size());\n        while(local_offset < refLen) {\n            index_t index_size = std::min<index_t>(refLen - local_offset, local_index_size);\n            assert_lt(tidx, all_local_recs.size());\n            assert_lt(local_offset / local_index_interval, all_local_recs[tidx].size());\n            EList_RefRecord& local_szs = all_local_recs[tidx][local_offset / local_index_interval];\n            \n            EList<RefRecord> conv_local_szs;\n            index_t local_len = 0, local_sztot = 0, local_sztot_interval = 0;\n            for(size_t i = 0; i < local_szs.size(); i++) {\n                assert(local_szs[i].off != 0 || local_szs[i].len != 0);\n                assert(i != 0 || local_szs[i].first);\n                conv_local_szs.push_back(local_szs[i]);\n                local_len += local_szs[i].off;\n                if(local_len < local_index_interval && local_szs[i].len > 0) {\n                    if(local_len + local_szs[i].len > local_index_interval) {\n                        local_sztot_interval += (local_index_interval - local_len);\n                    } else {\n                        local_sztot_interval += local_szs[i].len;\n                    }\n                }\n                local_sztot += local_szs[i].len;\n                local_len += local_szs[i].len;\n            }\n            TStr local_s;\n            local_s.resize(local_sztot);\n            if(refparams.reverse == REF_READ_REVERSE) {\n                local_s.install(s.buf() + s.length() - curr_sztot - local_sztot, local_sztot);\n            } else {\n                local_s.install(s.buf() + curr_sztot, local_sztot);\n            }\n            LocalEbwt<local_index_t, index_t>* localEbwt = new LocalEbwt<local_index_t, index_t>(\n                                                                                                 local_s,\n                                                                                                 tidx,\n                                                                                                 local_offset,\n                                                                                                 index_size,\n                                                                                                 packed,\n                                                                                                 color,\n                                                                                                 needEntireReverse,\n                                                                                                 local_lineRate,\n                                                                                                 localOffRate,      // suffix-array sampling rate\n                                                                                                 localFtabChars,    // number of chars in initial arrow-pair calc\n                                                                                                 file,               // basename for .?.ebwt files\n                                                                                                 fw,                 // fw\n                                                                                                 dcv,                // difference-cover period\n                                                                                                 conv_local_szs,     // list of reference sizes\n                                                                                                 local_sztot,        // total size of all unambiguous ref chars\n                                                                                                 refparams,          // reference read-in parameters\n                                                                                                 seed,               // pseudo-random number generator seed\n                                                                                                 fout5,\n                                                                                                 fout6,\n                                                                                                 -1,                 // override offRate\n                                                                                                 false,              // be silent\n                                                                                                 passMemExc,         // pass exceptions up to the toplevel so that we can adjust memory settings automatically\n                                                                                                 sanityCheck);       // verify results and internal consistency\n            firstIndex = false;\n            _localEbwts[tidx].push_back(localEbwt);\n            curr_sztot += local_sztot_interval;\n            local_offset += local_index_interval;\n        }\n    }\n    assert_eq(curr_sztot, sztot);\n    \n    \n    fout5 << '\\0';\n    fout5.flush(); fout6.flush();\n    if(fout5.fail() || fout6.fail()) {\n        cerr << \"An error occurred writing the index to disk.  Please check if the disk is full.\" << endl;\n        throw 1;\n    }\n    VMSG_NL(\"Returning from initFromVector\");\n    \n    // Close output files\n    fout5.flush();\n    int64_t tellpSz5 = (int64_t)fout5.tellp();\n    VMSG_NL(\"Wrote \" << fout5.tellp() << \" bytes to primary EBWT file: \" << _in5Str.c_str());\n    fout5.close();\n    bool err = false;\n    if(tellpSz5 > fileSize(_in5Str.c_str())) {\n        err = true;\n        cerr << \"Index is corrupt: File size for \" << _in5Str.c_str() << \" should have been \" << tellpSz5\n        << \" but is actually \" << fileSize(_in5Str.c_str()) << \".\" << endl;\n    }\n    fout6.flush();\n    int64_t tellpSz6 = (int64_t)fout6.tellp();\n    VMSG_NL(\"Wrote \" << fout6.tellp() << \" bytes to secondary EBWT file: \" << _in6Str.c_str());\n    fout6.close();\n    if(tellpSz6 > fileSize(_in6Str.c_str())) {\n        err = true;\n        cerr << \"Index is corrupt: File size for \" << _in6Str.c_str() << \" should have been \" << tellpSz6\n        << \" but is actually \" << fileSize(_in6Str.c_str()) << \".\" << endl;\n    }\n    if(err) {\n        cerr << \"Please check if there is a problem with the disk or if disk is full.\" << endl;\n        throw 1;\n    }\n    // Reopen as input streams\n    VMSG_NL(\"Re-opening _in5 and _in5 as input streams\");\n    if(this->_sanity) {\n        VMSG_NL(\"Sanity-checking Bt2\");\n        assert(!this->isInMemory());\n        readIntoMemory(\n                       color,                       // colorspace?\n                       fw ? -1 : needEntireReverse, // 1 -> need the reverse to be reverse-of-concat\n                       true,                        // load SA sample (_offs[])?\n                       true,                        // load ftab (_ftab[] & _eftab[])?\n                       true,                        // load r-starts (_rstarts[])?\n                       false,                       // just load header?\n                       NULL,                        // Params object to fill\n                       false,                       // mm sweep?\n                       true,                        // load names?\n                       false);                      // verbose startup?\n        sanityCheckAll(refparams.reverse);\n        evictFromMemory();\n        assert(!this->isInMemory());\n    }\n    VMSG_NL(\"Returning from HierEbwt constructor\");\n}\n\n    \n/**\n * Read an Ebwt from file with given filename.\n */\ntemplate <typename index_t, typename local_index_t>\nvoid HierEbwt<index_t, local_index_t>::readIntoMemory(\n\t\t\t\t\t\t\t\t\t\t\t\t\t  int color,\n\t\t\t\t\t\t\t\t\t\t\t\t\t  int needEntireRev,\n\t\t\t\t\t\t\t\t\t\t\t\t\t  bool loadSASamp,\n\t\t\t\t\t\t\t\t\t\t\t\t\t  bool loadFtab,\n\t\t\t\t\t\t\t\t\t\t\t\t\t  bool loadRstarts,\n\t\t\t\t\t\t\t\t\t\t\t\t\t  bool justHeader,\n\t\t\t\t\t\t\t\t\t\t\t\t\t  EbwtParams<index_t> *params,\n\t\t\t\t\t\t\t\t\t\t\t\t\t  bool mmSweep,\n\t\t\t\t\t\t\t\t\t\t\t\t\t  bool loadNames,\n\t\t\t\t\t\t\t\t\t\t\t\t\t  bool startVerbose)\n{\n    PARENT_CLASS::readIntoMemory(color,\n                                 needEntireRev,\n                                 loadSASamp,\n                                 loadFtab,\n                                 loadRstarts,\n                                 justHeader || needEntireRev == 1,\n                                 params,\n                                 mmSweep,\n                                 loadNames,\n                                 startVerbose);\n    \n    return;\n\n\tbool switchEndian; // dummy; caller doesn't care\n#ifdef BOWTIE_MM\n\tchar *mmFile[] = { NULL, NULL };\n#endif\n\tif(_in5Str.length() > 0) {\n\t\tif(this->_verbose || startVerbose) {\n\t\t\tcerr << \"  About to open input files: \";\n\t\t\tlogTime(cerr);\n\t\t}\n        // Initialize our primary and secondary input-stream fields\n\t\tif(_in5 != NULL) fclose(_in5);\n\t\tif(this->_verbose || startVerbose) cerr << \"Opening \\\"\" << _in5Str.c_str() << \"\\\"\" << endl;\n\t\tif((_in5 = fopen(_in5Str.c_str(), \"rb\")) == NULL) {\n\t\t\tcerr << \"Could not open index file \" << _in5Str.c_str() << endl;\n\t\t}\n\t\tif(loadSASamp) {\n\t\t\tif(_in6 != NULL) fclose(_in6);\n\t\t\tif(this->_verbose || startVerbose) cerr << \"Opening \\\"\" << _in6Str.c_str() << \"\\\"\" << endl;\n\t\t\tif((_in6 = fopen(_in6Str.c_str(), \"rb\")) == NULL) {\n\t\t\t\tcerr << \"Could not open index file \" << _in6Str.c_str() << endl;\n\t\t\t}\n\t\t}\n\t\tif(this->_verbose || startVerbose) {\n\t\t\tcerr << \"  Finished opening input files: \";\n\t\t\tlogTime(cerr);\n\t\t}\n\t\t\n#ifdef BOWTIE_MM\n\t\tif(this->_useMm /*&& !justHeader*/) {\n\t\t\tconst char *names[] = {_in5Str.c_str(), _in6Str.c_str()};\n            int fds[] = { fileno(_in5), fileno(_in6) };\n\t\t\tfor(int i = 0; i < (loadSASamp ? 2 : 1); i++) {\n\t\t\t\tif(this->_verbose || startVerbose) {\n\t\t\t\t\tcerr << \"  ¯ \" << (i+1) << \": \";\n\t\t\t\t\tlogTime(cerr);\n\t\t\t\t}\n\t\t\t\tstruct stat sbuf;\n\t\t\t\tif (stat(names[i], &sbuf) == -1) {\n\t\t\t\t\tperror(\"stat\");\n\t\t\t\t\tcerr << \"Error: Could not stat index file \" << names[i] << \" prior to memory-mapping\" << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n                mmFile[i] = (char*)mmap((void *)0, (size_t)sbuf.st_size,\n\t\t\t\t\t\t\t\t\t\tPROT_READ, MAP_SHARED, fds[(size_t)i], 0);\n\t\t\t\tif(mmFile[i] == (void *)(-1)) {\n\t\t\t\t\tperror(\"mmap\");\n\t\t\t\t\tcerr << \"Error: Could not memory-map the index file \" << names[i] << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t\tif(mmSweep) {\n\t\t\t\t\tint sum = 0;\n\t\t\t\t\tfor(off_t j = 0; j < sbuf.st_size; j += 1024) {\n\t\t\t\t\t\tsum += (int) mmFile[i][j];\n\t\t\t\t\t}\n\t\t\t\t\tif(startVerbose) {\n\t\t\t\t\t\tcerr << \"  Swept the memory-mapped ebwt index file 1; checksum: \" << sum << \": \";\n\t\t\t\t\t\tlogTime(cerr);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\tmmFile5_ = mmFile[0];\n\t\t\tmmFile6_ = loadSASamp ? mmFile[1] : NULL;\n\t\t}\n#endif\n\t}\n#ifdef BOWTIE_MM\n\telse if(this->_useMm && !justHeader) {\n\t\tmmFile[0] = mmFile5_;\n\t\tmmFile[1] = mmFile6_;\n\t}\n\tif(this->_useMm && !justHeader) {\n\t\tassert(mmFile[0] == mmFile5_);\n\t\tassert(mmFile[1] == mmFile6_);\n\t}\n#endif\n\t\n\tif(this->_verbose || startVerbose) {\n\t\tcerr << \"  Reading header: \";\n\t\tlogTime(cerr);\n\t}\n\t\n\t// Read endianness hints from both streams\n\tsize_t bytesRead = 0;\n\tswitchEndian = false;\n\tuint32_t one = readU32(_in5, switchEndian); // 1st word of primary stream\n\tbytesRead += 4;\n\tif(loadSASamp) {\n#ifndef NDEBUG\n\t\tassert_eq(one, readU32(_in6, switchEndian)); // should match!\n#else\n\t\treadU32(_in6, switchEndian);\n#endif\n\t}\n\tif(one != 1) {\n\t\tassert_eq((1u<<24), one);\n\t\tassert_eq(1, endianSwapU32(one));\n\t\tswitchEndian = true;\n\t}\n\t\n\t// Can't switch endianness and use memory-mapped files; in order to\n\t// support this, someone has to modify the file to switch\n\t// endiannesses appropriately, and we can't do this inside Bowtie\n\t// or we might be setting up a race condition with other processes.\n\tif(switchEndian && this->_useMm) {\n\t\tcerr << \"Error: Can't use memory-mapped files when the index is the opposite endianness\" << endl;\n\t\tthrow 1;\n\t}\t\n\t\n\t_nlocalEbwts      = readIndex<index_t>(_in5, switchEndian); bytesRead += sizeof(index_t);\n\tint32_t lineRate  = readI32(_in5, switchEndian); bytesRead += 4;\n\treadI32(_in5, switchEndian); bytesRead += 4;\n\tint32_t offRate   = readI32(_in5, switchEndian); bytesRead += 4;\n\t// TODO: add isaRate to the actual file format (right now, the\n\t// user has to tell us whether there's an ISA sample and what the\n\t// sampling rate is.\n\tint32_t ftabChars = readI32(_in5, switchEndian); bytesRead += 4;\n\t/*int32_t flag  =*/ readI32(_in5, switchEndian); bytesRead += 4;\n    \n    if(this->_verbose || startVerbose) {\n        cerr << \"    number of local indexes: \" << _nlocalEbwts << endl\n             << \"    local offRate: \" << offRate << endl\n             << \"    local ftabLen: \" << (1 << (2 * ftabChars)) << endl\n             << \"    local ftabSz: \"  << (2 << (2 * ftabChars)) << endl\n        ;\n    }\n\t\n\tclearLocalEbwts();\n\t\n\tindex_t tidx = 0, localOffset = 0;\n\tstring base = \"\";\n\tfor(size_t i = 0; i < _nlocalEbwts; i++) {\n\t\tLocalEbwt<local_index_t, index_t> *localEbwt = new LocalEbwt<local_index_t, index_t>(base,\n                                                                                             _in5,\n                                                                                             _in6,\n                                                                                             mmFile5_,\n                                                                                             mmFile6_,\n                                                                                             tidx,\n                                                                                             localOffset,\n                                                                                             switchEndian,\n                                                                                             bytesRead,\n                                                                                             color,\n                                                                                             needEntireRev,\n                                                                                             this->fw_,\n                                                                                             -1, // overrideOffRate\n                                                                                             -1, // offRatePlus\n                                                                                             (uint32_t)lineRate,\n                                                                                             (uint32_t)offRate,\n                                                                                             (uint32_t)ftabChars,\n                                                                                             this->_useMm,\n                                                                                             this->useShmem_,\n                                                                                             mmSweep,\n                                                                                             loadNames,\n                                                                                             loadSASamp,\n                                                                                             loadFtab,\n                                                                                             loadRstarts,\n                                                                                             false,  // _verbose\n                                                                                             false,\n                                                                                             this->_passMemExc,\n                                                                                             this->_sanity);\n\t\t\n\t\tif(tidx >= _localEbwts.size()) {\n\t\t\tassert_eq(tidx, _localEbwts.size());\n\t\t\t_localEbwts.expand();\n\t\t}\n\t\tassert_eq(tidx + 1, _localEbwts.size());\n\t\t_localEbwts.back().push_back(localEbwt);\n\t}\t\n\t\t\n#ifdef BOWTIE_MM\n    fseek(_in5, 0, SEEK_SET);\n\tfseek(_in6, 0, SEEK_SET);\n#else\n\trewind(_in5); rewind(_in6);\n#endif\n}\n\n#endif /*HIEREBWT_H_*/\n"
  },
  {
    "path": "hier_idx_common.h",
    "content": "/*\n * Copyright 2013, Daehwan Kim <infphilo@gmail.com>\n *\n * This file is part of Beast.  Beast is based on Bowtie 2.\n *\n * Beast is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Beast is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Beast.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef HIEREBWT_COMMON_H_\n#define HIEREBWT_COMMON_H_\n\n// maximum size of a sequence represented by a local index\nstatic const uint32_t local_index_size     = (1 << 16) - (1 << 8);  // 1 << 5 is necessary for eftab index\n\n// size of the overlapped sequence between the sequences represented by two consecutive local indexes\nstatic const uint32_t local_index_overlap  = 1024;\n\n// interval between two consecutive local indexes \nstatic const uint32_t local_index_interval = local_index_size - local_index_overlap;\n\n// line rate in local indexes\nstatic const int32_t local_lineRate = 6;\n\n// how many rows are marked in a local index, every 2^<int>th row is marked\nstatic const int32_t  local_offRate        = 3;\n\n// the look table in a local index 4^<int> entries\nstatic const int32_t  local_ftabChars      = 6;\n\n#endif /*HIEREBWT_COMMON_H_*/\n"
  },
  {
    "path": "hyperloglogbias.h",
    "content": "/*\n * hyperloglogbias.h\n *\n *  Created on: Apr 25, 2015\n *      Author: fbreitwieser\n */\n\n#ifndef HYPERLOGLOGBIAS_H_\n#define HYPERLOGLOGBIAS_H_\n\nconst double rawEstimateData_precision4[] = {\n    11, 11.717, 12.207, 12.7896, 13.2882, 13.8204, 14.3772, 14.9342, 15.5202, 16.161, 16.7722, 17.4636, 18.0396, 18.6766, 19.3566, 20.0454, 20.7936, 21.4856, 22.2666, 22.9946, 23.766, 24.4692, 25.3638, 26.0764, 26.7864, 27.7602, 28.4814, 29.433, 30.2926, 31.0664, 31.9996, 32.7956, 33.5366, 34.5894, 35.5738, 36.2698, 37.3682, 38.0544, 39.2342, 40.0108, 40.7966, 41.9298, 42.8704, 43.6358, 44.5194, 45.773, 46.6772, 47.6174, 48.4888, 49.3304, 50.2506, 51.4996, 52.3824, 53.3078, 54.3984, 55.5838, 56.6618, 57.2174, 58.3514, 59.0802, 60.1482, 61.0376, 62.3598, 62.8078, 63.9744, 64.914, 65.781, 67.1806, 68.0594, 68.8446, 69.7928, 70.8248, 71.8324, 72.8598, 73.6246, 74.7014, 75.393, 76.6708, 77.2394\n};\n\nconst double rawEstimateData_precision5[] = {\n    23, 23.1194, 23.8208, 24.2318, 24.77, 25.2436, 25.7774, 26.2848, 26.8224, 27.3742, 27.9336, 28.503, 29.0494, 29.6292, 30.2124, 30.798, 31.367, 31.9728, 32.5944, 33.217, 33.8438, 34.3696, 35.0956, 35.7044, 36.324, 37.0668, 37.6698, 38.3644, 39.049, 39.6918, 40.4146, 41.082, 41.687, 42.5398, 43.2462, 43.857, 44.6606, 45.4168, 46.1248, 46.9222, 47.6804, 48.447, 49.3454, 49.9594, 50.7636, 51.5776, 52.331, 53.19, 53.9676, 54.7564, 55.5314, 56.4442, 57.3708, 57.9774, 58.9624, 59.8796, 60.755, 61.472, 62.2076, 63.1024, 63.8908, 64.7338, 65.7728, 66.629, 67.413, 68.3266, 69.1524, 70.2642, 71.1806, 72.0566, 72.9192, 73.7598, 74.3516, 75.5802, 76.4386, 77.4916, 78.1524, 79.1892, 79.8414, 80.8798, 81.8376, 82.4698, 83.7656, 84.331, 85.5914, 86.6012, 87.7016, 88.5582, 89.3394, 90.3544, 91.4912, 92.308, 93.3552, 93.9746, 95.2052, 95.727, 97.1322, 98.3944, 98.7588, 100.242, 101.1914, 102.2538, 102.8776, 103.6292, 105.1932, 105.9152, 107.0868, 107.6728, 108.7144, 110.3114, 110.8716, 111.245, 112.7908, 113.7064, 114.636, 115.7464, 116.1788, 117.7464, 118.4896, 119.6166, 120.5082, 121.7798, 122.9028, 123.4426, 124.8854, 125.705, 126.4652, 128.3464, 128.3462, 130.0398, 131.0342, 131.0042, 132.4766, 133.511, 134.7252, 135.425, 136.5172, 138.0572, 138.6694, 139.3712, 140.8598, 141.4594, 142.554, 143.4006, 144.7374, 146.1634, 146.8994, 147.605, 147.9304, 149.1636, 150.2468, 151.5876, 152.2096, 153.7032, 154.7146, 155.807, 156.9228, 157.0372, 158.5852\n};\n\nconst double rawEstimateData_precision6[] = {\n    46, 46.1902, 47.271, 47.8358, 48.8142, 49.2854, 50.317, 51.354, 51.8924, 52.9436, 53.4596, 54.5262, 55.6248, 56.1574, 57.2822, 57.837, 58.9636, 60.074, 60.7042, 61.7976, 62.4772, 63.6564, 64.7942, 65.5004, 66.686, 67.291, 68.5672, 69.8556, 70.4982, 71.8204, 72.4252, 73.7744, 75.0786, 75.8344, 77.0294, 77.8098, 79.0794, 80.5732, 81.1878, 82.5648, 83.2902, 84.6784, 85.3352, 86.8946, 88.3712, 89.0852, 90.499, 91.2686, 92.6844, 94.2234, 94.9732, 96.3356, 97.2286, 98.7262, 100.3284, 101.1048, 102.5962, 103.3562, 105.1272, 106.4184, 107.4974, 109.0822, 109.856, 111.48, 113.2834, 114.0208, 115.637, 116.5174, 118.0576, 119.7476, 120.427, 122.1326, 123.2372, 125.2788, 126.6776, 127.7926, 129.1952, 129.9564, 131.6454, 133.87, 134.5428, 136.2, 137.0294, 138.6278, 139.6782, 141.792, 143.3516, 144.2832, 146.0394, 147.0748, 148.4912, 150.849, 151.696, 153.5404, 154.073, 156.3714, 157.7216, 158.7328, 160.4208, 161.4184, 163.9424, 165.2772, 166.411, 168.1308, 168.769, 170.9258, 172.6828, 173.7502, 175.706, 176.3886, 179.0186, 180.4518, 181.927, 183.4172, 184.4114, 186.033, 188.5124, 189.5564, 191.6008, 192.4172, 193.8044, 194.997, 197.4548, 198.8948, 200.2346, 202.3086, 203.1548, 204.8842, 206.6508, 206.6772, 209.7254, 210.4752, 212.7228, 214.6614, 215.1676, 217.793, 218.0006, 219.9052, 221.66, 223.5588, 225.1636, 225.6882, 227.7126, 229.4502, 231.1978, 232.9756, 233.1654, 236.727, 238.1974, 237.7474, 241.1346, 242.3048, 244.1948, 245.3134, 246.879, 249.1204, 249.853, 252.6792, 253.857, 254.4486, 257.2362, 257.9534, 260.0286, 260.5632, 262.663, 264.723, 265.7566, 267.2566, 267.1624, 270.62, 272.8216, 273.2166, 275.2056, 276.2202, 278.3726, 280.3344, 281.9284, 283.9728, 284.1924, 286.4872, 287.587, 289.807, 291.1206, 292.769, 294.8708, 296.665, 297.1182, 299.4012, 300.6352, 302.1354, 304.1756, 306.1606, 307.3462, 308.5214, 309.4134, 310.8352, 313.9684, 315.837, 316.7796, 318.9858\n};\n\nconst double rawEstimateData_precision7[] = {\n    92, 93.4934, 94.9758, 96.4574, 97.9718, 99.4954, 101.5302, 103.0756, 104.6374, 106.1782, 107.7888, 109.9522, 111.592, 113.2532, 114.9086, 116.5938, 118.9474, 120.6796, 122.4394, 124.2176, 125.9768, 128.4214, 130.2528, 132.0102, 133.8658, 135.7278, 138.3044, 140.1316, 142.093, 144.0032, 145.9092, 148.6306, 150.5294, 152.5756, 154.6508, 156.662, 159.552, 161.3724, 163.617, 165.5754, 167.7872, 169.8444, 172.7988, 174.8606, 177.2118, 179.3566, 181.4476, 184.5882, 186.6816, 189.0824, 191.0258, 193.6048, 196.4436, 198.7274, 200.957, 203.147, 205.4364, 208.7592, 211.3386, 213.781, 215.8028, 218.656, 221.6544, 223.996, 226.4718, 229.1544, 231.6098, 234.5956, 237.0616, 239.5758, 242.4878, 244.5244, 248.2146, 250.724, 252.8722, 255.5198, 258.0414, 261.941, 264.9048, 266.87, 269.4304, 272.028, 274.4708, 278.37, 281.0624, 283.4668, 286.5532, 289.4352, 293.2564, 295.2744, 298.2118, 300.7472, 304.1456, 307.2928, 309.7504, 312.5528, 315.979, 318.2102, 322.1834, 324.3494, 327.325, 330.6614, 332.903, 337.2544, 339.9042, 343.215, 345.2864, 348.0814, 352.6764, 355.301, 357.139, 360.658, 363.1732, 366.5902, 369.9538, 373.0828, 375.922, 378.9902, 382.7328, 386.4538, 388.1136, 391.2234, 394.0878, 396.708, 401.1556, 404.1852, 406.6372, 409.6822, 412.7796, 416.6078, 418.4916, 422.131, 424.5376, 428.1988, 432.211, 434.4502, 438.5282, 440.912, 444.0448, 447.7432, 450.8524, 453.7988, 456.7858, 458.8868, 463.9886, 466.5064, 468.9124, 472.6616, 475.4682, 478.582, 481.304, 485.2738, 488.6894, 490.329, 496.106, 497.6908, 501.1374, 504.5322, 506.8848, 510.3324, 513.4512, 516.179, 520.4412, 522.6066, 526.167, 528.7794, 533.379, 536.067, 538.46, 542.9116, 545.692, 547.9546, 552.493, 555.2722, 557.335, 562.449, 564.2014, 569.0738, 571.0974, 574.8564, 578.2996, 581.409, 583.9704, 585.8098, 589.6528, 594.5998, 595.958, 600.068, 603.3278, 608.2016, 609.9632, 612.864, 615.43, 620.7794, 621.272, 625.8644, 629.206, 633.219, 634.5154, 638.6102\n};\n\nconst double rawEstimateData_precision8[] = {\n    184.2152, 187.2454, 190.2096, 193.6652, 196.6312, 199.6822, 203.249, 206.3296, 210.0038, 213.2074, 216.4612, 220.27, 223.5178, 227.4412, 230.8032, 234.1634, 238.1688, 241.6074, 245.6946, 249.2664, 252.8228, 257.0432, 260.6824, 264.9464, 268.6268, 272.2626, 276.8376, 280.4034, 284.8956, 288.8522, 292.7638, 297.3552, 301.3556, 305.7526, 309.9292, 313.8954, 318.8198, 322.7668, 327.298, 331.6688, 335.9466, 340.9746, 345.1672, 349.3474, 354.3028, 358.8912, 364.114, 368.4646, 372.9744, 378.4092, 382.6022, 387.843, 392.5684, 397.1652, 402.5426, 407.4152, 412.5388, 417.3592, 422.1366, 427.486, 432.3918, 437.5076, 442.509, 447.3834, 453.3498, 458.0668, 463.7346, 469.1228, 473.4528, 479.7, 484.644, 491.0518, 495.5774, 500.9068, 506.432, 512.1666, 517.434, 522.6644, 527.4894, 533.6312, 538.3804, 544.292, 550.5496, 556.0234, 562.8206, 566.6146, 572.4188, 579.117, 583.6762, 590.6576, 595.7864, 601.509, 607.5334, 612.9204, 619.772, 624.2924, 630.8654, 636.1836, 642.745, 649.1316, 655.0386, 660.0136, 666.6342, 671.6196, 678.1866, 684.4282, 689.3324, 695.4794, 702.5038, 708.129, 713.528, 720.3204, 726.463, 732.7928, 739.123, 744.7418, 751.2192, 756.5102, 762.6066, 769.0184, 775.2224, 781.4014, 787.7618, 794.1436, 798.6506, 805.6378, 811.766, 819.7514, 824.5776, 828.7322, 837.8048, 843.6302, 849.9336, 854.4798, 861.3388, 867.9894, 873.8196, 880.3136, 886.2308, 892.4588, 899.0816, 905.4076, 912.0064, 917.3878, 923.619, 929.998, 937.3482, 943.9506, 947.991, 955.1144, 962.203, 968.8222, 975.7324, 981.7826, 988.7666, 994.2648, 1000.3128, 1007.4082, 1013.7536, 1020.3376, 1026.7156, 1031.7478, 1037.4292, 1045.393, 1051.2278, 1058.3434, 1062.8726, 1071.884, 1076.806, 1082.9176, 1089.1678, 1095.5032, 1102.525, 1107.2264, 1115.315, 1120.93, 1127.252, 1134.1496, 1139.0408, 1147.5448, 1153.3296, 1158.1974, 1166.5262, 1174.3328, 1175.657, 1184.4222, 1190.9172, 1197.1292, 1204.4606, 1210.4578, 1218.8728, 1225.3336, 1226.6592, 1236.5768, 1241.363, 1249.4074, 1254.6566, 1260.8014, 1266.5454, 1274.5192\n};\n\nconst double rawEstimateData_precision9[] = {\n    369, 374.8294, 381.2452, 387.6698, 394.1464, 400.2024, 406.8782, 413.6598, 420.462, 427.2826, 433.7102, 440.7416, 447.9366, 455.1046, 462.285, 469.0668, 476.306, 483.8448, 491.301, 498.9886, 506.2422, 513.8138, 521.7074, 529.7428, 537.8402, 545.1664, 553.3534, 561.594, 569.6886, 577.7876, 585.65, 594.228, 602.8036, 611.1666, 620.0818, 628.0824, 637.2574, 646.302, 655.1644, 664.0056, 672.3802, 681.7192, 690.5234, 700.2084, 708.831, 718.485, 728.1112, 737.4764, 746.76, 756.3368, 766.5538, 775.5058, 785.2646, 795.5902, 804.3818, 814.8998, 824.9532, 835.2062, 845.2798, 854.4728, 864.9582, 875.3292, 886.171, 896.781, 906.5716, 916.7048, 927.5322, 937.875, 949.3972, 958.3464, 969.7274, 980.2834, 992.1444, 1003.4264, 1013.0166, 1024.018, 1035.0438, 1046.34, 1057.6856, 1068.9836, 1079.0312, 1091.677, 1102.3188, 1113.4846, 1124.4424, 1135.739, 1147.1488, 1158.9202, 1169.406, 1181.5342, 1193.2834, 1203.8954, 1216.3286, 1226.2146, 1239.6684, 1251.9946, 1262.123, 1275.4338, 1285.7378, 1296.076, 1308.9692, 1320.4964, 1333.0998, 1343.9864, 1357.7754, 1368.3208, 1380.4838, 1392.7388, 1406.0758, 1416.9098, 1428.9728, 1440.9228, 1453.9292, 1462.617, 1476.05, 1490.2996, 1500.6128, 1513.7392, 1524.5174, 1536.6322, 1548.2584, 1562.3766, 1572.423, 1587.1232, 1596.5164, 1610.5938, 1622.5972, 1633.1222, 1647.7674, 1658.5044, 1671.57, 1683.7044, 1695.4142, 1708.7102, 1720.6094, 1732.6522, 1747.841, 1756.4072, 1769.9786, 1782.3276, 1797.5216, 1808.3186, 1819.0694, 1834.354, 1844.575, 1856.2808, 1871.1288, 1880.7852, 1893.9622, 1906.3418, 1920.6548, 1932.9302, 1945.8584, 1955.473, 1968.8248, 1980.6446, 1995.9598, 2008.349, 2019.8556, 2033.0334, 2044.0206, 2059.3956, 2069.9174, 2082.6084, 2093.7036, 2106.6108, 2118.9124, 2132.301, 2144.7628, 2159.8422, 2171.0212, 2183.101, 2193.5112, 2208.052, 2221.3194, 2233.3282, 2247.295, 2257.7222, 2273.342, 2286.5638, 2299.6786, 2310.8114, 2322.3312, 2335.516, 2349.874, 2363.5968, 2373.865, 2387.1918, 2401.8328, 2414.8496, 2424.544, 2436.7592, 2447.1682, 2464.1958, 2474.3438, 2489.0006, 2497.4526, 2513.6586, 2527.19, 2540.7028, 2553.768\n};\n\nconst double rawEstimateData_precision10[] = {\n    738.1256, 750.4234, 763.1064, 775.4732, 788.4636, 801.0644, 814.488, 827.9654, 841.0832, 854.7864, 868.1992, 882.2176, 896.5228, 910.1716, 924.7752, 938.899, 953.6126, 968.6492, 982.9474, 998.5214, 1013.1064, 1028.6364, 1044.2468, 1059.4588, 1075.3832, 1091.0584, 1106.8606, 1123.3868, 1139.5062, 1156.1862, 1172.463, 1189.339, 1206.1936, 1223.1292, 1240.1854, 1257.2908, 1275.3324, 1292.8518, 1310.5204, 1328.4854, 1345.9318, 1364.552, 1381.4658, 1400.4256, 1419.849, 1438.152, 1456.8956, 1474.8792, 1494.118, 1513.62, 1532.5132, 1551.9322, 1570.7726, 1590.6086, 1610.5332, 1630.5918, 1650.4294, 1669.7662, 1690.4106, 1710.7338, 1730.9012, 1750.4486, 1770.1556, 1791.6338, 1812.7312, 1833.6264, 1853.9526, 1874.8742, 1896.8326, 1918.1966, 1939.5594, 1961.07, 1983.037, 2003.1804, 2026.071, 2047.4884, 2070.0848, 2091.2944, 2114.333, 2135.9626, 2158.2902, 2181.0814, 2202.0334, 2224.4832, 2246.39, 2269.7202, 2292.1714, 2314.2358, 2338.9346, 2360.891, 2384.0264, 2408.3834, 2430.1544, 2454.8684, 2476.9896, 2501.4368, 2522.8702, 2548.0408, 2570.6738, 2593.5208, 2617.0158, 2640.2302, 2664.0962, 2687.4986, 2714.2588, 2735.3914, 2759.6244, 2781.8378, 2808.0072, 2830.6516, 2856.2454, 2877.2136, 2903.4546, 2926.785, 2951.2294, 2976.468, 3000.867, 3023.6508, 3049.91, 3073.5984, 3098.162, 3121.5564, 3146.2328, 3170.9484, 3195.5902, 3221.3346, 3242.7032, 3271.6112, 3296.5546, 3317.7376, 3345.072, 3369.9518, 3394.326, 3418.1818, 3444.6926, 3469.086, 3494.2754, 3517.8698, 3544.248, 3565.3768, 3588.7234, 3616.979, 3643.7504, 3668.6812, 3695.72, 3719.7392, 3742.6224, 3770.4456, 3795.6602, 3819.9058, 3844.002, 3869.517, 3895.6824, 3920.8622, 3947.1364, 3973.985, 3995.4772, 4021.62, 4046.628, 4074.65, 4096.2256, 4121.831, 4146.6406, 4173.276, 4195.0744, 4223.9696, 4251.3708, 4272.9966, 4300.8046, 4326.302, 4353.1248, 4374.312, 4403.0322, 4426.819, 4450.0598, 4478.5206, 4504.8116, 4528.8928, 4553.9584, 4578.8712, 4603.8384, 4632.3872, 4655.5128, 4675.821, 4704.6222, 4731.9862, 4755.4174, 4781.2628, 4804.332, 4832.3048, 4862.8752, 4883.4148, 4906.9544, 4935.3516, 4954.3532, 4984.0248, 5011.217, 5035.3258, 5057.3672, 5084.1828\n};\n\nconst double rawEstimateData_precision11[] = {\n    1477, 1501.6014, 1526.5802, 1551.7942, 1577.3042, 1603.2062, 1629.8402, 1656.2292, 1682.9462, 1709.9926, 1737.3026, 1765.4252, 1793.0578, 1821.6092, 1849.626, 1878.5568, 1908.527, 1937.5154, 1967.1874, 1997.3878, 2027.37, 2058.1972, 2089.5728, 2120.1012, 2151.9668, 2183.292, 2216.0772, 2247.8578, 2280.6562, 2313.041, 2345.714, 2380.3112, 2414.1806, 2447.9854, 2481.656, 2516.346, 2551.5154, 2586.8378, 2621.7448, 2656.6722, 2693.5722, 2729.1462, 2765.4124, 2802.8728, 2838.898, 2876.408, 2913.4926, 2951.4938, 2989.6776, 3026.282, 3065.7704, 3104.1012, 3143.7388, 3181.6876, 3221.1872, 3261.5048, 3300.0214, 3339.806, 3381.409, 3421.4144, 3461.4294, 3502.2286, 3544.651, 3586.6156, 3627.337, 3670.083, 3711.1538, 3753.5094, 3797.01, 3838.6686, 3882.1678, 3922.8116, 3967.9978, 4009.9204, 4054.3286, 4097.5706, 4140.6014, 4185.544, 4229.5976, 4274.583, 4316.9438, 4361.672, 4406.2786, 4451.8628, 4496.1834, 4543.505, 4589.1816, 4632.5188, 4678.2294, 4724.8908, 4769.0194, 4817.052, 4861.4588, 4910.1596, 4956.4344, 5002.5238, 5048.13, 5093.6374, 5142.8162, 5187.7894, 5237.3984, 5285.6078, 5331.0858, 5379.1036, 5428.6258, 5474.6018, 5522.7618, 5571.5822, 5618.59, 5667.9992, 5714.88, 5763.454, 5808.6982, 5860.3644, 5910.2914, 5953.571, 6005.9232, 6055.1914, 6104.5882, 6154.5702, 6199.7036, 6251.1764, 6298.7596, 6350.0302, 6398.061, 6448.4694, 6495.933, 6548.0474, 6597.7166, 6646.9416, 6695.9208, 6742.6328, 6793.5276, 6842.1934, 6894.2372, 6945.3864, 6996.9228, 7044.2372, 7094.1374, 7142.2272, 7192.2942, 7238.8338, 7288.9006, 7344.0908, 7394.8544, 7443.5176, 7490.4148, 7542.9314, 7595.6738, 7641.9878, 7694.3688, 7743.0448, 7797.522, 7845.53, 7899.594, 7950.3132, 7996.455, 8050.9442, 8092.9114, 8153.1374, 8197.4472, 8252.8278, 8301.8728, 8348.6776, 8401.4698, 8453.551, 8504.6598, 8553.8944, 8604.1276, 8657.6514, 8710.3062, 8758.908, 8807.8706, 8862.1702, 8910.4668, 8960.77, 9007.2766, 9063.164, 9121.0534, 9164.1354, 9218.1594, 9267.767, 9319.0594, 9372.155, 9419.7126, 9474.3722, 9520.1338, 9572.368, 9622.7702, 9675.8448, 9726.5396, 9778.7378, 9827.6554, 9878.1922, 9928.7782, 9978.3984, 10026.578, 10076.5626, 10137.1618, 10177.5244, 10229.9176\n};\n\nconst double rawEstimateData_precision12[] = {\n    2954, 3003.4782, 3053.3568, 3104.3666, 3155.324, 3206.9598, 3259.648, 3312.539, 3366.1474, 3420.2576, 3474.8376, 3530.6076, 3586.451, 3643.38, 3700.4104, 3757.5638, 3815.9676, 3875.193, 3934.838, 3994.8548, 4055.018, 4117.1742, 4178.4482, 4241.1294, 4304.4776, 4367.4044, 4431.8724, 4496.3732, 4561.4304, 4627.5326, 4693.949, 4761.5532, 4828.7256, 4897.6182, 4965.5186, 5034.4528, 5104.865, 5174.7164, 5244.6828, 5316.6708, 5387.8312, 5459.9036, 5532.476, 5604.8652, 5679.6718, 5753.757, 5830.2072, 5905.2828, 5980.0434, 6056.6264, 6134.3192, 6211.5746, 6290.0816, 6367.1176, 6447.9796, 6526.5576, 6606.1858, 6686.9144, 6766.1142, 6847.0818, 6927.9664, 7010.9096, 7091.0816, 7175.3962, 7260.3454, 7344.018, 7426.4214, 7511.3106, 7596.0686, 7679.8094, 7765.818, 7852.4248, 7936.834, 8022.363, 8109.5066, 8200.4554, 8288.5832, 8373.366, 8463.4808, 8549.7682, 8642.0522, 8728.3288, 8820.9528, 8907.727, 9001.0794, 9091.2522, 9179.988, 9269.852, 9362.6394, 9453.642, 9546.9024, 9640.6616, 9732.6622, 9824.3254, 9917.7484, 10007.9392, 10106.7508, 10196.2152, 10289.8114, 10383.5494, 10482.3064, 10576.8734, 10668.7872, 10764.7156, 10862.0196, 10952.793, 11049.9748, 11146.0702, 11241.4492, 11339.2772, 11434.2336, 11530.741, 11627.6136, 11726.311, 11821.5964, 11918.837, 12015.3724, 12113.0162, 12213.0424, 12306.9804, 12408.4518, 12504.8968, 12604.586, 12700.9332, 12798.705, 12898.5142, 12997.0488, 13094.788, 13198.475, 13292.7764, 13392.9698, 13486.8574, 13590.1616, 13686.5838, 13783.6264, 13887.2638, 13992.0978, 14081.0844, 14189.9956, 14280.0912, 14382.4956, 14486.4384, 14588.1082, 14686.2392, 14782.276, 14888.0284, 14985.1864, 15088.8596, 15187.0998, 15285.027, 15383.6694, 15495.8266, 15591.3736, 15694.2008, 15790.3246, 15898.4116, 15997.4522, 16095.5014, 16198.8514, 16291.7492, 16402.6424, 16499.1266, 16606.2436, 16697.7186, 16796.3946, 16902.3376, 17005.7672, 17100.814, 17206.8282, 17305.8262, 17416.0744, 17508.4092, 17617.0178, 17715.4554, 17816.758, 17920.1748, 18012.9236, 18119.7984, 18223.2248, 18324.2482, 18426.6276, 18525.0932, 18629.8976, 18733.2588, 18831.0466, 18940.1366, 19032.2696, 19131.729, 19243.4864, 19349.6932, 19442.866, 19547.9448, 19653.2798, 19754.4034, 19854.0692, 19965.1224, 20065.1774, 20158.2212, 20253.353, 20366.3264, 20463.22\n};\n\nconst double rawEstimateData_precision13[] = {\n    5908.5052, 6007.2672, 6107.347, 6208.5794, 6311.2622, 6414.5514, 6519.3376, 6625.6952, 6732.5988, 6841.3552, 6950.5972, 7061.3082, 7173.5646, 7287.109, 7401.8216, 7516.4344, 7633.3802, 7751.2962, 7870.3784, 7990.292, 8110.79, 8233.4574, 8356.6036, 8482.2712, 8607.7708, 8735.099, 8863.1858, 8993.4746, 9123.8496, 9255.6794, 9388.5448, 9522.7516, 9657.3106, 9792.6094, 9930.5642, 10068.794, 10206.7256, 10347.81, 10490.3196, 10632.0778, 10775.9916, 10920.4662, 11066.124, 11213.073, 11358.0362, 11508.1006, 11659.1716, 11808.7514, 11959.4884, 12112.1314, 12265.037, 12420.3756, 12578.933, 12734.311, 12890.0006, 13047.2144, 13207.3096, 13368.5144, 13528.024, 13689.847, 13852.7528, 14018.3168, 14180.5372, 14346.9668, 14513.5074, 14677.867, 14846.2186, 15017.4186, 15184.9716, 15356.339, 15529.2972, 15697.3578, 15871.8686, 16042.187, 16216.4094, 16389.4188, 16565.9126, 16742.3272, 16919.0042, 17094.7592, 17273.965, 17451.8342, 17634.4254, 17810.5984, 17988.9242, 18171.051, 18354.7938, 18539.466, 18721.0408, 18904.9972, 19081.867, 19271.9118, 19451.8694, 19637.9816, 19821.2922, 20013.1292, 20199.3858, 20387.8726, 20572.9514, 20770.7764, 20955.1714, 21144.751, 21329.9952, 21520.709, 21712.7016, 21906.3868, 22096.2626, 22286.0524, 22475.051, 22665.5098, 22862.8492, 23055.5294, 23249.6138, 23437.848, 23636.273, 23826.093, 24020.3296, 24213.3896, 24411.7392, 24602.9614, 24805.7952, 24998.1552, 25193.9588, 25389.0166, 25585.8392, 25780.6976, 25981.2728, 26175.977, 26376.5252, 26570.1964, 26773.387, 26962.9812, 27163.0586, 27368.164, 27565.0534, 27758.7428, 27961.1276, 28163.2324, 28362.3816, 28565.7668, 28758.644, 28956.9768, 29163.4722, 29354.7026, 29561.1186, 29767.9948, 29959.9986, 30164.0492, 30366.9818, 30562.5338, 30762.9928, 30976.1592, 31166.274, 31376.722, 31570.3734, 31770.809, 31974.8934, 32179.5286, 32387.5442, 32582.3504, 32794.076, 32989.9528, 33191.842, 33392.4684, 33595.659, 33801.8672, 34000.3414, 34200.0922, 34402.6792, 34610.0638, 34804.0084, 35011.13, 35218.669, 35418.6634, 35619.0792, 35830.6534, 36028.4966, 36229.7902, 36438.6422, 36630.7764, 36833.3102, 37048.6728, 37247.3916, 37453.5904, 37669.3614, 37854.5526, 38059.305, 38268.0936, 38470.2516, 38674.7064, 38876.167, 39068.3794, 39281.9144, 39492.8566, 39684.8628, 39898.4108, 40093.1836, 40297.6858, 40489.7086, 40717.2424\n};\n\nconst double rawEstimateData_precision14[] = {\n    11817.475, 12015.0046, 12215.3792, 12417.7504, 12623.1814, 12830.0086, 13040.0072, 13252.503, 13466.178, 13683.2738, 13902.0344, 14123.9798, 14347.394, 14573.7784, 14802.6894, 15033.6824, 15266.9134, 15502.8624, 15741.4944, 15980.7956, 16223.8916, 16468.6316, 16715.733, 16965.5726, 17217.204, 17470.666, 17727.8516, 17986.7886, 18247.6902, 18510.9632, 18775.304, 19044.7486, 19314.4408, 19587.202, 19862.2576, 20135.924, 20417.0324, 20697.9788, 20979.6112, 21265.0274, 21550.723, 21841.6906, 22132.162, 22428.1406, 22722.127, 23020.5606, 23319.7394, 23620.4014, 23925.2728, 24226.9224, 24535.581, 24845.505, 25155.9618, 25470.3828, 25785.9702, 26103.7764, 26420.4132, 26742.0186, 27062.8852, 27388.415, 27714.6024, 28042.296, 28365.4494, 28701.1526, 29031.8008, 29364.2156, 29704.497, 30037.1458, 30380.111, 30723.8168, 31059.5114, 31404.9498, 31751.6752, 32095.2686, 32444.7792, 32794.767, 33145.204, 33498.4226, 33847.6502, 34209.006, 34560.849, 34919.4838, 35274.9778, 35635.1322, 35996.3266, 36359.1394, 36722.8266, 37082.8516, 37447.7354, 37815.9606, 38191.0692, 38559.4106, 38924.8112, 39294.6726, 39663.973, 40042.261, 40416.2036, 40779.2036, 41161.6436, 41540.9014, 41921.1998, 42294.7698, 42678.5264, 43061.3464, 43432.375, 43818.432, 44198.6598, 44583.0138, 44970.4794, 45353.924, 45729.858, 46118.2224, 46511.5724, 46900.7386, 47280.6964, 47668.1472, 48055.6796, 48446.9436, 48838.7146, 49217.7296, 49613.7796, 50010.7508, 50410.0208, 50793.7886, 51190.2456, 51583.1882, 51971.0796, 52376.5338, 52763.319, 53165.5534, 53556.5594, 53948.2702, 54346.352, 54748.7914, 55138.577, 55543.4824, 55941.1748, 56333.7746, 56745.1552, 57142.7944, 57545.2236, 57935.9956, 58348.5268, 58737.5474, 59158.5962, 59542.6896, 59958.8004, 60349.3788, 60755.0212, 61147.6144, 61548.194, 61946.0696, 62348.6042, 62763.603, 63162.781, 63560.635, 63974.3482, 64366.4908, 64771.5876, 65176.7346, 65597.3916, 65995.915, 66394.0384, 66822.9396, 67203.6336, 67612.2032, 68019.0078, 68420.0388, 68821.22, 69235.8388, 69640.0724, 70055.155, 70466.357, 70863.4266, 71276.2482, 71677.0306, 72080.2006, 72493.0214, 72893.5952, 73314.5856, 73714.9852, 74125.3022, 74521.2122, 74933.6814, 75341.5904, 75743.0244, 76166.0278, 76572.1322, 76973.1028, 77381.6284, 77800.6092, 78189.328, 78607.0962, 79012.2508, 79407.8358, 79825.725, 80238.701, 80646.891, 81035.6436, 81460.0448, 81876.3884\n};\n\nconst double rawEstimateData_precision15[] = {\n    23635.0036, 24030.8034, 24431.4744, 24837.1524, 25246.7928, 25661.326, 26081.3532, 26505.2806, 26933.9892, 27367.7098, 27805.318, 28248.799, 28696.4382, 29148.8244, 29605.5138, 30066.8668, 30534.2344, 31006.32, 31480.778, 31962.2418, 32447.3324, 32938.0232, 33432.731, 33930.728, 34433.9896, 34944.1402, 35457.5588, 35974.5958, 36497.3296, 37021.9096, 37554.326, 38088.0826, 38628.8816, 39171.3192, 39723.2326, 40274.5554, 40832.3142, 41390.613, 41959.5908, 42532.5466, 43102.0344, 43683.5072, 44266.694, 44851.2822, 45440.7862, 46038.0586, 46640.3164, 47241.064, 47846.155, 48454.7396, 49076.9168, 49692.542, 50317.4778, 50939.65, 51572.5596, 52210.2906, 52843.7396, 53481.3996, 54127.236, 54770.406, 55422.6598, 56078.7958, 56736.7174, 57397.6784, 58064.5784, 58730.308, 59404.9784, 60077.0864, 60751.9158, 61444.1386, 62115.817, 62808.7742, 63501.4774, 64187.5454, 64883.6622, 65582.7468, 66274.5318, 66976.9276, 67688.7764, 68402.138, 69109.6274, 69822.9706, 70543.6108, 71265.5202, 71983.3848, 72708.4656, 73433.384, 74158.4664, 74896.4868, 75620.9564, 76362.1434, 77098.3204, 77835.7662, 78582.6114, 79323.9902, 80067.8658, 80814.9246, 81567.0136, 82310.8536, 83061.9952, 83821.4096, 84580.8608, 85335.547, 86092.5802, 86851.6506, 87612.311, 88381.2016, 89146.3296, 89907.8974, 90676.846, 91451.4152, 92224.5518, 92995.8686, 93763.5066, 94551.2796, 95315.1944, 96096.1806, 96881.0918, 97665.679, 98442.68, 99229.3002, 100011.0994, 100790.6386, 101580.1564, 102377.7484, 103152.1392, 103944.2712, 104730.216, 105528.6336, 106324.9398, 107117.6706, 107890.3988, 108695.2266, 109485.238, 110294.7876, 111075.0958, 111878.0496, 112695.2864, 113464.5486, 114270.0474, 115068.608, 115884.3626, 116673.2588, 117483.3716, 118275.097, 119085.4092, 119879.2808, 120687.5868, 121499.9944, 122284.916, 123095.9254, 123912.5038, 124709.0454, 125503.7182, 126323.259, 127138.9412, 127943.8294, 128755.646, 129556.5354, 130375.3298, 131161.4734, 131971.1962, 132787.5458, 133588.1056, 134431.351, 135220.2906, 136023.398, 136846.6558, 137667.0004, 138463.663, 139283.7154, 140074.6146, 140901.3072, 141721.8548, 142543.2322, 143356.1096, 144173.7412, 144973.0948, 145794.3162, 146609.5714, 147420.003, 148237.9784, 149050.5696, 149854.761, 150663.1966, 151494.0754, 152313.1416, 153112.6902, 153935.7206, 154746.9262, 155559.547, 156401.9746, 157228.7036, 158008.7254, 158820.75, 159646.9184, 160470.4458, 161279.5348, 162093.3114, 162918.542, 163729.2842\n};\n\nconst double rawEstimateData_precision16[] = {\n    47271, 48062.3584, 48862.7074, 49673.152, 50492.8416, 51322.9514, 52161.03, 53009.407, 53867.6348, 54734.206, 55610.5144, 56496.2096, 57390.795, 58297.268, 59210.6448, 60134.665, 61068.0248, 62010.4472, 62962.5204, 63923.5742, 64895.0194, 65876.4182, 66862.6136, 67862.6968, 68868.8908, 69882.8544, 70911.271, 71944.0924, 72990.0326, 74040.692, 75100.6336, 76174.7826, 77252.5998, 78340.2974, 79438.2572, 80545.4976, 81657.2796, 82784.6336, 83915.515, 85059.7362, 86205.9368, 87364.4424, 88530.3358, 89707.3744, 90885.9638, 92080.197, 93275.5738, 94479.391, 95695.918, 96919.2236, 98148.4602, 99382.3474, 100625.6974, 101878.0284, 103141.6278, 104409.4588, 105686.2882, 106967.5402, 108261.6032, 109548.1578, 110852.0728, 112162.231, 113479.0072, 114806.2626, 116137.9072, 117469.5048, 118813.5186, 120165.4876, 121516.2556, 122875.766, 124250.5444, 125621.2222, 127003.2352, 128387.848, 129775.2644, 131181.7776, 132577.3086, 133979.9458, 135394.1132, 136800.9078, 138233.217, 139668.5308, 141085.212, 142535.2122, 143969.0684, 145420.2872, 146878.1542, 148332.7572, 149800.3202, 151269.66, 152743.6104, 154213.0948, 155690.288, 157169.4246, 158672.1756, 160160.059, 161650.6854, 163145.7772, 164645.6726, 166159.1952, 167682.1578, 169177.3328, 170700.0118, 172228.8964, 173732.6664, 175265.5556, 176787.799, 178317.111, 179856.6914, 181400.865, 182943.4612, 184486.742, 186033.4698, 187583.7886, 189148.1868, 190688.4526, 192250.1926, 193810.9042, 195354.2972, 196938.7682, 198493.5898, 200079.2824, 201618.912, 203205.5492, 204765.5798, 206356.1124, 207929.3064, 209498.7196, 211086.229, 212675.1324, 214256.7892, 215826.2392, 217412.8474, 218995.6724, 220618.6038, 222207.1166, 223781.0364, 225387.4332, 227005.7928, 228590.4336, 230217.8738, 231805.1054, 233408.9, 234995.3432, 236601.4956, 238190.7904, 239817.2548, 241411.2832, 243002.4066, 244640.1884, 246255.3128, 247849.3508, 249479.9734, 251106.8822, 252705.027, 254332.9242, 255935.129, 257526.9014, 259154.772, 260777.625, 262390.253, 264004.4906, 265643.59, 267255.4076, 268873.426, 270470.7252, 272106.4804, 273722.4456, 275337.794, 276945.7038, 278592.9154, 280204.3726, 281841.1606, 283489.171, 285130.1716, 286735.3362, 288364.7164, 289961.1814, 291595.5524, 293285.683, 294899.6668, 296499.3434, 298128.0462, 299761.8946, 301394.2424, 302997.6748, 304615.1478, 306269.7724, 307886.114, 309543.1028, 311153.2862, 312782.8546, 314421.2008, 316033.2438, 317692.9636, 319305.2648, 320948.7406, 322566.3364, 324228.4224, 325847.1542\n};\n\nconst double rawEstimateData_precision17[] = {\n    94542, 96125.811, 97728.019, 99348.558, 100987.9705, 102646.7565, 104324.5125, 106021.7435, 107736.7865, 109469.272, 111223.9465, 112995.219, 114787.432, 116593.152, 118422.71, 120267.2345, 122134.6765, 124020.937, 125927.2705, 127851.255, 129788.9485, 131751.016, 133726.8225, 135722.592, 137736.789, 139770.568, 141821.518, 143891.343, 145982.1415, 148095.387, 150207.526, 152355.649, 154515.6415, 156696.05, 158887.7575, 161098.159, 163329.852, 165569.053, 167837.4005, 170121.6165, 172420.4595, 174732.6265, 177062.77, 179412.502, 181774.035, 184151.939, 186551.6895, 188965.691, 191402.8095, 193857.949, 196305.0775, 198774.6715, 201271.2585, 203764.78, 206299.3695, 208818.1365, 211373.115, 213946.7465, 216532.076, 219105.541, 221714.5375, 224337.5135, 226977.5125, 229613.0655, 232270.2685, 234952.2065, 237645.3555, 240331.1925, 243034.517, 245756.0725, 248517.6865, 251232.737, 254011.3955, 256785.995, 259556.44, 262368.335, 265156.911, 267965.266, 270785.583, 273616.0495, 276487.4835, 279346.639, 282202.509, 285074.3885, 287942.2855, 290856.018, 293774.0345, 296678.5145, 299603.6355, 302552.6575, 305492.9785, 308466.8605, 311392.581, 314347.538, 317319.4295, 320285.9785, 323301.7325, 326298.3235, 329301.3105, 332301.987, 335309.791, 338370.762, 341382.923, 344431.1265, 347464.1545, 350507.28, 353619.2345, 356631.2005, 359685.203, 362776.7845, 365886.488, 368958.2255, 372060.6825, 375165.4335, 378237.935, 381328.311, 384430.5225, 387576.425, 390683.242, 393839.648, 396977.8425, 400101.9805, 403271.296, 406409.8425, 409529.5485, 412678.7, 415847.423, 419020.8035, 422157.081, 425337.749, 428479.6165, 431700.902, 434893.1915, 438049.582, 441210.5415, 444379.2545, 447577.356, 450741.931, 453959.548, 457137.0935, 460329.846, 463537.4815, 466732.3345, 469960.5615, 473164.681, 476347.6345, 479496.173, 482813.1645, 486025.6995, 489249.4885, 492460.1945, 495675.8805, 498908.0075, 502131.802, 505374.3855, 508550.9915, 511806.7305, 515026.776, 518217.0005, 521523.9855, 524705.9855, 527950.997, 531210.0265, 534472.497, 537750.7315, 540926.922, 544207.094, 547429.4345, 550666.3745, 553975.3475, 557150.7185, 560399.6165, 563662.697, 566916.7395, 570146.1215, 573447.425, 576689.6245, 579874.5745, 583202.337, 586503.0255, 589715.635, 592910.161, 596214.3885, 599488.035, 602740.92, 605983.0685, 609248.67, 612491.3605, 615787.912, 619107.5245, 622307.9555, 625577.333, 628840.4385, 632085.2155, 635317.6135, 638691.7195, 641887.467, 645139.9405, 648441.546, 651666.252, 654941.845\n};\n\nconst double rawEstimateData_precision18[] = {\n    189084, 192250.913, 195456.774, 198696.946, 201977.762, 205294.444, 208651.754, 212042.099, 215472.269, 218941.91, 222443.912, 225996.845, 229568.199, 233193.568, 236844.457, 240543.233, 244279.475, 248044.27, 251854.588, 255693.2, 259583.619, 263494.621, 267445.385, 271454.061, 275468.769, 279549.456, 283646.446, 287788.198, 291966.099, 296181.164, 300431.469, 304718.618, 309024.004, 313393.508, 317760.803, 322209.731, 326675.061, 331160.627, 335654.47, 340241.442, 344841.833, 349467.132, 354130.629, 358819.432, 363574.626, 368296.587, 373118.482, 377914.93, 382782.301, 387680.669, 392601.981, 397544.323, 402529.115, 407546.018, 412593.658, 417638.657, 422762.865, 427886.169, 433017.167, 438213.273, 443441.254, 448692.421, 453937.533, 459239.049, 464529.569, 469910.083, 475274.03, 480684.473, 486070.26, 491515.237, 496995.651, 502476.617, 507973.609, 513497.19, 519083.233, 524726.509, 530305.505, 535945.728, 541584.404, 547274.055, 552967.236, 558667.862, 564360.216, 570128.148, 575965.08, 581701.952, 587532.523, 593361.144, 599246.128, 605033.418, 610958.779, 616837.117, 622772.818, 628672.04, 634675.369, 640574.831, 646585.739, 652574.547, 658611.217, 664642.684, 670713.914, 676737.681, 682797.313, 688837.897, 694917.874, 701009.882, 707173.648, 713257.254, 719415.392, 725636.761, 731710.697, 737906.209, 744103.074, 750313.39, 756504.185, 762712.579, 768876.985, 775167.859, 781359, 787615.959, 793863.597, 800245.477, 806464.582, 812785.294, 819005.925, 825403.057, 831676.197, 837936.284, 844266.968, 850642.711, 856959.756, 863322.774, 869699.931, 876102.478, 882355.787, 888694.463, 895159.952, 901536.143, 907872.631, 914293.672, 920615.14, 927130.974, 933409.404, 939922.178, 946331.47, 952745.93, 959209.264, 965590.224, 972077.284, 978501.961, 984953.19, 991413.271, 997817.479, 1004222.658, 1010725.676, 1017177.138, 1023612.529, 1030098.236, 1036493.719, 1043112.207, 1049537.036, 1056008.096, 1062476.184, 1068942.337, 1075524.95, 1081932.864, 1088426.025, 1094776.005, 1101327.448, 1107901.673, 1114423.639, 1120884.602, 1127324.923, 1133794.24, 1140328.886, 1146849.376, 1153346.682, 1159836.502, 1166478.703, 1172953.304, 1179391.502, 1185950.982, 1192544.052, 1198913.41, 1205430.994, 1212015.525, 1218674.042, 1225121.683, 1231551.101, 1238126.379, 1244673.795, 1251260.649, 1257697.86, 1264320.983, 1270736.319, 1277274.694, 1283804.95, 1290211.514, 1296858.568, 1303455.691\n};\n\n\nconst double biasData_precision4[] = {\n    10, 9.717, 9.207, 8.7896, 8.2882, 7.8204, 7.3772, 6.9342, 6.5202, 6.161, 5.7722, 5.4636, 5.0396, 4.6766, 4.3566, 4.0454, 3.7936, 3.4856, 3.2666, 2.9946, 2.766, 2.4692, 2.3638, 2.0764, 1.7864, 1.7602, 1.4814, 1.433, 1.2926, 1.0664, 0.999600000000001, 0.7956, 0.5366, 0.589399999999998, 0.573799999999999, 0.269799999999996, 0.368200000000002, 0.0544000000000011, 0.234200000000001, 0.0108000000000033, -0.203400000000002, -0.0701999999999998, -0.129600000000003, -0.364199999999997, -0.480600000000003, -0.226999999999997, -0.322800000000001, -0.382599999999996, -0.511200000000002, -0.669600000000003, -0.749400000000001, -0.500399999999999, -0.617600000000003, -0.6922, -0.601599999999998, -0.416200000000003, -0.338200000000001, -0.782600000000002, -0.648600000000002, -0.919800000000002, -0.851799999999997, -0.962400000000002, -0.6402, -1.1922, -1.0256, -1.086, -1.21899999999999, -0.819400000000002, -0.940600000000003, -1.1554, -1.2072, -1.1752, -1.16759999999999, -1.14019999999999, -1.3754, -1.29859999999999, -1.607, -1.3292, -1.7606\n};\n\nconst double biasData_precision5[] = {\n    22, 21.1194, 20.8208, 20.2318, 19.77, 19.2436, 18.7774, 18.2848, 17.8224, 17.3742, 16.9336, 16.503, 16.0494, 15.6292, 15.2124, 14.798, 14.367, 13.9728, 13.5944, 13.217, 12.8438, 12.3696, 12.0956, 11.7044, 11.324, 11.0668, 10.6698, 10.3644, 10.049, 9.6918, 9.4146, 9.082, 8.687, 8.5398, 8.2462, 7.857, 7.6606, 7.4168, 7.1248, 6.9222, 6.6804, 6.447, 6.3454, 5.9594, 5.7636, 5.5776, 5.331, 5.19, 4.9676, 4.7564, 4.5314, 4.4442, 4.3708, 3.9774, 3.9624, 3.8796, 3.755, 3.472, 3.2076, 3.1024, 2.8908, 2.7338, 2.7728, 2.629, 2.413, 2.3266, 2.1524, 2.2642, 2.1806, 2.0566, 1.9192, 1.7598, 1.3516, 1.5802, 1.43859999999999, 1.49160000000001, 1.1524, 1.1892, 0.841399999999993, 0.879800000000003, 0.837599999999995, 0.469800000000006, 0.765600000000006, 0.331000000000003, 0.591399999999993, 0.601200000000006, 0.701599999999999, 0.558199999999999, 0.339399999999998, 0.354399999999998, 0.491200000000006, 0.308000000000007, 0.355199999999996, -0.0254000000000048, 0.205200000000005, -0.272999999999996, 0.132199999999997, 0.394400000000005, -0.241200000000006, 0.242000000000004, 0.191400000000002, 0.253799999999998, -0.122399999999999, -0.370800000000003, 0.193200000000004, -0.0848000000000013, 0.0867999999999967, -0.327200000000005, -0.285600000000002, 0.311400000000006, -0.128399999999999, -0.754999999999995, -0.209199999999996, -0.293599999999998, -0.364000000000004, -0.253600000000006, -0.821200000000005, -0.253600000000006, -0.510400000000004, -0.383399999999995, -0.491799999999998, -0.220200000000006, -0.0972000000000008, -0.557400000000001, -0.114599999999996, -0.295000000000002, -0.534800000000004, 0.346399999999988, -0.65379999999999, 0.0398000000000138, 0.0341999999999985, -0.995800000000003, -0.523400000000009, -0.489000000000004, -0.274799999999999, -0.574999999999989, -0.482799999999997, 0.0571999999999946, -0.330600000000004, -0.628800000000012, -0.140199999999993, -0.540600000000012, -0.445999999999998, -0.599400000000003, -0.262599999999992, 0.163399999999996, -0.100599999999986, -0.39500000000001, -1.06960000000001, -0.836399999999998, -0.753199999999993, -0.412399999999991, -0.790400000000005, -0.29679999999999, -0.28540000000001, -0.193000000000012, -0.0772000000000048, -0.962799999999987, -0.414800000000014\n};\n\nconst double biasData_precision6[] = {\n    45, 44.1902, 43.271, 42.8358, 41.8142, 41.2854, 40.317, 39.354, 38.8924, 37.9436, 37.4596, 36.5262, 35.6248, 35.1574, 34.2822, 33.837, 32.9636, 32.074, 31.7042, 30.7976, 30.4772, 29.6564, 28.7942, 28.5004, 27.686, 27.291, 26.5672, 25.8556, 25.4982, 24.8204, 24.4252, 23.7744, 23.0786, 22.8344, 22.0294, 21.8098, 21.0794, 20.5732, 20.1878, 19.5648, 19.2902, 18.6784, 18.3352, 17.8946, 17.3712, 17.0852, 16.499, 16.2686, 15.6844, 15.2234, 14.9732, 14.3356, 14.2286, 13.7262, 13.3284, 13.1048, 12.5962, 12.3562, 12.1272, 11.4184, 11.4974, 11.0822, 10.856, 10.48, 10.2834, 10.0208, 9.637, 9.51739999999999, 9.05759999999999, 8.74760000000001, 8.42700000000001, 8.1326, 8.2372, 8.2788, 7.6776, 7.79259999999999, 7.1952, 6.9564, 6.6454, 6.87, 6.5428, 6.19999999999999, 6.02940000000001, 5.62780000000001, 5.6782, 5.792, 5.35159999999999, 5.28319999999999, 5.0394, 5.07480000000001, 4.49119999999999, 4.84899999999999, 4.696, 4.54040000000001, 4.07300000000001, 4.37139999999999, 3.7216, 3.7328, 3.42080000000001, 3.41839999999999, 3.94239999999999, 3.27719999999999, 3.411, 3.13079999999999, 2.76900000000001, 2.92580000000001, 2.68279999999999, 2.75020000000001, 2.70599999999999, 2.3886, 3.01859999999999, 2.45179999999999, 2.92699999999999, 2.41720000000001, 2.41139999999999, 2.03299999999999, 2.51240000000001, 2.5564, 2.60079999999999, 2.41720000000001, 1.80439999999999, 1.99700000000001, 2.45480000000001, 1.8948, 2.2346, 2.30860000000001, 2.15479999999999, 1.88419999999999, 1.6508, 0.677199999999999, 1.72540000000001, 1.4752, 1.72280000000001, 1.66139999999999, 1.16759999999999, 1.79300000000001, 1.00059999999999, 0.905200000000008, 0.659999999999997, 1.55879999999999, 1.1636, 0.688199999999995, 0.712600000000009, 0.450199999999995, 1.1978, 0.975599999999986, 0.165400000000005, 1.727, 1.19739999999999, -0.252600000000001, 1.13460000000001, 1.3048, 1.19479999999999, 0.313400000000001, 0.878999999999991, 1.12039999999999, 0.853000000000009, 1.67920000000001, 0.856999999999999, 0.448599999999999, 1.2362, 0.953399999999988, 1.02859999999998, 0.563199999999995, 0.663000000000011, 0.723000000000013, 0.756599999999992, 0.256599999999992, -0.837600000000009, 0.620000000000005, 0.821599999999989, 0.216600000000028, 0.205600000000004, 0.220199999999977, 0.372599999999977, 0.334400000000016, 0.928400000000011, 0.972800000000007, 0.192400000000021, 0.487199999999973, -0.413000000000011, 0.807000000000016, 0.120600000000024, 0.769000000000005, 0.870799999999974, 0.66500000000002, 0.118200000000002, 0.401200000000017, 0.635199999999998, 0.135400000000004, 0.175599999999974, 1.16059999999999, 0.34620000000001, 0.521400000000028, -0.586599999999976, -1.16480000000001, 0.968399999999974, 0.836999999999989, 0.779600000000016, 0.985799999999983\n};\n\nconst double biasData_precision7[] = {\n    91, 89.4934, 87.9758, 86.4574, 84.9718, 83.4954, 81.5302, 80.0756, 78.6374, 77.1782, 75.7888, 73.9522, 72.592, 71.2532, 69.9086, 68.5938, 66.9474, 65.6796, 64.4394, 63.2176, 61.9768, 60.4214, 59.2528, 58.0102, 56.8658, 55.7278, 54.3044, 53.1316, 52.093, 51.0032, 49.9092, 48.6306, 47.5294, 46.5756, 45.6508, 44.662, 43.552, 42.3724, 41.617, 40.5754, 39.7872, 38.8444, 37.7988, 36.8606, 36.2118, 35.3566, 34.4476, 33.5882, 32.6816, 32.0824, 31.0258, 30.6048, 29.4436, 28.7274, 27.957, 27.147, 26.4364, 25.7592, 25.3386, 24.781, 23.8028, 23.656, 22.6544, 21.996, 21.4718, 21.1544, 20.6098, 19.5956, 19.0616, 18.5758, 18.4878, 17.5244, 17.2146, 16.724, 15.8722, 15.5198, 15.0414, 14.941, 14.9048, 13.87, 13.4304, 13.028, 12.4708, 12.37, 12.0624, 11.4668, 11.5532, 11.4352, 11.2564, 10.2744, 10.2118, 9.74720000000002, 10.1456, 9.2928, 8.75040000000001, 8.55279999999999, 8.97899999999998, 8.21019999999999, 8.18340000000001, 7.3494, 7.32499999999999, 7.66140000000001, 6.90300000000002, 7.25439999999998, 6.9042, 7.21499999999997, 6.28640000000001, 6.08139999999997, 6.6764, 6.30099999999999, 5.13900000000001, 5.65800000000002, 5.17320000000001, 4.59019999999998, 4.9538, 5.08280000000002, 4.92200000000003, 4.99020000000002, 4.7328, 5.4538, 4.11360000000002, 4.22340000000003, 4.08780000000002, 3.70800000000003, 4.15559999999999, 4.18520000000001, 3.63720000000001, 3.68220000000002, 3.77960000000002, 3.6078, 2.49160000000001, 3.13099999999997, 2.5376, 3.19880000000001, 3.21100000000001, 2.4502, 3.52820000000003, 2.91199999999998, 3.04480000000001, 2.7432, 2.85239999999999, 2.79880000000003, 2.78579999999999, 1.88679999999999, 2.98860000000002, 2.50639999999999, 1.91239999999999, 2.66160000000002, 2.46820000000002, 1.58199999999999, 1.30399999999997, 2.27379999999999, 2.68939999999998, 1.32900000000001, 3.10599999999999, 1.69080000000002, 2.13740000000001, 2.53219999999999, 1.88479999999998, 1.33240000000001, 1.45119999999997, 1.17899999999997, 2.44119999999998, 1.60659999999996, 2.16700000000003, 0.77940000000001, 2.37900000000002, 2.06700000000001, 1.46000000000004, 2.91160000000002, 1.69200000000001, 0.954600000000028, 2.49300000000005, 2.2722, 1.33500000000004, 2.44899999999996, 1.20140000000004, 3.07380000000001, 2.09739999999999, 2.85640000000001, 2.29960000000005, 2.40899999999999, 1.97040000000004, 0.809799999999996, 1.65279999999996, 2.59979999999996, 0.95799999999997, 2.06799999999998, 2.32780000000002, 4.20159999999998, 1.96320000000003, 1.86400000000003, 1.42999999999995, 3.77940000000001, 1.27200000000005, 1.86440000000005, 2.20600000000002, 3.21900000000005, 1.5154, 2.61019999999996\n};\n\nconst double biasData_precision8[] = {\n    183.2152, 180.2454, 177.2096, 173.6652, 170.6312, 167.6822, 164.249, 161.3296, 158.0038, 155.2074, 152.4612, 149.27, 146.5178, 143.4412, 140.8032, 138.1634, 135.1688, 132.6074, 129.6946, 127.2664, 124.8228, 122.0432, 119.6824, 116.9464, 114.6268, 112.2626, 109.8376, 107.4034, 104.8956, 102.8522, 100.7638, 98.3552, 96.3556, 93.7526, 91.9292, 89.8954, 87.8198, 85.7668, 83.298, 81.6688, 79.9466, 77.9746, 76.1672, 74.3474, 72.3028, 70.8912, 69.114, 67.4646, 65.9744, 64.4092, 62.6022, 60.843, 59.5684, 58.1652, 56.5426, 55.4152, 53.5388, 52.3592, 51.1366, 49.486, 48.3918, 46.5076, 45.509, 44.3834, 43.3498, 42.0668, 40.7346, 40.1228, 38.4528, 37.7, 36.644, 36.0518, 34.5774, 33.9068, 32.432, 32.1666, 30.434, 29.6644, 28.4894, 27.6312, 26.3804, 26.292, 25.5496000000001, 25.0234, 24.8206, 22.6146, 22.4188, 22.117, 20.6762, 20.6576, 19.7864, 19.509, 18.5334, 17.9204, 17.772, 16.2924, 16.8654, 15.1836, 15.745, 15.1316, 15.0386, 14.0136, 13.6342, 12.6196, 12.1866, 12.4281999999999, 11.3324, 10.4794000000001, 11.5038, 10.129, 9.52800000000002, 10.3203999999999, 9.46299999999997, 9.79280000000006, 9.12300000000005, 8.74180000000001, 9.2192, 7.51020000000005, 7.60659999999996, 7.01840000000004, 7.22239999999999, 7.40139999999997, 6.76179999999999, 7.14359999999999, 5.65060000000005, 5.63779999999997, 5.76599999999996, 6.75139999999999, 5.57759999999996, 3.73220000000003, 5.8048, 5.63019999999995, 4.93359999999996, 3.47979999999995, 4.33879999999999, 3.98940000000005, 3.81960000000004, 3.31359999999995, 3.23080000000004, 3.4588, 3.08159999999998, 3.4076, 3.00639999999999, 2.38779999999997, 2.61900000000003, 1.99800000000005, 3.34820000000002, 2.95060000000001, 0.990999999999985, 2.11440000000005, 2.20299999999997, 2.82219999999995, 2.73239999999998, 2.7826, 3.76660000000004, 2.26480000000004, 2.31280000000004, 2.40819999999997, 2.75360000000001, 3.33759999999995, 2.71559999999999, 1.7478000000001, 1.42920000000004, 2.39300000000003, 2.22779999999989, 2.34339999999997, 0.87259999999992, 3.88400000000001, 1.80600000000004, 1.91759999999999, 1.16779999999994, 1.50320000000011, 2.52500000000009, 0.226400000000012, 2.31500000000005, 0.930000000000064, 1.25199999999995, 2.14959999999996, 0.0407999999999902, 2.5447999999999, 1.32960000000003, 0.197400000000016, 2.52620000000002, 3.33279999999991, -1.34300000000007, 0.422199999999975, 0.917200000000093, 1.12920000000008, 1.46060000000011, 1.45779999999991, 2.8728000000001, 3.33359999999993, -1.34079999999994, 1.57680000000005, 0.363000000000056, 1.40740000000005, 0.656600000000026, 0.801400000000058, -0.454600000000028, 1.51919999999996\n};\n\nconst double biasData_precision9[] = {\n    368, 361.8294, 355.2452, 348.6698, 342.1464, 336.2024, 329.8782, 323.6598, 317.462, 311.2826, 305.7102, 299.7416, 293.9366, 288.1046, 282.285, 277.0668, 271.306, 265.8448, 260.301, 254.9886, 250.2422, 244.8138, 239.7074, 234.7428, 229.8402, 225.1664, 220.3534, 215.594, 210.6886, 205.7876, 201.65, 197.228, 192.8036, 188.1666, 184.0818, 180.0824, 176.2574, 172.302, 168.1644, 164.0056, 160.3802, 156.7192, 152.5234, 149.2084, 145.831, 142.485, 139.1112, 135.4764, 131.76, 129.3368, 126.5538, 122.5058, 119.2646, 116.5902, 113.3818, 110.8998, 107.9532, 105.2062, 102.2798, 99.4728, 96.9582, 94.3292, 92.171, 89.7809999999999, 87.5716, 84.7048, 82.5322, 79.875, 78.3972, 75.3464, 73.7274, 71.2834, 70.1444, 68.4263999999999, 66.0166, 64.018, 62.0437999999999, 60.3399999999999, 58.6856, 57.9836, 55.0311999999999, 54.6769999999999, 52.3188, 51.4846, 49.4423999999999, 47.739, 46.1487999999999, 44.9202, 43.4059999999999, 42.5342000000001, 41.2834, 38.8954000000001, 38.3286000000001, 36.2146, 36.6684, 35.9946, 33.123, 33.4338, 31.7378000000001, 29.076, 28.9692, 27.4964, 27.0998, 25.9864, 26.7754, 24.3208, 23.4838, 22.7388000000001, 24.0758000000001, 21.9097999999999, 20.9728, 19.9228000000001, 19.9292, 16.617, 17.05, 18.2996000000001, 15.6128000000001, 15.7392, 14.5174, 13.6322, 12.2583999999999, 13.3766000000001, 11.423, 13.1232, 9.51639999999998, 10.5938000000001, 9.59719999999993, 8.12220000000002, 9.76739999999995, 7.50440000000003, 7.56999999999994, 6.70440000000008, 6.41419999999994, 6.71019999999999, 5.60940000000005, 4.65219999999999, 6.84099999999989, 3.4072000000001, 3.97859999999991, 3.32760000000007, 5.52160000000003, 3.31860000000006, 2.06940000000009, 4.35400000000004, 1.57500000000005, 0.280799999999999, 2.12879999999996, -0.214799999999968, -0.0378000000000611, -0.658200000000079, 0.654800000000023, -0.0697999999999865, 0.858400000000074, -2.52700000000004, -2.1751999999999, -3.35539999999992, -1.04019999999991, -0.651000000000067, -2.14439999999991, -1.96659999999997, -3.97939999999994, -0.604400000000169, -3.08260000000018, -3.39159999999993, -5.29640000000018, -5.38920000000007, -5.08759999999984, -4.69900000000007, -5.23720000000003, -3.15779999999995, -4.97879999999986, -4.89899999999989, -7.48880000000008, -5.94799999999987, -5.68060000000014, -6.67180000000008, -4.70499999999993, -7.27779999999984, -4.6579999999999, -4.4362000000001, -4.32139999999981, -5.18859999999995, -6.66879999999992, -6.48399999999992, -5.1260000000002, -4.4032000000002, -6.13500000000022, -5.80819999999994, -4.16719999999987, -4.15039999999999, -7.45600000000013, -7.24080000000004, -9.83179999999993, -5.80420000000004, -8.6561999999999, -6.99940000000015, -10.5473999999999, -7.34139999999979, -6.80999999999995, -6.29719999999998, -6.23199999999997\n};\n\nconst double biasData_precision10[] = {\n    737.1256, 724.4234, 711.1064, 698.4732, 685.4636, 673.0644, 660.488, 647.9654, 636.0832, 623.7864, 612.1992, 600.2176, 588.5228, 577.1716, 565.7752, 554.899, 543.6126, 532.6492, 521.9474, 511.5214, 501.1064, 490.6364, 480.2468, 470.4588, 460.3832, 451.0584, 440.8606, 431.3868, 422.5062, 413.1862, 404.463, 395.339, 386.1936, 378.1292, 369.1854, 361.2908, 353.3324, 344.8518, 337.5204, 329.4854, 321.9318, 314.552, 306.4658, 299.4256, 292.849, 286.152, 278.8956, 271.8792, 265.118, 258.62, 252.5132, 245.9322, 239.7726, 233.6086, 227.5332, 222.5918, 216.4294, 210.7662, 205.4106, 199.7338, 194.9012, 188.4486, 183.1556, 178.6338, 173.7312, 169.6264, 163.9526, 159.8742, 155.8326, 151.1966, 147.5594, 143.07, 140.037, 134.1804, 131.071, 127.4884, 124.0848, 120.2944, 117.333, 112.9626, 110.2902, 107.0814, 103.0334, 99.4832000000001, 96.3899999999999, 93.7202000000002, 90.1714000000002, 87.2357999999999, 85.9346, 82.8910000000001, 80.0264000000002, 78.3834000000002, 75.1543999999999, 73.8683999999998, 70.9895999999999, 69.4367999999999, 64.8701999999998, 65.0408000000002, 61.6738, 59.5207999999998, 57.0158000000001, 54.2302, 53.0962, 50.4985999999999, 52.2588000000001, 47.3914, 45.6244000000002, 42.8377999999998, 43.0072, 40.6516000000001, 40.2453999999998, 35.2136, 36.4546, 33.7849999999999, 33.2294000000002, 32.4679999999998, 30.8670000000002, 28.6507999999999, 28.9099999999999, 27.5983999999999, 26.1619999999998, 24.5563999999999, 23.2328000000002, 21.9484000000002, 21.5902000000001, 21.3346000000001, 17.7031999999999, 20.6111999999998, 19.5545999999999, 15.7375999999999, 17.0720000000001, 16.9517999999998, 15.326, 13.1817999999998, 14.6925999999999, 13.0859999999998, 13.2754, 10.8697999999999, 11.248, 7.3768, 4.72339999999986, 7.97899999999981, 8.7503999999999, 7.68119999999999, 9.7199999999998, 7.73919999999998, 5.6224000000002, 7.44560000000001, 6.6601999999998, 5.9058, 4.00199999999995, 4.51699999999983, 4.68240000000014, 3.86220000000003, 5.13639999999987, 5.98500000000013, 2.47719999999981, 2.61999999999989, 1.62800000000016, 4.65000000000009, 0.225599999999758, 0.831000000000131, -0.359400000000278, 1.27599999999984, -2.92559999999958, -0.0303999999996449, 2.37079999999969, -2.0033999999996, 0.804600000000391, 0.30199999999968, 1.1247999999996, -2.6880000000001, 0.0321999999996478, -1.18099999999959, -3.9402, -1.47940000000017, -0.188400000000001, -2.10720000000038, -2.04159999999956, -3.12880000000041, -4.16160000000036, -0.612799999999879, -3.48719999999958, -8.17900000000009, -5.37780000000021, -4.01379999999972, -5.58259999999973, -5.73719999999958, -7.66799999999967, -5.69520000000011, -1.1247999999996, -5.58520000000044, -8.04560000000038, -4.64840000000004, -11.6468000000004, -7.97519999999986, -5.78300000000036, -7.67420000000038, -10.6328000000003, -9.81720000000041\n};\n\nconst double biasData_precision11[] = {\n    1476, 1449.6014, 1423.5802, 1397.7942, 1372.3042, 1347.2062, 1321.8402, 1297.2292, 1272.9462, 1248.9926, 1225.3026, 1201.4252, 1178.0578, 1155.6092, 1132.626, 1110.5568, 1088.527, 1066.5154, 1045.1874, 1024.3878, 1003.37, 982.1972, 962.5728, 942.1012, 922.9668, 903.292, 884.0772, 864.8578, 846.6562, 828.041, 809.714, 792.3112, 775.1806, 757.9854, 740.656, 724.346, 707.5154, 691.8378, 675.7448, 659.6722, 645.5722, 630.1462, 614.4124, 600.8728, 585.898, 572.408, 558.4926, 544.4938, 531.6776, 517.282, 505.7704, 493.1012, 480.7388, 467.6876, 456.1872, 445.5048, 433.0214, 420.806, 411.409, 400.4144, 389.4294, 379.2286, 369.651, 360.6156, 350.337, 342.083, 332.1538, 322.5094, 315.01, 305.6686, 298.1678, 287.8116, 280.9978, 271.9204, 265.3286, 257.5706, 249.6014, 242.544, 235.5976, 229.583, 220.9438, 214.672, 208.2786, 201.8628, 195.1834, 191.505, 186.1816, 178.5188, 172.2294, 167.8908, 161.0194, 158.052, 151.4588, 148.1596, 143.4344, 138.5238, 133.13, 127.6374, 124.8162, 118.7894, 117.3984, 114.6078, 109.0858, 105.1036, 103.6258, 98.6018000000004, 95.7618000000002, 93.5821999999998, 88.5900000000001, 86.9992000000002, 82.8800000000001, 80.4539999999997, 74.6981999999998, 74.3644000000004, 73.2914000000001, 65.5709999999999, 66.9232000000002, 65.1913999999997, 62.5882000000001, 61.5702000000001, 55.7035999999998, 56.1764000000003, 52.7596000000003, 53.0302000000001, 49.0609999999997, 48.4694, 44.933, 46.0474000000004, 44.7165999999997, 41.9416000000001, 39.9207999999999, 35.6328000000003, 35.5276000000003, 33.1934000000001, 33.2371999999996, 33.3864000000003, 33.9228000000003, 30.2371999999996, 29.1373999999996, 25.2272000000003, 24.2942000000003, 19.8338000000003, 18.9005999999999, 23.0907999999999, 21.8544000000002, 19.5176000000001, 15.4147999999996, 16.9314000000004, 18.6737999999996, 12.9877999999999, 14.3688000000002, 12.0447999999997, 15.5219999999999, 12.5299999999997, 14.5940000000001, 14.3131999999996, 9.45499999999993, 12.9441999999999, 3.91139999999996, 13.1373999999996, 5.44720000000052, 9.82779999999912, 7.87279999999919, 3.67760000000089, 5.46980000000076, 5.55099999999948, 5.65979999999945, 3.89439999999922, 3.1275999999998, 5.65140000000065, 6.3062000000009, 3.90799999999945, 1.87060000000019, 5.17020000000048, 2.46680000000015, 0.770000000000437, -3.72340000000077, 1.16400000000067, 8.05340000000069, 0.135399999999208, 2.15940000000046, 0.766999999999825, 1.0594000000001, 3.15500000000065, -0.287399999999252, 2.37219999999979, -2.86620000000039, -1.63199999999961, -2.22979999999916, -0.15519999999924, -1.46039999999994, -0.262199999999211, -2.34460000000036, -2.8078000000005, -3.22179999999935, -5.60159999999996, -8.42200000000048, -9.43740000000071, 0.161799999999857, -10.4755999999998, -10.0823999999993\n};\n\nconst double biasData_precision12[] = {\n    2953, 2900.4782, 2848.3568, 2796.3666, 2745.324, 2694.9598, 2644.648, 2595.539, 2546.1474, 2498.2576, 2450.8376, 2403.6076, 2357.451, 2311.38, 2266.4104, 2221.5638, 2176.9676, 2134.193, 2090.838, 2048.8548, 2007.018, 1966.1742, 1925.4482, 1885.1294, 1846.4776, 1807.4044, 1768.8724, 1731.3732, 1693.4304, 1657.5326, 1621.949, 1586.5532, 1551.7256, 1517.6182, 1483.5186, 1450.4528, 1417.865, 1385.7164, 1352.6828, 1322.6708, 1291.8312, 1260.9036, 1231.476, 1201.8652, 1173.6718, 1145.757, 1119.2072, 1092.2828, 1065.0434, 1038.6264, 1014.3192, 988.5746, 965.0816, 940.1176, 917.9796, 894.5576, 871.1858, 849.9144, 827.1142, 805.0818, 783.9664, 763.9096, 742.0816, 724.3962, 706.3454, 688.018, 667.4214, 650.3106, 633.0686, 613.8094, 597.818, 581.4248, 563.834, 547.363, 531.5066, 520.455400000001, 505.583199999999, 488.366, 476.480799999999, 459.7682, 450.0522, 434.328799999999, 423.952799999999, 408.727000000001, 399.079400000001, 387.252200000001, 373.987999999999, 360.852000000001, 351.6394, 339.642, 330.902400000001, 322.661599999999, 311.662200000001, 301.3254, 291.7484, 279.939200000001, 276.7508, 263.215200000001, 254.811400000001, 245.5494, 242.306399999999, 234.8734, 223.787200000001, 217.7156, 212.0196, 200.793, 195.9748, 189.0702, 182.449199999999, 177.2772, 170.2336, 164.741, 158.613600000001, 155.311, 147.5964, 142.837, 137.3724, 132.0162, 130.0424, 121.9804, 120.451800000001, 114.8968, 111.585999999999, 105.933199999999, 101.705, 98.5141999999996, 95.0488000000005, 89.7880000000005, 91.4750000000004, 83.7764000000006, 80.9698000000008, 72.8574000000008, 73.1615999999995, 67.5838000000003, 62.6263999999992, 63.2638000000006, 66.0977999999996, 52.0843999999997, 58.9956000000002, 47.0912000000008, 46.4956000000002, 48.4383999999991, 47.1082000000006, 43.2392, 37.2759999999998, 40.0283999999992, 35.1864000000005, 35.8595999999998, 32.0998, 28.027, 23.6694000000007, 33.8266000000003, 26.3736000000008, 27.2008000000005, 21.3245999999999, 26.4115999999995, 23.4521999999997, 19.5013999999992, 19.8513999999996, 10.7492000000002, 18.6424000000006, 13.1265999999996, 18.2436000000016, 6.71860000000015, 3.39459999999963, 6.33759999999893, 7.76719999999841, 0.813999999998487, 3.82819999999992, 0.826199999999517, 8.07440000000133, -1.59080000000176, 5.01780000000144, 0.455399999998917, -0.24199999999837, 0.174800000000687, -9.07640000000174, -4.20160000000033, -3.77520000000004, -4.75179999999818, -5.3724000000002, -8.90680000000066, -6.10239999999976, -5.74120000000039, -9.95339999999851, -3.86339999999836, -13.7304000000004, -16.2710000000006, -7.51359999999841, -3.30679999999847, -13.1339999999982, -10.0551999999989, -6.72019999999975, -8.59660000000076, -10.9307999999983, -1.8775999999998, -4.82259999999951, -13.7788, -21.6470000000008, -10.6735999999983, -15.7799999999988\n};\n\nconst double biasData_precision13[] = {\n    5907.5052, 5802.2672, 5697.347, 5593.5794, 5491.2622, 5390.5514, 5290.3376, 5191.6952, 5093.5988, 4997.3552, 4902.5972, 4808.3082, 4715.5646, 4624.109, 4533.8216, 4444.4344, 4356.3802, 4269.2962, 4183.3784, 4098.292, 4014.79, 3932.4574, 3850.6036, 3771.2712, 3691.7708, 3615.099, 3538.1858, 3463.4746, 3388.8496, 3315.6794, 3244.5448, 3173.7516, 3103.3106, 3033.6094, 2966.5642, 2900.794, 2833.7256, 2769.81, 2707.3196, 2644.0778, 2583.9916, 2523.4662, 2464.124, 2406.073, 2347.0362, 2292.1006, 2238.1716, 2182.7514, 2128.4884, 2077.1314, 2025.037, 1975.3756, 1928.933, 1879.311, 1831.0006, 1783.2144, 1738.3096, 1694.5144, 1649.024, 1606.847, 1564.7528, 1525.3168, 1482.5372, 1443.9668, 1406.5074, 1365.867, 1329.2186, 1295.4186, 1257.9716, 1225.339, 1193.2972, 1156.3578, 1125.8686, 1091.187, 1061.4094, 1029.4188, 1000.9126, 972.3272, 944.004199999999, 915.7592, 889.965, 862.834200000001, 840.4254, 812.598399999999, 785.924200000001, 763.050999999999, 741.793799999999, 721.466, 699.040799999999, 677.997200000002, 649.866999999998, 634.911800000002, 609.8694, 591.981599999999, 570.2922, 557.129199999999, 538.3858, 521.872599999999, 502.951400000002, 495.776399999999, 475.171399999999, 459.751, 439.995200000001, 426.708999999999, 413.7016, 402.3868, 387.262599999998, 372.0524, 357.050999999999, 342.5098, 334.849200000001, 322.529399999999, 311.613799999999, 295.848000000002, 289.273000000001, 274.093000000001, 263.329600000001, 251.389599999999, 245.7392, 231.9614, 229.7952, 217.155200000001, 208.9588, 199.016599999999, 190.839199999999, 180.6976, 176.272799999999, 166.976999999999, 162.5252, 151.196400000001, 149.386999999999, 133.981199999998, 130.0586, 130.164000000001, 122.053400000001, 110.7428, 108.1276, 106.232400000001, 100.381600000001, 98.7668000000012, 86.6440000000002, 79.9768000000004, 82.4722000000002, 68.7026000000005, 70.1186000000016, 71.9948000000004, 58.998599999999, 59.0492000000013, 56.9818000000014, 47.5338000000011, 42.9928, 51.1591999999982, 37.2740000000013, 42.7220000000016, 31.3734000000004, 26.8090000000011, 25.8934000000008, 26.5286000000015, 29.5442000000003, 19.3503999999994, 26.0760000000009, 17.9527999999991, 14.8419999999969, 10.4683999999979, 8.65899999999965, 9.86720000000059, 4.34139999999752, -0.907800000000861, -3.32080000000133, -0.936199999996461, -11.9916000000012, -8.87000000000262, -6.33099999999831, -11.3366000000024, -15.9207999999999, -9.34659999999712, -15.5034000000014, -19.2097999999969, -15.357799999998, -28.2235999999975, -30.6898000000001, -19.3271999999997, -25.6083999999973, -24.409599999999, -13.6385999999984, -33.4473999999973, -32.6949999999997, -28.9063999999998, -31.7483999999968, -32.2935999999972, -35.8329999999987, -47.620600000002, -39.0855999999985, -33.1434000000008, -46.1371999999974, -37.5892000000022, -46.8164000000033, -47.3142000000007, -60.2914000000019, -37.7575999999972\n};\n\nconst double biasData_precision14[] = {\n    11816.475, 11605.0046, 11395.3792, 11188.7504, 10984.1814, 10782.0086, 10582.0072, 10384.503, 10189.178, 9996.2738, 9806.0344, 9617.9798, 9431.394, 9248.7784, 9067.6894, 8889.6824, 8712.9134, 8538.8624, 8368.4944, 8197.7956, 8031.8916, 7866.6316, 7703.733, 7544.5726, 7386.204, 7230.666, 7077.8516, 6926.7886, 6778.6902, 6631.9632, 6487.304, 6346.7486, 6206.4408, 6070.202, 5935.2576, 5799.924, 5671.0324, 5541.9788, 5414.6112, 5290.0274, 5166.723, 5047.6906, 4929.162, 4815.1406, 4699.127, 4588.5606, 4477.7394, 4369.4014, 4264.2728, 4155.9224, 4055.581, 3955.505, 3856.9618, 3761.3828, 3666.9702, 3575.7764, 3482.4132, 3395.0186, 3305.8852, 3221.415, 3138.6024, 3056.296, 2970.4494, 2896.1526, 2816.8008, 2740.2156, 2670.497, 2594.1458, 2527.111, 2460.8168, 2387.5114, 2322.9498, 2260.6752, 2194.2686, 2133.7792, 2074.767, 2015.204, 1959.4226, 1898.6502, 1850.006, 1792.849, 1741.4838, 1687.9778, 1638.1322, 1589.3266, 1543.1394, 1496.8266, 1447.8516, 1402.7354, 1361.9606, 1327.0692, 1285.4106, 1241.8112, 1201.6726, 1161.973, 1130.261, 1094.2036, 1048.2036, 1020.6436, 990.901400000002, 961.199800000002, 924.769800000002, 899.526400000002, 872.346400000002, 834.375, 810.432000000001, 780.659800000001, 756.013800000001, 733.479399999997, 707.923999999999, 673.858, 652.222399999999, 636.572399999997, 615.738599999997, 586.696400000001, 564.147199999999, 541.679600000003, 523.943599999999, 505.714599999999, 475.729599999999, 461.779600000002, 449.750800000002, 439.020799999998, 412.7886, 400.245600000002, 383.188199999997, 362.079599999997, 357.533799999997, 334.319000000003, 327.553399999997, 308.559399999998, 291.270199999999, 279.351999999999, 271.791400000002, 252.576999999997, 247.482400000001, 236.174800000001, 218.774599999997, 220.155200000001, 208.794399999999, 201.223599999998, 182.995600000002, 185.5268, 164.547400000003, 176.5962, 150.689599999998, 157.8004, 138.378799999999, 134.021200000003, 117.614399999999, 108.194000000003, 97.0696000000025, 89.6042000000016, 95.6030000000028, 84.7810000000027, 72.635000000002, 77.3482000000004, 59.4907999999996, 55.5875999999989, 50.7346000000034, 61.3916000000027, 50.9149999999936, 39.0384000000049, 58.9395999999979, 29.633600000001, 28.2032000000036, 26.0078000000067, 17.0387999999948, 9.22000000000116, 13.8387999999977, 8.07240000000456, 14.1549999999988, 15.3570000000036, 3.42660000000615, 6.24820000000182, -2.96940000000177, -8.79940000000352, -5.97860000000219, -14.4048000000039, -3.4143999999942, -13.0148000000045, -11.6977999999945, -25.7878000000055, -22.3185999999987, -24.409599999999, -31.9756000000052, -18.9722000000038, -22.8678000000073, -30.8972000000067, -32.3715999999986, -22.3907999999938, -43.6720000000059, -35.9038, -39.7492000000057, -54.1641999999993, -45.2749999999942, -42.2989999999991, -44.1089999999967, -64.3564000000042, -49.9551999999967, -42.6116000000038\n};\n\nconst double biasData_precision15[] = {\n    23634.0036, 23210.8034, 22792.4744, 22379.1524, 21969.7928, 21565.326, 21165.3532, 20770.2806, 20379.9892, 19994.7098, 19613.318, 19236.799, 18865.4382, 18498.8244, 18136.5138, 17778.8668, 17426.2344, 17079.32, 16734.778, 16397.2418, 16063.3324, 15734.0232, 15409.731, 15088.728, 14772.9896, 14464.1402, 14157.5588, 13855.5958, 13559.3296, 13264.9096, 12978.326, 12692.0826, 12413.8816, 12137.3192, 11870.2326, 11602.5554, 11340.3142, 11079.613, 10829.5908, 10583.5466, 10334.0344, 10095.5072, 9859.694, 9625.2822, 9395.7862, 9174.0586, 8957.3164, 8738.064, 8524.155, 8313.7396, 8116.9168, 7913.542, 7718.4778, 7521.65, 7335.5596, 7154.2906, 6968.7396, 6786.3996, 6613.236, 6437.406, 6270.6598, 6107.7958, 5945.7174, 5787.6784, 5635.5784, 5482.308, 5337.9784, 5190.0864, 5045.9158, 4919.1386, 4771.817, 4645.7742, 4518.4774, 4385.5454, 4262.6622, 4142.74679999999, 4015.5318, 3897.9276, 3790.7764, 3685.13800000001, 3573.6274, 3467.9706, 3368.61079999999, 3271.5202, 3170.3848, 3076.4656, 2982.38400000001, 2888.4664, 2806.4868, 2711.9564, 2634.1434, 2551.3204, 2469.7662, 2396.61139999999, 2318.9902, 2243.8658, 2171.9246, 2105.01360000001, 2028.8536, 1960.9952, 1901.4096, 1841.86079999999, 1777.54700000001, 1714.5802, 1654.65059999999, 1596.311, 1546.2016, 1492.3296, 1433.8974, 1383.84600000001, 1339.4152, 1293.5518, 1245.8686, 1193.50659999999, 1162.27959999999, 1107.19439999999, 1069.18060000001, 1035.09179999999, 999.679000000004, 957.679999999993, 925.300199999998, 888.099400000006, 848.638600000006, 818.156400000007, 796.748399999997, 752.139200000005, 725.271200000003, 692.216, 671.633600000001, 647.939799999993, 621.670599999998, 575.398799999995, 561.226599999995, 532.237999999998, 521.787599999996, 483.095799999996, 467.049599999998, 465.286399999997, 415.548599999995, 401.047399999996, 380.607999999993, 377.362599999993, 347.258799999996, 338.371599999999, 310.096999999994, 301.409199999995, 276.280799999993, 265.586800000005, 258.994399999996, 223.915999999997, 215.925399999993, 213.503800000006, 191.045400000003, 166.718200000003, 166.259000000005, 162.941200000001, 148.829400000002, 141.645999999993, 123.535399999993, 122.329800000007, 89.473399999988, 80.1962000000058, 77.5457999999926, 59.1056000000099, 83.3509999999951, 52.2906000000075, 36.3979999999865, 40.6558000000077, 42.0003999999899, 19.6630000000005, 19.7153999999864, -8.38539999999921, -0.692799999989802, 0.854800000000978, 3.23219999999856, -3.89040000000386, -5.25880000001052, -24.9052000000083, -22.6837999999989, -26.4286000000138, -34.997000000003, -37.0216000000073, -43.430400000012, -58.2390000000014, -68.8034000000043, -56.9245999999985, -57.8583999999973, -77.3097999999882, -73.2793999999994, -81.0738000000129, -87.4530000000086, -65.0254000000132, -57.296399999992, -96.2746000000043, -103.25, -96.081600000005, -91.5542000000132, -102.465200000006, -107.688599999994, -101.458000000013, -109.715800000005\n};\n\nconst double biasData_precision16[] = {\n    47270, 46423.3584, 45585.7074, 44757.152, 43938.8416, 43130.9514, 42330.03, 41540.407, 40759.6348, 39988.206, 39226.5144, 38473.2096, 37729.795, 36997.268, 36272.6448, 35558.665, 34853.0248, 34157.4472, 33470.5204, 32793.5742, 32127.0194, 31469.4182, 30817.6136, 30178.6968, 29546.8908, 28922.8544, 28312.271, 27707.0924, 27114.0326, 26526.692, 25948.6336, 25383.7826, 24823.5998, 24272.2974, 23732.2572, 23201.4976, 22674.2796, 22163.6336, 21656.515, 21161.7362, 20669.9368, 20189.4424, 19717.3358, 19256.3744, 18795.9638, 18352.197, 17908.5738, 17474.391, 17052.918, 16637.2236, 16228.4602, 15823.3474, 15428.6974, 15043.0284, 14667.6278, 14297.4588, 13935.2882, 13578.5402, 13234.6032, 12882.1578, 12548.0728, 12219.231, 11898.0072, 11587.2626, 11279.9072, 10973.5048, 10678.5186, 10392.4876, 10105.2556, 9825.766, 9562.5444, 9294.2222, 9038.2352, 8784.848, 8533.2644, 8301.7776, 8058.30859999999, 7822.94579999999, 7599.11319999999, 7366.90779999999, 7161.217, 6957.53080000001, 6736.212, 6548.21220000001, 6343.06839999999, 6156.28719999999, 5975.15419999999, 5791.75719999999, 5621.32019999999, 5451.66, 5287.61040000001, 5118.09479999999, 4957.288, 4798.4246, 4662.17559999999, 4512.05900000001, 4364.68539999999, 4220.77720000001, 4082.67259999999, 3957.19519999999, 3842.15779999999, 3699.3328, 3583.01180000001, 3473.8964, 3338.66639999999, 3233.55559999999, 3117.799, 3008.111, 2909.69140000001, 2814.86499999999, 2719.46119999999, 2624.742, 2532.46979999999, 2444.7886, 2370.1868, 2272.45259999999, 2196.19260000001, 2117.90419999999, 2023.2972, 1969.76819999999, 1885.58979999999, 1833.2824, 1733.91200000001, 1682.54920000001, 1604.57980000001, 1556.11240000001, 1491.3064, 1421.71960000001, 1371.22899999999, 1322.1324, 1264.7892, 1196.23920000001, 1143.8474, 1088.67240000001, 1073.60380000001, 1023.11660000001, 959.036400000012, 927.433199999999, 906.792799999996, 853.433599999989, 841.873800000001, 791.1054, 756.899999999994, 704.343200000003, 672.495599999995, 622.790399999998, 611.254799999995, 567.283200000005, 519.406599999988, 519.188400000014, 495.312800000014, 451.350799999986, 443.973399999988, 431.882199999993, 392.027000000002, 380.924200000009, 345.128999999986, 298.901400000002, 287.771999999997, 272.625, 247.253000000026, 222.490600000019, 223.590000000026, 196.407599999977, 176.425999999978, 134.725199999986, 132.4804, 110.445599999977, 86.7939999999944, 56.7038000000175, 64.915399999998, 38.3726000000024, 37.1606000000029, 46.170999999973, 49.1716000000015, 15.3362000000197, 6.71639999997569, -34.8185999999987, -39.4476000000141, 12.6830000000191, -12.3331999999937, -50.6565999999875, -59.9538000000175, -65.1054000000004, -70.7576000000117, -106.325200000021, -126.852200000023, -110.227599999984, -132.885999999999, -113.897200000007, -142.713800000027, -151.145399999979, -150.799200000009, -177.756200000003, -156.036399999983, -182.735199999996, -177.259399999981, -198.663600000029, -174.577600000019, -193.84580000001\n};\n\nconst double biasData_precision17[] = {\n    94541, 92848.811, 91174.019, 89517.558, 87879.9705, 86262.7565, 84663.5125, 83083.7435, 81521.7865, 79977.272, 78455.9465, 76950.219, 75465.432, 73994.152, 72546.71, 71115.2345, 69705.6765, 68314.937, 66944.2705, 65591.255, 64252.9485, 62938.016, 61636.8225, 60355.592, 59092.789, 57850.568, 56624.518, 55417.343, 54231.1415, 53067.387, 51903.526, 50774.649, 49657.6415, 48561.05, 47475.7575, 46410.159, 45364.852, 44327.053, 43318.4005, 42325.6165, 41348.4595, 40383.6265, 39436.77, 38509.502, 37594.035, 36695.939, 35818.6895, 34955.691, 34115.8095, 33293.949, 32465.0775, 31657.6715, 30877.2585, 30093.78, 29351.3695, 28594.1365, 27872.115, 27168.7465, 26477.076, 25774.541, 25106.5375, 24452.5135, 23815.5125, 23174.0655, 22555.2685, 21960.2065, 21376.3555, 20785.1925, 20211.517, 19657.0725, 19141.6865, 18579.737, 18081.3955, 17578.995, 17073.44, 16608.335, 16119.911, 15651.266, 15194.583, 14749.0495, 14343.4835, 13925.639, 13504.509, 13099.3885, 12691.2855, 12328.018, 11969.0345, 11596.5145, 11245.6355, 10917.6575, 10580.9785, 10277.8605, 9926.58100000001, 9605.538, 9300.42950000003, 8989.97850000003, 8728.73249999998, 8448.3235, 8175.31050000002, 7898.98700000002, 7629.79100000003, 7413.76199999999, 7149.92300000001, 6921.12650000001, 6677.1545, 6443.28000000003, 6278.23450000002, 6014.20049999998, 5791.20299999998, 5605.78450000001, 5438.48800000001, 5234.2255, 5059.6825, 4887.43349999998, 4682.935, 4496.31099999999, 4322.52250000002, 4191.42499999999, 4021.24200000003, 3900.64799999999, 3762.84250000003, 3609.98050000001, 3502.29599999997, 3363.84250000003, 3206.54849999998, 3079.70000000001, 2971.42300000001, 2867.80349999998, 2727.08100000001, 2630.74900000001, 2496.6165, 2440.902, 2356.19150000002, 2235.58199999999, 2120.54149999999, 2012.25449999998, 1933.35600000003, 1820.93099999998, 1761.54800000001, 1663.09350000002, 1578.84600000002, 1509.48149999999, 1427.3345, 1379.56150000001, 1306.68099999998, 1212.63449999999, 1084.17300000001, 1124.16450000001, 1060.69949999999, 1007.48849999998, 941.194499999983, 879.880500000028, 836.007500000007, 782.802000000025, 748.385499999975, 647.991500000004, 626.730500000005, 570.776000000013, 484.000500000024, 513.98550000001, 418.985499999952, 386.996999999974, 370.026500000036, 355.496999999974, 356.731499999994, 255.92200000002, 259.094000000041, 205.434499999974, 165.374500000034, 197.347500000033, 95.718499999959, 67.6165000000037, 54.6970000000438, 31.7395000000251, -15.8784999999916, 8.42500000004657, -26.3754999999655, -118.425500000012, -66.6629999999423, -42.9745000000112, -107.364999999991, -189.839000000036, -162.611499999999, -164.964999999967, -189.079999999958, -223.931499999948, -235.329999999958, -269.639500000048, -249.087999999989, -206.475499999942, -283.04449999996, -290.667000000016, -304.561499999953, -336.784499999951, -380.386500000022, -283.280499999993, -364.533000000054, -389.059499999974, -364.454000000027, -415.748000000021, -417.155000000028\n};\n\nconst double biasData_precision18[] = {\n    189083, 185696.913, 182348.774, 179035.946, 175762.762, 172526.444, 169329.754, 166166.099, 163043.269, 159958.91, 156907.912, 153906.845, 150924.199, 147996.568, 145093.457, 142239.233, 139421.475, 136632.27, 133889.588, 131174.2, 128511.619, 125868.621, 123265.385, 120721.061, 118181.769, 115709.456, 113252.446, 110840.198, 108465.099, 106126.164, 103823.469, 101556.618, 99308.004, 97124.508, 94937.803, 92833.731, 90745.061, 88677.627, 86617.47, 84650.442, 82697.833, 80769.132, 78879.629, 77014.432, 75215.626, 73384.587, 71652.482, 69895.93, 68209.301, 66553.669, 64921.981, 63310.323, 61742.115, 60205.018, 58698.658, 57190.657, 55760.865, 54331.169, 52908.167, 51550.273, 50225.254, 48922.421, 47614.533, 46362.049, 45098.569, 43926.083, 42736.03, 41593.473, 40425.26, 39316.237, 38243.651, 37170.617, 36114.609, 35084.19, 34117.233, 33206.509, 32231.505, 31318.728, 30403.404, 29540.0550000001, 28679.236, 27825.862, 26965.216, 26179.148, 25462.08, 24645.952, 23922.523, 23198.144, 22529.128, 21762.4179999999, 21134.779, 20459.117, 19840.818, 19187.04, 18636.3689999999, 17982.831, 17439.7389999999, 16874.547, 16358.2169999999, 15835.684, 15352.914, 14823.681, 14329.313, 13816.897, 13342.874, 12880.882, 12491.648, 12021.254, 11625.392, 11293.7610000001, 10813.697, 10456.209, 10099.074, 9755.39000000001, 9393.18500000006, 9047.57900000003, 8657.98499999999, 8395.85900000005, 8033, 7736.95900000003, 7430.59699999995, 7258.47699999996, 6924.58200000005, 6691.29399999999, 6357.92500000005, 6202.05700000003, 5921.19700000004, 5628.28399999999, 5404.96799999999, 5226.71100000001, 4990.75600000005, 4799.77399999998, 4622.93099999998, 4472.478, 4171.78700000001, 3957.46299999999, 3868.95200000005, 3691.14300000004, 3474.63100000005, 3341.67200000002, 3109.14000000001, 3071.97400000005, 2796.40399999998, 2756.17799999996, 2611.46999999997, 2471.93000000005, 2382.26399999997, 2209.22400000005, 2142.28399999999, 2013.96100000001, 1911.18999999994, 1818.27099999995, 1668.47900000005, 1519.65800000005, 1469.67599999998, 1367.13800000004, 1248.52899999998, 1181.23600000003, 1022.71900000004, 1088.20700000005, 959.03600000008, 876.095999999903, 791.183999999892, 703.337000000058, 731.949999999953, 586.86400000006, 526.024999999907, 323.004999999888, 320.448000000091, 340.672999999952, 309.638999999966, 216.601999999955, 102.922999999952, 19.2399999999907, -0.114000000059605, -32.6240000000689, -89.3179999999702, -153.497999999905, -64.2970000000205, -143.695999999996, -259.497999999905, -253.017999999924, -213.948000000091, -397.590000000084, -434.006000000052, -403.475000000093, -297.958000000101, -404.317000000039, -528.898999999976, -506.621000000043, -513.205000000075, -479.351000000024, -596.139999999898, -527.016999999993, -664.681000000099, -680.306000000099, -704.050000000047, -850.486000000034, -757.43200000003, -713.308999999892\n};\n\n\n#endif /* HYPERLOGLOGBIAS_H_ */\n"
  },
  {
    "path": "hyperloglogplus.h",
    "content": "/*\n * hyperloglogplus.h\n *\n * Implementation of HyperLogLog++ algorithm described by Stefan Heule et al.\n *\n *  Created on: Apr 25, 2015\n *      Author: fbreitwieser\n */\n\n#ifndef HYPERLOGLOGPLUS_H_\n#define HYPERLOGLOGPLUS_H_\n\n#include<set>\n#include<vector>\n#include<stdexcept>\n#include<iostream>\n#include<fstream>\n#include<math.h>    //log\n#include<algorithm> //vector.count\n#include<bitset>\n\n#include \"hyperloglogbias.h\"\n#include \"third_party/MurmurHash3.cpp\"\n#include \"assert_helpers.h\"\n\nusing namespace std;\n\n//#define HLL_DEBUG\n//#define NDEBUG\n//#define NDEBUG2\n#define arr_len(a) (a + sizeof a / sizeof a[0])\n\n// experimentally determined threshold values for  p - 4\nstatic const uint32_t threshold[] = {10, 20, 40, 80, 220, 400, 900, 1800, 3100,\n\t\t\t\t\t\t\t  6500, 11500, 20000, 50000, 120000, 350000};\n\n\n///////////////////////\n\n//\n/**\n * gives the estimated cardinality for m bins, v of which are non-zero\n * @param m number of bins in the matrix\n * @param v number of non-zero bins\n * @return\n */\ndouble linearCounting(uint32_t m, uint32_t v) {\n\tif (v > m) {\n\t    throw std::invalid_argument(\"number of v should not be greater than m\");\n\t}\n\tdouble fm = double(m);\n\treturn fm * log(fm/double(v));\n}\n\n/**\n  * from Numerical Recipes, 3rd Edition, p 352\n  * Returns hash of u as a 64-bit integer.\n  *\n*/\ninline uint64_t ranhash (uint64_t u) {\n  uint64_t v = u * 3935559000370003845 + 2691343689449507681;\n\n  v ^= v >> 21; v ^= v << 37; v ^= v >>  4;\n\n  v *= 4768777513237032717;\n\n  v ^= v << 20; v ^= v >> 41; v ^= v <<  5;\n\n  return v;\n}\n\ninline uint64_t murmurhash3_finalizer (uint64_t key)  {\n\tkey += 1; // murmurhash returns a hash value of 0 for the key 0 - avoid that.\n\tkey ^= key >> 33;\n\tkey *= 0xff51afd7ed558ccd;\n\tkey ^= key >> 33;\n\tkey *= 0xc4ceb9fe1a85ec53;\n\tkey ^= key >> 33;\n\treturn key;\n}\n\n/**\n * Bias correction factors for specific m's\n * @param m\n * @return\n */\ndouble alpha(uint32_t m)  {\n\tswitch (m) {\n\tcase 16: return 0.673;\n\tcase 32: return 0.697;\n\tcase 64: return 0.709;\n\t}\n\n\t// m >= 128\n\treturn 0.7213 / (1 + 1.079/double(m));\n}\n\n/**\n * calculate the raw estimate as harmonic mean of the ranks in the register\n * @param array\n * @return\n */\ndouble calculateEstimate(vector<uint8_t> array) {\n\tdouble inverseSum = 0.0;\n\tfor (size_t i = 0; i < array.size(); ++i) {\n\t\t// TODO: pre-calculate the power calculation\n\t\tinverseSum += pow(2,-array[i]);\n\t}\n\treturn alpha(array.size()) * double(array.size() * array.size()) * 1 / inverseSum;\n}\n\nuint32_t countZeros(vector<uint8_t> s) {\n\treturn (uint32_t)count(s.begin(), s.end(), 0);\n}\n\n/**\n * Extract bits (from uint32_t or uint64_t) using LSB 0 numbering from hi to lo, including lo\n * @param bits\n * @param hi\n * @param lo\n * @return\n */\ntemplate<typename T>\nT extractBits(T value, uint8_t hi, uint8_t lo, bool shift_left = false) {\n\n    // create a bitmask:\n    //            (T(1) << (hi - lo)                 a 1 at the position (hi - lo)\n    //           ((T(1) << (hi - lo) - 1)              1's from position 0 to position (hi-lo-1)\n    //          (((T(1) << (hi - lo)) - 1) << lo)      1's from position lo to position hi\n\n\t// The T(1) is required to not cause overflow on 32bit machines\n\t// TODO: consider creating a bitmask only once in the beginning\n\tT bitmask = (((T(1) << (hi - lo)) - 1) << lo);\n    T result = value & bitmask;\n\n    if (!shift_left) {\n        // shift resulting bits to the right\n        result = result >> lo;\n    } else {\n        // shift resulting bits to the left\n        result = result << (sizeof(T)*8 - hi);\n    }\n    return result;\t\n}\n\ntemplate<typename T>\nT extractBits(T bits, uint8_t hi) {\n    // create a bitmask for first hi bits (LSB 0 numbering)\n\tT bitmask = T(-1) << (sizeof(T)*8 - hi);\n\n\treturn (bits & bitmask);\n}\n\n// functions for counting the number of leading 0-bits (clz)\n//           and counting the number of trailing 0-bits (ctz)\n//#ifdef __GNUC__\n\n// TODO: switch between builtin clz and 64_clz based on architecture\n//#define clz(x) __builtin_clz(x)\n#if 0\nstatic int clz_manual(uint64_t x)\n{\n  // This uses a binary search (counting down) algorithm from Hacker's Delight.\n   uint64_t y;\n   int n = 64;\n   y = x >>32;  if (y != 0) {n -= 32;  x = y;}\n   y = x >>16;  if (y != 0) {n -= 16;  x = y;}\n   y = x >> 8;  if (y != 0) {n -=  8;  x = y;}\n   y = x >> 4;  if (y != 0) {n -=  4;  x = y;}\n   y = x >> 2;  if (y != 0) {n -=  2;  x = y;}\n   y = x >> 1;  if (y != 0) return n - 2;\n   return n - x;\n}\n#endif\n\ninline uint32_t clz(const uint32_t x) {\n\treturn __builtin_clz(x);\n}\n\ninline uint32_t clz(const uint64_t x) {\n    uint32_t u32 = (x >> 32);\n    uint32_t result = u32 ? __builtin_clz(u32) : 32;\n    if (result == 32) {\n        u32 = x & 0xFFFFFFFFUL;\n        result += (u32 ? __builtin_clz(u32) : 32);\n    }\n    return result;\n}\n//#else\n\nuint32_t clz_log2(const uint64_t w) {\n\treturn 63 - floor(log2(w));\n}\n//#endif\n\n\n// TODO: the sparse list may be encoded with variable length encoding\n//   see Heule et al., section 5.3.2\n// Also, using sets might give a larger overhead as each insertion costs more\n//  consider using vector and sort/unique when merging.\ntypedef set<uint32_t> SparseListType;\ntypedef uint64_t HashSize;\n\n/**\n * HyperLogLogPlusMinus class\n * typename T corresponds to the hash size - usually either uint32_t or uint64_t (implemented for uint64_t)\n */\n\ntypedef uint64_t T_KEY;\ntemplate <typename T_KEY>\nclass HyperLogLogPlusMinus {\n\nprivate:\n\n\tvector<uint8_t> M;  // registers (M) of size m\n\tuint8_t p;            // precision\n\tuint32_t m;           // number of registers\n\tbool sparse;          // sparse representation of the data?\n\tSparseListType sparseList; // TODO: use a compressed list instead\n\n\t// vectors containing data for bias correction\n\tvector<vector<double> > rawEstimateData; // TODO: make this static\n\tvector<vector<double> > biasData;\n\n\t// sparse versions of p and m\n\tstatic const uint8_t  pPrime = 25; // precision when using a sparse representation\n\t                                   // fixed to 25, because 25 + 6 bits for rank + 1 flag bit = 32\n\tstatic const uint32_t mPrime = 1 << (pPrime -1); // 2^pPrime\n\n\npublic:\n\n\t~HyperLogLogPlusMinus() {};\n\n\t/**\n\t * Create new HyperLogLogPlusMinus counter\n\t * @param precision\n\t * @param sparse\n\t */\n\tHyperLogLogPlusMinus(uint8_t precision=10, bool sparse=true):p(precision),sparse(sparse) {\n\t\tif (precision > 18 || precision < 4) {\n\t        throw std::invalid_argument(\"precision (number of register = 2^precision) must be between 4 and 18\");\n\t\t}\n\n\t\tthis->m = 1 << precision;\n\n\t\tif (sparse) {\n\t\t\tthis->sparseList = SparseListType(); // TODO: if SparseListType is changed, initialize with appropriate size\n\t\t} else {\n\t\t\tthis->M = vector<uint8_t>(m);\n\t\t}\n\t}\n\n\t/**\n\t * Add a new item to the counter.\n\t * @param item\n\t */\n\tvoid add(T_KEY item) {\n\t\tadd(item, sizeof(T_KEY));\n\t}\n\n\t/**\n\t * Add a new item to the counter.\n\t * @param item\n\t * @param size  size of item\n\t */\n\tvoid add(T_KEY item, size_t size) {\n\n\t\t// compute hash for item\n\t\tHashSize hash_value = murmurhash3_finalizer(item);\n\n#ifdef HLL_DEBUG\n\t\tcerr << \"Value: \" << item << \"; hash(value): \" << hash_value << endl;\n\t\tcerr << bitset<64>(hash_value) << endl;\n#endif\n\n\t\tif (sparse) {\n\t\t\t// sparse mode: put the encoded hash into sparse list\n\t\t\tuint32_t encoded_hash_value = encodeHashIn32Bit(hash_value);\n\t\t\tthis->sparseList.insert(encoded_hash_value);\n\n#ifdef HLL_DEBUG\n\t\t\tidx_n_rank ir = getIndexAndRankFromEncodedHash(encoded_hash_value);\n\t\t\tassert_eq(ir.idx,get_index(hash_value, p));\n\t\t\tassert_eq(ir.rank, get_rank(hash_value, p));\n#endif\n\n\t\t\t// if the sparseList is too large, switch to normal (register) representation\n\t\t\tif (this->sparseList.size() > this->m) { // TODO: is the size of m correct?\n\t\t\t\tswitchToNormalRepresentation();\n\t\t\t}\n\t\t} else {\n\t\t\t// normal mode\n\t\t\t// take first p bits as index  {x63,...,x64-p}\n\t\t\tuint32_t idx = get_index(hash_value, p);\n\t\t\t// shift those p values off, and count leading zeros of the remaining string {x63-p,...,x0}\n\t\t\tuint8_t rank = get_rank(hash_value, p);\n\n\t\t\t// update the register if current rank is bigger\n\t\t\tif (rank > this->M[idx]) {\n\t\t\t\tthis->M[idx] = rank;\n\t\t\t}\n\t\t}\n\t}\n\n\tvoid add(vector<T_KEY> words) {\n\t\tfor(size_t i = 0; i < words.size(); ++i) {\n\t\t\tthis->add(words[i]);\n\t\t}\n\t}\n\n\t/**\n\t * Reset to its initial state.\n\t */\n\tvoid reset() {\n\t\tthis->sparse = true;\n\t\tthis->sparseList.clear();  // \n\t\tthis->M.clear();\n\t}\n\n\t/**\n\t * Convert from sparse representation (using tmpSet and sparseList) to normal (using register)\n\t */\n\tvoid switchToNormalRepresentation() {\n#ifdef HLL_DEBUG\n\t\tcerr << \"switching to normal representation\" << endl;\n\t\tcerr << \" est before: \" << cardinality(true) << endl;\n#endif\n\t\tthis->sparse = false;\n\t\tthis->M = vector<uint8_t>(this->m);\n\t\tif (sparseList.size() > 0) { //TDOD: do I need to check this, here?\n\t\t\taddToRegisters(this->sparseList);\n\t\t\tthis->sparseList.clear();\n\t\t}\n#ifdef HLL_DEBUG\n\t\tcerr << \" est after: \" << cardinality(true) << endl;\n#endif\n\t}\n\n\t/**\n\t * add sparseList to the registers of M\n\t */\n\tvoid addToRegisters(const SparseListType &sparseList) {\n\t\tif (sparseList.size() == 0) {\n\t\t\treturn;\n\t\t}\n\t\tfor (SparseListType::const_iterator encoded_hash_value_ptr = sparseList.begin(); encoded_hash_value_ptr != sparseList.end(); ++encoded_hash_value_ptr) {\n\n\t\t\tidx_n_rank ir = getIndexAndRankFromEncodedHash(*encoded_hash_value_ptr);\n\n\t\t\tassert_lt(ir.idx,M.size());\n\t\t\tif (ir.rank > this->M[ir.idx]) {\n\t\t\t\tthis->M[ir.idx] = ir.rank;\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Merge another HyperLogLogPlusMinus into this. Converts to normal representation\n\t * @param other\n\t */\n\tvoid merge(const HyperLogLogPlusMinus* other) {\n\t\tif (this->p != other->p) {\n\t\t\tthrow std::invalid_argument(\"precisions must be equal\");\n\t\t}\n\n\t\tif (this->sparse && other->sparse) {\n\t\t\tif (this->sparseList.size()+other->sparseList.size() > this->m) {\n\t\t\t\tswitchToNormalRepresentation();\n\t\t\t\taddToRegisters(other->sparseList);\n\t\t\t} else {\n\t\t\t\tthis->sparseList.insert(other->sparseList.begin(),other->sparseList.end());\n\t\t\t}\n\t\t} else if (other->sparse) {\n\t\t\t// other is sparse, but this is not\n\t\t\taddToRegisters(other->sparseList);\n\t\t} else {\n\t\t\tif (this->sparse) {\n\t\t\t\tswitchToNormalRepresentation();\n\t\t\t}\n\n\t\t\t// merge registers\n\t\t\tfor (size_t i = 0; i < other->M.size(); ++i) {\n\t\t\t\tif (other->M[i] > this->M[i]) {\n\t\t\t\t\tthis->M[i] = other->M[i];\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t *\n\t * @return cardinality estimate\n\t */\n\tuint64_t cardinality(bool verbose=true) {\n\t\tif (sparse) {\n\t\t\t// if we are still 'sparse', then use linear counting, which is more\n\t\t\t//  accurate for low cardinalities, and use increased precision pPrime\n\t\t\treturn uint64_t(linearCounting(mPrime, mPrime-uint32_t(sparseList.size())));\n\t\t}\n\n\t\t// initialize bias correction data\n\t\tif (rawEstimateData.empty()) { initRawEstimateData(); }\n\t\tif (biasData.empty())        { initBiasData(); }\n\n\t\t// calculate raw estimate on registers\n\t\t//double est = alpha(m) * harmonicMean(M, m);\n\t\tdouble est = calculateEstimate(M);\n\n\t\t// correct for biases if estimate is smaller than 5m\n\t\tif (est <= double(m)*5.0) {\n\t\t\test -= getEstimateBias(est);\n\t\t}\n\n\t\tuint32_t v = countZeros(M);\n\t\tif (v > 2) {\n\t\t\t// calculate linear counting (lc) estimate if there are more than 2 zeros in the matrix\n\t\t\tdouble lc_estimate = linearCounting(m, v);\n\n\t\t\t// check if the lc estimate is below the threshold\n\t\t\tif (lc_estimate <= double(threshold[p-4])) {\n\t\t\t\tif (lc_estimate < 0) { throw; }\n\t\t\t\t// return lc estimate of cardinality\n\t\t\t\treturn lc_estimate;\n\t\t\t}\n\t\t\treturn lc_estimate; // always use lc_estimate when available\n\t\t}\n\n\t\t// return bias-corrected hyperloglog estimate of cardinality\n\t\treturn uint64_t(est);\n\t}\n\nprivate:\n\n    uint8_t rank(HashSize x, uint8_t b) {\n        uint8_t v = 1;\n        while (v <= b && !(x & 0x80000000)) {\n            v++;\n            x <<= 1;\n        }\n        return v;\n    }\n\n    template<typename T> inline uint32_t get_index(const T hash_value, const uint8_t p, const uint8_t size) const {\n    \t// take first p bits as index  {x63,...,x64-p}\n    \tassert_lt(p,size);\n    \tuint32_t idx = hash_value >> (size - p);\n    \treturn idx;\n    }\n\n    inline uint32_t get_index(const uint64_t hash_value, const uint8_t p) const {\n        return get_index(hash_value, p, 64);\n    }\n\n    inline uint32_t get_index(const uint32_t hash_value, const uint8_t p) const {\n    \treturn get_index(hash_value, p, 32);\n    }\n\n    template<typename T> inline\n\tT get_trailing_ones(const uint8_t p) const {\n    \treturn (T(1) << p ) - 1;\n    }\n\n    template<typename T> inline\n    uint8_t get_rank(const T hash_value, const uint8_t p) const {\n    \t// shift p values off, and count leading zeros of the remaining string {x63-p,...,x0}\n    \tT_KEY rank_bits = (hash_value << p | get_trailing_ones<T>(p));\n#ifdef HLL_DEBUG\n    \tcerr << \"rank bits: \" << bitset<32>(rank_bits) << endl;\n#endif\n\n    \tuint8_t rank_val = (uint8_t) (clz(rank_bits)) + 1;\n    \tassert_leq(rank_val,64-p+1);\n    \treturn rank_val;\n    }\n\n\tvoid initRawEstimateData() {\n\t    rawEstimateData = vector<vector<double> >();\n\n\t    rawEstimateData.push_back(vector<double>(rawEstimateData_precision4,arr_len(rawEstimateData_precision4)));\n\t    rawEstimateData.push_back(vector<double>(rawEstimateData_precision5,arr_len(rawEstimateData_precision5)));\n\t    rawEstimateData.push_back(vector<double>(rawEstimateData_precision6,arr_len(rawEstimateData_precision6)));\n\t    rawEstimateData.push_back(vector<double>(rawEstimateData_precision7,arr_len(rawEstimateData_precision7)));\n\t    rawEstimateData.push_back(vector<double>(rawEstimateData_precision8,arr_len(rawEstimateData_precision8)));\n\t    rawEstimateData.push_back(vector<double>(rawEstimateData_precision9,arr_len(rawEstimateData_precision9)));\n\t    rawEstimateData.push_back(vector<double>(rawEstimateData_precision10,arr_len(rawEstimateData_precision10)));\n\t    rawEstimateData.push_back(vector<double>(rawEstimateData_precision11,arr_len(rawEstimateData_precision11)));\n\t    rawEstimateData.push_back(vector<double>(rawEstimateData_precision12,arr_len(rawEstimateData_precision12)));\n\t    rawEstimateData.push_back(vector<double>(rawEstimateData_precision13,arr_len(rawEstimateData_precision13)));\n\t    rawEstimateData.push_back(vector<double>(rawEstimateData_precision14,arr_len(rawEstimateData_precision14)));\n\t    rawEstimateData.push_back(vector<double>(rawEstimateData_precision15,arr_len(rawEstimateData_precision15)));\n\t    rawEstimateData.push_back(vector<double>(rawEstimateData_precision16,arr_len(rawEstimateData_precision16)));\n\t    rawEstimateData.push_back(vector<double>(rawEstimateData_precision17,arr_len(rawEstimateData_precision17)));\n\t    rawEstimateData.push_back(vector<double>(rawEstimateData_precision18,arr_len(rawEstimateData_precision18)));\n\n\t}\n\n\tvoid initBiasData() {\n\t\tbiasData = vector<vector<double> >();\n\n\t\tbiasData.push_back(vector<double>(biasData_precision4,arr_len(biasData_precision4)));\n\t\tbiasData.push_back(vector<double>(biasData_precision5,arr_len(biasData_precision5)));\n\t\tbiasData.push_back(vector<double>(biasData_precision6,arr_len(biasData_precision6)));\n\t\tbiasData.push_back(vector<double>(biasData_precision7,arr_len(biasData_precision7)));\n\t\tbiasData.push_back(vector<double>(biasData_precision8,arr_len(biasData_precision8)));\n\t\tbiasData.push_back(vector<double>(biasData_precision9,arr_len(biasData_precision9)));\n\t\tbiasData.push_back(vector<double>(biasData_precision10,arr_len(biasData_precision10)));\n\t\tbiasData.push_back(vector<double>(biasData_precision11,arr_len(biasData_precision11)));\n\t\tbiasData.push_back(vector<double>(biasData_precision12,arr_len(biasData_precision12)));\n\t\tbiasData.push_back(vector<double>(biasData_precision13,arr_len(biasData_precision13)));\n\t\tbiasData.push_back(vector<double>(biasData_precision14,arr_len(biasData_precision14)));\n\t\tbiasData.push_back(vector<double>(biasData_precision15,arr_len(biasData_precision15)));\n\t\tbiasData.push_back(vector<double>(biasData_precision16,arr_len(biasData_precision16)));\n\t\tbiasData.push_back(vector<double>(biasData_precision17,arr_len(biasData_precision17)));\n\t\tbiasData.push_back(vector<double>(biasData_precision18,arr_len(biasData_precision18)));\n\t}\n\n\t/**\n\t * Estimate the bias using empirically determined values.\n\t * Uses weighted average of the two cells between which the estimate falls.\n\t * TODO: Check if nearest neighbor average gives better values, as proposed in the paper\n\t * @param est\n\t * @return correction value for\n\t */\n\tdouble getEstimateBias(double estimate) {\n\t\tvector<double> rawEstimateTable = rawEstimateData[p-4];\n\t\tvector<double> biasTable = biasData[p-4];\n\t\n\t\t// check if estimate is lower than first entry, or larger than last\n\t\tif (rawEstimateTable.front() >= estimate) { return rawEstimateTable.front() - biasTable.front(); }\n\t\tif (rawEstimateTable.back()  <= estimate) { return rawEstimateTable.back() - biasTable.back(); }\n\t\n\t\t// get iterator to first element that is not smaller than estimate\n\t\tvector<double>::const_iterator it = lower_bound(rawEstimateTable.begin(),rawEstimateTable.end(),estimate);\n\t\tsize_t pos = it - rawEstimateTable.begin();\n\n\t\tdouble e1 = rawEstimateTable[pos-1];\n\t\tdouble e2 = rawEstimateTable[pos];\n\t\n\t\tdouble c = (estimate - e1) / (e2 - e1);\n\n\t\treturn biasTable[pos-1]*(1-c) + biasTable[pos]*c;\n\t}\n\t\n\n\t/**\n\t * Encode the 64-bit hash code x as an 32-bit integer, to be used in the sparse representation.\n\t *\n\t * Difference from the algorithm described in the paper:\n\t * The index always is in the p most significant bits\n\t *\n\t * see section 5.3 in Heule et al.\n\t * @param x the hash bits\n\t * @return encoded hash value\n\t */\n\tuint32_t encodeHashIn32Bit(uint64_t hash_value) {\n\t\t// extract first pPrime bits, and shift them onto a 32-bit integer\n\t\tuint32_t idx = (uint32_t)(extractBits(hash_value,pPrime) >> 32);\n\n#ifdef HLL_DEBUG\n\t\tcerr << \"value:  \" << bitset<64>(hash_value) << endl;\n        cerr << \"index: \" << std::bitset<32>(idx) << \" ( bits from 64 to \" << 64-pPrime << \"; \" << idx << \")\" << endl;\n#endif\n\n\t\t// are the bits {63-p, ..., 63-p'} all 0?\n\t\tif (extractBits(hash_value, 64-this->p, 64-pPrime) == 0) {\n\t\t\t// compute the additional rank (minimum rank is already p'-p)\n\t\t\t// the maximal size will be below 2^6=64. We thus combine the 25 bits of the index with 6 bits for the rank, and one bit as flag\n\t\t\tuint8_t additional_rank = get_rank(hash_value, pPrime); // this is rank - (p'-p), as we know that positions p'...p are 0\n\t\t\treturn idx | uint32_t(additional_rank<<1) | 1;\n\t\t} else {\n\t\t\t// else, return the idx, only - it has enough length to calculate the rank (left-shifted, last bit = 0)\n\t\t\tassert_eq((idx & 1),0);\n\t\t\treturn idx;\n\t\t}\n\t}\n\n\n\t/**\n\t * struct holding the index and rank/rho of an entry\n\t */\n\tstruct idx_n_rank {\n\t\tuint32_t idx;\n\t\tuint8_t rank;\n\t\tidx_n_rank(uint32_t _idx, uint8_t _rank) : idx(_idx), rank(_rank) {}\n\t};\n\n\t//\n\t//\n\t/**\n\t * Decode a hash from the sparse representation.\n\t * Returns the index and number of leading zeros (nlz) with precision p stored in k\n\t * @param k the hash bits\n\t * @return index and rank in non-sparse format\n\t */\n\tidx_n_rank getIndexAndRankFromEncodedHash(const uint32_t encoded_hash_value) const  {\n\n\t\t// difference to paper: Index can be recovered in the same way for pPrime and normally encoded hashes\n\t\tuint32_t idx = get_index(encoded_hash_value, p);\n\t\tuint8_t rank_val;\n\n\t\t// check if the last bit is 1\n\t\tif ( (encoded_hash_value & 1) == 1) {\n\t\t\t// if yes: the hash was stored with higher precision, bits p to pPrime were 0\n\t\t\tuint8_t additional_rank = pPrime - p;\n\t\t\trank_val = additional_rank + extractBits(encoded_hash_value, 7, 1);\n\t\t} else {\n\t\t\trank_val = get_rank(encoded_hash_value,p);\n\n\t\t\t// clz counts 64 bit only, it seems\n\t\t\tif (rank_val > 32)\n\t\t\t\trank_val -= 32;\n\t\t}\n\n\t\treturn(idx_n_rank(idx,rank_val));\n\t}\n\n};\n\n\n\n\n#endif /* HYPERLOGLOGPLUS_H_ */\n"
  },
  {
    "path": "indices/Makefile",
    "content": "#\n# Makefile\n# fbreitwieser, 2016-01-29 13:00\n#\n\nSHELL := /bin/bash\n\nTHREADS?=1\nKEEP_FILES?=0\n\nget_ref_file_names = $(addprefix $(REFERENCE_SEQUENCES_DIR)/, $(addsuffix $(1), \\\n\t$(addprefix all-,$(COMPLETE_GENOMES)) \\\n\t$(addprefix all-,$(addsuffix -chromosome_level,$(CHROMOSOME_LEVEL_GENOMES))) \\\n\t$(addprefix all-,$(addsuffix -any_level,$(ANY_LEVEL_GENOMES))) \\\n\t$(addprefix mammalian-reference-,$(MAMMALIAN_TAXIDS)) \\\n\t$(addprefix all-compressed-,$(COMPLETE_GENOMES_COMPRESSED)) \\\n\t$(if $(INCLUDE_CONTAMINANTS),contaminants)))\n\nDL_DIR=downloaded-seq\nTMP_DIR?=tmp_$(IDX_NAME)\nTAXID_SUFFIX:=.map\nREFERENCE_SEQUENCES_DIR:=reference-sequences\n\n.PHONY: index index-name index-size .path-ok .dustmasker-ok\n\ndefine USAGE\n\nMakefile to create common indices to use with Centrifuge.\n\n  make [OPTIONS] TARGET\n\nOPTIONS:\n    THREADS=n          Number of threads for downloading, compression and\n                       index building\n\nSTANDARD TARGETS:\n\n    p_compressed        Download all bacteria genomes from RefSeq,\n                        and compresses them at the species level\n\n    p_compressed+h+v    p_compressed + human genome and transcripts,\n                        contaminant sequences from UniVec and EmVec,\n                        and all viral genomes\n\n    p+h+v               As above, but with uncompressed bacterial genomes\n\n\tp+v\n\n\tv\n\nAlternatively, a IDX_NAME and one or more genomes may be specified as\noptions to build a custom database.\n\nEXTENDED OPTIONS:\n\tCOMPLETE_GENOMES=s\n\tCOMPLETE_GENOMES_COMPRESSED=s\n\tMAMMALIAN_TAXIDS=i\n\tINCLUDE_CONAMINANTS=1\n\tDONT_DUSTMASK=1\n\tIDX_NAME=s\n\nEXAMPLES:\n\t# Make an index with all complete bacterial and archaeal genomes, and compress\n\t# the bacterial genomes to the species level\n\tmake p_compressed\n\n\t# same as:\n\tmake COMPLETE_GENOMES=archaea COMPLETE_GENOMES_COMPRESSED=bacteria IDX_NAME=p_compressed\n\n\t# Make an index with just the human genome\n\tmake IDX_NAME=h MAMMALIAN_TAXIDS=9606\n\n\t# All archaeal genomes and contaminant sequences from UniVec and EmVec\n\tmake IDX_NAME=a COMPLETE_GENOMES=archaea  INCLUDE_CONTAMINANTS=1\n\nendef\nexport USAGE\n\n###################################################################################################\nifndef IDX_NAME\n\nall:\n\t@echo \"$$USAGE\"\n\nIDX_NAME?=$(shell basename $(shell dirname $(abspath $(lastword $(MAKEFILE_LIST)))))\n\nINDICES=p+h+v p+v v p p_compressed p_compressed+h+v refseq_microbial refseq_full nt\n\np+h+v: export ANY_LEVEL_GENOMES:=viral\np+h+v: export COMPLETE_GENOMES:=archaea bacteria\np+h+v: export MAMMALIAN_TAXIDS:=9606\np+h+v: export INCLUDE_CONTAMINANTS:=1\np+h+v: export IDX_NAME:=p+h+v\n\np+v: export ANY_LEVEL_GENOMES:=viral\np+v: export COMPLETE_GENOMES:=archaea bacteria\np+v: export INCLUDE_CONTAMINANTS:=1\np+v: export IDX_NAME:=p+v\n\nv: export ANY_LEVEL_GENOMES:=viral\nv: export IDX_NAME:=v\n\np: export COMPLETE_GENOMES:=archaea bacteria\np: export IDX_NAME:=p\n\np_compressed: export COMPLETE_GENOMES_COMPRESSED:=archaea bacteria\np_compressed: export IDX_NAME:=p_compressed\n\np_compressed+h+v: export ANY_LEVEL_GENOMES:=viral\np_compressed+h+v: export COMPLETE_GENOMES_COMPRESSED:=archaea bacteria\np_compressed+h+v: export MAMMALIAN_TAXIDS:=9606\np_compressed+h+v: export INCLUDE_CONTAMINANTS:=1\np_compressed+h+v: export IDX_NAME:=p_compressed+h+v\n\nrefseq_microbial: export COMPLETE_GENOMES:=archaea bacteria fungi protozoa\nrefseq_microbial: export CHROMOSOME_LEVEL_GENOMES:=$(COMPLETE_GENOMES)\nrefseq_microbial: export ANY_LEVEL_GENOMES:=viral\n##refseq_microbial: export SMALL_GENOMES:=mitochondrion plasmid plastid # TODO\nrefseq_microbial: export MAMMALIAN_TAXIDS:=9606 10090\nrefseq_microbial: export INCLUDE_CONTAMINANTS:=1\nrefseq_microbial: export IDX_NAME:=refseq_microbial\nrefseq_microbial: export CF_BUILD_OPTS+=--ftabchars 14\n\nrefseq_full: export COMPLETE_GENOMES:=archaea bacteria fungi invertebrate plant protozoa vertebrate_mammalian vertebrate_other viral\nrefseq_full: export CHROMOSOME_LEVEL_GENOMES:=$(COMPLETE_GENOMES)\nrefseq_full: export ANY_LEVEL_GENOMES:=viral\nrefseq_full: export SMALL_GENOMES:=mitochondrion plasmid plastid\nrefseq_full: export MAMMALIAN_TAXIDS:=9606 10090\nrefseq_full: export INCLUDE_CONTAMINANTS:=1\nrefseq_full: export IDX_NAME:=refseq_full\n\n\nnt: export IDX_NAME:=nt\n\n$(INDICES):\n\t@echo Making: $@: $(IDX_NAME)\n\t$(MAKE) -f $(THIS_FILE) IDX_NAME=$(IDX_NAME)\n\n####################################################################################################\nelse ## IDX_NAME is defined\n\nDONT_DUSTMASK=\nTAXONOMY_DOWNLOAD_OPTS?=\nREFERENCE_SEQUENCES=$(call get_ref_file_names,.fna)\nTAXID_MAPS=$(call get_ref_file_names,$(TAXID_SUFFIX))\nCF_BUILD_OPTS?=\n\nifeq (nt,$(IDX_NAME))\nifeq ($(strip $(DONT_DUSTMASK)),)\nREFERENCE_SEQUENCES+=nt-dusted.fna\nelse\nREFERENCE_SEQUENCES+=nt-sorted.fna\nendif\nTAXID_MAPS+=nt.map\nCF_BUILD_OPTS+=--ftabchars=14\nendif\n\n\nifeq ($(strip $(REFERENCE_SEQUENCES)),)\n$(error REFERENCE_SEQUENCES is not set - specify at lease one of COMPLETE_GENOMES, \\\nCOMPLETE_GENOMES_COMPRESSED, or MAMMALIAN_TAXIDS with the IDX_NAME ($(IDX_NAME)))\nendif\n\nSIZE_TABLES=$(addprefix $(REFERENCE_SEQUENCES_DIR)/all-compressed-,$(addsuffix .size,$(COMPLETE_GENOMES_COMPRESSED)))\nifneq ($(strip $(COMPLETE_GENOMES_COMPRESSED)),)\nCF_BUILD_OPTS+=--size-table <(cat $(SIZE_TABLES))\nendif\n\nCF_DOWNLOAD_OPTS?=\nCF_COMPRESS_OPTS?=\nifeq ($(strip $(DONT_DUSTMASK)),)\nCF_DOWNLOAD_OPTS+=-m\nelse\nCF_COMPRESS_OPTS+=--noDustmasker\nendif\n\nall: $(IDX_NAME).1.cf\n\n# vim:ft=make\nendif ## ifndef IDX_NAME\n\n$(REFERENCE_SEQUENCES_DIR):\n\tmkdir -p $(REFERENCE_SEQUENCES_DIR)\n\n#$(TAXID_MAPS): | $(REFERENCE_SEQUENCES_DIR)\n#\trm $(patsubst %$(TAXID_SUFFIX),%.fna, $@)\n#\t$(MAKE) -f $(THIS_FILE) $(patsubst %$(TAXID_SUFFIX),%.fna, $@)\n\nnt.gz:\n\tcurl -o nt.gz ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz\n\nnt.fna: nt.gz\n\tgunzip -c nt.gz > nt.fna\n\naccession2taxid/nucl_gb.accession2taxid.gz:\n\tmkdir -p accession2taxid\n\tcurl ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz > accession2taxid/nucl_gb.accession2taxid.gz\n\naccession2taxid/nucl_wgs.accession2taxid.gz:\n\tmkdir -p accession2taxid\n\tcurl ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz > accession2taxid/nucl_wgs.accession2taxid.gz\n\nnt.map: nt-sorted.fna\n\nnt-sorted.fna: nt.fna accession2taxid/nucl_gb.accession2taxid.gz accession2taxid/nucl_wgs.accession2taxid.gz\n\tcentrifuge-sort-nt.pl -m nt.map -a nt-acs-wo-mapping.txt \\\n\t\tnt.fna accession2taxid/nucl_gb.accession2taxid.gz accession2taxid/nucl_wgs.accession2taxid.gz \\\n\t\t> nt-sorted.fna\n\nnt-dusted.fna: nt-sorted.fna | .dustmasker-ok\n\t dustmasker -infmt fasta -in nt-sorted.fna -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > nt-dusted.fna\n\n$(REFERENCE_SEQUENCES_DIR)/mammalian-reference-%.fna: | $(REFERENCE_SEQUENCES_DIR)\n\t@[[ -d $(TMP_DIR) ]] && rm -rf $(TMP_DIR); mkdir -p $(TMP_DIR)\n\tcentrifuge-download -o $(TMP_DIR) -d \"vertebrate_mammalian\" -a \"Chromosome\" -t $* -c 'reference genome' -P $(THREADS) refseq > \\\n\t\t$(TMP_DIR)/$(patsubst %.fna,%$(TAXID_SUFFIX), $(notdir $@))\n\tfind $(TMP_DIR)/vertebrate_mammalian -name \"*.fna\" | xargs cat > $@.tmp && mv $@.tmp $@\n\tmv $(TMP_DIR)/$(patsubst %.fna,%$(TAXID_SUFFIX),$(notdir $@)) $(patsubst %.fna,%$(TAXID_SUFFIX),$@)\nifeq (1,$(KEEP_FILES))\n\t[[ -d $(DL_DIR)/vertebrate_mammalian ]] || mkdir -p $(DL_DIR)/vertebrate_mammalian\n\tmv $(TMP_DIR)/vertebrate_mammalian/* $(DL_DIR)/vertebrate_mammalian\nelse\n\trm -rf $(TMP_DIR)\nendif\n\n$(REFERENCE_SEQUENCES_DIR)/all-compressed-%.fna: | $(REFERENCE_SEQUENCES_DIR) taxonomy/nodes.dmp taxonomy/names.dmp .dustmasker-ok\n\t[[ -d $(TMP_DIR) ]] && rm -rf $(TMP_DIR); mkdir -p $(TMP_DIR)\n\tcentrifuge-download -o $(TMP_DIR) -d \"$*\" -P $(THREADS) refseq > $(TMP_DIR)/all-$*.map\n\ttime centrifuge-compress.pl $(TMP_DIR)/$* taxonomy $(CF_COMPRESS_OPTS) -map $(TMP_DIR)/all-$*.map \\\n\t\t-o $@.tmp -t $(THREADS) -maxG 50000000 2>&1 | tee centrifuge-compress-$(IDX_NAME).log && \\\n\tmv $@.tmp.fa $@ && mv $@.tmp.size $(patsubst %.fna,%.size,$@) && \\\n\tmv $@.tmp.map $(patsubst %.fna,%$(TAXID_SUFFIX),$@)\nifeq (1,$(KEEP_FILES))\n\t[[ -d $(DL_DIR)/$* ]] || mkdir -p $(DL_DIR)/$*\n\tmv $(TMP_DIR)/$*/* $(DL_DIR)/$*\nelse\n\trm -rf $(TMP_DIR)\nendif\n\n$(REFERENCE_SEQUENCES_DIR)/all-%-chromosome_level.fna: | $(REFERENCE_SEQUENCES_DIR) .dustmasker-ok\n\t[[ -d $(TMP_DIR) ]] && rm -rf $(TMP_DIR); mkdir -p $(TMP_DIR)\n\t@echo Downloading and dust-masking $*\n\tcentrifuge-download -o $(TMP_DIR) $(CF_DOWNLOAD_OPTS) -a \"Chromosome\" -d \"$*\" -P $(THREADS) refseq > \\\n\t\t$(TMP_DIR)/$(patsubst %.fna,%$(TAXID_SUFFIX),$(notdir $@))\n\tfind $(TMP_DIR)/$* -name \"*.fna\" | xargs cat > $@.tmp && mv $@.tmp $@\n\tmv $(TMP_DIR)/$(patsubst %.fna,%$(TAXID_SUFFIX),$(notdir $@)) $(patsubst %.fna,%$(TAXID_SUFFIX),$@)\nifeq (1,$(KEEP_FILES))\n\t[[ -d $(DL_DIR)/$* ]] || mkdir -p $(DL_DIR)/$*\n\tmv $(TMP_DIR)/$*/* $(DL_DIR)/$*\nelse\n\trm -rf $(TMP_DIR)\nendif\n\n$(REFERENCE_SEQUENCES_DIR)/all-%-any_level.fna: | $(REFERENCE_SEQUENCES_DIR) .dustmasker-ok\n\t[[ -d $(TMP_DIR) ]] && rm -rf $(TMP_DIR); mkdir -p $(TMP_DIR)\n\t@echo Downloading and dust-masking $*\n\tcentrifuge-download -o $(TMP_DIR) $(CF_DOWNLOAD_OPTS) -a \"Any\" -d \"$*\" -P $(THREADS) refseq > \\\n\t\t$(TMP_DIR)/$(patsubst %.fna,%$(TAXID_SUFFIX),$(notdir $@))\n\tfind $(TMP_DIR)/$* -name \"*.fna\" | xargs cat > $@.tmp && mv $@.tmp $@\n\tmv $(TMP_DIR)/$(patsubst %.fna,%$(TAXID_SUFFIX),$(notdir $@)) $(patsubst %.fna,%$(TAXID_SUFFIX),$@)\nifeq (1,$(KEEP_FILES))\n\t[[ -d $(DL_DIR)/$* ]] || mkdir -p $(DL_DIR)/$*\n\tmv $(TMP_DIR)/$*/* $(DL_DIR)/$*\nelse\n\trm -rf $(TMP_DIR)\nendif\n\n$(REFERENCE_SEQUENCES_DIR)/all-%.fna: | $(REFERENCE_SEQUENCES_DIR) .dustmasker-ok\n\t[[ -d $(TMP_DIR) ]] && rm -rf $(TMP_DIR); mkdir -p $(TMP_DIR)\n\t@echo Downloading and dust-masking $*\n\tcentrifuge-download -o $(TMP_DIR) $(CF_DOWNLOAD_OPTS) -d \"$*\" -P $(THREADS) refseq > \\\n\t\t$(TMP_DIR)/$(patsubst %.fna,%$(TAXID_SUFFIX),$(notdir $@))\n\tfind $(TMP_DIR)/$* -name \"*.fna\" | xargs cat > $@.tmp && mv $@.tmp $@\n\tmv $(TMP_DIR)/$(patsubst %.fna,%$(TAXID_SUFFIX),$(notdir $@)) $(patsubst %.fna,%$(TAXID_SUFFIX),$@)\nifeq (1,$(KEEP_FILES))\n\t[[ -d $(DL_DIR)/$* ]] || mkdir -p $(DL_DIR)/$*\n\tmv $(TMP_DIR)/$*/* $(DL_DIR)/$*\nelse\n\trm -rf $(TMP_DIR)\nendif\n\n$(REFERENCE_SEQUENCES_DIR)/contaminants.fna: | $(REFERENCE_SEQUENCES_DIR)\n\t[[ -d $(TMP_DIR) ]] && rm -rf $(TMP_DIR); mkdir -p $(TMP_DIR)\n\tcentrifuge-download -o $(TMP_DIR) contaminants > $(TMP_DIR)/$(patsubst %.fna,%$(TAXID_SUFFIX),$(notdir $@))\n\tfind $(TMP_DIR)/contaminants -name \"*.fna\" | xargs cat > $@.tmp && mv $@.tmp $@\n\tmv $(TMP_DIR)/$(patsubst %.fna,%$(TAXID_SUFFIX),$(notdir $@)) $(patsubst %.fna,%$(TAXID_SUFFIX),$@)\nifeq (1,$(KEEP_FILES))\n\t[[ -d $(DL_DIR)/contaminants ]] || mkdir -p $(DL_DIR)/contaminants\n\tmv $(TMP_DIR)/contaminants/* $(DL_DIR)/$*\nelse\n\trm -rf $(TMP_DIR)\nendif\n\nDUSTMASKER_EXISTS := $(shell command -v dustmasker)\n.dustmasker-ok:\nifndef DUSTMASKER_EXISTS\nifeq ($(strip $(DONT_DUSTMASK)),)\n\t$(error dustmasker program does not exist. Install NCBI blast+, or set option DONT_DUSTMASK=1)\nendif\nendif\n\n\ntaxonomy/names.dmp: taxonomy/nodes.dmp\ntaxonomy/nodes.dmp: | .path-ok\n\t[[ -d $(TMP_DIR) ]] && rm -rf $(TMP_DIR); mkdir -p $(TMP_DIR)\n\tcentrifuge-download $(TAXONOMY_DOWNLOAD_OPTS) -o $(TMP_DIR)/taxonomy taxonomy\n\tmkdir -p taxonomy\n\tmv $(TMP_DIR)/taxonomy/* taxonomy && rmdir $(TMP_DIR)/taxonomy && rmdir $(TMP_DIR)\n\n$(IDX_NAME).1.cf: $(REFERENCE_SEQUENCES) $(SIZE_TABLES) $(TAXID_MAPS) taxonomy/nodes.dmp taxonomy/names.dmp | .path-ok\n\t@echo Index building prerequisites: $^\n\t[[ -d $(TMP_DIR) ]] && rm -rf $(TMP_DIR); mkdir -p $(TMP_DIR)\n\ttime centrifuge-build -p $(THREADS) $(CF_BUILD_OPTS) \\\n\t\t--conversion-table <(cat $(TAXID_MAPS)) \\\n\t\t--taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \\\n\t\t$(call join_w_comma,$(REFERENCE_SEQUENCES)) $(TMP_DIR)/$(IDX_NAME) 2>&1 | tee centrifuge-build-$(IDX_NAME).log\n\tmv $(TMP_DIR)/$(IDX_NAME).*.cf . && rmdir $(TMP_DIR)\n\n\nclean:\n\t# Removing input sequences (all required information is in the index)\n\trm -rf taxonomy\n\trm -rf $(DL_DIR)\n\trm -rf $(TMP_DIR)\n\trm -rf tmp_*\n\trm -rf reference-sequences\n\trm -f *.map\n\trm -f *.log\n\n# Join a list with commas\nCOMMA:=,\nEMPTY:=\nSPACE:= $(EMPTY) $(EMPTY)\njoin_w_comma = $(subst $(SPACE),$(COMMA),$(strip $1))\n\n\nTHIS_FILE := $(lastword $(MAKEFILE_LIST))\nPATH_OK  := $(shell command -v centrifuge-build 2> /dev/null && command -v centrifuge-download 2> /dev/null )\nCF_BASE_DIR := $(shell dirname $(shell dirname $(THIS_FILE)))\n\nerror_msg := centrifuge-download and centrifuge-build are not available - please make sure they are in the path.\ndefine n\n\n\nendef\n\nTEST_PROGRAMS=centrifuge-build centrifuge-download\n\nifneq (\"$(wildcard $(CF_BASE_DIR)/centrifuge-build)\",\"\")\nerror_msg := $(error_msg)$n$nThe following command may solve this problem:$n  export PATH=$$PATH:\"$(CF_BASE_DIR)\"$n\nendif\n\n.path-ok:\nifndef PATH_OK\n    $(error $n$(error_msg))\nelse\n\t@echo Found centrifuge-download and centrifuge-build.\nendif\n\nindex-name:\n\techo $(IDX_NAME)\n\nindex-size:\n\tdu -csh $(IDX_NAME).[123].cf\n\n"
  },
  {
    "path": "limit.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <limits>\n#include \"limit.h\"\n\nuint8_t  MIN_U8  = std::numeric_limits<uint8_t>::min();\nuint8_t  MAX_U8  = std::numeric_limits<uint8_t>::max();\nuint16_t MIN_U16 = std::numeric_limits<uint16_t>::min();\nuint16_t MAX_U16 = std::numeric_limits<uint16_t>::max();\nuint32_t MIN_U32 = std::numeric_limits<uint32_t>::min();\nuint32_t MAX_U32 = std::numeric_limits<uint32_t>::max();\nuint64_t MIN_U64 = std::numeric_limits<uint64_t>::min();\nuint64_t MAX_U64 = std::numeric_limits<uint64_t>::max();\nsize_t   MIN_SIZE_T = std::numeric_limits<size_t>::min();\nsize_t   MAX_SIZE_T = std::numeric_limits<size_t>::max();\n\nint      MIN_I   = std::numeric_limits<int>::min();\nint      MAX_I   = std::numeric_limits<int>::max();\nint8_t   MIN_I8  = std::numeric_limits<int8_t>::min();\nint8_t   MAX_I8  = std::numeric_limits<int8_t>::max();\nint16_t  MIN_I16 = std::numeric_limits<int16_t>::min();\nint16_t  MAX_I16 = std::numeric_limits<int16_t>::max();\nint32_t  MIN_I32 = std::numeric_limits<int32_t>::min();\nint32_t  MAX_I32 = std::numeric_limits<int32_t>::max();\nint64_t  MIN_I64 = std::numeric_limits<int64_t>::min();\nint64_t  MAX_I64 = std::numeric_limits<int64_t>::max();\n"
  },
  {
    "path": "limit.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef LIMIT_H_\n#define LIMIT_H_\n\n#include <stdint.h>\n#include <cstring>\n\nextern uint8_t  MIN_U8;\nextern uint8_t  MAX_U8;\nextern uint16_t MIN_U16;\nextern uint16_t MAX_U16;\nextern uint32_t MIN_U32;\nextern uint32_t MAX_U32;\nextern uint64_t MIN_U64;\nextern uint64_t MAX_U64;\nextern size_t   MIN_SIZE_T;\nextern size_t   MAX_SIZE_T;\n\nextern int     MIN_I;\nextern int     MAX_I;\nextern int8_t  MIN_I8;\nextern int8_t  MAX_I8;\nextern int16_t MIN_I16;\nextern int16_t MAX_I16;\nextern int32_t MIN_I32;\nextern int32_t MAX_I32;\nextern int64_t MIN_I64;\nextern int64_t MAX_I64;\n\n#endif\n"
  },
  {
    "path": "ls.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifdef MAIN_LS\n\n#include <string.h>\n#include <iostream>\n#include \"sstring.h\"\n#include \"ls.h\"\n#include \"ds.h\"\n\nusing namespace std;\n\nint main(void) {\n\tcerr << \"Test LarssonSadakana for int...\";\n\t{\n\t\ttypedef int T;\n\t\tconst char *t = \"banana\";\n\t\tEList<T> sa;\n\t\tEList<T> isa;\n\t\tfor(size_t i = 0; i < strlen(t); i++) {\n\t\t\tisa.push_back(t[i]);\n\t\t}\n\t\tisa.push_back(0); // disregarded\n\t\tsa.resize(isa.size());\n\t\tLarssonSadakane<T> ls;\n\t\tls.suffixsort(isa.ptr(), sa.ptr(), (T)sa.size()-1, 'z', 0);\n\t\tassert_eq((T)'a', t[sa[1]]); assert_eq(5, sa[1]);\n\t\tassert_eq((T)'a', t[sa[2]]); assert_eq(3, sa[2]);\n\t\tassert_eq((T)'a', t[sa[3]]); assert_eq(1, sa[3]);\n\t\tassert_eq((T)'b', t[sa[4]]); assert_eq(0, sa[4]);\n\t\tassert_eq((T)'n', t[sa[5]]); assert_eq(4, sa[5]);\n\t\tassert_eq((T)'n', t[sa[6]]); assert_eq(2, sa[6]);\n\t}\n\tcerr << \"PASSED\" << endl;\n\n\tcerr << \"Test LarssonSadakana for uint32_t...\";\n\t{\n\t\ttypedef uint32_t T;\n\t\tconst char *t = \"banana\";\n\t\tEList<T> sa;\n\t\tEList<T> isa;\n\t\tfor(size_t i = 0; i < strlen(t); i++) {\n\t\t\tisa.push_back(t[i]);\n\t\t}\n\t\tisa.push_back(0); // disregarded\n\t\tsa.resize(isa.size());\n\t\tLarssonSadakane<int> ls;\n\t\tls.suffixsort(\n\t\t\t(int*)isa.ptr(),\n\t\t\t(int*)sa.ptr(),\n\t\t\t(int)sa.size()-1,\n\t\t\t'z',\n\t\t\t0);\n\t\tassert_eq((T)'a', t[sa[1]]); assert_eq(5, sa[1]);\n\t\tassert_eq((T)'a', t[sa[2]]); assert_eq(3, sa[2]);\n\t\tassert_eq((T)'a', t[sa[3]]); assert_eq(1, sa[3]);\n\t\tassert_eq((T)'b', t[sa[4]]); assert_eq(0, sa[4]);\n\t\tassert_eq((T)'n', t[sa[5]]); assert_eq(4, sa[5]);\n\t\tassert_eq((T)'n', t[sa[6]]); assert_eq(2, sa[6]);\n\t}\n\tcerr << \"PASSED\" << endl;\n\n\tcerr << \"Last elt is < or > others ...\";\n\t{\n\t\t{\n\t\ttypedef int T;\n\t\tconst char *t = \"aaa\";\n\t\tEList<T> sa;\n\t\tEList<T> isa;\n\t\tfor(size_t i = 0; i < strlen(t); i++) {\n\t\t\tisa.push_back(t[i]);\n\t\t}\n\t\tisa.push_back(0); // disregarded\n\t\tsa.resize(isa.size());\n\t\tLarssonSadakane<T> ls;\n\t\tls.suffixsort(isa.ptr(), sa.ptr(), (T)sa.size()-1, 'z', 0);\n\t\tassert_eq(3, sa[0]);\n\t\tassert_eq(2, sa[1]);\n\t\tassert_eq(1, sa[2]);\n\t\tassert_eq(0, sa[3]);\n\t\t}\n\n\t\t{\n\t\ttypedef int T;\n\t\tconst char *t = \"aaa\";\n\t\tEList<T> sa;\n\t\tEList<T> isa;\n\t\tfor(size_t i = 0; i < strlen(t); i++) {\n\t\t\tisa.push_back(t[i]);\n\t\t}\n\t\tisa.push_back('y'); // doesn't matter if this is > others\n\t\tsa.resize(isa.size());\n\t\tLarssonSadakane<T> ls;\n\t\tls.suffixsort(isa.ptr(), sa.ptr(), (T)sa.size()-1, 'z', 0);\n\t\tassert_eq(3, sa[0]);\n\t\tassert_eq(2, sa[1]);\n\t\tassert_eq(1, sa[2]);\n\t\tassert_eq(0, sa[3]);\n\t\t}\n\t\t\n\t\t{\n\t\ttypedef int T;\n\t\tconst char *t = \"aaa\";\n\t\tEList<T> sa;\n\t\tEList<T> isa;\n\t\tfor(size_t i = 0; i < strlen(t); i++) {\n\t\t\tisa.push_back(t[i]);\n\t\t}\n\t\tisa.push_back('y'); // breaks ties\n\t\tisa.push_back(0);   // disregarded\n\t\tsa.resize(isa.size());\n\t\tLarssonSadakane<T> ls;\n\t\tls.suffixsort(isa.ptr(), sa.ptr(), (T)sa.size()-1, 'z', 0);\n\t\tassert_eq(4, sa[0]);\n\t\tassert_eq(0, sa[1]);\n\t\tassert_eq(1, sa[2]);\n\t\tassert_eq(2, sa[3]);\n\t\tassert_eq(3, sa[4]);\n\t\t}\n\t\t\n\t}\n\tcerr << \"PASSED\" << endl;\n}\n\n#endif\n"
  },
  {
    "path": "ls.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n/* Code in this file is ultimately based on:\n\n   qsufsort.c\n   Copyright 1999, N. Jesper Larsson, all rights reserved.\n\n   This file contains an implementation of the algorithm presented in \"Faster\n   Suffix Sorting\" by N. Jesper Larsson (jesper@cs.lth.se) and Kunihiko\n   Sadakane (sada@is.s.u-tokyo.ac.jp).\n\n   This software may be used freely for any purpose. However, when distributed,\n   the original source must be clearly stated, and, when the source code is\n   distributed, the copyright notice must be retained and any alterations in\n   the code must be clearly marked. No warranty is given regarding the quality\n   of this software.*/\n\n#ifndef LS_H_\n#define LS_H_\n\n#include <iostream>\n#include <limits>\n#include <stdint.h>\n\ntemplate<typename T>\nclass LarssonSadakane {\n\tT *I, /* group array, ultimately suffix array.*/\n\t*V,   /* inverse array, ultimately inverse of I.*/\n\tr,    /* number of symbols aggregated by transform.*/\n\th;    /* length of already-sorted prefixes.*/\n\n\t#define LS_KEY(p)          (V[*(p)+(h)])\n\t#define LS_SWAP(p, q)      (tmp=*(p), *(p)=*(q), *(q)=tmp)\n\t#define LS_SMED3(a, b, c)  (LS_KEY(a)<LS_KEY(b) ?                        \\\n\t\t\t  (LS_KEY(b)<LS_KEY(c) ? (b) : LS_KEY(a)<LS_KEY(c) ? (c) : (a))  \\\n\t\t\t: (LS_KEY(b)>LS_KEY(c) ? (b) : LS_KEY(a)>LS_KEY(c) ? (c) : (a)))\n\n\t/* Subroutine for select_sort_split and sort_split. Sets group numbers for a\n\t   group whose lowest position in I is pl and highest position is pm.*/\n\n\tinline void update_group(T *pl, T *pm) {\n\t   T g;\n\t   g=(T)(pm-I);                 /* group number.*/\n\t   V[*pl]=g;                    /* update group number of first position.*/\n\t   if (pl==pm)\n\t\t  *pl=-1;                   /* one element, sorted group.*/\n\t   else\n\t\t  do                        /* more than one element, unsorted group.*/\n\t\t\t V[*++pl]=g;            /* update group numbers.*/\n\t\t  while (pl<pm);\n\t}\n\n\t/* Quadratic sorting method to use for small subarrays. To be able to update\n\t   group numbers consistently, a variant of selection sorting is used.*/\n\n\tinline void select_sort_split(T *p, T n) {\n\t   T *pa, *pb, *pi, *pn;\n\t   T f, v, tmp;\n\n\t   pa=p;                        /* pa is start of group being picked out.*/\n\t   pn=p+n-1;                    /* pn is last position of subarray.*/\n\t   while (pa<pn) {\n\t\t  for (pi=pb=pa+1, f=LS_KEY(pa); pi<=pn; ++pi)\n\t\t\t if ((v=LS_KEY(pi))<f) {\n\t\t\t\tf=v;                /* f is smallest key found.*/\n\t\t\t\tLS_SWAP(pi, pa);       /* place smallest element at beginning.*/\n\t\t\t\tpb=pa+1;            /* pb is position for elements equal to f.*/\n\t\t\t } else if (v==f) {     /* if equal to smallest key.*/\n\t\t\t\tLS_SWAP(pi, pb);       /* place next to other smallest elements.*/\n\t\t\t\t++pb;\n\t\t\t }\n\t\t  update_group(pa, pb-1);   /* update group values for new group.*/\n\t\t  pa=pb;                    /* continue sorting rest of the subarray.*/\n\t   }\n\t   if (pa==pn) {                /* check if last part is single element.*/\n\t\t  V[*pa]=(T)(pa-I);\n\t\t  *pa=-1;                   /* sorted group.*/\n\t   }\n\t}\n\n\t/* Subroutine for sort_split, algorithm by Bentley & McIlroy.*/\n\n\tinline T choose_pivot(T *p, T n) {\n\t   T *pl, *pm, *pn;\n\t   T s;\n\n\t   pm=p+(n>>1);                 /* small arrays, middle element.*/\n\t   if (n>7) {\n\t\t  pl=p;\n\t\t  pn=p+n-1;\n\t\t  if (n>40) {               /* big arrays, pseudomedian of 9.*/\n\t\t\t s=n>>3;\n\t\t\t pl=LS_SMED3(pl, pl+s, pl+s+s);\n\t\t\t pm=LS_SMED3(pm-s, pm, pm+s);\n\t\t\t pn=LS_SMED3(pn-s-s, pn-s, pn);\n\t\t  }\n\t\t  pm=LS_SMED3(pl, pm, pn);      /* midsize arrays, median of 3.*/\n\t   }\n\t   return LS_KEY(pm);\n\t}\n\n\t/* Sorting routine called for each unsorted group. Sorts the array of integers\n\t   (suffix numbers) of length n starting at p. The algorithm is a ternary-split\n\t   quicksort taken from Bentley & McIlroy, \"Engineering a Sort Function\",\n\t   Software -- Practice and Experience 23(11), 1249-1265 (November 1993). This\n\t   function is based on Program 7.*/\n\n\tinline void sort_split(T *p, T n)\n\t{\n\t   T *pa, *pb, *pc, *pd, *pl, *pm, *pn;\n\t   T f, v, s, t, tmp;\n\n\t   if (n<7) {                   /* multi-selection sort smallest arrays.*/\n\t\t  select_sort_split(p, n);\n\t\t  return;\n\t   }\n\n\t   v=choose_pivot(p, n);\n\t   pa=pb=p;\n\t   pc=pd=p+n-1;\n\t   while (1) {                  /* split-end partition.*/\n\t\t  while (pb<=pc && (f=LS_KEY(pb))<=v) {\n\t\t\t if (f==v) {\n\t\t\t\tLS_SWAP(pa, pb);\n\t\t\t\t++pa;\n\t\t\t }\n\t\t\t ++pb;\n\t\t  }\n\t\t  while (pc>=pb && (f=LS_KEY(pc))>=v) {\n\t\t\t if (f==v) {\n\t\t\t\tLS_SWAP(pc, pd);\n\t\t\t\t--pd;\n\t\t\t }\n\t\t\t --pc;\n\t\t  }\n\t\t  if (pb>pc)\n\t\t\t break;\n\t\t  LS_SWAP(pb, pc);\n\t\t  ++pb;\n\t\t  --pc;\n\t   }\n\t   pn=p+n;\n\t   if ((s=(T)(pa-p))>(t=(T)(pb-pa)))\n\t\t  s=t;\n\t   for (pl=p, pm=pb-s; s; --s, ++pl, ++pm)\n\t\t  LS_SWAP(pl, pm);\n\t   if ((s=(T)(pd-pc))>(t=(T)(pn-pd-1)))\n\t\t  s=t;\n\t   for (pl=pb, pm=pn-s; s; --s, ++pl, ++pm)\n\t\t  LS_SWAP(pl, pm);\n\n\t   s=(T)(pb-pa);\n\t   t=(T)(pd-pc);\n\t   if (s>0)\n\t\t  sort_split(p, s);\n\t   update_group(p+s, p+n-t-1);\n\t   if (t>0)\n\t\t  sort_split(p+n-t, t);\n\t}\n\n\t/* Bucketsort for first iteration.\n\n\t   Input: x[0...n-1] holds integers in the range 1...k-1, all of which appear\n\t   at least once. x[n] is 0. (This is the corresponding output of transform.) k\n\t   must be at most n+1. p is array of size n+1 whose contents are disregarded.\n\n\t   Output: x is V and p is I after the initial sorting stage of the refined\n\t   suffix sorting algorithm.*/\n\n\tinline void bucketsort(T *x, T *p, T n, T k)\n\t{\n\t   T *pi, i, c, d, g;\n\n\t   for (pi=p; pi<p+k; ++pi)\n\t\t  *pi=-1;                   /* mark linked lists empty.*/\n\t   for (i=0; i<=n; ++i) {\n\t\t  x[i]=p[c=x[i]];           /* insert in linked list.*/\n\t\t  p[c]=i;\n\t   }\n\t   for (pi=p+k-1, i=n; pi>=p; --pi) {\n\t\t  d=x[c=*pi];               /* c is position, d is next in list.*/\n\t\t  x[c]=g=i;                 /* last position equals group number.*/\n\t\t  if (d == 0 || d > 0) {    /* if more than one element in group.*/\n\t\t\t p[i--]=c;              /* p is permutation for the sorted x.*/\n\t\t\t do {\n\t\t\t\td=x[c=d];           /* next in linked list.*/\n\t\t\t\tx[c]=g;             /* group number in x.*/\n\t\t\t\tp[i--]=c;           /* permutation in p.*/\n\t\t\t } while (d == 0 || d > 0);\n\t\t  } else\n\t\t\t p[i--]=-1;             /* one element, sorted group.*/\n\t   }\n\t}\n\n\t/* Transforms the alphabet of x by attempting to aggregate several symbols into\n\t   one, while preserving the suffix order of x. The alphabet may also be\n\t   compacted, so that x on output comprises all integers of the new alphabet\n\t   with no skipped numbers.\n\n\t   Input: x is an array of size n+1 whose first n elements are positive\n\t   integers in the range l...k-1. p is array of size n+1, used for temporary\n\t   storage. q controls aggregation and compaction by defining the maximum value\n\t   for any symbol during transformation: q must be at least k-l; if q<=n,\n\t   compaction is guaranteed; if k-l>n, compaction is never done; if q is\n\t   INT_MAX, the maximum number of symbols are aggregated into one.\n\n\t   Output: Returns an integer j in the range 1...q representing the size of the\n\t   new alphabet. If j<=n+1, the alphabet is compacted. The global variable r is\n\t   set to the number of old symbols grouped into one. Only x[n] is 0.*/\n\n\tinline T transform(T *x, T *p, T n, T k, T l, T q)\n\t{\n\t   T b, c, d, e, i, j, m, s;\n\t   T *pi, *pj;\n\n\t   for (s=0, i=k-l; i; i>>=1)\n\t\t  ++s;                      /* s is number of bits in old symbol.*/\n\t   e=std::numeric_limits<T>::max()>>s; /* e is for overflow checking.*/\n\t   for (b=d=r=0; r<n && d<=e && (c=d<<s|(k-l))<=q; ++r) {\n\t\t  b=b<<s|(x[r]-l+1);        /* b is start of x in chunk alphabet.*/\n\t\t  d=c;                      /* d is max symbol in chunk alphabet.*/\n\t   }\n\t   m=(((T)1)<<(r-1)*s)-1;            /* m masks off top old symbol from chunk.*/\n\t   x[n]=l-1;                    /* emulate zero terminator.*/\n\t   if (d<=n) {                  /* if bucketing possible, compact alphabet.*/\n\t\t  for (pi=p; pi<=p+d; ++pi)\n\t\t\t *pi=0;                 /* zero transformation table.*/\n\t\t  for (pi=x+r, c=b; pi<=x+n; ++pi) {\n\t\t\t p[c]=1;                /* mark used chunk symbol.*/\n\t\t\t c=(c&m)<<s|(*pi-l+1);  /* shift in next old symbol in chunk.*/\n\t\t  }\n\t\t  for (i=1; i<r; ++i) {     /* handle last r-1 positions.*/\n\t\t\t p[c]=1;                /* mark used chunk symbol.*/\n\t\t\t c=(c&m)<<s;            /* shift in next old symbol in chunk.*/\n\t\t  }\n\t\t  for (pi=p, j=1; pi<=p+d; ++pi)\n\t\t\t if (*pi)\n\t\t\t\t*pi=j++;            /* j is new alphabet size.*/\n\t\t  for (pi=x, pj=x+r, c=b; pj<=x+n; ++pi, ++pj) {\n\t\t\t *pi=p[c];              /* transform to new alphabet.*/\n\t\t\t c=(c&m)<<s|(*pj-l+1);  /* shift in next old symbol in chunk.*/\n\t\t  }\n\t\t  while (pi<x+n) {          /* handle last r-1 positions.*/\n\t\t\t *pi++=p[c];            /* transform to new alphabet.*/\n\t\t\t c=(c&m)<<s;            /* shift right-end zero in chunk.*/\n\t\t  }\n\t   } else {                     /* bucketing not possible, don't compact.*/\n\t\t  for (pi=x, pj=x+r, c=b; pj<=x+n; ++pi, ++pj) {\n\t\t\t *pi=c;                 /* transform to new alphabet.*/\n\t\t\t c=(c&m)<<s|(*pj-l+1);  /* shift in next old symbol in chunk.*/\n\t\t  }\n\t\t  while (pi<x+n) {          /* handle last r-1 positions.*/\n\t\t\t *pi++=c;               /* transform to new alphabet.*/\n\t\t\t c=(c&m)<<s;            /* shift right-end zero in chunk.*/\n\t\t  }\n\t\t  j=d+1;                    /* new alphabet size.*/\n\t   }\n\t   x[n]=0;                      /* end-of-string symbol is zero.*/\n\t   return j;                    /* return new alphabet size.*/\n\t}\n\t\n\tpublic:\n\n\t/* Makes suffix array p of x. x becomes inverse of p. p and x are both of size\n\t   n+1. Contents of x[0...n-1] are integers in the range l...k-1. Original\n\t   contents of x[n] is disregarded, the n-th symbol being regarded as\n\t   end-of-string smaller than all other symbols.*/\n\n\tvoid suffixsort(T *x, T *p, T n, T k, T l)\n\t{\n\t   T *pi, *pk;\n\t   T i, j, s, sl;\n\n\t   V=x;                         /* set global values.*/\n\t   I=p;\n\n\t   if (n>=k-l) {                /* if bucketing possible,*/\n\t\t  j=transform(V, I, n, k, l, n);\n\t\t  bucketsort(V, I, n, j);   /* bucketsort on first r positions.*/\n\t   } else {\n\t\t  transform(V, I, n, k, l, std::numeric_limits<T>::max());\n\t\t  for (i=0; i<=n; ++i)\n\t\t\t I[i]=i;                /* initialize I with suffix numbers.*/\n\t\t  h=0;\n\t\t  sort_split(I, n+1);       /* quicksort on first r positions.*/\n\t   }\n\t   h=r;                         /* number of symbols aggregated by transform.*/\n\n\t   while (*I>=-n) {\n\t\t  pi=I;                     /* pi is first position of group.*/\n\t\t  sl=0;                     /* sl is negated length of sorted groups.*/\n\t\t  do {\n\t\t\t if ((s=*pi) <= 0 && (s=*pi) != 0) {\n\t\t\t\tpi-=s;              /* skip over sorted group.*/\n\t\t\t\tsl+=s;              /* add negated length to sl.*/\n\t\t\t } else {\n\t\t\t\tif (sl) {\n\t\t\t\t   *(pi+sl)=sl;     /* combine sorted groups before pi.*/\n\t\t\t\t   sl=0;\n\t\t\t\t}\n\t\t\t\tpk=I+V[s]+1;        /* pk-1 is last position of unsorted group.*/\n\t\t\t\tsort_split(pi, (T)(pk-pi));\n\t\t\t\tpi=pk;              /* next group.*/\n\t\t\t }\n\t\t  } while (pi<=I+n);\n\t\t  if (sl)                   /* if the array ends with a sorted group.*/\n\t\t\t *(pi+sl)=sl;           /* combine sorted groups at end of I.*/\n\t\t  h=2*h;                    /* double sorted-depth.*/\n\t   }\n\n\t   for (i=0; i<=n; ++i)         /* reconstruct suffix array from inverse.*/\n\t\t  I[V[i]]=i;\n\t}\n};\n\n#endif /*def LS_H_*/\n"
  },
  {
    "path": "mask.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"mask.h\"\n\n// 5-bit pop count\nint alts5[32] = {\n\t 0, 1, 1, 2, 1, 2, 2, 3,\n\t 1, 2, 2, 3, 2, 3, 3, 4,\n\t 1, 2, 2, 3, 2, 3, 3, 4,\n\t 2, 3, 3, 4, 3, 4, 4, 5\n};\n\n// Index of lowest set bit\nint firsts5[32] = {\n\t-1, 0, 1, 0, 2, 0, 1, 0,\n\t 3, 0, 1, 0, 2, 0, 1, 0,\n\t 4, 0, 1, 0, 2, 0, 1, 0,\n\t 3, 0, 1, 0, 2, 0, 1, 0\n};\n"
  },
  {
    "path": "mask.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef MASK_H_\n#define MASK_H_\n\n#include <iostream>\n#include \"random_source.h\"\n\n// 5-bit pop count\nextern int alts5[32];\n\n// Index of lowest set bit\nextern int firsts5[32];\n\n/**\n * Return 1 if a 2-bit-encoded base ('i') matches any bit in the mask ('j') and\n * the mask < 16.  Returns -1 if either the reference or the read character was\n * ambiguous.  Returns 0 if the characters unambiguously mismatch.\n */\nstatic inline int matchesEx(int i, int j) {\n\tif(j >= 16 || i > 3) {\n\t\t// read and/or ref was ambiguous\n\t\treturn -1;\n\t}\n\treturn (((1 << i) & j) != 0) ? 1 : 0;\n}\n\n/**\n * Return 1 if a 2-bit-encoded base ('i') matches any bit in the mask ('j').\n */\nstatic inline bool matches(int i, int j) {\n\treturn ((1 << i) & j) != 0;\n}\n\n/**\n * Given a mask with up to 5 bits, return an index corresponding to a\n * set bit in the mask, randomly chosen from among all set bits.\n */\nstatic inline int randFromMask(RandomSource& rnd, int mask) {\n\tassert_gt(mask, 0);\n\tif(alts5[mask] == 1) {\n\t\t// only one to pick from, pick it via lookup table\n\t\treturn firsts5[mask];\n\t}\n\tassert_gt(mask, 0);\n\tassert_lt(mask, 32);\n\tint r = rnd.nextU32() % alts5[mask];\n\tassert_geq(r, 0);\n\tassert_lt(r, alts5[mask]);\n\t// could do the following via lookup table too\n\tfor(int i = 0; i < 5; i++) {\n\t\tif((mask & (1 << i)) != 0) {\n\t\t\tif(r == 0) return i;\n\t\t\tr--;\n\t\t}\n\t}\n\tstd::cerr << \"Shouldn't get here\" << std::endl;\n\tthrow 1;\n\treturn -1;\n}\n\n#endif /*ndef MASK_H_*/\n"
  },
  {
    "path": "mem_ids.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n// For holding index data\n#define EBWT_CAT  ((int) 1)\n// For holding index-building data\n#define EBWTB_CAT ((int) 2)\n// For holding cache data\n#define CA_CAT    ((int) 3)\n// For holding group-walk-left bookkeeping data\n#define GW_CAT    ((int) 4)\n// For holding alignment bookkeeping data\n#define AL_CAT    ((int) 5)\n// For holding dynamic programming bookkeeping data\n#define DP_CAT    ((int) 6)\n// For holding alignment results and other hit objects\n#define RES_CAT   ((int) 7)\n#define MISC_CAT  ((int) 9)\n#define DEBUG_CAT ((int)10)\n"
  },
  {
    "path": "mm.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef MM_H_\n#define MM_H_\n\n/**\n * mm.h:\n *\n * Defines that make it easier to handle files in the two different MM\n * contexts: i.e. on Linux and Mac where MM is supported and POSIX I/O\n * functions work as expected, and on Windows where MM is not supported\n * and where there isn't POSIX I/O,\n */\n#if 0\n#ifdef BOWTIE_MM\n#define MM_FILE_CLOSE(x) if(x > 3) { close(x); }\n#define MM_READ_RET ssize_t\n// #define MM_READ read\n#define MM_SEEK lseek\n#define MM_FILE int\n#define MM_FILE_INIT -1\n#else\n#define MM_FILE_CLOSE(x) if(x != NULL) { fclose(x); }\n#define MM_READ_RET size_t\n#define MM_SEEK fseek\n#define MM_FILE FILE*\n#define MM_FILE_INIT NULL\n#endif\n#endif\n\n#define MM_READ(file, dest, sz) fread(dest, 1, sz, file)\n#define MM_IS_IO_ERR(file_hd, ret, count) is_fread_err(file_hd, ret, count)\n\n#endif /* MM_H_ */\n"
  },
  {
    "path": "multikey_qsort.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef MULTIKEY_QSORT_H_\n#define MULTIKEY_QSORT_H_\n\n#include <iostream>\n#include \"sequence_io.h\"\n#include \"alphabet.h\"\n#include \"assert_helpers.h\"\n#include \"diff_sample.h\"\n#include \"sstring.h\"\n#include \"btypes.h\"\n\nusing namespace std;\n\n/**\n * Swap elements a and b in s\n */\ntemplate <typename TStr, typename TPos>\nstatic inline void swap(TStr& s, size_t slen, TPos a, TPos b) {\n\tassert_lt(a, slen);\n\tassert_lt(b, slen);\n\tswap(s[a], s[b]);\n}\n\n/**\n * Swap elements a and b in array s\n */\ntemplate <typename TVal, typename TPos>\nstatic inline void swap(TVal* s, size_t slen, TPos a, TPos b) {\n\tassert_lt(a, slen);\n\tassert_lt(b, slen);\n\tswap(s[a], s[b]);\n}\n\n/**\n * Helper macro for swapping elements a and b in s.  Does some additional\n * sainty checking w/r/t begin and end (which are parameters to the sorting\n * routines below).\n */\n#define SWAP(s, a, b) { \\\n\tassert_geq(a, begin); \\\n\tassert_geq(b, begin); \\\n\tassert_lt(a, end); \\\n\tassert_lt(b, end); \\\n\tswap(s, slen, a, b); \\\n}\n\n/**\n * Helper macro for swapping the same pair of elements a and b in two different\n * strings s and s2.  This is a helpful variant if, for example, the caller\n * would like to see how their input was permuted by the sort routine (in that\n * case, the caller would let s2 be an array s2[] where s2 is the same length\n * as s and s2[i] = i).\n */\n#define SWAP2(s, s2, a, b) { \\\n\tSWAP(s, a, b); \\\n\tswap(s2, slen, a, b); \\\n}\n\n#define SWAP1(s, s2, a, b) { \\\n\tSWAP(s, a, b); \\\n}\n\n/**\n * Helper macro that swaps a range of elements [i, i+n) with another\n * range [j, j+n) in s.\n */\n#define VECSWAP(s, i, j, n) { \\\n\tif(n > 0) { vecswap(s, slen, i, j, n, begin, end); } \\\n}\n\n/**\n * Helper macro that swaps a range of elements [i, i+n) with another\n * range [j, j+n) both in s and s2.\n */\n#define VECSWAP2(s, s2, i, j, n) { \\\n\tif(n > 0) { vecswap2(s, slen, s2, i, j, n, begin, end); } \\\n}\n\n/**\n * Helper function that swaps a range of elements [i, i+n) with another\n * range [j, j+n) in s.  begin and end represent the current range under\n * consideration by the caller (one of the recursive multikey_quicksort\n * routines below).\n */\ntemplate <typename TStr, typename TPos>\nstatic inline void vecswap(TStr& s, size_t slen, TPos i, TPos j, TPos n, TPos begin, TPos end) {\n\tassert_geq(i, begin);\n\tassert_geq(j, begin);\n\tassert_lt(i, end);\n\tassert_lt(j, end);\n\twhile(n-- > 0) {\n\t\tassert_geq(n, 0);\n\t\tTPos a = i+n;\n\t\tTPos b = j+n;\n\t\tassert_geq(a, begin);\n\t\tassert_geq(b, begin);\n\t\tassert_lt(a, end);\n\t\tassert_lt(b, end);\n\t\tswap(s, slen, a, b);\n\t}\n}\n\ntemplate <typename TVal, typename TPos>\nstatic inline void vecswap(TVal *s, size_t slen, TPos i, TPos j, TPos n, TPos begin, TPos end) {\n\tassert_geq(i, begin);\n\tassert_geq(j, begin);\n\tassert_lt(i, end);\n\tassert_lt(j, end);\n\twhile(n-- > 0) {\n\t\tassert_geq(n, 0);\n\t\tTPos a = i+n;\n\t\tTPos b = j+n;\n\t\tassert_geq(a, begin);\n\t\tassert_geq(b, begin);\n\t\tassert_lt(a, end);\n\t\tassert_lt(b, end);\n\t\tswap(s, slen, a, b);\n\t}\n}\n\n/**\n * Helper function that swaps a range of elements [i, i+n) with another range\n * [j, j+n) both in s and s2.  begin and end represent the current range under\n * consideration by the caller (one of the recursive multikey_quicksort\n * routines below).\n */\ntemplate <typename TStr, typename TPos>\nstatic inline void vecswap2(\n\tTStr& s,\n\tsize_t slen,\n\tTStr& s2,\n\tTPos i,\n\tTPos j,\n\tTPos n,\n\tTPos begin,\n\tTPos end)\n{\n\tassert_geq(i, begin);\n\tassert_geq(j, begin);\n\tassert_lt(i, end);\n\tassert_lt(j, end);\n\twhile(n-- > 0) {\n\t\tassert_geq(n, 0);\n\t\tTPos a = i+n;\n\t\tTPos b = j+n;\n\t\tassert_geq(a, begin);\n\t\tassert_geq(b, begin);\n\t\tassert_lt(a, end);\n\t\tassert_lt(b, end);\n\t\tswap(s, slen, a, b);\n\t\tswap(s2, slen, a, b);\n\t}\n}\n\ntemplate <typename TVal, typename TPos>\nstatic inline void vecswap2(TVal* s, size_t slen, TVal* s2, TPos i, TPos j, TPos n, TPos begin, TPos end) {\n\tassert_geq(i, begin);\n\tassert_geq(j, begin);\n\tassert_lt(i, end);\n\tassert_lt(j, end);\n\twhile(n-- > 0) {\n\t\tassert_geq(n, 0);\n\t\tTPos a = i+n;\n\t\tTPos b = j+n;\n\t\tassert_geq(a, begin);\n\t\tassert_geq(b, begin);\n\t\tassert_lt(a, end);\n\t\tassert_lt(b, end);\n\t\tswap(s, slen, a, b);\n\t\tswap(s2, slen, a, b);\n\t}\n}\n\n/// Retrieve an int-ized version of the ath character of string s, or,\n/// if a goes off the end of s, return a (user-specified) int greater\n/// than any TAlphabet character - 'hi'.\n#define CHAR_AT(ss, aa) ((length(s[ss]) > aa) ? (int)(s[ss][aa]) : hi)\n\n/// Retrieve an int-ized version of the ath character of string s, or,\n/// if a goes off the end of s, return a (user-specified) int greater\n/// than any TAlphabet character - 'hi'.\n#define CHAR_AT_SUF(si, off) \\\n\t(((off + s[si]) < hlen) ? ((int)(host[off + s[si]])) : (hi))\n\n/// Retrieve an int-ized version of the ath character of string s, or,\n/// if a goes off the end of s, return a (user-specified) int greater\n/// than any TAlphabet character - 'hi'.\n\n#define CHAR_AT_SUF_U8(si, off) char_at_suf_u8(host, hlen, s, si, off, hi)\n\n// Note that CHOOSE_AND_SWAP_RANDOM_PIVOT is unused\n#define CHOOSE_AND_SWAP_RANDOM_PIVOT(sw, ch) {                            \\\n\t/* Note: rand() didn't really cut it here; it seemed to run out of */ \\\n\t/* randomness and, after a time, returned the same thing over and */  \\\n\t/* over again */                                                      \\\n\ta = (rand() % n) + begin; /* choose pivot between begin and end */  \\\n\tassert_lt(a, end); assert_geq(a, begin);                              \\\n\tsw(s, s2, begin, a); /* move pivot to beginning */                    \\\n}\n\n/**\n * Ad-hoc DNA-centric way of choose a pretty good pivot without using\n * the pseudo-random number generator.  We try to get a 1 or 2 if\n * possible, since they'll split things more evenly than a 0 or 4.  We\n * also avoid swapping in the event that we choose the first element.\n */\n#define CHOOSE_AND_SWAP_SMART_PIVOT(sw, ch) {                                    \\\n\ta = begin; /* choose first elt */                                            \\\n\t/* now try to find a better elt */                                           \\\n\tif(n >= 5) { /* n is the difference between begin and end */                 \\\n\t\tif     (ch(begin+1, depth) == 1 || ch(begin+1, depth) == 2) a = begin+1; \\\n\t\telse if(ch(begin+2, depth) == 1 || ch(begin+2, depth) == 2) a = begin+2; \\\n\t\telse if(ch(begin+3, depth) == 1 || ch(begin+3, depth) == 2) a = begin+3; \\\n\t\telse if(ch(begin+4, depth) == 1 || ch(begin+4, depth) == 2) a = begin+4; \\\n\t\tif(a != begin) sw(s, s2, begin, a); /* move pivot to beginning */        \\\n\t}                                                                            \\\n\t/* the element at [begin] now holds the pivot value */                       \\\n}\n\n#define CHOOSE_AND_SWAP_PIVOT CHOOSE_AND_SWAP_SMART_PIVOT\n\n#ifndef NDEBUG\n\n/**\n * Assert that the range of chars at depth 'depth' in strings 'begin'\n * to 'end' in string-of-suffix-offsets s is parititioned properly\n * according to the ternary paritioning strategy of Bentley and McIlroy\n * (*prior to* swapping the = regions to the center)\n */\ntemplate<typename THost>\nbool assertPartitionedSuf(\n\tconst THost& host,\n\tTIndexOffU *s,\n\tsize_t slen,\n\tint hi,\n\tint pivot,\n\tsize_t begin,\n\tsize_t end,\n\tsize_t depth)\n{\n\tsize_t hlen = host.length();\n\tint state = 0; // 0 -> 1st = section, 1 -> < section, 2 -> > section, 3 -> 2nd = section\n\tfor(size_t i = begin; i < end; i++) {\n\t\tswitch(state) {\n\t\tcase 0:\n\t\t\tif       (CHAR_AT_SUF(i, depth) < pivot)  { state = 1; break; }\n\t\t\telse if  (CHAR_AT_SUF(i, depth) > pivot)  { state = 2; break; }\n\t\t\tassert_eq(CHAR_AT_SUF(i, depth), pivot);  break;\n\t\tcase 1:\n\t\t\tif       (CHAR_AT_SUF(i, depth) > pivot)  { state = 2; break; }\n\t\t\telse if  (CHAR_AT_SUF(i, depth) == pivot) { state = 3; break; }\n\t\t\tassert_lt(CHAR_AT_SUF(i, depth), pivot);  break;\n\t\tcase 2:\n\t\t\tif       (CHAR_AT_SUF(i, depth) == pivot) { state = 3; break; }\n\t\t\tassert_gt(CHAR_AT_SUF(i, depth), pivot);\t break;\n\t\tcase 3:\n\t\t\tassert_eq(CHAR_AT_SUF(i, depth), pivot);\t break;\n\t\t}\n\t}\n\treturn true;\n}\n\n/**\n * Assert that the range of chars at depth 'depth' in strings 'begin'\n * to 'end' in string-of-suffix-offsets s is parititioned properly\n * according to the ternary paritioning strategy of Bentley and McIlroy\n * (*after* swapping the = regions to the center)\n */\ntemplate<typename THost>\nbool assertPartitionedSuf2(\n\tconst THost& host,\n\tTIndexOffU *s,\n\tsize_t slen,\n\tint hi,\n\tint pivot,\n\tsize_t begin,\n\tsize_t end,\n\tsize_t depth)\n{\n\tsize_t hlen = host.length();\n\tint state = 0; // 0 -> < section, 1 -> = section, 2 -> > section\n\tfor(size_t i = begin; i < end; i++) {\n\t\tswitch(state) {\n\t\tcase 0:\n\t\t\tif       (CHAR_AT_SUF(i, depth) == pivot) { state = 1; break; }\n\t\t\telse if  (CHAR_AT_SUF(i, depth) > pivot)  { state = 2; break; }\n\t\t\tassert_lt(CHAR_AT_SUF(i, depth), pivot);  break;\n\t\tcase 1:\n\t\t\tif       (CHAR_AT_SUF(i, depth) > pivot)  { state = 2; break; }\n\t\t\tassert_eq(CHAR_AT_SUF(i, depth), pivot);  break;\n\t\tcase 2:\n\t\t\tassert_gt(CHAR_AT_SUF(i, depth), pivot);  break;\n\t\t}\n\t}\n\treturn true;\n}\n#endif\n\n/**\n * Assert that string s of suffix offsets into string 'host' is a seemingly\n * legitimate suffix-offset list (at this time, we just check that it doesn't\n * list any suffix twice).\n */\nstatic inline void sanityCheckInputSufs(TIndexOffU *s, size_t slen) {\n\tassert_gt(slen, 0);\n\tfor(size_t i = 0; i < slen; i++) {\n\t\t// Actually, it's convenient to allow the caller to provide\n\t\t// suffix offsets thare are off the end of the host string.\n\t\t// See, e.g., build() in diff_sample.cpp.\n\t\t//assert_lt(s[i], length(host));\n\t\tfor(size_t j = i+1; j < slen; j++) {\n\t\t\tassert_neq(s[i], s[j]);\n\t\t}\n\t}\n}\n\n/**\n * Assert that the string s of suffix offsets into  'host' really are in\n * lexicographical order up to depth 'upto'.\n */\ntemplate <typename T>\nvoid sanityCheckOrderedSufs(\n\tconst T& host,\n\tsize_t hlen,\n\tTIndexOffU *s,\n\tsize_t slen,\n\tsize_t upto,\n\tsize_t lower = 0,\n\tsize_t upper = OFF_MASK)\n{\n\tassert_lt(s[0], hlen);\n\tupper = min<size_t>(upper, slen-1);\n\tfor(size_t i = lower; i < upper; i++) {\n\t\t// Allow s[i+t] to point off the end of the string; this is\n\t\t// convenient for some callers\n\t\tif(s[i+1] >= hlen) continue;\n#ifndef NDEBUG\n\t\tif(upto == OFF_MASK) {\n\t\t\tassert(sstr_suf_lt(host, s[i], hlen, host, s[i+1], hlen, false));\n\t\t} else {\n\t\t\tif(sstr_suf_upto_lt(host, s[i], host, s[i+1], upto, false)) {\n\t\t\t\t// operator > treats shorter strings as\n\t\t\t\t// lexicographically smaller, but we want to opposite\n\t\t\t\t//assert(isPrefix(suffix(host, s[i+1]), suffix(host, s[i])));\n\t\t\t}\n\t\t}\n#endif\n\t}\n}\n\n/**\n * Main multikey quicksort function for suffixes.  Based on Bentley &\n * Sedgewick's algorithm on p.5 of their paper \"Fast Algorithms for\n * Sorting and Searching Strings\".  That algorithm has been extended in\n * three ways:\n *\n *  1. Deal with keys of different lengths by checking bounds and\n *     considering off-the-end values to be 'hi' (b/c our goal is the\n *     BWT transform, we're biased toward considring prefixes as\n *     lexicographically *greater* than their extensions).\n *  2. The multikey_qsort_suffixes version takes a single host string\n *     and a list of suffix offsets as input.  This reduces memory\n *     footprint compared to an approach that treats its input\n *     generically as a set of strings (not necessarily suffixes), thus\n *     requiring that we store at least two integers worth of\n *     information for each string.\n *  3. Sorting functions take an extra \"upto\" parameter that upper-\n *     bounds the depth to which the function sorts.\n *\n * TODO: Consult a tie-breaker (like a difference cover sample) if two\n * keys share a long prefix.\n */\ntemplate<typename T>\nvoid mkeyQSortSuf(\n\tconst T& host,\n\tsize_t hlen,\n\tTIndexOffU *s,\n\tsize_t slen,\n\tint hi,\n\tsize_t begin,\n\tsize_t end,\n\tsize_t depth,\n\tsize_t upto = OFF_MASK)\n{\n\t// Helper for making the recursive call; sanity-checks arguments to\n\t// make sure that the problem actually got smaller.\n\t#define MQS_RECURSE_SUF(nbegin, nend, ndepth) { \\\n\t\tassert(nbegin > begin || nend < end || ndepth > depth); \\\n\t\tif(ndepth < upto) { /* don't exceed depth of 'upto' */ \\\n\t\t\tmkeyQSortSuf(host, hlen, s, slen, hi, nbegin, nend, ndepth, upto); \\\n\t\t} \\\n\t}\n\tassert_leq(begin, slen);\n\tassert_leq(end, slen);\n\tsize_t a, b, c, d, /*e,*/ r;\n\tsize_t n = end - begin;\n\tif(n <= 1) return;                 // 1-element list already sorted\n\tCHOOSE_AND_SWAP_PIVOT(SWAP1, CHAR_AT_SUF); // pick pivot, swap it into [begin]\n\tint v = CHAR_AT_SUF(begin, depth); // v <- randomly-selected pivot value\n\t#ifndef NDEBUG\n\t{\n\t\tbool stillInBounds = false;\n\t\tfor(size_t i = begin; i < end; i++) {\n\t\t\tif(depth < (hlen-s[i])) {\n\t\t\t\tstillInBounds = true;\n\t\t\t\tbreak;\n\t\t\t} else { /* already fell off this suffix */ }\n\t\t}\n\t\tassert(stillInBounds); // >=1 suffix must still be in bounds\n\t}\n\t#endif\n\ta = b = begin;\n\tc = d = end-1;\n\twhile(true) {\n\t\t// Invariant: everything before a is = pivot, everything\n\t\t// between a and b is <\n\t\tint bc = 0; // shouldn't have to init but gcc on Mac complains\n\t\twhile(b <= c && v >= (bc = CHAR_AT_SUF(b, depth))) {\n\t\t\tif(v == bc) {\n\t\t\t\tSWAP(s, a, b); a++;\n\t\t\t}\n\t\t\tb++;\n\t\t}\n\t\t// Invariant: everything after d is = pivot, everything\n\t\t// between c and d is >\n\t\tint cc = 0; // shouldn't have to init but gcc on Mac complains\n\t\twhile(b <= c && v <= (cc = CHAR_AT_SUF(c, depth))) {\n\t\t\tif(v == cc) {\n\t\t\t\tSWAP(s, c, d); d--;\n\t\t\t}\n\t\t\tc--;\n\t\t}\n\t\tif(b > c) break;\n\t\tSWAP(s, b, c);\n\t\tb++;\n\t\tc--;\n\t}\n\tassert(a > begin || c < end-1);                      // there was at least one =s\n\tassert_lt(d-c, n); // they can't all have been > pivot\n\tassert_lt(b-a, n); // they can't all have been < pivot\n\tassert(assertPartitionedSuf(host, s, slen, hi, v, begin, end, depth));  // check pre-=-swap invariant\n\tr = min(a-begin, b-a); VECSWAP(s, begin, b-r,   r);  // swap left = to center\n\tr = min(d-c, end-d-1); VECSWAP(s, b,     end-r, r);  // swap right = to center\n\tassert(assertPartitionedSuf2(host, s, slen, hi, v, begin, end, depth)); // check post-=-swap invariant\n\tr = b-a; // r <- # of <'s\n\tif(r > 0) {\n\t\tMQS_RECURSE_SUF(begin, begin + r, depth); // recurse on <'s\n\t}\n\t// Do not recurse on ='s if the pivot was the off-the-end value;\n\t// they're already fully sorted\n\tif(v != hi) {\n\t\tMQS_RECURSE_SUF(begin + r, begin + r + (a-begin) + (end-d-1), depth+1); // recurse on ='s\n\t}\n\tr = d-c; // r <- # of >'s excluding those exhausted\n\tif(r > 0 && v < hi-1) {\n\t\tMQS_RECURSE_SUF(end-r, end, depth); // recurse on >'s\n\t}\n}\n\n/**\n * Toplevel function for multikey quicksort over suffixes.\n */\ntemplate<typename T>\nvoid mkeyQSortSuf(\n\tconst T& host,\n\tTIndexOffU *s,\n\tsize_t slen,\n\tint hi,\n\tbool verbose = false,\n\tbool sanityCheck = false,\n\tsize_t upto = OFF_MASK)\n{\n\tsize_t hlen = host.length();\n\tassert_gt(slen, 0);\n\tif(sanityCheck) sanityCheckInputSufs(s, slen);\n\tmkeyQSortSuf(host, hlen, s, slen, hi, (size_t)0, slen, (size_t)0, upto);\n\tif(sanityCheck) sanityCheckOrderedSufs(host, hlen, s, slen, upto);\n}\n\n/**\n * Just like mkeyQSortSuf but all swaps are applied to s2 as well as s.\n * This is a helpful variant if, for example, the caller would like to\n * see how their input was permuted by the sort routine (in that case,\n * the caller would let s2 be an array s2[] where s2 is the same length\n * as s and s2[i] = i).\n */\nstruct QSortRange {\n    size_t begin;\n    size_t end;\n    size_t depth;\n};\ntemplate<typename T>\nvoid mkeyQSortSuf2(\n                   const T& host,\n                   size_t hlen,\n                   TIndexOffU *s,\n                   size_t slen,\n                   TIndexOffU *s2,\n                   int hi,\n                   size_t _begin,\n                   size_t _end,\n                   size_t _depth,\n                   size_t upto = OFF_MASK,\n                   EList<size_t>* boundaries = NULL)\n{\n    ELList<QSortRange, 3, 1024> block_list;\n    while(true) {\n        size_t begin = 0, end = 0, depth = 0;\n        if(block_list.size() == 0) {\n            begin = _begin;\n            end = _end;\n            depth = _depth;\n        } else {\n            if(block_list.back().size() > 0) {\n                begin = block_list.back()[0].begin;\n                end = block_list.back()[0].end;\n                depth = block_list.back()[0].depth;\n                block_list.back().erase(0);\n            } else {\n                block_list.resize(block_list.size() - 1);\n                if(block_list.size() == 0) {\n                    break;\n                }\n            }\n        }\n        if(depth == upto) {\n            if(boundaries != NULL) {\n                (*boundaries).push_back(end);\n            }\n            continue;\n        }\n        assert_leq(begin, slen);\n        assert_leq(end, slen);\n        size_t a, b, c, d, /*e,*/ r;\n        size_t n = end - begin;\n        if(n <= 1) { // 1-element list already sorted\n            if(n == 1 && boundaries != NULL) {\n                boundaries->push_back(end);\n            }\n            continue;\n        }\n        CHOOSE_AND_SWAP_PIVOT(SWAP2, CHAR_AT_SUF); // pick pivot, swap it into [begin]\n        int v = CHAR_AT_SUF(begin, depth); // v <- randomly-selected pivot value\n#ifndef NDEBUG\n        {\n            bool stillInBounds = false;\n            for(size_t i = begin; i < end; i++) {\n                if(depth < (hlen-s[i])) {\n                    stillInBounds = true;\n                    break;\n                } else { /* already fell off this suffix */ }\n            }\n            assert(stillInBounds); // >=1 suffix must still be in bounds\n        }\n#endif\n        a = b = begin;\n        c = d = /*e =*/ end-1;\n        while(true) {\n            // Invariant: everything before a is = pivot, everything\n            // between a and b is <\n            int bc = 0; // shouldn't have to init but gcc on Mac complains\n            while(b <= c && v >= (bc = CHAR_AT_SUF(b, depth))) {\n                if(v == bc) {\n                    SWAP2(s, s2, a, b); a++;\n                }\n                b++;\n            }\n            // Invariant: everything after d is = pivot, everything\n            // between c and d is >\n            int cc = 0; // shouldn't have to init but gcc on Mac complains\n            while(b <= c && v <= (cc = CHAR_AT_SUF(c, depth))) {\n                if(v == cc) {\n                    SWAP2(s, s2, c, d); d--; /*e--;*/\n                }\n                //else if(c == e && v == hi) e--;\n                c--;\n            }\n            if(b > c) break;\n            SWAP2(s, s2, b, c);\n            b++;\n            c--;\n        }\n        assert(a > begin || c < end-1);                      // there was at least one =s\n        assert_lt(/*e*/d-c, n); // they can't all have been > pivot\n        assert_lt(b-a, n); // they can't all have been < pivot\n        assert(assertPartitionedSuf(host, s, slen, hi, v, begin, end, depth));  // check pre-=-swap invariant\n        r = min(a-begin, b-a); VECSWAP2(s, s2, begin, b-r,   r);  // swap left = to center\n        r = min(d-c, end-d-1); VECSWAP2(s, s2, b,     end-r, r);  // swap right = to center\n        assert(assertPartitionedSuf2(host, s, slen, hi, v, begin, end, depth)); // check post-=-swap invariant\n        r = b-a; // r <- # of <'s\n        block_list.expand();\n        block_list.back().clear();\n        if(r > 0) { // recurse on <'s\n            block_list.back().expand();\n            block_list.back().back().begin = begin;\n            block_list.back().back().end = begin + r;\n            block_list.back().back().depth = depth;\n        }\n        // Do not recurse on ='s if the pivot was the off-the-end value;\n        // they're already fully sorted\n        if(v != hi) { // recurse on ='s\n            block_list.back().expand();\n            block_list.back().back().begin = begin + r;\n            block_list.back().back().end = begin + r + (a-begin) + (end-d-1);\n            block_list.back().back().depth = depth + 1;\n        }\n        r = d-c;   // r <- # of >'s excluding those exhausted\n        if(r > 0 && v < hi-1) { // recurse on >'s\n            block_list.back().expand();\n            block_list.back().back().begin = end - r;\n            block_list.back().back().end = end;\n            block_list.back().back().depth = depth;\n        }\n    }\n}\n\n/**\n * Toplevel function for multikey quicksort over suffixes with double\n * swapping.\n */\ntemplate<typename T>\nvoid mkeyQSortSuf2(\n                   const T& host,\n                   TIndexOffU *s,\n                   size_t slen,\n                   TIndexOffU *s2,\n                   int hi,\n                   bool verbose = false,\n                   bool sanityCheck = false,\n                   size_t upto = OFF_MASK,\n                   EList<size_t>* boundaries = NULL)\n{\n\tsize_t hlen = host.length();\n\tif(sanityCheck) sanityCheckInputSufs(s, slen);\n\tTIndexOffU *sOrig = NULL;\n\tif(sanityCheck) {\n\t\tsOrig = new TIndexOffU[slen];\n\t\tmemcpy(sOrig, s, OFF_SIZE * slen);\n\t}\n\tmkeyQSortSuf2(host, hlen, s, slen, s2, hi, (size_t)0, slen, (size_t)0, upto, boundaries);\n\tif(sanityCheck) {\n\t\tsanityCheckOrderedSufs(host, hlen, s, slen, upto);\n\t\tfor(size_t i = 0; i < slen; i++) {\n\t\t\tassert_eq(s[i], sOrig[s2[i]]);\n\t\t}\n\t\tdelete[] sOrig;\n\t}\n}\n\n// Ugly but necessary; otherwise the compiler chokes dramatically on\n// the DifferenceCoverSample<> template args to the next few functions\ntemplate <typename T>\nclass DifferenceCoverSample;\n\n/**\n * Constant time\n */\ntemplate<typename T1, typename T2> inline\nbool sufDcLt(\n\tconst T1& host,\n\tconst T2& s1,\n\tconst T2& s2,\n\tconst DifferenceCoverSample<T1>& dc,\n\tbool sanityCheck = false)\n{\n\tsize_t diff = dc.tieBreakOff(s1, s2);\n\tASSERT_ONLY(size_t hlen = host.length());\n\tassert_lt(diff, dc.v());\n\tassert_lt(diff, hlen-s1);\n\tassert_lt(diff, hlen-s2);\n\tif(sanityCheck) {\n\t\tfor(size_t i = 0; i < diff; i++) {\n\t\t\tassert_eq(host[s1+i], host[s2+i]);\n\t\t}\n\t}\n\tbool ret = dc.breakTie(s1+diff, s2+diff) < 0;\n#ifndef NDEBUG\n\tif(sanityCheck && ret != sstr_suf_lt(host, s1, hlen, host, s2, hlen, false)) {\n\t\tassert(false);\n\t}\n#endif\n\treturn ret;\n}\n\n/**\n * k log(k)\n */\ntemplate<typename T> inline\nvoid qsortSufDc(\n\tconst T& host,\n\tsize_t hlen,\n\tTIndexOffU* s,\n\tsize_t slen,\n\tconst DifferenceCoverSample<T>& dc,\n\tsize_t begin,\n\tsize_t end,\n\tbool sanityCheck = false)\n{\n\tassert_leq(end, slen);\n\tassert_lt(begin, slen);\n\tassert_gt(end, begin);\n\tsize_t n = end - begin;\n\tif(n <= 1) return;                 // 1-element list already sorted\n\t// Note: rand() didn't really cut it here; it seemed to run out of\n\t// randomness and, after a time, returned the same thing over and\n\t// over again\n\tsize_t a = (rand() % n) + begin; // choose pivot between begin and end\n\tassert_lt(a, end);\n\tassert_geq(a, begin);\n\tSWAP(s, end-1, a); // move pivot to end\n\tsize_t cur = 0;\n\tfor(size_t i = begin; i < end-1; i++) {\n\t\tif(sufDcLt(host, s[i], s[end-1], dc, sanityCheck)) {\n\t\t\tif(sanityCheck)\n\t\t\t\tassert(dollarLt(suffix(host, s[i]), suffix(host, s[end-1])));\n\t\t\tassert_lt(begin + cur, end-1);\n\t\t\tSWAP(s, i, begin + cur);\n\t\t\tcur++;\n\t\t}\n\t}\n\t// Put pivot into place\n\tassert_lt(cur, end-begin);\n\tSWAP(s, end-1, begin+cur);\n\tif(begin+cur > begin) qsortSufDc(host, hlen, s, slen, dc, begin, begin+cur);\n\tif(end > begin+cur+1) qsortSufDc(host, hlen, s, slen, dc, begin+cur+1, end);\n}\n\n/**\n * Toplevel function for multikey quicksort over suffixes.\n */\ntemplate<typename T1, typename T2>\nvoid mkeyQSortSufDcU8(\n\tconst T1& host1,\n\tconst T2& host,\n\tsize_t hlen,\n\tTIndexOffU* s,\n\tsize_t slen,\n\tconst DifferenceCoverSample<T1>& dc,\n\tint hi,\n\tbool verbose = false,\n\tbool sanityCheck = false)\n{\n\tif(sanityCheck) sanityCheckInputSufs(s, slen);\n\tmkeyQSortSufDcU8(host1, host, hlen, s, slen, dc, hi, 0, slen, 0, sanityCheck);\n\tif(sanityCheck) sanityCheckOrderedSufs(host1, hlen, s, slen, OFF_MASK);\n}\n\n/**\n * Return a boolean indicating whether s1 < s2 using the difference\n * cover to break the tie.\n */\ntemplate<typename T1, typename T2> inline\nbool sufDcLtU8(\n\tconst T1& host1,\n\tconst T2& host,\n\tsize_t hlen,\n\tsize_t s1,\n\tsize_t s2,\n\tconst DifferenceCoverSample<T1>& dc,\n\tbool sanityCheck = false)\n{\n\thlen += 0;\n\tsize_t diff = dc.tieBreakOff((TIndexOffU)s1, (TIndexOffU)s2);\n\tassert_lt(diff, dc.v());\n\tassert_lt(diff, hlen-s1);\n\tassert_lt(diff, hlen-s2);\n\tif(sanityCheck) {\n\t\tfor(size_t i = 0; i < diff; i++) {\n\t\t\tassert_eq(host[s1+i], host1[s2+i]);\n\t\t}\n\t}\n\tbool ret = dc.breakTie((TIndexOffU)(s1+diff), (TIndexOffU)(s2+diff)) < 0;\n\t// Sanity-check return value using dollarLt\n#ifndef NDEBUG\n\tbool ret2 = sstr_suf_lt(host1, s1, hlen, host, s2, hlen, false);\n\tassert(!sanityCheck || ret == ret2);\n#endif\n\treturn ret;\n}\n\n/**\n * k log(k)\n */\ntemplate<typename T1, typename T2> inline\nvoid qsortSufDcU8(\n\tconst T1& host1,\n\tconst T2& host,\n\tsize_t hlen,\n\tTIndexOffU* s,\n\tsize_t slen,\n\tconst DifferenceCoverSample<T1>& dc,\n\tsize_t begin,\n\tsize_t end,\n\tbool sanityCheck = false)\n{\n\tassert_leq(end, slen);\n\tassert_lt(begin, slen);\n\tassert_gt(end, begin);\n\tsize_t n = end - begin;\n\tif(n <= 1) return;                 // 1-element list already sorted\n\t// Note: rand() didn't really cut it here; it seemed to run out of\n\t// randomness and, after a time, returned the same thing over and\n\t// over again\n\tsize_t a = (rand() % n) + begin; // choose pivot between begin and end\n\tassert_lt(a, end);\n\tassert_geq(a, begin);\n\tSWAP(s, end-1, a); // move pivot to end\n\tsize_t cur = 0;\n\tfor(size_t i = begin; i < end-1; i++) {\n\t\tif(sufDcLtU8(host1, host, hlen, s[i], s[end-1], dc, sanityCheck)) {\n#ifndef NDEBUG\n\t\t\tif(sanityCheck) {\n\t\t\t\tassert(sstr_suf_lt(host1, s[i], hlen, host1, s[end-1], hlen, false));\n\t\t\t}\n\t\t\tassert_lt(begin + cur, end-1);\n#endif\n\t\t\tSWAP(s, i, begin + cur);\n\t\t\tcur++;\n\t\t}\n\t}\n\t// Put pivot into place\n\tassert_lt(cur, end-begin);\n\tSWAP(s, end-1, begin+cur);\n\tif(begin+cur > begin) qsortSufDcU8(host1, host, hlen, s, slen, dc, begin, begin+cur);\n\tif(end > begin+cur+1) qsortSufDcU8(host1, host, hlen, s, slen, dc, begin+cur+1, end);\n}\n\n#define BUCKET_SORT_CUTOFF (4 * 1024 * 1024)\n#define SELECTION_SORT_CUTOFF 6\n\n/**\n * Straightforwardly obtain a uint8_t-ized version of t[off].  This\n * works fine as long as TStr is not packed.\n */\ntemplate<typename TStr>\ninline uint8_t get_uint8(const TStr& t, size_t off) {\n\treturn t[off];\n}\n\n/**\n * For incomprehensible generic-programming reasons, getting a uint8_t\n * version of a character in a packed String<> requires casting first\n * to Dna then to uint8_t.\n */\ntemplate<>\ninline uint8_t get_uint8<S2bDnaString>(const S2bDnaString& t, size_t off) {\n\treturn (uint8_t)t[off];\n}\n\n/**\n * Return character at offset 'off' from the 'si'th suffix in the array\n * 's' of suffixes.  If the character is out-of-bounds, return hi.\n */\ntemplate<typename TStr>\nstatic inline int char_at_suf_u8(\n\tconst TStr& host,\n\tsize_t hlen,\n\tTIndexOffU* s,\n\tsize_t si,\n\tsize_t off,\n\tuint8_t hi)\n{\n\treturn ((off+s[si]) < hlen) ? get_uint8(host, off+s[si]) : (hi);\n}\n\ntemplate<typename T1, typename T2>\nstatic void selectionSortSufDcU8(\n\t\tconst T1& host1,\n\t\tconst T2& host,\n        size_t hlen,\n        TIndexOffU* s,\n        size_t slen,\n        const DifferenceCoverSample<T1>& dc,\n        uint8_t hi,\n        size_t begin,\n        size_t end,\n        size_t depth,\n        bool sanityCheck = false)\n{\n#define ASSERT_SUF_LT(l, r) \\\n\tif(sanityCheck && \\\n\t   !sstr_suf_lt(host1, s[l], hlen, host1, s[r], hlen, false)) { \\\n\t\tassert(false); \\\n\t}\n\n\tassert_gt(end, begin+1);\n\tassert_leq(end-begin, SELECTION_SORT_CUTOFF);\n\tassert_eq(hi, 4);\n\tsize_t v = dc.v();\n\tif(end == begin+2) {\n\t\tsize_t off = dc.tieBreakOff(s[begin], s[begin+1]);\n\t\tif(off + s[begin] >= hlen ||\n\t\t   off + s[begin+1] >= hlen)\n\t\t{\n\t\t\toff = OFF_MASK;\n\t\t}\n\t\tif(off != OFF_MASK) {\n\t\t\tif(off < depth) {\n\t\t\t\tqsortSufDcU8<T1,T2>(host1, host, hlen, s, slen, dc,\n\t\t\t\t                    begin, end, sanityCheck);\n\t\t\t\t// It's helpful for debugging if we call this here\n\t\t\t\tif(sanityCheck) {\n\t\t\t\t\tsanityCheckOrderedSufs(host1, hlen, s, slen,\n\t\t\t\t\t                       OFF_MASK, begin, end);\n\t\t\t\t}\n\t\t\t\treturn;\n\t\t\t}\n\t\t\tv = off - depth + 1;\n\t\t}\n\t}\n\tassert_leq(v, dc.v());\n\tsize_t lim = v;\n\tassert_geq(lim, 0);\n\tfor(size_t i = begin; i < end-1; i++) {\n\t\tsize_t targ = i;\n\t\tsize_t targoff = depth + s[i];\n\t\tfor(size_t j = i+1; j < end; j++) {\n\t\t\tassert_neq(j, targ);\n\t\t\tsize_t joff = depth + s[j];\n\t\t\tsize_t k;\n\t\t\tfor(k = 0; k <= lim; k++) {\n\t\t\t\tassert_neq(j, targ);\n\t\t\t\tuint8_t jc = (k + joff < hlen)    ? get_uint8(host, k + joff)    : hi;\n\t\t\t\tuint8_t tc = (k + targoff < hlen) ? get_uint8(host, k + targoff) : hi;\n\t\t\t\tassert(jc != hi || tc != hi);\n\t\t\t\tif(jc > tc) {\n\t\t\t\t\t// the jth suffix is greater than the current\n\t\t\t\t\t// smallest suffix\n\t\t\t\t\tASSERT_SUF_LT(targ, j);\n\t\t\t\t\tbreak;\n\t\t\t\t} else if(jc < tc) {\n\t\t\t\t\t// the jth suffix is less than the current smallest\n\t\t\t\t\t// suffix, so update smallest to be j\n\t\t\t\t\tASSERT_SUF_LT(j, targ);\n\t\t\t\t\ttarg = j;\n\t\t\t\t\ttargoff = joff;\n\t\t\t\t\tbreak;\n\t\t\t\t} else if(k == lim) {\n\t\t\t\t\t// Check whether either string ends immediately\n\t\t\t\t\t// after this character\n\t\t\t\t\tassert_leq(k + joff + 1, hlen);\n\t\t\t\t\tassert_leq(k + targoff + 1, hlen);\n\t\t\t\t\tif(k + joff + 1 == hlen) {\n\t\t\t\t\t\t// targ < j\n\t\t\t\t\t\tassert_neq(k + targoff + 1, hlen);\n\t\t\t\t\t\tASSERT_SUF_LT(targ, j);\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t} else if(k + targoff + 1 == hlen) {\n\t\t\t\t\t\t// j < targ\n\t\t\t\t\t\tASSERT_SUF_LT(j, targ);\n\t\t\t\t\t\ttarg = j;\n\t\t\t\t\t\ttargoff = joff;\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t}\n\t\t\t\t} else {\n\t\t\t\t\t// They're equal so far, keep going\n\t\t\t\t}\n\t\t\t}\n\t\t\t// The jth suffix was equal to the current smallest suffix\n\t\t\t// up to the difference-cover period, so disambiguate with\n\t\t\t// difference cover\n\t\t\tif(k == lim+1) {\n\t\t\t\tassert_neq(j, targ);\n\t\t\t\tif(sufDcLtU8(host1, host, hlen, s[j], s[targ], dc, sanityCheck)) {\n\t\t\t\t\t// j < targ\n\t\t\t\t\tassert(!sufDcLtU8(host1, host, hlen, s[targ], s[j], dc, sanityCheck));\n\t\t\t\t\tASSERT_SUF_LT(j, targ);\n\t\t\t\t\ttarg = j;\n\t\t\t\t\ttargoff = joff;\n\t\t\t\t} else {\n\t\t\t\t\tassert(sufDcLtU8(host1, host, hlen, s[targ], s[j], dc, sanityCheck));\n\t\t\t\t\tASSERT_SUF_LT(targ, j); // !\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tif(i != targ) {\n\t\t\tASSERT_SUF_LT(targ, i);\n\t\t\t// swap i and targ\n\t\t\tTIndexOffU tmp = s[i];\n\t\t\ts[i] = s[targ];\n\t\t\ts[targ] = tmp;\n\t\t}\n\t\tfor(size_t j = i+1; j < end; j++) {\n\t\t\tASSERT_SUF_LT(i, j);\n\t\t}\n\t}\n\tif(sanityCheck) {\n\t\tsanityCheckOrderedSufs(host1, hlen, s, slen, OFF_MASK, begin, end);\n\t}\n}\n\ntemplate<typename T1, typename T2>\nstatic void bucketSortSufDcU8(\n\t\tconst T1& host1,\n\t\tconst T2& host,\n        size_t hlen,\n        TIndexOffU* s,\n        size_t slen,\n        const DifferenceCoverSample<T1>& dc,\n        uint8_t hi,\n        size_t _begin,\n        size_t _end,\n        size_t _depth,\n        bool sanityCheck = false)\n{\n    // 5 64-element buckets for bucket-sorting A, C, G, T, $\n    TIndexOffU* bkts[4];\n    for(size_t i = 0; i < 4; i++) {\n        bkts[i] = new TIndexOffU[4 * 1024 * 1024];\n    }\n    ELList<size_t, 5, 1024> block_list;\n    bool first = true;\n    while(true) {\n        size_t begin = 0, end = 0;\n\n\tif ( first )\n\t{\n\t\tbegin = _begin ;\n\t\tend = _end ;\n\t\tfirst = false ;\n\t}\n\telse\n\t{\n\t\tif(block_list.size() == 0) {\n\t\t\t//begin = _begin;\n\t\t\t//end = _end;\n\t\t\tbreak ;\n\t\t} else {\n\t\t\tif(block_list.back().size() > 1) {\n\t\t\t\tend = block_list.back().back(); block_list.back().pop_back();\n\t\t\t\tbegin = block_list.back().back();\n\t\t\t} else {\n\t\t\t\tblock_list.resize(block_list.size() - 1);\n\t\t\t\tif(block_list.size() == 0) {\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n        size_t depth = block_list.size() + _depth;\n        assert_leq(end-begin, BUCKET_SORT_CUTOFF);\n        assert_eq(hi, 4);\n        if(end <= begin + 1) { // 1-element list already sorted\n            continue;\n        }\n        if(depth > dc.v()) {\n            // Quicksort the remaining suffixes using difference cover\n            // for constant-time comparisons; this is O(k*log(k)) where\n            // k=(end-begin)\n            qsortSufDcU8<T1,T2>(host1, host, hlen, s, slen, dc, begin, end, sanityCheck);\n            continue;\n        }\n        if(end-begin <= SELECTION_SORT_CUTOFF) {\n            // Bucket sort remaining items\n            selectionSortSufDcU8(host1, host, hlen, s, slen, dc, hi,\n                                 begin, end, depth, sanityCheck);\n            if(sanityCheck) {\n                sanityCheckOrderedSufs(host1, hlen, s, slen,\n                                       OFF_MASK, begin, end);\n            }\n            continue;\n        }\n        size_t cnts[] = { 0, 0, 0, 0, 0 };\n        for(size_t i = begin; i < end; i++) {\n            size_t off = depth + s[i];\n            uint8_t c = (off < hlen) ? get_uint8(host, off) : hi;\n            assert_leq(c, 4);\n            if(c == 0) {\n                s[begin + cnts[0]++] = s[i];\n            } else {\n                bkts[c-1][cnts[c]++] = s[i];\n            }\n        }\n        assert_eq(cnts[0] + cnts[1] + cnts[2] + cnts[3] + cnts[4], end - begin);\n        size_t cur = begin + cnts[0];\n        if(cnts[1] > 0) { memcpy(&s[cur], bkts[0], cnts[1] << (OFF_SIZE/4 + 1)); cur += cnts[1]; }\n        if(cnts[2] > 0) { memcpy(&s[cur], bkts[1], cnts[2] << (OFF_SIZE/4 + 1)); cur += cnts[2]; }\n        if(cnts[3] > 0) { memcpy(&s[cur], bkts[2], cnts[3] << (OFF_SIZE/4 + 1)); cur += cnts[3]; }\n        if(cnts[4] > 0) { memcpy(&s[cur], bkts[3], cnts[4] << (OFF_SIZE/4 + 1)); }\n        // This frame is now totally finished with bkts[][], so recursive\n        // callees can safely clobber it; we're not done with cnts[], but\n        // that's local to the stack frame.\n        block_list.expand();\n        block_list.back().clear();\n        block_list.back().push_back(begin);\n        for(size_t i = 0; i < 4; i++) {\n            if(cnts[i] > 0) {\n                block_list.back().push_back(block_list.back().back() + cnts[i]);\n            }\n        }\n    }\n    // Done\n    \n    for(size_t i = 0; i < 4; i++) {\n        delete [] bkts[i];\n    }\n}\n\n/**\n * Main multikey quicksort function for suffixes.  Based on Bentley &\n * Sedgewick's algorithm on p.5 of their paper \"Fast Algorithms for\n * Sorting and Searching Strings\".  That algorithm has been extended in\n * three ways:\n *\n *  1. Deal with keys of different lengths by checking bounds and\n *     considering off-the-end values to be 'hi' (b/c our goal is the\n *     BWT transform, we're biased toward considring prefixes as\n *     lexicographically *greater* than their extensions).\n *  2. The multikey_qsort_suffixes version takes a single host string\n *     and a list of suffix offsets as input.  This reduces memory\n *     footprint compared to an approach that treats its input\n *     generically as a set of strings (not necessarily suffixes), thus\n *     requiring that we store at least two integers worth of\n *     information for each string.\n *  3. Sorting functions take an extra \"upto\" parameter that upper-\n *     bounds the depth to which the function sorts.\n */\ntemplate<typename T1, typename T2>\nvoid mkeyQSortSufDcU8(\n\tconst T1& host1,\n\tconst T2& host,\n\tsize_t hlen,\n\tTIndexOffU* s,\n\tsize_t slen,\n\tconst DifferenceCoverSample<T1>& dc,\n\tint hi,\n\tsize_t begin,\n\tsize_t end,\n\tsize_t depth,\n\tbool sanityCheck = false)\n{\n\t// Helper for making the recursive call; sanity-checks arguments to\n\t// make sure that the problem actually got smaller.\n\t#define MQS_RECURSE_SUF_DC_U8(nbegin, nend, ndepth) { \\\n\t\tassert(nbegin > begin || nend < end || ndepth > depth); \\\n\t\tmkeyQSortSufDcU8(host1, host, hlen, s, slen, dc, hi, nbegin, nend, ndepth, sanityCheck); \\\n\t}\n\tassert_leq(begin, slen);\n\tassert_leq(end, slen);\n\tsize_t n = end - begin;\n\tif(n <= 1) return; // 1-element list already sorted\n\tif(depth > dc.v()) {\n\t\t// Quicksort the remaining suffixes using difference cover\n\t\t// for constant-time comparisons; this is O(k*log(k)) where\n\t\t// k=(end-begin)\n\t\tqsortSufDcU8<T1,T2>(host1, host, hlen, s, slen, dc, begin, end, sanityCheck);\n\t\tif(sanityCheck) {\n\t\t\tsanityCheckOrderedSufs(host1, hlen, s, slen, OFF_MASK, begin, end);\n\t\t}\n\t\treturn;\n\t}\n\tif(n <= BUCKET_SORT_CUTOFF) {\n\t\t// Bucket sort remaining items\n\t\tbucketSortSufDcU8(host1, host, hlen, s, slen, dc,\n\t\t                  (uint8_t)hi, begin, end, depth, sanityCheck);\n\t\tif(sanityCheck) {\n\t\t\tsanityCheckOrderedSufs(host1, hlen, s, slen, OFF_MASK, begin, end);\n\t\t}\n\t\treturn;\n\t}\n\tsize_t a, b, c, d, r;\n\tCHOOSE_AND_SWAP_PIVOT(SWAP1, CHAR_AT_SUF_U8); // choose pivot, swap to begin\n\tint v = CHAR_AT_SUF_U8(begin, depth); // v <- pivot value\n\t#ifndef NDEBUG\n\t{\n\t\tbool stillInBounds = false;\n\t\tfor(size_t i = begin; i < end; i++) {\n\t\t\tif(depth < (hlen-s[i])) {\n\t\t\t\tstillInBounds = true;\n\t\t\t\tbreak;\n\t\t\t} else { /* already fell off this suffix */ }\n\t\t}\n\t\tassert(stillInBounds); // >=1 suffix must still be in bounds\n\t}\n\t#endif\n\ta = b = begin;\n\tc = d = end-1;\n\twhile(true) {\n\t\t// Invariant: everything before a is = pivot, everything\n\t\t// between a and b is <\n\t\tint bc = 0; // shouldn't have to init but gcc on Mac complains\n\t\twhile(b <= c && v >= (bc = CHAR_AT_SUF_U8(b, depth))) {\n\t\t\tif(v == bc) {\n\t\t\t\tSWAP(s, a, b); a++;\n\t\t\t}\n\t\t\tb++;\n\t\t}\n\t\t// Invariant: everything after d is = pivot, everything\n\t\t// between c and d is >\n\t\tint cc = 0; // shouldn't have to init but gcc on Mac complains\n\t\t//bool hiLatch = true;\n\t\twhile(b <= c && v <= (cc = CHAR_AT_SUF_U8(c, depth))) {\n\t\t\tif(v == cc) {\n\t\t\t\tSWAP(s, c, d); d--;\n\t\t\t}\n\t\t\t//else if(hiLatch && cc == hi) { }\n\t\t\tc--;\n\t\t}\n\t\tif(b > c) break;\n\t\tSWAP(s, b, c);\n\t\tb++;\n\t\tc--;\n\t}\n\tassert(a > begin || c < end-1);                      // there was at least one =s\n\tassert_lt(d-c, n); // they can't all have been > pivot\n\tassert_lt(b-a, n); // they can't all have been < pivot\n\tr = min(a-begin, b-a); VECSWAP(s, begin, b-r,   r);  // swap left = to center\n\tr = min(d-c, end-d-1); VECSWAP(s, b,     end-r, r);  // swap right = to center\n\tr = b-a; // r <- # of <'s\n\tif(r > 0) {\n\t\tMQS_RECURSE_SUF_DC_U8(begin, begin + r, depth); // recurse on <'s\n\t}\n\t// Do not recurse on ='s if the pivot was the off-the-end value;\n\t// they're already fully sorted\n\tif(v != hi) {\n\t\tMQS_RECURSE_SUF_DC_U8(begin + r, begin + r + (a-begin) + (end-d-1), depth+1); // recurse on ='s\n\t}\n\tr = d-c; // r <- # of >'s excluding those exhausted\n\tif(r > 0 && v < hi-1) {\n\t\tMQS_RECURSE_SUF_DC_U8(end-r, end, depth); // recurse on >'s\n\t}\n}\n\n\n#endif /*MULTIKEY_QSORT_H_*/\n"
  },
  {
    "path": "opts.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef OPTS_H_\n#define OPTS_H_\n\nenum {\n\tARG_ORIG = 256,             // --orig\n\tARG_SEED,                   // --seed\n\tARG_SOLEXA_QUALS,           // --solexa-quals\n\tARG_VERBOSE,                // --verbose\n\tARG_STARTVERBOSE,           // --startverbose\n\tARG_QUIET,                  // --quiet\n\tARG_METRIC_IVAL,            // --met\n\tARG_METRIC_FILE,            // --met-file\n\tARG_METRIC_STDERR,          // --met-stderr\n\tARG_METRIC_PER_READ,        // --met-per-read\n\tARG_REFIDX,                 // --refidx\n\tARG_SANITY,                 // --sanity\n\tARG_PARTITION,              // --partition\n\tARG_INTEGER_QUALS,          // --int-quals\n\tARG_FILEPAR,                // --filepar\n\tARG_SHMEM,                  // --shmem\n\tARG_MM,                     // --mm\n\tARG_MMSWEEP,                // --mmsweep\n\tARG_FF,                     // --ff\n\tARG_FR,                     // --fr\n\tARG_RF,                     // --rf\n\tARG_NO_MIXED,               // --no-mixed\n\tARG_NO_DISCORDANT,          // --no-discordant\n\tARG_CACHE_LIM,              // --\n\tARG_CACHE_SZ,               // --\n\tARG_NO_FW,                  // --nofw\n\tARG_NO_RC,                  // --norc\n\tARG_SKIP,                   // --skip\n\tARG_ONETWO,                 // --12\n\tARG_PHRED64,                // --phred64\n\tARG_PHRED33,                // --phred33\n\tARG_HADOOPOUT,              // --hadoopout\n\tARG_FUZZY,                  // --fuzzy\n\tARG_FULLREF,                // --fullref\n\tARG_USAGE,                  // --usage\n\tARG_SNPPHRED,               // --snpphred\n\tARG_SNPFRAC,                // --snpfrac\n\tARG_SAM_NO_QNAME_TRUNC,     // --sam-no-qname-trunc\n\tARG_SAM_OMIT_SEC_SEQ,       // --sam-omit-sec-seq\n\tARG_SAM_NOHEAD,             // --sam-noHD/--sam-nohead\n\tARG_SAM_NOSQ,               // --sam-nosq/--sam-noSQ\n\tARG_SAM_RG,                 // --sam-rg\n\tARG_SAM_RGID,               // --sam-rg-id\n\tARG_GAP_BAR,                // --gbar\n\tARG_QUALS1,                 // --Q1\n\tARG_QUALS2,                 // --Q2\n\tARG_QSEQ,                   // --qseq\n\tARG_SEED_SUMM,              // --seed-summary\n\tARG_OVERHANG,               // --overhang\n\tARG_NO_CACHE,               // --no-cache\n\tARG_USE_CACHE,              // --cache\n\tARG_NOISY_HPOLY,            // --454/--ion-torrent\n\tARG_LOCAL,                  // --local\n\tARG_END_TO_END,             // --end-to-end\n\tARG_SCAN_NARROWED,          // --scan-narrowed\n\tARG_QC_FILTER,              // --qc-filter\n\tARG_BWA_SW_LIKE,            // --bwa-sw-like\n\tARG_MULTISEED_IVAL,         // --multiseed\n\tARG_SCORE_MIN,              // --score-min\n\tARG_SCORE_MA,               // --ma\n\tARG_SCORE_MMP,              // --mm\n\tARG_SCORE_NP,               // --nm\n\tARG_SCORE_RDG,              // --rdg\n\tARG_SCORE_RFG,              // --rfg\n\tARG_N_CEIL,                 // --n-ceil\n\tARG_DPAD,                   // --dpad\n\tARG_SAM_PRINT_YI,           // --mapq-print-inputs\n\tARG_ALIGN_POLICY,           // --policy\n\tARG_PRESET_VERY_FAST,       // --very-fast\n\tARG_PRESET_FAST,            // --fast\n\tARG_PRESET_SENSITIVE,       // --sensitive\n\tARG_PRESET_VERY_SENSITIVE,  // --very-sensitive\n\tARG_PRESET_VERY_FAST_LOCAL,      // --very-fast-local\n\tARG_PRESET_FAST_LOCAL,           // --fast-local\n\tARG_PRESET_SENSITIVE_LOCAL,      // --sensitive-local\n\tARG_PRESET_VERY_SENSITIVE_LOCAL, // --very-sensitive-local\n\tARG_NO_SCORE_PRIORITY,      // --no-score-priority\n\tARG_IGNORE_QUALS,           // --ignore-quals\n\tARG_DESC,                   // --arg-desc\n\tARG_TAB5,                   // --tab5\n\tARG_TAB6,                   // --tab6\n\tARG_WRAPPER,                // --wrapper\n\tARG_DOVETAIL,               // --dovetail\n\tARG_NO_DOVETAIL,            // --no-dovetail\n\tARG_CONTAIN,                // --contain\n\tARG_NO_CONTAIN,             // --no-contain\n\tARG_OVERLAP,                // --overlap\n\tARG_NO_OVERLAP,             // --no-overlap\n\tARG_MAPQ_V,                 // --mapq-v\n\tARG_SSE8,                   // --sse8\n\tARG_SSE8_NO,                // --no-sse8\n\tARG_UNGAPPED,               // --ungapped\n\tARG_UNGAPPED_NO,            // --no-ungapped\n\tARG_TIGHTEN,                // --tighten\n\tARG_UNGAP_THRESH,           // --ungap-thresh\n\tARG_EXACT_UPFRONT,          // --exact-upfront\n\tARG_1MM_UPFRONT,            // --1mm-upfront\n\tARG_EXACT_UPFRONT_NO,       // --no-exact-upfront\n\tARG_1MM_UPFRONT_NO,         // --no-1mm-upfront\n\tARG_1MM_MINLEN,             // --1mm-minlen\n\tARG_VERSION,                // --version\n\tARG_SEED_OFF,               // --seed-off\n\tARG_SEED_BOOST_THRESH,      // --seed-boost\n\tARG_READ_TIMES,             // --read-times\n\tARG_EXTEND_ITERS,           // --extends\n\tARG_DP_MATE_STREAK_THRESH,  // --db-mate-streak\n\tARG_DP_FAIL_STREAK_THRESH,  // --dp-fail-streak\n\tARG_UG_FAIL_STREAK_THRESH,  // --ug-fail-streak\n\tARG_EE_FAIL_STREAK_THRESH,  // --ee-fail-streak\n\tARG_DP_FAIL_THRESH,         // --dp-fails\n\tARG_UG_FAIL_THRESH,         // --ug-fails\n\tARG_MAPQ_EX,                // --mapq-extra\n\tARG_NO_EXTEND,              // --no-extend\n\tARG_REORDER,                // --reorder\n\tARG_SHOW_RAND_SEED,         // --show-rand-seed\n\tARG_READ_PASSTHRU,          // --passthrough\n\tARG_SAMPLE,                 // --sample\n\tARG_CP_MIN,                 // --cp-min\n\tARG_CP_IVAL,                // --cp-ival\n\tARG_TRI,                    // --tri\n\tARG_LOCAL_SEED_CACHE_SZ,    // --local-seed-cache-sz\n\tARG_CURRENT_SEED_CACHE_SZ,  // --seed-cache-sz\n\tARG_SAM_NO_UNAL,            // --no-unal\n\tARG_NON_DETERMINISTIC,      // --non-deterministic\n\tARG_TEST_25,                // --test-25\n\tARG_DESC_KB,                // --desc-kb\n\tARG_DESC_LANDING,           // --desc-landing\n\tARG_DESC_EXP,               // --desc-exp\n\tARG_DESC_FMOPS,             // --desc-fmops\n    ARG_NO_TEMPSPLICESITE,\n    ARG_PEN_CANSPLICE,\n    ARG_PEN_NONCANSPLICE,\n    ARG_PEN_CONFLICTSPLICE,\n    ARG_PEN_INTRONLEN,\n    ARG_KNOWN_SPLICESITE_INFILE,\n    ARG_NOVEL_SPLICESITE_INFILE,\n    ARG_NOVEL_SPLICESITE_OUTFILE,\n    ARG_SECONDARY,\n    ARG_NO_SPLICED_ALIGNMENT,\n    ARG_RNA_STRANDNESS,\n    ARG_SPLICESITE_DB_ONLY,\n    ARG_MIN_HITLEN,              // --min-hitlen\n    ARG_MIN_TOTALLEN,            // --min-totallen\n    ARG_HOST_TAXIDS,             // --host-taxids\n\tARG_REPORT_FILE,             // --report\n    ARG_NO_ABUNDANCE,            // --no-abundance\n    ARG_NO_TRAVERSE,             // --no-traverse\n    ARG_CLASSIFICATION_RANK,\n    ARG_EXCLUDE_TAXIDS,\n    ARG_OUT_FMT,\n    ARG_TAB_FMT_COLS,\n#ifdef USE_SRA\n    ARG_SRA_ACC,\n#endif\n    ARG_SEPARATOR,\n};\n\n#endif\n\n"
  },
  {
    "path": "outq.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"outq.h\"\n\n/**\n * Caller is telling us that they're about to write output record(s) for\n * the read with the given id.\n */\nvoid OutputQueue::beginRead(TReadId rdid, size_t threadId) {\n\tThreadSafe t(&mutex_m, threadSafe_);\n\tnstarted_++;\n\tif(reorder_) {\n\t\tassert_geq(rdid, cur_);\n\t\tassert_eq(lines_.size(), finished_.size());\n\t\tassert_eq(lines_.size(), started_.size());\n\t\tif(rdid - cur_ >= lines_.size()) {\n\t\t\t// Make sure there's enough room in lines_, started_ and finished_\n\t\t\tsize_t oldsz = lines_.size();\n\t\t\tlines_.resize(rdid - cur_ + 1);\n\t\t\tstarted_.resize(rdid - cur_ + 1);\n\t\t\tfinished_.resize(rdid - cur_ + 1);\n\t\t\tfor(size_t i = oldsz; i < lines_.size(); i++) {\n\t\t\t\tstarted_[i] = finished_[i] = false;\n\t\t\t}\n\t\t}\n\t\tstarted_[rdid - cur_] = true;\n\t\tfinished_[rdid - cur_] = false;\n\t}\n}\n\n/**\n * Writer is finished writing to \n */\nvoid OutputQueue::finishRead(const BTString& rec, TReadId rdid, size_t threadId) {\n\tThreadSafe t(&mutex_m, threadSafe_);\n\tif(reorder_) {\n\t\tassert_geq(rdid, cur_);\n\t\tassert_eq(lines_.size(), finished_.size());\n\t\tassert_eq(lines_.size(), started_.size());\n\t\tassert_lt(rdid - cur_, lines_.size());\n\t\tassert(started_[rdid - cur_]);\n\t\tassert(!finished_[rdid - cur_]);\n\t\tlines_[rdid - cur_] = rec;\n\t\tnfinished_++;\n\t\tfinished_[rdid - cur_] = true;\n\t\tflush(false, false); // don't force; already have lock\n\t} else {\n\t\t// obuf_ is the OutFileBuf for the output file\n\t\tobuf_.writeString(rec);\n\t\tnfinished_++;\n\t\tnflushed_++;\n\t}\n}\n\n/**\n * Write already-finished lines starting from cur_.\n */\nvoid OutputQueue::flush(bool force, bool getLock) {\n\tif(!reorder_) {\n\t\treturn;\n\t}\n\tThreadSafe t(&mutex_m, getLock && threadSafe_);\n\tsize_t nflush = 0;\n\twhile(nflush < finished_.size() && finished_[nflush]) {\n\t\tassert(started_[nflush]);\n\t\tnflush++;\n\t}\n\t// Waiting until we have several in a row to flush cuts down on copies\n\t// (but requires more buffering)\n\tif(force || nflush >= NFLUSH_THRESH) {\n\t\tfor(size_t i = 0; i < nflush; i++) {\n\t\t\tassert(started_[i]);\n\t\t\tassert(finished_[i]);\n\t\t\tobuf_.writeString(lines_[i]);\n\t\t}\n\t\tlines_.erase(0, nflush);\n\t\tstarted_.erase(0, nflush);\n\t\tfinished_.erase(0, nflush);\n\t\tcur_ += nflush;\n\t\tnflushed_ += nflush;\n\t}\n}\n\n#ifdef OUTQ_MAIN\n\n#include <iostream>\n\nusing namespace std;\n\nint main(void) {\n\tcerr << \"Case 1 (one thread) ... \";\n\t{\n\t\tOutFileBuf ofb;\n\t\tOutputQueue oq(ofb, false);\n\t\tassert_eq(0, oq.numFlushed());\n\t\tassert_eq(0, oq.numStarted());\n\t\tassert_eq(0, oq.numFinished());\n\t\toq.beginRead(1);\n\t\tassert_eq(0, oq.numFlushed());\n\t\tassert_eq(1, oq.numStarted());\n\t\tassert_eq(0, oq.numFinished());\n\t\toq.beginRead(3);\n\t\tassert_eq(0, oq.numFlushed());\n\t\tassert_eq(2, oq.numStarted());\n\t\tassert_eq(0, oq.numFinished());\n\t\toq.beginRead(2);\n\t\tassert_eq(0, oq.numFlushed());\n\t\tassert_eq(3, oq.numStarted());\n\t\tassert_eq(0, oq.numFinished());\n\t\toq.flush();\n\t\tassert_eq(0, oq.numFlushed());\n\t\tassert_eq(3, oq.numStarted());\n\t\tassert_eq(0, oq.numFinished());\n\t\toq.beginRead(0);\n\t\tassert_eq(0, oq.numFlushed());\n\t\tassert_eq(4, oq.numStarted());\n\t\tassert_eq(0, oq.numFinished());\n\t\toq.flush();\n\t\tassert_eq(0, oq.numFlushed());\n\t\tassert_eq(4, oq.numStarted());\n\t\tassert_eq(0, oq.numFinished());\n\t\toq.finishRead(0);\n\t\tassert_eq(0, oq.numFlushed());\n\t\tassert_eq(4, oq.numStarted());\n\t\tassert_eq(1, oq.numFinished());\n\t\toq.flush();\n\t\tassert_eq(0, oq.numFlushed());\n\t\tassert_eq(4, oq.numStarted());\n\t\tassert_eq(1, oq.numFinished());\n\t\toq.flush(true);\n\t\tassert_eq(1, oq.numFlushed());\n\t\tassert_eq(4, oq.numStarted());\n\t\tassert_eq(1, oq.numFinished());\n\t\toq.finishRead(2);\n\t\tassert_eq(1, oq.numFlushed());\n\t\tassert_eq(4, oq.numStarted());\n\t\tassert_eq(2, oq.numFinished());\n\t\toq.flush(true);\n\t\tassert_eq(1, oq.numFlushed());\n\t\tassert_eq(4, oq.numStarted());\n\t\tassert_eq(2, oq.numFinished());\n\t\toq.finishRead(1);\n\t\tassert_eq(1, oq.numFlushed());\n\t\tassert_eq(4, oq.numStarted());\n\t\tassert_eq(3, oq.numFinished());\n\t\toq.flush(true);\n\t\tassert_eq(3, oq.numFlushed());\n\t\tassert_eq(4, oq.numStarted());\n\t\tassert_eq(3, oq.numFinished());\n\t}\n\tcerr << \"PASSED\" << endl;\n\n\tcerr << \"Case 2 (one thread) ... \";\n\t{\n\t\tOutFileBuf ofb;\n\t\tOutputQueue oq(ofb, false);\n\t\tBTString& buf1 = oq.beginRead(0);\n\t\tBTString& buf2 = oq.beginRead(1);\n\t\tBTString& buf3 = oq.beginRead(2);\n\t\tBTString& buf4 = oq.beginRead(3);\n\t\tBTString& buf5 = oq.beginRead(4);\n\t\tassert_eq(5, oq.numStarted());\n\t\tassert_eq(0, oq.numFinished());\n\t\tbuf1.install(\"A\\n\");\n\t\tbuf2.install(\"B\\n\");\n\t\tbuf3.install(\"C\\n\");\n\t\tbuf4.install(\"D\\n\");\n\t\tbuf5.install(\"E\\n\");\n\t\toq.finishRead(4);\n\t\toq.finishRead(1);\n\t\toq.finishRead(0);\n\t\toq.finishRead(2);\n\t\toq.finishRead(3);\n\t\toq.flush(true);\n\t\tassert_eq(5, oq.numFlushed());\n\t\tassert_eq(5, oq.numStarted());\n\t\tassert_eq(5, oq.numFinished());\n\t\tofb.flush();\n\t}\n\tcerr << \"PASSED\" << endl;\n\treturn 0;\n}\n\n#endif /*def ALN_SINK_MAIN*/\n"
  },
  {
    "path": "outq.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef OUTQ_H_\n#define OUTQ_H_\n\n#include \"assert_helpers.h\"\n#include \"ds.h\"\n#include \"sstring.h\"\n#include \"read.h\"\n#include \"threading.h\"\n#include \"mem_ids.h\"\n\n/**\n * Encapsulates a list of lines of output.  If the earliest as-yet-unreported\n * read has id N and Bowtie 2 wants to write a record for read with id N+1, we\n * resize the lines_ and committed_ lists to have at least 2 elements (1 for N,\n * 1 for N+1) and return the BTString * associated with the 2nd element.  When\n * the user calls commit() for the read with id N, \n */\nclass OutputQueue {\n\n\tstatic const size_t NFLUSH_THRESH = 8;\n\npublic:\n\n\tOutputQueue(\n\t\tOutFileBuf& obuf,\n\t\tbool reorder,\n\t\tsize_t nthreads,\n\t\tbool threadSafe,\n\t\tTReadId rdid = 0) :\n\t\tobuf_(obuf),\n\t\tcur_(rdid),\n\t\tnstarted_(0),\n\t\tnfinished_(0),\n\t\tnflushed_(0),\n\t\tlines_(RES_CAT),\n\t\tstarted_(RES_CAT),\n\t\tfinished_(RES_CAT),\n\t\treorder_(reorder),\n\t\tthreadSafe_(threadSafe),\n        mutex_m()\n\t{\n\t\tassert(nthreads <= 1 || threadSafe);\n\t}\n\n\t/**\n\t * Caller is telling us that they're about to write output record(s) for\n\t * the read with the given id.\n\t */\n\tvoid beginRead(TReadId rdid, size_t threadId);\n\t\n\t/**\n\t * Writer is finished writing to \n\t */\n\tvoid finishRead(const BTString& rec, TReadId rdid, size_t threadId);\n\t\n\t/**\n\t * Return the number of records currently being buffered.\n\t */\n\tsize_t size() const {\n\t\treturn lines_.size();\n\t}\n\t\n\t/**\n\t * Return the number of records that have been flushed so far.\n\t */\n\tTReadId numFlushed() const {\n\t\treturn nflushed_;\n\t}\n\n\t/**\n\t * Return the number of records that have been started so far.\n\t */\n\tTReadId numStarted() const {\n\t\treturn nstarted_;\n\t}\n\n\t/**\n\t * Return the number of records that have been finished so far.\n\t */\n\tTReadId numFinished() const {\n\t\treturn nfinished_;\n\t}\n\n\t/**\n\t * Write already-committed lines starting from cur_.\n\t */\n\tvoid flush(bool force = false, bool getLock = true);\n\nprotected:\n\n\tOutFileBuf&     obuf_;\n\tTReadId         cur_;\n\tTReadId         nstarted_;\n\tTReadId         nfinished_;\n\tTReadId         nflushed_;\n\tEList<BTString> lines_;\n\tEList<bool>     started_;\n\tEList<bool>     finished_;\n\tbool            reorder_;\n\tbool            threadSafe_;\n\tMUTEX_T         mutex_m;\n};\n\nclass OutputQueueMark {\npublic:\n\tOutputQueueMark(\n\t\tOutputQueue& q,\n\t\tconst BTString& rec,\n\t\tTReadId rdid,\n\t\tsize_t threadId) :\n\t\tq_(q),\n\t\trec_(rec),\n\t\trdid_(rdid),\n\t\tthreadId_(threadId)\n\t{\n\t\tq_.beginRead(rdid, threadId);\n\t}\n\t\n\t~OutputQueueMark() {\n\t\tq_.finishRead(rec_, rdid_, threadId_);\n\t}\n\t\nprotected:\n\tOutputQueue& q_;\n\tconst BTString& rec_;\n\tTReadId rdid_;\n\tsize_t threadId_;\n};\n\n#endif\n"
  },
  {
    "path": "pat.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <cmath>\n#include <iostream>\n#include <string>\n#include <stdexcept>\n#include \"sstring.h\"\n\n#include \"pat.h\"\n#include \"filebuf.h\"\n#include \"formats.h\"\n\n#ifdef USE_SRA\n\n#include \"tinythread.h\"\n#include <ncbi-vdb/NGS.hpp>\n#include <ngs/ErrorMsg.hpp>\n#include <ngs/ReadCollection.hpp>\n#include <ngs/ReadIterator.hpp>\n#include <ngs/Read.hpp>\n\n#endif\n\nusing namespace std;\n\n/**\n * Return a new dynamically allocated PatternSource for the given\n * format, using the given list of strings as the filenames to read\n * from or as the sequences themselves (i.e. if -c was used).\n */\nPatternSource* PatternSource::patsrcFromStrings(\n                                                const PatternParams& p,\n                                                const EList<string>& qs,\n                                                int nthreads)\n{\n\tswitch(p.format) {\n\t\tcase FASTA:       return new FastaPatternSource(qs, p);\n\t\tcase FASTA_CONT:  return new FastaContinuousPatternSource(qs, p);\n\t\tcase RAW:         return new RawPatternSource(qs, p);\n\t\tcase FASTQ:       return new FastqPatternSource(qs, p);\n\t\tcase TAB_MATE5:   return new TabbedPatternSource(qs, p, false);\n\t\tcase TAB_MATE6:   return new TabbedPatternSource(qs, p, true);\n\t\tcase CMDLINE:     return new VectorPatternSource(qs, p);\n\t\tcase QSEQ:        return new QseqPatternSource(qs, p);\n#ifdef USE_SRA\n        case SRA_FASTA:\n        case SRA_FASTQ: return new SRAPatternSource(qs, p, nthreads);\n#endif\n\t\tdefault: {\n\t\t\tcerr << \"Internal error; bad patsrc format: \" << p.format << endl;\n\t\t\tthrow 1;\n\t\t}\n\t}\n}\n\n/**\n * The main member function for dispensing patterns.\n *\n * Returns true iff a pair was parsed succesfully.\n */\nbool PatternSource::nextReadPair(\n\tRead& ra,\n\tRead& rb,\n\tTReadId& rdid,\n\tTReadId& endid,\n\tbool& success,\n\tbool& done,\n\tbool& paired,\n\tbool fixName)\n{\n\t// nextPatternImpl does the reading from the ultimate source;\n\t// it is implemented in concrete subclasses\n\tsuccess = done = paired = false;\n\tnextReadPairImpl(ra, rb, rdid, endid, success, done, paired);\n\tif(success) {\n\t\t// Construct reversed versions of fw and rc seqs/quals\n\t\tra.finalize();\n\t\tif(!rb.empty()) {\n\t\t\trb.finalize();\n\t\t}\n\t\t// Fill in the random-seed field using a combination of\n\t\t// information from the user-specified seed and the read\n\t\t// sequence, qualities, and name\n\t\tra.seed = genRandSeed(ra.patFw, ra.qual, ra.name, seed_);\n\t\tif(!rb.empty()) {\n\t\t\trb.seed = genRandSeed(rb.patFw, rb.qual, rb.name, seed_);\n\t\t}\n\t}\n\treturn success;\n}\n\n/**\n * The main member function for dispensing patterns.\n */\nbool PatternSource::nextRead(\n\tRead& r,\n\tTReadId& rdid,\n\tTReadId& endid,\n\tbool& success,\n\tbool& done)\n{\n\t// nextPatternImpl does the reading from the ultimate source;\n\t// it is implemented in concrete subclasses\n\tnextReadImpl(r, rdid, endid, success, done);\n\tif(success) {\n\t\t// Construct the reversed versions of the fw and rc seqs\n\t\t// and quals\n\t\tr.finalize();\n\t\t// Fill in the random-seed field using a combination of\n\t\t// information from the user-specified seed and the read\n\t\t// sequence, qualities, and name\n\t\tr.seed = genRandSeed(r.patFw, r.qual, r.name, seed_);\n\t}\n\treturn success;\n}\n\n/**\n * Get the next paired or unpaired read from the wrapped\n * PairedPatternSource.\n */\nbool WrappedPatternSourcePerThread::nextReadPair(\n\tbool& success,\n\tbool& done,\n\tbool& paired,\n\tbool fixName)\n{\n\tPatternSourcePerThread::nextReadPair(success, done, paired, fixName);\n\tASSERT_ONLY(TReadId lastRdId = rdid_);\n\tbuf1_.reset();\n\tbuf2_.reset();\n\tpatsrc_.nextReadPair(buf1_, buf2_, rdid_, endid_, success, done, paired, fixName);\n\tassert(!success || rdid_ != lastRdId);\n\treturn success;\n}\n\n/**\n * The main member function for dispensing pairs of reads or\n * singleton reads.  Returns true iff ra and rb contain a new\n * pair; returns false if ra contains a new unpaired read.\n */\nbool PairedSoloPatternSource::nextReadPair(\n\tRead& ra,\n\tRead& rb,\n\tTReadId& rdid,\n\tTReadId& endid,\n\tbool& success,\n\tbool& done,\n\tbool& paired,\n\tbool fixName)\n{\n\tuint32_t cur = cur_;\n\tsuccess = false;\n\twhile(cur < src_->size()) {\n\t\t// Patterns from srca_[cur_] are unpaired\n\t\tdo {\n\t\t\t(*src_)[cur]->nextReadPair(\n\t\t\t\tra, rb, rdid, endid, success, done, paired, fixName);\n\t\t} while(!success && !done);\n\t\tif(!success) {\n\t\t\tassert(done);\n\t\t\t// If patFw is empty, that's our signal that the\n\t\t\t// input dried up\n\t\t\tlock();\n\t\t\tif(cur + 1 > cur_) cur_++;\n\t\t\tcur = cur_;\n\t\t\tunlock();\n\t\t\tcontinue; // on to next pair of PatternSources\n\t\t}\n\t\tassert(success);\n\t\tra.seed = genRandSeed(ra.patFw, ra.qual, ra.name, seed_);\n\t\tif(!rb.empty()) {\n\t\t\trb.seed = genRandSeed(rb.patFw, rb.qual, rb.name, seed_);\n\t\t\tif(fixName) {\n\t\t\t\tra.fixMateName(1);\n\t\t\t\trb.fixMateName(2);\n\t\t\t}\n\t\t}\n\t\tra.rdid = rdid;\n\t\tra.endid = endid;\n\t\tif(!rb.empty()) {\n\t\t\trb.rdid = rdid;\n\t\t\trb.endid = endid+1;\n\t\t}\n\t\tra.mate = 1;\n\t\trb.mate = 2;\n\t\treturn true; // paired\n\t}\n\tassert_leq(cur, src_->size());\n\tdone = (cur == src_->size());\n\treturn false;\n}\n\n/**\n * The main member function for dispensing pairs of reads or\n * singleton reads.  Returns true iff ra and rb contain a new\n * pair; returns false if ra contains a new unpaired read.\n */\nbool PairedDualPatternSource::nextReadPair(\n\tRead& ra,\n\tRead& rb,\n\tTReadId& rdid,\n\tTReadId& endid,\n\tbool& success,\n\tbool& done,\n\tbool& paired,\n\tbool fixName)\n{\n\t// 'cur' indexes the current pair of PatternSources\n\tuint32_t cur;\n\t{\n\t\tlock();\n\t\tcur = cur_;\n\t\tunlock();\n\t}\n\tsuccess = false;\n\tdone = true;\n\twhile(cur < srca_->size()) {\n\t\tif((*srcb_)[cur] == NULL) {\n\t\t\tpaired = false;\n\t\t\t// Patterns from srca_ are unpaired\n\t\t\tdo {\n\t\t\t\t(*srca_)[cur]->nextRead(ra, rdid, endid, success, done);\n\t\t\t} while(!success && !done);\n\t\t\tif(!success) {\n\t\t\t\tassert(done);\n\t\t\t\tlock();\n\t\t\t\tif(cur + 1 > cur_) cur_++;\n\t\t\t\tcur = cur_; // Move on to next PatternSource\n\t\t\t\tunlock();\n\t\t\t\tcontinue; // on to next pair of PatternSources\n\t\t\t}\n\t\t\tra.rdid = rdid;\n\t\t\tra.endid = endid;\n\t\t\tra.mate  = 0;\n\t\t\treturn success;\n\t\t} else {\n\t\t\tpaired = true;\n\t\t\t// Patterns from srca_[cur_] and srcb_[cur_] are paired\n\t\t\tTReadId rdid_a = 0, endid_a = 0;\n\t\t\tTReadId rdid_b = 0, endid_b = 0;\n\t\t\tbool success_a = false, done_a = false;\n\t\t\tbool success_b = false, done_b = false;\n\t\t\t// Lock to ensure that this thread gets parallel reads\n\t\t\t// in the two mate files\n\t\t\tlock();\n\t\t\tdo {\n\t\t\t\t(*srca_)[cur]->nextRead(ra, rdid_a, endid_a, success_a, done_a);\n\t\t\t} while(!success_a && !done_a);\n\t\t\tdo {\n\t\t\t\t(*srcb_)[cur]->nextRead(rb, rdid_b, endid_b, success_b, done_b);\n\t\t\t} while(!success_b && !done_b);\n\t\t\tif(!success_a && success_b) {\n\t\t\t\tcerr << \"Error, fewer reads in file specified with -1 than in file specified with -2\" << endl;\n\t\t\t\tthrow 1;\n\t\t\t} else if(!success_a) {\n\t\t\t\tassert(done_a && done_b);\n\t\t\t\tif(cur + 1 > cur_) cur_++;\n\t\t\t\tcur = cur_; // Move on to next PatternSource\n\t\t\t\tunlock();\n\t\t\t\tcontinue; // on to next pair of PatternSources\n\t\t\t} else if(!success_b) {\n\t\t\t\tcerr << \"Error, fewer reads in file specified with -2 than in file specified with -1\" << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\tassert_eq(rdid_a, rdid_b);\n\t\t\t//assert_eq(endid_a+1, endid_b);\n\t\t\tassert_eq(success_a, success_b);\n\t\t\tunlock();\n\t\t\tif(fixName) {\n\t\t\t\tra.fixMateName(1);\n\t\t\t\trb.fixMateName(2);\n\t\t\t}\n\t\t\trdid = rdid_a;\n\t\t\tendid = endid_a;\n\t\t\tsuccess = success_a;\n\t\t\tdone = done_a;\n\t\t\tra.rdid = rdid;\n\t\t\tra.endid = endid;\n\t\t\tif(!rb.empty()) {\n\t\t\t\trb.rdid = rdid;\n\t\t\t\trb.endid = endid+1;\n\t\t\t}\n\t\t\tra.mate = 1;\n\t\t\trb.mate = 2;\n\t\t\treturn success;\n\t\t}\n\t}\n\treturn success;\n}\n\n/**\n * Return the number of reads attempted.\n */\npair<TReadId, TReadId> PairedDualPatternSource::readCnt() const {\n\tuint64_t rets = 0llu, retp = 0llu;\n\tfor(size_t i = 0; i < srca_->size(); i++) {\n\t\tif((*srcb_)[i] == NULL) {\n\t\t\trets += (*srca_)[i]->readCnt();\n\t\t} else {\n\t\t\tassert_eq((*srca_)[i]->readCnt(), (*srcb_)[i]->readCnt());\n\t\t\tretp += (*srca_)[i]->readCnt();\n\t\t}\n\t}\n\treturn make_pair(rets, retp);\n}\n\n/**\n * Given the values for all of the various arguments used to specify\n * the read and quality input, create a list of pattern sources to\n * dispense them.\n */\nPairedPatternSource* PairedPatternSource::setupPatternSources(\n\tconst EList<string>& si,   // singles, from argv\n\tconst EList<string>& m1,   // mate1's, from -1 arg\n\tconst EList<string>& m2,   // mate2's, from -2 arg\n\tconst EList<string>& m12,  // both mates on each line, from --12 arg\n#ifdef USE_SRA\n    const EList<string>& sra_accs,\n#endif\n\tconst EList<string>& q,    // qualities associated with singles\n\tconst EList<string>& q1,   // qualities associated with m1\n\tconst EList<string>& q2,   // qualities associated with m2\n\tconst PatternParams& p,    // read-in parameters\n                                                              int nthreads,\n\tbool verbose)              // be talkative?\n{\n\tEList<PatternSource*>* a  = new EList<PatternSource*>();\n\tEList<PatternSource*>* b  = new EList<PatternSource*>();\n\tEList<PatternSource*>* ab = new EList<PatternSource*>();\n\t// Create list of pattern sources for paired reads appearing\n\t// interleaved in a single file\n\tfor(size_t i = 0; i < m12.size(); i++) {\n\t\tconst EList<string>* qs = &m12;\n\t\tEList<string> tmp;\n\t\tif(p.fileParallel) {\n\t\t\t// Feed query files one to each PatternSource\n\t\t\tqs = &tmp;\n\t\t\ttmp.push_back(m12[i]);\n\t\t\tassert_eq(1, tmp.size());\n\t\t}\n\t\tab->push_back(PatternSource::patsrcFromStrings(p, *qs, nthreads));\n\t\tif(!p.fileParallel) {\n\t\t\tbreak;\n\t\t}\n\t}\n    \n#ifdef USE_SRA\n    for(size_t i = 0; i < sra_accs.size(); i++) {\n        const EList<string>* qs = &sra_accs;\n        EList<string> tmp;\n        if(p.fileParallel) {\n            // Feed query files one to each PatternSource\n            qs = &tmp;\n            tmp.push_back(sra_accs[i]);\n            assert_eq(1, tmp.size());\n        }\n        ab->push_back(PatternSource::patsrcFromStrings(p, *qs, nthreads));\n        if(!p.fileParallel) {\n            break;\n        }\n    }\n#endif\n\n\t// Create list of pattern sources for paired reads\n\tfor(size_t i = 0; i < m1.size(); i++) {\n\t\tconst EList<string>* qs = &m1;\n\t\tEList<string> tmpSeq;\n\t\tEList<string> tmpQual;\n\t\tif(p.fileParallel) {\n\t\t\t// Feed query files one to each PatternSource\n\t\t\tqs = &tmpSeq;\n\t\t\ttmpSeq.push_back(m1[i]);\n\t\t\tassert_eq(1, tmpSeq.size());\n\t\t}\n\t\ta->push_back(PatternSource::patsrcFromStrings(p, *qs, nthreads));\n\t\tif(!p.fileParallel) {\n\t\t\tbreak;\n\t\t}\n\t}\n\n\t// Create list of pattern sources for paired reads\n\tfor(size_t i = 0; i < m2.size(); i++) {\n\t\tconst EList<string>* qs = &m2;\n\t\tEList<string> tmpSeq;\n\t\tEList<string> tmpQual;\n\t\tif(p.fileParallel) {\n\t\t\t// Feed query files one to each PatternSource\n\t\t\tqs = &tmpSeq;\n\t\t\ttmpSeq.push_back(m2[i]);\n\t\t\tassert_eq(1, tmpSeq.size());\n\t\t}\n\t\tb->push_back(PatternSource::patsrcFromStrings(p, *qs, nthreads));\n\t\tif(!p.fileParallel) {\n\t\t\tbreak;\n\t\t}\n\t}\n\t// All mates/mate files must be paired\n\tassert_eq(a->size(), b->size());\n\n\t// Create list of pattern sources for the unpaired reads\n\tfor(size_t i = 0; i < si.size(); i++) {\n\t\tconst EList<string>* qs = &si;\n\t\tPatternSource* patsrc = NULL;\n\t\tEList<string> tmpSeq;\n\t\tEList<string> tmpQual;\n\t\tif(p.fileParallel) {\n\t\t\t// Feed query files one to each PatternSource\n\t\t\tqs = &tmpSeq;\n\t\t\ttmpSeq.push_back(si[i]);\n\t\t\tassert_eq(1, tmpSeq.size());\n\t\t}\n\t\tpatsrc = PatternSource::patsrcFromStrings(p, *qs, nthreads);\n\t\tassert(patsrc != NULL);\n\t\ta->push_back(patsrc);\n\t\tb->push_back(NULL);\n\t\tif(!p.fileParallel) {\n\t\t\tbreak;\n\t\t}\n\t}\n\n\tPairedPatternSource *patsrc = NULL;\n#ifdef USE_SRA\n    if(m12.size() > 0 || sra_accs.size() > 0) {\n#else\n    if(m12.size() > 0) {\n#endif\n\t\tpatsrc = new PairedSoloPatternSource(ab, p);\n\t\tfor(size_t i = 0; i < a->size(); i++) delete (*a)[i];\n\t\tfor(size_t i = 0; i < b->size(); i++) delete (*b)[i];\n\t\tdelete a; delete b;\n\t} else {\n\t\tpatsrc = new PairedDualPatternSource(a, b, p);\n\t\tfor(size_t i = 0; i < ab->size(); i++) delete (*ab)[i];\n\t\tdelete ab;\n\t}\n\treturn patsrc;\n}\n\nVectorPatternSource::VectorPatternSource(\n\tconst EList<string>& v,\n\tconst PatternParams& p) :\n\tPatternSource(p),\n\tcur_(p.skip),\n\tskip_(p.skip),\n\tpaired_(false),\n\tv_(),\n\tquals_()\n{\n\tfor(size_t i = 0; i < v.size(); i++) {\n\t\tEList<string> ss;\n\t\ttokenize(v[i], \":\", ss, 2);\n\t\tassert_gt(ss.size(), 0);\n\t\tassert_leq(ss.size(), 2);\n\t\t// Initialize s\n\t\tstring s = ss[0];\n\t\tint mytrim5 = gTrim5;\n\t\tif(gColor && s.length() > 1) {\n\t\t\t// This may be a primer character.  If so, keep it in the\n\t\t\t// 'primer' field of the read buf and parse the rest of the\n\t\t\t// read without it.\n\t\t\tint c = toupper(s[0]);\n\t\t\tif(asc2dnacat[c] > 0) {\n\t\t\t\t// First char is a DNA char\n\t\t\t\tint c2 = toupper(s[1]);\n\t\t\t\t// Second char is a color char\n\t\t\t\tif(asc2colcat[c2] > 0) {\n\t\t\t\t\tmytrim5 += 2; // trim primer and first color\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tif(gColor) {\n\t\t\t// Convert '0'-'3' to 'A'-'T'\n\t\t\tfor(size_t i = 0; i < s.length(); i++) {\n\t\t\t\tif(s[i] >= '0' && s[i] <= '4') {\n\t\t\t\t\ts[i] = \"ACGTN\"[(int)s[i] - '0'];\n\t\t\t\t}\n\t\t\t\tif(s[i] == '.') s[i] = 'N';\n\t\t\t}\n\t\t}\n\t\tif(s.length() <= (size_t)(gTrim3 + mytrim5)) {\n\t\t\t// Entire read is trimmed away\n\t\t\ts.clear();\n\t\t} else {\n\t\t\t// Trim on 5' (high-quality) end\n\t\t\tif(mytrim5 > 0) {\n\t\t\t\ts.erase(0, mytrim5);\n\t\t\t}\n\t\t\t// Trim on 3' (low-quality) end\n\t\t\tif(gTrim3 > 0) {\n\t\t\t\ts.erase(s.length()-gTrim3);\n\t\t\t}\n\t\t}\n\t\t//  Initialize vq\n\t\tstring vq;\n\t\tif(ss.size() == 2) {\n\t\t\tvq = ss[1];\n\t\t}\n\t\t// Trim qualities\n\t\tif(vq.length() > (size_t)(gTrim3 + mytrim5)) {\n\t\t\t// Trim on 5' (high-quality) end\n\t\t\tif(mytrim5 > 0) {\n\t\t\t\tvq.erase(0, mytrim5);\n\t\t\t}\n\t\t\t// Trim on 3' (low-quality) end\n\t\t\tif(gTrim3 > 0) {\n\t\t\t\tvq.erase(vq.length()-gTrim3);\n\t\t\t}\n\t\t}\n\t\t// Pad quals with Is if necessary; this shouldn't happen\n\t\twhile(vq.length() < s.length()) {\n\t\t\tvq.push_back('I');\n\t\t}\n\t\t// Truncate quals to match length of read if necessary;\n\t\t// this shouldn't happen\n\t\tif(vq.length() > s.length()) {\n\t\t\tvq.erase(s.length());\n\t\t}\n\t\tassert_eq(vq.length(), s.length());\n\t\tv_.expand();\n\t\tv_.back().installChars(s);\n\t\tquals_.push_back(BTString(vq));\n\t\ttrimmed3_.push_back(gTrim3);\n\t\ttrimmed5_.push_back(mytrim5);\n\t\tostringstream os;\n\t\tos << (names_.size());\n\t\tnames_.push_back(BTString(os.str()));\n\t}\n\tassert_eq(v_.size(), quals_.size());\n}\n\t\nbool VectorPatternSource::nextReadImpl(\n\tRead& r,\n\tTReadId& rdid,\n\tTReadId& endid,\n\tbool& success,\n\tbool& done)\n{\n\t// Let Strings begin at the beginning of the respective bufs\n\tr.reset();\n\tlock();\n\tif(cur_ >= v_.size()) {\n\t\tunlock();\n\t\t// Clear all the Strings, as a signal to the caller that\n\t\t// we're out of reads\n\t\tr.reset();\n\t\tsuccess = false;\n\t\tdone = true;\n\t\tassert(r.empty());\n\t\treturn false;\n\t}\n\t// Copy v_*, quals_* strings into the respective Strings\n\tr.color = gColor;\n\tr.patFw  = v_[cur_];\n\tr.qual = quals_[cur_];\n\tr.trimmed3 = trimmed3_[cur_];\n\tr.trimmed5 = trimmed5_[cur_];\n\tostringstream os;\n\tos << cur_;\n\tr.name = os.str();\n\tcur_++;\n\tdone = cur_ == v_.size();\n\trdid = endid = readCnt_;\n\treadCnt_++;\n\tunlock();\n\tsuccess = true;\n\treturn true;\n}\n\t\n/**\n * This is unused, but implementation is given for completeness.\n */\nbool VectorPatternSource::nextReadPairImpl(\n\tRead& ra,\n\tRead& rb,\n\tTReadId& rdid,\n\tTReadId& endid,\n\tbool& success,\n\tbool& done,\n\tbool& paired)\n{\n\t// Let Strings begin at the beginning of the respective bufs\n\tra.reset();\n\trb.reset();\n\tpaired = true;\n\tif(!paired_) {\n\t\tpaired_ = true;\n\t\tcur_ <<= 1;\n\t}\n\tlock();\n\tif(cur_ >= v_.size()-1) {\n\t\tunlock();\n\t\t// Clear all the Strings, as a signal to the caller that\n\t\t// we're out of reads\n\t\tra.reset();\n\t\trb.reset();\n\t\tassert(ra.empty());\n\t\tassert(rb.empty());\n\t\tsuccess = false;\n\t\tdone = true;\n\t\treturn false;\n\t}\n\t// Copy v_*, quals_* strings into the respective Strings\n\tra.patFw  = v_[cur_];\n\tra.qual = quals_[cur_];\n\tra.trimmed3 = trimmed3_[cur_];\n\tra.trimmed5 = trimmed5_[cur_];\n\tcur_++;\n\trb.patFw  = v_[cur_];\n\trb.qual = quals_[cur_];\n\trb.trimmed3 = trimmed3_[cur_];\n\trb.trimmed5 = trimmed5_[cur_];\n\tostringstream os;\n\tos << readCnt_;\n\tra.name = os.str();\n\trb.name = os.str();\n\tra.color = rb.color = gColor;\n\tcur_++;\n\tdone = cur_ >= v_.size()-1;\n\trdid = endid = readCnt_;\n\treadCnt_++;\n\tunlock();\n\tsuccess = true;\n\treturn true;\n}\n\n/**\n * Parse a single quality string from fb and store qualities in r.\n * Assume the next character obtained via fb.get() is the first\n * character of the quality string.  When returning, the next\n * character returned by fb.peek() or fb.get() should be the first\n * character of the following line.\n */\nint parseQuals(\n\tRead& r,\n\tFileBuf& fb,\n\tint firstc,\n\tint readLen,\n\tint trim3,\n\tint trim5,\n\tbool intQuals,\n\tbool phred64,\n\tbool solexa64)\n{\n\tint c = firstc;\n\tassert(c != '\\n' && c != '\\r');\n\tr.qual.clear();\n\tif (intQuals) {\n\t\twhile (c != '\\r' && c != '\\n' && c != -1) {\n\t\t\tbool neg = false;\n\t\t\tint num = 0;\n\t\t\twhile(!isspace(c) && !fb.eof()) {\n\t\t\t\tif(c == '-') {\n\t\t\t\t\tneg = true;\n\t\t\t\t\tassert_eq(num, 0);\n\t\t\t\t} else {\n\t\t\t\t\tif(!isdigit(c)) {\n\t\t\t\t\t\tchar buf[2048];\n\t\t\t\t\t\tcerr << \"Warning: could not parse quality line:\" << endl;\n\t\t\t\t\t\tfb.getPastNewline();\n\t\t\t\t\t\tcerr << fb.copyLastN(buf);\n\t\t\t\t\t\tbuf[2047] = '\\0';\n\t\t\t\t\t\tcerr << buf;\n\t\t\t\t\t\tthrow 1;\n\t\t\t\t\t}\n\t\t\t\t\tassert(isdigit(c));\n\t\t\t\t\tnum *= 10;\n\t\t\t\t\tnum += (c - '0');\n\t\t\t\t}\n\t\t\t\tc = fb.get();\n\t\t\t}\n\t\t\tif(neg) num = 0;\n\t\t\t// Phred-33 ASCII encode it and add it to the back of the\n\t\t\t// quality string\n\t\t\tr.qual.append('!' + num);\n\t\t\t// Skip over next stretch of whitespace\n\t\t\twhile(c != '\\r' && c != '\\n' && isspace(c) && !fb.eof()) {\n\t\t\t\tc = fb.get();\n\t\t\t}\n\t\t}\n\t} else {\n\t\twhile (c != '\\r' && c != '\\n' && c != -1) {\n\t\t\tr.qual.append(charToPhred33(c, solexa64, phred64));\n\t\t\tc = fb.get();\n\t\t\twhile(c != '\\r' && c != '\\n' && isspace(c) && !fb.eof()) {\n\t\t\t\tc = fb.get();\n\t\t\t}\n\t\t}\n\t}\n\tif ((int)r.qual.length() < readLen-1 ||\n\t    ((int)r.qual.length() < readLen && !r.color))\n\t{\n\t\ttooFewQualities(r.name);\n\t}\n\tr.qual.trimEnd(trim3);\n\tif(r.qual.length()-trim5 < r.patFw.length()) {\n\t\tassert(gColor && r.primer != -1);\n\t\tassert_gt(trim5, 0);\n\t\ttrim5--;\n\t}\n\tr.qual.trimBegin(trim5);\n\tif(r.qual.length() <= 0) return 0;\n\tassert_eq(r.qual.length(), r.patFw.length());\n\twhile(fb.peek() == '\\n' || fb.peek() == '\\r') fb.get();\n\treturn (int)r.qual.length();\n}\n\n/// Read another pattern from a FASTA input file\nbool FastaPatternSource::read(\n\tRead& r,\n\tTReadId& rdid,\n\tTReadId& endid,\n\tbool& success,\n\tbool& done)\n{\n\tint c, qc = 0;\n\tsuccess = true;\n\tdone = false;\n\tassert(fb_.isOpen());\n\tr.reset();\n\tr.color = gColor;\n\t// Pick off the first carat\n\tc = fb_.get();\n\tif(c < 0) {\n\t\tbail(r); success = false; done = true; return success;\n\t}\n\twhile(c == '#' || c == ';' || c == '\\r' || c == '\\n') {\n\t\tc = fb_.peekUptoNewline();\n\t\tfb_.resetLastN();\n\t\tc = fb_.get();\n\t}\n\tassert_eq(1, fb_.lastNLen());\n\n\t// Pick off the first carat\n\tif(first_) {\n\t\tif(c != '>') {\n\t\t\tcerr << \"Error: reads file does not look like a FASTA file\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tfirst_ = false;\n\t}\n\tassert_eq('>', c);\n\tc = fb_.get(); // get next char after '>'\n\n\t// Read to the end of the id line, sticking everything after the '>'\n\t// into *name\n\t//bool warning = false;\n\twhile(true) {\n\t\tif(c < 0 || qc < 0) {\n\t\t\tbail(r); success = false; done = true; return success;\n\t\t}\n\t\tif(c == '\\n' || c == '\\r') {\n\t\t\t// Break at end of line, after consuming all \\r's, \\n's\n\t\t\twhile(c == '\\n' || c == '\\r') {\n\t\t\t\tif(fb_.peek() == '>') {\n\t\t\t\t\t// Empty sequence\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t\tc = fb_.get();\n\t\t\t\tif(c < 0 || qc < 0) {\n\t\t\t\t\tbail(r); success = false; done = true; return success;\n\t\t\t\t}\n\t\t\t}\n\t\t\tbreak;\n\t\t}\n\t\tr.name.append(c);\n\t\tif(fb_.peek() == '>') {\n\t\t\t// Empty sequence\n\t\t\tbreak;\n\t\t}\n\t\tc = fb_.get();\n\t}\n\tif(c == '>') {\n\t\t// Empty sequences!\n\t\tcerr << \"Warning: skipping empty FASTA read with name '\" << r.name << \"'\" << endl;\n\t\tfb_.resetLastN();\n\t\trdid = endid = readCnt_;\n\t\treadCnt_++;\n\t\tsuccess = true; done = false; return success;\n\t}\n\tassert_neq('>', c);\n\n\t// _in now points just past the first character of a sequence\n\t// line, and c holds the first character\n\tint begin = 0;\n\tint mytrim5 = gTrim5;\n\tif(gColor) {\n\t\t// This is the primer character, keep it in the\n\t\t// 'primer' field of the read buf and keep parsing\n\t\tc = toupper(c);\n\t\tif(asc2dnacat[c] > 0) {\n\t\t\t// First char is a DNA char\n\t\t\tint c2 = toupper(fb_.peek());\n\t\t\tif(asc2colcat[c2] > 0) {\n\t\t\t\t// Second char is a color char\n\t\t\t\tr.primer = c;\n\t\t\t\tr.trimc = c2;\n\t\t\t\tmytrim5 += 2;\n\t\t\t}\n\t\t}\n\t\tif(c < 0) {\n\t\t\tbail(r); success = false; done = true; return success;\n\t\t}\n\t}\n\twhile(c != '>' && c >= 0) {\n\t\tif(gColor) {\n\t\t\tif(c >= '0' && c <= '4') c = \"ACGTN\"[(int)c - '0'];\n\t\t\tif(c == '.') c = 'N';\n\t\t}\n\t\tif(asc2dnacat[c] > 0 && begin++ >= mytrim5) {\n\t\t\tr.patFw.append(asc2dna[c]);\n\t\t\tr.qual.append('I');\n\t\t}\n\t\tif(fb_.peek() == '>') break;\n\t\tc = fb_.get();\n\t}\n\tr.patFw.trimEnd(gTrim3);\n\tr.qual.trimEnd(gTrim3);\n\tr.trimmed3 = gTrim3;\n\tr.trimmed5 = mytrim5;\n\t// Set up a default name if one hasn't been set\n\tif(r.name.empty()) {\n\t\tchar cbuf[20];\n\t\titoa10<TReadId>(readCnt_, cbuf);\n\t\tr.name.install(cbuf);\n\t}\n\tassert_gt(r.name.length(), 0);\n\tr.readOrigBuf.install(fb_.lastN(), fb_.lastNLen());\n\tfb_.resetLastN();\n\trdid = endid = readCnt_;\n\treadCnt_++;\n\treturn success;\n}\n\n/// Read another pattern from a FASTQ input file\nbool FastqPatternSource::read(\n\tRead& r,\n\tTReadId& rdid,\n\tTReadId& endid,\n\tbool& success,\n\tbool& done)\n{\n\tint c;\n\tint dstLen = 0;\n\tsuccess = true;\n\tdone = false;\n\tr.reset();\n\tr.color = gColor;\n\tr.fuzzy = fuzzy_;\n\t// Pick off the first at\n\tif(first_) {\n\t\tc = fb_.get();\n\t\tif(c != '@') {\n\t\t\tc = getOverNewline(fb_);\n\t\t\tif(c < 0) {\n\t\t\t\tbail(r); success = false; done = true; return success;\n\t\t\t}\n\t\t}\n\t\tif(c != '@') {\n\t\t\tcerr << \"Error: reads file does not look like a FASTQ file\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tassert_eq('@', c);\n\t\tfirst_ = false;\n\t}\n\n\t// Read to the end of the id line, sticking everything after the '@'\n\t// into *name\n\twhile(true) {\n\t\tc = fb_.get();\n\t\tif(c < 0) {\n\t\t\tbail(r); success = false; done = true; return success;\n\t\t}\n\t\tif(c == '\\n' || c == '\\r') {\n\t\t\t// Break at end of line, after consuming all \\r's, \\n's\n\t\t\twhile(c == '\\n' || c == '\\r') {\n\t\t\t\tc = fb_.get();\n\t\t\t\tif(c < 0) {\n\t\t\t\t\tbail(r); success = false; done = true; return success;\n\t\t\t\t}\n\t\t\t}\n\t\t\tbreak;\n\t\t}\n\t\tr.name.append(c);\n\t}\n\t// fb_ now points just past the first character of a\n\t// sequence line, and c holds the first character\n\tint charsRead = 0;\n\tBTDnaString *sbuf = &r.patFw;\n\tint dstLens[] = {0, 0, 0, 0};\n\tint *dstLenCur = &dstLens[0];\n\tint mytrim5 = gTrim5;\n\tint altBufIdx = 0;\n\tif(gColor && c != '+') {\n\t\t// This may be a primer character.  If so, keep it in the\n\t\t// 'primer' field of the read buf and parse the rest of the\n\t\t// read without it.\n\t\tc = toupper(c);\n\t\tif(asc2dnacat[c] > 0) {\n\t\t\t// First char is a DNA char\n\t\t\tint c2 = toupper(fb_.peek());\n\t\t\t// Second char is a color char\n\t\t\tif(asc2colcat[c2] > 0) {\n\t\t\t\tr.primer = c;\n\t\t\t\tr.trimc = c2;\n\t\t\t\tmytrim5 += 2; // trim primer and first color\n\t\t\t}\n\t\t}\n\t\tif(c < 0) {\n\t\t\tbail(r); success = false; done = true; return success;\n\t\t}\n\t}\n\tint trim5 = 0;\n\tif(c != '+') {\n\t\ttrim5 = mytrim5;\n\t\twhile(c != '+') {\n\t\t\t// Convert color numbers to letters if necessary\n\t\t\tif(c == '.') c = 'N';\n\t\t\tif(gColor) {\n\t\t\t\tif(c >= '0' && c <= '4') c = \"ACGTN\"[(int)c - '0'];\n\t\t\t}\n\t\t\tif(fuzzy_ && c == '-') c = 'A';\n\t\t\tif(isalpha(c)) {\n\t\t\t\t// If it's past the 5'-end trim point\n\t\t\t\tif(charsRead >= trim5) {\n\t\t\t\t\tsbuf->append(asc2dna[c]);\n\t\t\t\t\t(*dstLenCur)++;\n\t\t\t\t}\n\t\t\t\tcharsRead++;\n\t\t\t} else if(fuzzy_ && c == ' ') {\n\t\t\t\ttrim5 = 0; // disable 5' trimming for now\n\t\t\t\tif(charsRead == 0) {\n\t\t\t\t\tc = fb_.get();\n\t\t\t\t\tcontinue;\n\t\t\t\t}\n\t\t\t\tcharsRead = 0;\n\t\t\t\tif(altBufIdx >= 3) {\n\t\t\t\t\tcerr << \"At most 3 alternate sequence strings permitted; offending read: \" << r.name << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t\t// Move on to the next alternate-sequence buffer\n\t\t\t\tsbuf = &r.altPatFw[altBufIdx++];\n\t\t\t\tdstLenCur = &dstLens[altBufIdx];\n\t\t\t}\n\t\t\tc = fb_.get();\n\t\t\tif(c < 0) {\n\t\t\t\tbail(r); success = false; done = true; return success;\n\t\t\t}\n\t\t}\n\t\tdstLen = dstLens[0];\n\t\tcharsRead = dstLen + mytrim5;\n\t}\n\t// Trim from 3' end\n\tif(gTrim3 > 0) {\n\t\tif((int)r.patFw.length() > gTrim3) {\n\t\t\tr.patFw.resize(r.patFw.length() - gTrim3);\n\t\t\tdstLen -= gTrim3;\n\t\t\tassert_eq((int)r.patFw.length(), dstLen);\n\t\t} else {\n\t\t\t// Trimmed the whole read; we won't be using this read,\n\t\t\t// but we proceed anyway so that fb_ is advanced\n\t\t\t// properly\n\t\t\tr.patFw.clear();\n\t\t\tdstLen = 0;\n\t\t}\n\t}\n\tassert_eq('+', c);\n\n\t// Chew up the optional name on the '+' line\n\tASSERT_ONLY(int pk =) peekToEndOfLine(fb_);\n\tif(charsRead == 0) {\n\t\tassert_eq('@', pk);\n\t\tfb_.get();\n\t\tfb_.resetLastN();\n\t\trdid = endid = readCnt_;\n\t\treadCnt_++;\n\t\treturn success;\n\t}\n\n\t// Now read the qualities\n\tif (intQuals_) {\n\t\tassert(!fuzzy_);\n\t\tint qualsRead = 0;\n\t\tchar buf[4096];\n\t\tif(gColor && r.primer != -1) {\n\t\t\t// In case the original quality string is one shorter\n\t\t\tmytrim5--;\n\t\t}\n\t\tqualToks_.clear();\n\t\ttokenizeQualLine(fb_, buf, 4096, qualToks_);\n\t\tfor(unsigned int j = 0; j < qualToks_.size(); ++j) {\n\t\t\tchar c = intToPhred33(atoi(qualToks_[j].c_str()), solQuals_);\n\t\t\tassert_geq(c, 33);\n\t\t\tif (qualsRead >= mytrim5) {\n\t\t\t\tr.qual.append(c);\n\t\t\t}\n\t\t\t++qualsRead;\n\t\t} // done reading integer quality lines\n\t\tif(gColor && r.primer != -1) mytrim5++;\n\t\tr.qual.trimEnd(gTrim3);\n\t\tif(r.qual.length() < r.patFw.length()) {\n\t\t\ttooFewQualities(r.name);\n\t\t} else if(r.qual.length() > r.patFw.length() + 1) {\n\t\t\ttooManyQualities(r.name);\n\t\t}\n\t\tif(r.qual.length() == r.patFw.length()+1 && gColor && r.primer != -1) {\n\t\t\tr.qual.remove(0);\n\t\t}\n\t\t// Trim qualities on 3' end\n\t\tif(r.qual.length() > r.patFw.length()) {\n\t\t\tr.qual.resize(r.patFw.length());\n\t\t\tassert_eq((int)r.qual.length(), dstLen);\n\t\t}\n\t\tpeekOverNewline(fb_);\n\t} else {\n\t\t// Non-integer qualities\n\t\taltBufIdx = 0;\n\t\ttrim5 = mytrim5;\n\t\tint qualsRead[4] = {0, 0, 0, 0};\n\t\tint *qualsReadCur = &qualsRead[0];\n\t\tBTString *qbuf = &r.qual;\n\t\tif(gColor && r.primer != -1) {\n\t\t\t// In case the original quality string is one shorter\n\t\t\ttrim5--;\n\t\t}\n\t\twhile(true) {\n\t\t\tc = fb_.get();\n\t\t\tif (!fuzzy_ && c == ' ') {\n\t\t\t\twrongQualityFormat(r.name);\n\t\t\t} else if(c == ' ') {\n\t\t\t\ttrim5 = 0; // disable 5' trimming for now\n\t\t\t\tif((*qualsReadCur) == 0) continue;\n\t\t\t\tif(altBufIdx >= 3) {\n\t\t\t\t\tcerr << \"At most 3 alternate quality strings permitted; offending read: \" << r.name << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t}\n\t\t\t\tqbuf = &r.altQual[altBufIdx++];\n\t\t\t\tqualsReadCur = &qualsRead[altBufIdx];\n\t\t\t\tcontinue;\n\t\t\t}\n\t\t\tif(c < 0) {\n\t\t\t\tbreak; // let the file end just at the end of a quality line\n\t\t\t\t//bail(r); success = false; done = true; return success;\n\t\t\t}\n\t\t\tif (c != '\\r' && c != '\\n') {\n\t\t\t\tif (*qualsReadCur >= trim5) {\n\t\t\t\t\tc = charToPhred33(c, solQuals_, phred64Quals_);\n\t\t\t\t\tassert_geq(c, 33);\n\t\t\t\t\tqbuf->append(c);\n\t\t\t\t}\n\t\t\t\t(*qualsReadCur)++;\n\t\t\t} else {\n\t\t\t\tbreak;\n\t\t\t}\n\t\t}\n\t\tqualsRead[0] -= gTrim3;\n\t\tr.qual.trimEnd(gTrim3);\n\t\tif(r.qual.length() < r.patFw.length()) {\n\t\t\ttooFewQualities(r.name);\n\t\t} else if(r.qual.length() > r.patFw.length()+1) {\n\t\t\ttooManyQualities(r.name);\n\t\t}\n\t\tif(r.qual.length() == r.patFw.length()+1 && gColor && r.primer != -1) {\n\t\t\tr.qual.remove(0);\n\t\t}\n\n\t\tif(fuzzy_) {\n\t\t\t// Trim from 3' end of alternate basecall and quality strings\n\t\t\tif(gTrim3 > 0) {\n\t\t\t\tfor(int i = 0; i < 3; i++) {\n\t\t\t\t\tassert_eq(r.altQual[i].length(), r.altPatFw[i].length());\n\t\t\t\t\tif((int)r.altQual[i].length() > gTrim3) {\n\t\t\t\t\t\tr.altPatFw[i].resize(gTrim3);\n\t\t\t\t\t\tr.altQual[i].resize(gTrim3);\n\t\t\t\t\t} else {\n\t\t\t\t\t\tr.altPatFw[i].clear();\n\t\t\t\t\t\tr.altQual[i].clear();\n\t\t\t\t\t}\n\t\t\t\t\tqualsRead[i+1] = dstLens[i+1] =\n\t\t\t\t\t\tmax<int>(0, dstLens[i+1] - gTrim3);\n\t\t\t\t}\n\t\t\t}\n\t\t\t// Shift to RHS, and install in Strings\n\t\t\tassert_eq(0, r.alts);\n\t\t\tfor(int i = 1; i < 4; i++) {\n\t\t\t\tif(qualsRead[i] == 0) continue;\n\t\t\t\tif(qualsRead[i] > dstLen) {\n\t\t\t\t\t// Shift everybody up\n\t\t\t\t\tint shiftAmt = qualsRead[i] - dstLen;\n\t\t\t\t\tfor(int j = 0; j < dstLen; j++) {\n\t\t\t\t\t\tr.altQual[i-1].set(r.altQual[i-1][j+shiftAmt], j);\n\t\t\t\t\t\tr.altPatFw[i-1].set(r.altPatFw[i-1][j+shiftAmt], j);\n\t\t\t\t\t}\n\t\t\t\t\tr.altQual[i-1].resize(dstLen);\n\t\t\t\t\tr.altPatFw[i-1].resize(dstLen);\n\t\t\t\t} else if (qualsRead[i] < dstLen) {\n\t\t\t\t\tr.altQual[i-1].resize(dstLen);\n\t\t\t\t\tr.altPatFw[i-1].resize(dstLen);\n\t\t\t\t\t// Shift everybody down\n\t\t\t\t\tint shiftAmt = dstLen - qualsRead[i];\n\t\t\t\t\tfor(int j = dstLen-1; j >= shiftAmt; j--) {\n\t\t\t\t\t\tr.altQual[i-1].set(r.altQual[i-1][j-shiftAmt], j);\n\t\t\t\t\t\tr.altPatFw[i-1].set(r.altPatFw[i-1][j-shiftAmt], j);\n\t\t\t\t\t}\n\t\t\t\t\t// Fill in unset positions\n\t\t\t\t\tfor(int j = 0; j < shiftAmt; j++) {\n\t\t\t\t\t\t// '!' - indicates no alternate basecall at\n\t\t\t\t\t\t// this position\n\t\t\t\t\t\tr.altQual[i-1].set(33, j);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tr.alts++;\n\t\t\t}\n\t\t}\n\n\t\tif(c == '\\r' || c == '\\n') {\n\t\t\tc = peekOverNewline(fb_);\n\t\t} else {\n\t\t\tc = peekToEndOfLine(fb_);\n\t\t}\n\t}\n\tr.readOrigBuf.install(fb_.lastN(), fb_.lastNLen());\n\tfb_.resetLastN();\n\n\tc = fb_.get();\n\t// Should either be at end of file or at beginning of next record\n\tassert(c == -1 || c == '@');\n\n\t// Set up a default name if one hasn't been set\n\tif(r.name.empty()) {\n\t\tchar cbuf[20];\n\t\titoa10<TReadId>(readCnt_, cbuf);\n\t\tr.name.install(cbuf);\n\t}\n\tr.trimmed3 = gTrim3;\n\tr.trimmed5 = mytrim5;\n\trdid = endid = readCnt_;\n\treadCnt_++;\n\treturn success;\n}\n\n/// Read another pattern from a FASTA input file\nbool TabbedPatternSource::read(\n\tRead& r,\n\tTReadId& rdid,\n\tTReadId& endid,\n\tbool& success,\n\tbool& done)\n{\n\tr.reset();\n\tr.color = gColor;\n\tsuccess = true;\n\tdone = false;\n\t// fb_ is about to dish out the first character of the\n\t// name field\n\tif(parseName(r, NULL, '\\t') == -1) {\n\t\tpeekOverNewline(fb_); // skip rest of line\n\t\tr.reset();\n\t\tsuccess = false;\n\t\tdone = true;\n\t\treturn false;\n\t}\n\tassert_neq('\\t', fb_.peek());\n\n\t// fb_ is about to dish out the first character of the\n\t// sequence field\n\tint charsRead = 0;\n\tint mytrim5 = gTrim5;\n\tint dstLen = parseSeq(r, charsRead, mytrim5, '\\t');\n\tassert_neq('\\t', fb_.peek());\n\tif(dstLen < 0) {\n\t\tpeekOverNewline(fb_); // skip rest of line\n\t\tr.reset();\n\t\tsuccess = false;\n\t\tdone = true;\n\t\treturn false;\n\t}\n\n\t// fb_ is about to dish out the first character of the\n\t// quality-string field\n\tchar ct = 0;\n\tif(parseQuals(r, charsRead, dstLen, mytrim5, ct, '\\n') < 0) {\n\t\tpeekOverNewline(fb_); // skip rest of line\n\t\tr.reset();\n\t\tsuccess = false;\n\t\tdone = true;\n\t\treturn false;\n\t}\n\tr.trimmed3 = gTrim3;\n\tr.trimmed5 = mytrim5;\n\tassert_eq(ct, '\\n');\n\tassert_neq('\\n', fb_.peek());\n\tr.readOrigBuf.install(fb_.lastN(), fb_.lastNLen());\n\tfb_.resetLastN();\n\trdid = endid = readCnt_;\n\treadCnt_++;\n\treturn true;\n}\n\n/// Read another pair of patterns from a FASTA input file\nbool TabbedPatternSource::readPair(\n\tRead& ra,\n\tRead& rb,\n\tTReadId& rdid,\n\tTReadId& endid,\n\tbool& success,\n\tbool& done,\n\tbool& paired)\n{\n\tsuccess = true;\n\tdone = false;\n\t\n\t// Skip over initial vertical whitespace\n\tif(fb_.peek() == '\\r' || fb_.peek() == '\\n') {\n\t\tfb_.peekUptoNewline();\n\t\tfb_.resetLastN();\n\t}\n\t\n\t// fb_ is about to dish out the first character of the\n\t// name field\n\tint mytrim5_1 = gTrim5;\n\tif(parseName(ra, &rb, '\\t') == -1) {\n\t\tpeekOverNewline(fb_); // skip rest of line\n\t\tra.reset();\n\t\trb.reset();\n\t\tfb_.resetLastN();\n\t\tsuccess = false;\n\t\tdone = true;\n\t\treturn false;\n\t}\n\tassert_neq('\\t', fb_.peek());\n\n\t// fb_ is about to dish out the first character of the\n\t// sequence field for the first mate\n\tint charsRead1 = 0;\n\tint dstLen1 = parseSeq(ra, charsRead1, mytrim5_1, '\\t');\n\tif(dstLen1 < 0) {\n\t\tpeekOverNewline(fb_); // skip rest of line\n\t\tra.reset();\n\t\trb.reset();\n\t\tfb_.resetLastN();\n\t\tsuccess = false;\n\t\tdone = true;\n\t\treturn false;\n\t}\n\tassert_neq('\\t', fb_.peek());\n\n\t// fb_ is about to dish out the first character of the\n\t// quality-string field\n\tchar ct = 0;\n\tif(parseQuals(ra, charsRead1, dstLen1, mytrim5_1, ct, '\\t', '\\n') < 0) {\n\t\tpeekOverNewline(fb_); // skip rest of line\n\t\tra.reset();\n\t\trb.reset();\n\t\tfb_.resetLastN();\n\t\tsuccess = false;\n\t\tdone = true;\n\t\treturn false;\n\t}\n\tra.trimmed3 = gTrim3;\n\tra.trimmed5 = mytrim5_1;\n\tassert(ct == '\\t' || ct == '\\n' || ct == '\\r' || ct == -1);\n\tif(ct == '\\r' || ct == '\\n' || ct == -1) {\n\t\t// Only had 3 fields prior to newline, so this must be an unpaired read\n\t\trb.reset();\n\t\tra.readOrigBuf.install(fb_.lastN(), fb_.lastNLen());\n\t\tfb_.resetLastN();\n\t\tsuccess = true;\n\t\tdone = false;\n\t\tpaired = false;\n\t\trdid = endid = readCnt_;\n\t\treadCnt_++;\n\t\treturn success;\n\t}\n\tpaired = true;\n\tassert_neq('\\t', fb_.peek());\n\t\n\t// Saw another tab after the third field, so this must be a pair\n\tif(secondName_) {\n\t\t// The second mate has its own name\n\t\tif(parseName(rb, NULL, '\\t') == -1) {\n\t\t\tpeekOverNewline(fb_); // skip rest of line\n\t\t\tra.reset();\n\t\t\trb.reset();\n\t\t\tfb_.resetLastN();\n\t\t\tsuccess = false;\n\t\t\tdone = true;\n\t\t\treturn false;\n\t\t}\n\t\tassert_neq('\\t', fb_.peek());\n\t}\n\n\t// fb_ about to give the first character of the second mate's sequence\n\tint charsRead2 = 0;\n\tint mytrim5_2 = gTrim5;\n\tint dstLen2 = parseSeq(rb, charsRead2, mytrim5_2, '\\t');\n\tif(dstLen2 < 0) {\n\t\tpeekOverNewline(fb_); // skip rest of line\n\t\tra.reset();\n\t\trb.reset();\n\t\tfb_.resetLastN();\n\t\tsuccess = false;\n\t\tdone = true;\n\t\treturn false;\n\t}\n\tassert_neq('\\t', fb_.peek());\n\n\t// fb_ is about to dish out the first character of the\n\t// quality-string field\n\tif(parseQuals(rb, charsRead2, dstLen2, mytrim5_2, ct, '\\n') < 0) {\n\t\tpeekOverNewline(fb_); // skip rest of line\n\t\tra.reset();\n\t\trb.reset();\n\t\tfb_.resetLastN();\n\t\tsuccess = false;\n\t\tdone = true;\n\t\treturn false;\n\t}\n\tra.readOrigBuf.install(fb_.lastN(), fb_.lastNLen());\n\tfb_.resetLastN();\n\trb.trimmed3 = gTrim3;\n\trb.trimmed5 = mytrim5_2;\n\trdid = endid = readCnt_;\n\treadCnt_++;\n\treturn true;\n}\n\n/**\n * Parse a name from fb_ and store in r.  Assume that the next\n * character obtained via fb_.get() is the first character of\n * the sequence and the string stops at the next char upto (could\n * be tab, newline, etc.).\n */\nint TabbedPatternSource::parseName(\n\tRead& r,\n\tRead* r2,\n\tchar upto /* = '\\t' */)\n{\n\t// Read the name out of the first field\n\tint c = 0;\n\tif(r2 != NULL) r2->name.clear();\n\tr.name.clear();\n\twhile(true) {\n\t\tif((c = fb_.get()) < 0) {\n\t\t\treturn -1;\n\t\t}\n\t\tif(c == upto) {\n\t\t\t// Finished with first field\n\t\t\tbreak;\n\t\t}\n\t\tif(c == '\\n' || c == '\\r') {\n\t\t\treturn -1;\n\t\t}\n\t\tif(r2 != NULL) r2->name.append(c);\n\t\tr.name.append(c);\n\t}\n\t// Set up a default name if one hasn't been set\n\tif(r.name.empty()) {\n\t\tchar cbuf[20];\n\t\titoa10<TReadId>(readCnt_, cbuf);\n\t\tr.name.install(cbuf);\n\t\tif(r2 != NULL) r2->name.install(cbuf);\n\t}\n\treturn (int)r.name.length();\n}\n\n/**\n * Parse a single sequence from fb_ and store in r.  Assume\n * that the next character obtained via fb_.get() is the first\n * character of the sequence and the sequence stops at the next\n * char upto (could be tab, newline, etc.).\n */\nint TabbedPatternSource::parseSeq(\n\tRead& r,\n\tint& charsRead,\n\tint& trim5,\n\tchar upto /*= '\\t'*/)\n{\n\tint begin = 0;\n\tint c = fb_.get();\n\tassert(c != upto);\n\tr.patFw.clear();\n\tr.color = gColor;\n\tif(gColor) {\n\t\t// This may be a primer character.  If so, keep it in the\n\t\t// 'primer' field of the read buf and parse the rest of the\n\t\t// read without it.\n\t\tc = toupper(c);\n\t\tif(asc2dnacat[c] > 0) {\n\t\t\t// First char is a DNA char\n\t\t\tint c2 = toupper(fb_.peek());\n\t\t\t// Second char is a color char\n\t\t\tif(asc2colcat[c2] > 0) {\n\t\t\t\tr.primer = c;\n\t\t\t\tr.trimc = c2;\n\t\t\t\ttrim5 += 2; // trim primer and first color\n\t\t\t}\n\t\t}\n\t\tif(c < 0) { return -1; }\n\t}\n\twhile(c != upto) {\n\t\tif(gColor) {\n\t\t\tif(c >= '0' && c <= '4') c = \"ACGTN\"[(int)c - '0'];\n\t\t\tif(c == '.') c = 'N';\n\t\t}\n\t\tif(isalpha(c)) {\n\t\t\tassert_in(toupper(c), \"ACGTN\");\n\t\t\tif(begin++ >= trim5) {\n\t\t\t\tassert_neq(0, asc2dnacat[c]);\n\t\t\t\tr.patFw.append(asc2dna[c]);\n\t\t\t}\n\t\t\tcharsRead++;\n\t\t}\n\t\tif((c = fb_.get()) < 0) {\n\t\t\treturn -1;\n\t\t}\n\t}\n\tr.patFw.trimEnd(gTrim3);\n\treturn (int)r.patFw.length();\n}\n\n/**\n * Parse a single quality string from fb_ and store in r.\n * Assume that the next character obtained via fb_.get() is\n * the first character of the quality string and the string stops\n * at the next char upto (could be tab, newline, etc.).\n */\nint TabbedPatternSource::parseQuals(\n\tRead& r,\n\tint charsRead,\n\tint dstLen,\n\tint trim5,\n\tchar& c2,\n\tchar upto /*= '\\t'*/,\n\tchar upto2 /*= -1*/)\n{\n\tint qualsRead = 0;\n\tint c = 0;\n\tif (intQuals_) {\n\t\tchar buf[4096];\n\t\twhile (qualsRead < charsRead) {\n\t\t\tqualToks_.clear();\n\t\t\tif(!tokenizeQualLine(fb_, buf, 4096, qualToks_)) break;\n\t\t\tfor (unsigned int j = 0; j < qualToks_.size(); ++j) {\n\t\t\t\tchar c = intToPhred33(atoi(qualToks_[j].c_str()), solQuals_);\n\t\t\t\tassert_geq(c, 33);\n\t\t\t\tif (qualsRead >= trim5) {\n\t\t\t\t\tr.qual.append(c);\n\t\t\t\t}\n\t\t\t\t++qualsRead;\n\t\t\t}\n\t\t} // done reading integer quality lines\n\t\tif (charsRead > qualsRead) tooFewQualities(r.name);\n\t} else {\n\t\t// Non-integer qualities\n\t\twhile((qualsRead < dstLen + trim5) && c >= 0) {\n\t\t\tc = fb_.get();\n\t\t\tc2 = c;\n\t\t\tif (c == ' ') wrongQualityFormat(r.name);\n\t\t\tif(c < 0) {\n\t\t\t\t// EOF occurred in the middle of a read - abort\n\t\t\t\treturn -1;\n\t\t\t}\n\t\t\tif(!isspace(c) && c != upto && (upto2 == -1 || c != upto2)) {\n\t\t\t\tif (qualsRead >= trim5) {\n\t\t\t\t\tc = charToPhred33(c, solQuals_, phred64Quals_);\n\t\t\t\t\tassert_geq(c, 33);\n\t\t\t\t\tr.qual.append(c);\n\t\t\t\t}\n\t\t\t\tqualsRead++;\n\t\t\t} else {\n\t\t\t\tbreak;\n\t\t\t}\n\t\t}\n\t\tif(qualsRead < dstLen + trim5) {\n\t\t\ttooFewQualities(r.name);\n\t\t} else if(qualsRead > dstLen + trim5) {\n\t\t\ttooManyQualities(r.name);\n\t\t}\n\t}\n\tr.qual.resize(dstLen);\n\twhile(c != upto && (upto2 == -1 || c != upto2) && c != -1) {\n\t\tc = fb_.get();\n\t\tc2 = c;\n\t}\n\treturn qualsRead;\n}\n\nvoid wrongQualityFormat(const BTString& read_name) {\n\tcerr << \"Error: Encountered one or more spaces while parsing the quality \"\n\t     << \"string for read \" << read_name << \".  If this is a FASTQ file \"\n\t\t << \"with integer (non-ASCII-encoded) qualities, try re-running with \"\n\t\t << \"the --integer-quals option.\" << endl;\n\tthrow 1;\n}\n\nvoid tooFewQualities(const BTString& read_name) {\n\tcerr << \"Error: Read \" << read_name << \" has more read characters than \"\n\t\t << \"quality values.\" << endl;\n\tthrow 1;\n}\n\nvoid tooManyQualities(const BTString& read_name) {\n\tcerr << \"Error: Read \" << read_name << \" has more quality values than read \"\n\t\t << \"characters.\" << endl;\n\tthrow 1;\n}\n\n#ifdef USE_SRA\n    \n    struct SRA_Read {\n        SStringExpandable<char, 64>      name;      // read name\n        SDnaStringExpandable<128, 2>     patFw;     // forward-strand sequence\n        SStringExpandable<char, 128, 2>  qual;      // quality values\n        \n        void reset() {\n            name.clear();\n            patFw.clear();\n            qual.clear();\n        }\n    };\n    \n    static const uint64_t buffer_size_per_thread = 4096;\n    \n    struct SRA_Data {\n        uint64_t read_pos;\n        uint64_t write_pos;\n        uint64_t buffer_size;\n        bool     done;\n        EList<pair<SRA_Read, SRA_Read> > paired_reads;\n        \n        ngs::ReadIterator* sra_it;\n        \n        SRA_Data() {\n            read_pos = 0;\n            write_pos = 0;\n            buffer_size = buffer_size_per_thread;\n            done = false;\n            sra_it = NULL;\n        }\n        \n        bool isFull() {\n            assert_leq(read_pos, write_pos);\n            assert_geq(read_pos + buffer_size, write_pos);\n            return read_pos + buffer_size <= write_pos;\n        }\n        \n        bool isEmpty() {\n            assert_leq(read_pos, write_pos);\n            assert_geq(read_pos + buffer_size, write_pos);\n            return read_pos == write_pos;\n        }\n        \n        pair<SRA_Read, SRA_Read>& getPairForRead() {\n            assert(!isEmpty());\n            return paired_reads[read_pos % buffer_size];\n        }\n        \n        pair<SRA_Read, SRA_Read>& getPairForWrite() {\n            assert(!isFull());\n            return paired_reads[write_pos % buffer_size];\n        }\n        \n        void advanceReadPos() {\n            assert(!isEmpty());\n            read_pos++;\n        }\n        \n        void advanceWritePos() {\n            assert(!isFull());\n            write_pos++;\n        }\n    };\n    \n    static void SRA_IO_Worker(void *vp)\n    {\n        SRA_Data* sra_data = (SRA_Data*)vp;\n        assert(sra_data != NULL);\n        ngs::ReadIterator* sra_it = sra_data->sra_it;\n        assert(sra_it != NULL);\n        \n        while(!sra_data->done) {\n            while(sra_data->isFull()) {\n#if defined(_TTHREAD_WIN32_)\n                Sleep(1);\n#elif defined(_TTHREAD_POSIX_)\n                const static timespec ts = {0, 1000000};  // 1 millisecond\n                nanosleep(&ts, NULL);\n#endif\n            }\n            pair<SRA_Read, SRA_Read>& pair = sra_data->getPairForWrite();\n            SRA_Read& ra = pair.first;\n            SRA_Read& rb = pair.second;\n            bool exception_thrown = false;\n            try {\n                if(!sra_it->nextRead() || !sra_it->nextFragment()) {\n                    ra.reset();\n                    rb.reset();\n                    sra_data->done = true;\n                    return;\n                }\n                \n                // Read the name out of the first field\n                ngs::StringRef rname = sra_it->getReadId();\n                ra.name.install(rname.data(), rname.size());\n                assert(!ra.name.empty());\n                \n                ngs::StringRef ra_seq = sra_it->getFragmentBases();\n                if(gTrim5 + gTrim3 < (int)ra_seq.size()) {\n                    ra.patFw.installChars(ra_seq.data() + gTrim5, ra_seq.size() - gTrim5 - gTrim3);\n                }\n                ngs::StringRef ra_qual = sra_it->getFragmentQualities();\n                if(ra_seq.size() == ra_qual.size() && gTrim5 + gTrim3 < (int)ra_qual.size()) {\n                    ra.qual.install(ra_qual.data() + gTrim5, ra_qual.size() - gTrim5 - gTrim3);\n                } else {\n                    ra.qual.resize(ra.patFw.length());\n                    ra.qual.fill('I');\n                }\n                assert_eq(ra.patFw.length(), ra.qual.length());\n                \n                if(!sra_it->nextFragment()) {\n                    rb.reset();\n                } else {\n                    // rb.name = ra.name;\n                    ngs::StringRef rb_seq = sra_it->getFragmentBases();\n                    if(gTrim5 + gTrim3 < (int)rb_seq.size()) {\n                        rb.patFw.installChars(rb_seq.data() + gTrim5, rb_seq.size() - gTrim5 - gTrim3);\n                    }\n                    ngs::StringRef rb_qual = sra_it->getFragmentQualities();\n                    if(rb_seq.size() == rb_qual.size() && gTrim5 + gTrim3 < (int)rb_qual.size()) {\n                        rb.qual.install(rb_qual.data() + gTrim5, rb_qual.size() - gTrim5 - gTrim3);\n                    } else {\n                        rb.qual.resize(rb.patFw.length());\n                        rb.qual.fill('I');\n                    }\n                    assert_eq(rb.patFw.length(), rb.qual.length());\n                }\n                sra_data->advanceWritePos();\n            } catch(ngs::ErrorMsg & x) {\n                cerr << x.toString () << endl;\n                exception_thrown = true;\n            } catch(exception & x) {\n                cerr << x.what () << endl;\n                exception_thrown = true;\n            } catch(...) {\n                cerr << \"unknown exception\\n\";\n                exception_thrown = true;\n            }\n            \n            if(exception_thrown) {\n                ra.reset();\n                rb.reset();\n                sra_data->done = true;\n                cerr << \"An error happened while fetching SRA reads. Please rerun HISAT2. You may want to disable the SRA cache if you didn't (see the instructions at https://github.com/ncbi/sra-tools/wiki/Toolkit-Configuration).\\n\";\n                exit(1);\n            }\n        }\n    }\n    \n    SRAPatternSource::~SRAPatternSource() {\n        if(io_thread_) delete io_thread_;\n        if(sra_data_) delete sra_data_;\n        if(sra_it_) delete sra_it_;\n        if(sra_run_) delete sra_run_;\n    }\n    \n    /// Read another pair of patterns from a FASTA input file\n    bool SRAPatternSource::readPair(\n                                    Read& ra,\n                                    Read& rb,\n                                    TReadId& rdid,\n                                    TReadId& endid,\n                                    bool& success,\n                                    bool& done,\n                                    bool& paired)\n    {\n        assert(sra_run_ != NULL && sra_it_ != NULL);\n        success = true;\n        done = false;\n        while(sra_data_->isEmpty()) {\n            if(sra_data_->done && sra_data_->isEmpty()) {\n                ra.reset();\n                rb.reset();\n                success = false;\n                done = true;\n                return false;\n            }\n            \n#if defined(_TTHREAD_WIN32_)\n            Sleep(1);\n#elif defined(_TTHREAD_POSIX_)\n            const static timespec ts = {0, 1000000}; // 1 millisecond\n            nanosleep(&ts, NULL);\n#endif\n        }\n        \n        pair<SRA_Read, SRA_Read>& pair = sra_data_->getPairForRead();\n        ra.name.install(pair.first.name.buf(), pair.first.name.length());\n        ra.patFw.install(pair.first.patFw.buf(), pair.first.patFw.length());\n        ra.qual.install(pair.first.qual.buf(), pair.first.qual.length());\n        ra.trimmed3 = gTrim3;\n        ra.trimmed5 = gTrim5;\n        if(pair.second.patFw.length() > 0) {\n            rb.name.install(pair.first.name.buf(), pair.first.name.length());\n            rb.patFw.install(pair.second.patFw.buf(), pair.second.patFw.length());\n            rb.qual.install(pair.second.qual.buf(), pair.second.qual.length());\n            rb.trimmed3 = gTrim3;\n            rb.trimmed5 = gTrim5;\n            paired = true;\n        } else {\n            rb.reset();\n        }\n        sra_data_->advanceReadPos();\n        \n        rdid = endid = readCnt_;\n        readCnt_++;\n        \n        return true;\n    }\n    \n    void SRAPatternSource::open() {\n        string version = \"centrifuge-\";\n        version += CENTRIFUGE_VERSION;\n        ncbi::NGS::setAppVersionString(version.c_str());\n        assert(!sra_accs_.empty());\n        while(sra_acc_cur_ < sra_accs_.size()) {\n            // Open read\n            if(sra_it_) {\n                delete sra_it_;\n                sra_it_ = NULL;\n            }\n            if(sra_run_) {\n                delete sra_run_;\n                sra_run_ = NULL;\n            }\n            try {\n                // open requested accession using SRA implementation of the API\n                sra_run_ = new ngs::ReadCollection(ncbi::NGS::openReadCollection(sra_accs_[sra_acc_cur_]));\n                // compute window to iterate through\n                size_t MAX_ROW = sra_run_->getReadCount();\n                sra_it_ = new ngs::ReadIterator(sra_run_->getReadRange(1, MAX_ROW, ngs::Read::all));\n                \n                // create a buffer for SRA data\n                sra_data_ = new SRA_Data;\n                sra_data_->sra_it = sra_it_;\n                sra_data_->buffer_size = nthreads_ * buffer_size_per_thread;\n                sra_data_->paired_reads.resize(sra_data_->buffer_size);\n                \n                // create a thread for handling SRA data access\n                io_thread_ = new tthread::thread(SRA_IO_Worker, (void*)sra_data_);\n                // io_thread_->join();\n            } catch(...) {\n                if(!errs_[sra_acc_cur_]) {\n                    cerr << \"Warning: Could not access \\\"\" << sra_accs_[sra_acc_cur_].c_str() << \"\\\" for reading; skipping...\" << endl;\n                    errs_[sra_acc_cur_] = true;\n                }\n                sra_acc_cur_++;\n                continue;\n            }\n            return;\n        }\n        cerr << \"Error: No input SRA accessions were valid\" << endl;\n        exit(1);\n        return;\n    }\n    \n#endif\n"
  },
  {
    "path": "pat.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef PAT_H_\n#define PAT_H_\n\n#include <cassert>\n#include <cmath>\n#include <stdexcept>\n#include <vector>\n#include <string>\n#include <cstring>\n#include <ctype.h>\n#include <fstream>\n#include \"alphabet.h\"\n#include \"assert_helpers.h\"\n#include \"tokenize.h\"\n#include \"random_source.h\"\n#include \"threading.h\"\n#include \"filebuf.h\"\n#include \"qual.h\"\n#include \"search_globals.h\"\n#include \"sstring.h\"\n#include \"ds.h\"\n#include \"read.h\"\n#include \"util.h\"\n\n/**\n * Classes and routines for reading reads from various input sources.\n */\n\nusing namespace std;\n\n/**\n * Calculate a per-read random seed based on a combination of\n * the read data (incl. sequence, name, quals) and the global\n * seed in '_randSeed'.\n */\nstatic inline uint32_t genRandSeed(const BTDnaString& qry,\n                                   const BTString& qual,\n                                   const BTString& name,\n                                   uint32_t seed)\n{\n\t// Calculate a per-read random seed based on a combination of\n\t// the read data (incl. sequence, name, quals) and the global\n\t// seed\n\tuint32_t rseed = (seed + 101) * 59 * 61 * 67 * 71 * 73 * 79 * 83;\n\tsize_t qlen = qry.length();\n\t// Throw all the characters of the read into the random seed\n\tfor(size_t i = 0; i < qlen; i++) {\n\t\tint p = (int)qry[i];\n\t\tassert_leq(p, 4);\n\t\tsize_t off = ((i & 15) << 1);\n\t\trseed ^= (p << off);\n\t}\n\t// Throw all the quality values for the read into the random\n\t// seed\n\tfor(size_t i = 0; i < qlen; i++) {\n\t\tint p = (int)qual[i];\n\t\tassert_leq(p, 255);\n\t\tsize_t off = ((i & 3) << 3);\n\t\trseed ^= (p << off);\n\t}\n\t// Throw all the characters in the read name into the random\n\t// seed\n\tsize_t namelen = name.length();\n\tfor(size_t i = 0; i < namelen; i++) {\n\t\tint p = (int)name[i];\n\t\tif(p == '/') break;\n\t\tassert_leq(p, 255);\n\t\tsize_t off = ((i & 3) << 3);\n\t\trseed ^= (p << off);\n\t}\n\treturn rseed;\n}\n\n/**\n * Parameters affecting how reads and read in.\n */\nstruct PatternParams {\n\tPatternParams(\n\t\tint format_,\n\t\tbool fileParallel_,\n\t\tuint32_t seed_,\n\t\tbool useSpinlock_,\n\t\tbool solexa64_,\n\t\tbool phred64_,\n\t\tbool intQuals_,\n\t\tbool fuzzy_,\n\t\tint sampleLen_,\n\t\tint sampleFreq_,\n\t\tuint32_t skip_) :\n\t\tformat(format_),\n\t\tfileParallel(fileParallel_),\n\t\tseed(seed_),\n\t\tuseSpinlock(useSpinlock_),\n\t\tsolexa64(solexa64_),\n\t\tphred64(phred64_),\n\t\tintQuals(intQuals_),\n\t\tfuzzy(fuzzy_),\n\t\tsampleLen(sampleLen_),\n\t\tsampleFreq(sampleFreq_),\n\t\tskip(skip_) { }\n\n\tint format;           // file format\n\tbool fileParallel;    // true -> wrap files with separate PairedPatternSources\n\tuint32_t seed;        // pseudo-random seed\n\tbool useSpinlock;     // use spin locks instead of pthreads\n\tbool solexa64;        // true -> qualities are on solexa64 scale\n\tbool phred64;         // true -> qualities are on phred64 scale\n\tbool intQuals;        // true -> qualities are space-separated numbers\n\tbool fuzzy;           // true -> try to parse fuzzy fastq\n\tint sampleLen;        // length of sampled reads for FastaContinuous...\n\tint sampleFreq;       // frequency of sampled reads for FastaContinuous...\n\tuint32_t skip;        // skip the first 'skip' patterns\n};\n\n/**\n * Encapsulates a synchronized source of patterns; usually a file.\n * Optionally reverses reads and quality strings before returning them,\n * though that is usually more efficiently done by the concrete\n * subclass.  Concrete subclasses should delimit critical sections with\n * calls to lock() and unlock().\n */\nclass PatternSource {\n\npublic:\n\n\tPatternSource(const PatternParams& p) :\n\t\tseed_(p.seed),\n\t\treadCnt_(0),\n\t\tnumWrappers_(0),\n\t\tdoLocking_(true),\n\t\tuseSpinlock_(p.useSpinlock),\n\t\tmutex()\n\t{\n\t}\n\n\tvirtual ~PatternSource() { }\n\n\t/**\n\t * Call this whenever this PatternSource is wrapped by a new\n\t * WrappedPatternSourcePerThread.  This helps us keep track of\n\t * whether locks will be contended.\n\t */\n\tvoid addWrapper() {\n\t\tlock();\n\t\tnumWrappers_++;\n\t\tunlock();\n\t}\n\t\n\t/**\n\t * The main member function for dispensing patterns.\n\t *\n\t * Returns true iff a pair was parsed succesfully.\n\t */\n\tvirtual bool nextReadPair(\n\t\tRead& ra,\n\t\tRead& rb,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired,\n\t\tbool fixName);\n\n\t/**\n\t * The main member function for dispensing patterns.\n\t */\n\tvirtual bool nextRead(\n\t\tRead& r,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done);\n\n\t/**\n\t * Implementation to be provided by concrete subclasses.  An\n\t * implementation for this member is only relevant for formats that\n\t * can read in a pair of reads in a single transaction with a\n\t * single input source.  If paired-end input is given as a pair of\n\t * parallel files, this member should throw an error and exit.\n\t */\n\tvirtual bool nextReadPairImpl(\n\t\tRead& ra,\n\t\tRead& rb,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired) = 0;\n\n\t/**\n\t * Implementation to be provided by concrete subclasses.  An\n\t * implementation for this member is only relevant for formats\n\t * where individual input sources look like single-end-read\n\t * sources, e.g., formats where paired-end reads are specified in\n\t * parallel read files.\n\t */\n\tvirtual bool nextReadImpl(\n\t\tRead& r,\n\t\tTReadId& rdid, \n\t\tTReadId& endid, \n\t\tbool& success,\n\t\tbool& done) = 0;\n\n\t/// Reset state to start over again with the first read\n\tvirtual void reset() { readCnt_ = 0; }\n\n\t/**\n\t * Concrete subclasses call lock() to enter a critical region.\n\t * What constitutes a critical region depends on the subclass.\n\t */\n\tvoid lock() {\n\t\tif(!doLocking_) return; // no contention\n        mutex.lock();\n\t}\n\n\t/**\n\t * Concrete subclasses call unlock() to exit a critical region\n\t * What constitutes a critical region depends on the subclass.\n\t */\n\tvoid unlock() {\n\t\tif(!doLocking_) return; // no contention\n        mutex.unlock();\n\t}\n\n\t/**\n\t * Return a new dynamically allocated PatternSource for the given\n\t * format, using the given list of strings as the filenames to read\n\t * from or as the sequences themselves (i.e. if -c was used).\n\t */\n\tstatic PatternSource* patsrcFromStrings(\n                                            const PatternParams& p,\n                                            const EList<string>& qs,\n                                            int nthreads);\n\n\t/**\n\t * Return the number of reads attempted.\n\t */\n\tTReadId readCnt() const { return readCnt_ - 1; }\n\nprotected:\n\n\tuint32_t seed_;\n\n\t/// The number of reads read by this PatternSource\n\tTReadId readCnt_;\n\n\tint numWrappers_;      /// # threads that own a wrapper for this PatternSource\n\tbool doLocking_;       /// override whether to lock (true = don't override)\n\t/// User can ask to use the normal pthreads-style lock even if\n\t/// spinlocks is enabled and compiled in.  This is sometimes better\n\t/// if we expect bad I/O latency on some reads.\n\tbool useSpinlock_;\n\tMUTEX_T mutex;\n};\n\n/**\n * Abstract parent class for synhconized sources of paired-end reads\n * (and possibly also single-end reads).\n */\nclass PairedPatternSource {\npublic:\n\tPairedPatternSource(const PatternParams& p) : mutex_m(), seed_(p.seed) {}\n\tvirtual ~PairedPatternSource() { }\n\n\tvirtual void addWrapper() = 0;\n\tvirtual void reset() = 0;\n\t\n\tvirtual bool nextReadPair(\n\t\tRead& ra,\n\t\tRead& rb,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired,\n\t\tbool fixName) = 0;\n\t\n\tvirtual pair<TReadId, TReadId> readCnt() const = 0;\n\n\t/**\n\t * Lock this PairedPatternSource, usually because one of its shared\n\t * fields is being updated.\n\t */\n\tvoid lock() {\n\t\tmutex_m.lock();\n\t}\n\n\t/**\n\t * Unlock this PairedPatternSource.\n\t */\n\tvoid unlock() {\n\t\tmutex_m.unlock();\n\t}\n\n\t/**\n\t * Given the values for all of the various arguments used to specify\n\t * the read and quality input, create a list of pattern sources to\n\t * dispense them.\n\t */\n\tstatic PairedPatternSource* setupPatternSources(\n\t\tconst EList<string>& si,    // singles, from argv\n\t\tconst EList<string>& m1,    // mate1's, from -1 arg\n\t\tconst EList<string>& m2,    // mate2's, from -2 arg\n\t\tconst EList<string>& m12,   // both mates on each line, from --12 arg\n#ifdef USE_SRA\n        const EList<string>& sra_accs,\n#endif\n\t\tconst EList<string>& q,     // qualities associated with singles\n\t\tconst EList<string>& q1,    // qualities associated with m1\n\t\tconst EList<string>& q2,    // qualities associated with m2\n\t\tconst PatternParams& p,     // read-in params\n                                                    int nthreads,\n\t\tbool verbose);              // be talkative?\n\nprotected:\n\n\tMUTEX_T mutex_m; /// mutex for syncing over critical regions\n\tuint32_t seed_;\n};\n\n/**\n * Encapsulates a synchronized source of both paired-end reads and\n * unpaired reads, where the paired-end must come from parallel files.\n */\nclass PairedSoloPatternSource : public PairedPatternSource {\n\npublic:\n\n\tPairedSoloPatternSource(\n\t\tconst EList<PatternSource*>* src,\n\t\tconst PatternParams& p) :\n\t\tPairedPatternSource(p),\n\t\tcur_(0),\n\t\tsrc_(src)\n\t{\n\t\tassert(src_ != NULL);\n\t\tfor(size_t i = 0; i < src_->size(); i++) {\n\t\t\tassert((*src_)[i] != NULL);\n\t\t}\n\t}\n\n\tvirtual ~PairedSoloPatternSource() { delete src_; }\n\n\t/**\n\t * Call this whenever this PairedPatternSource is wrapped by a new\n\t * WrappedPatternSourcePerThread.  This helps us keep track of\n\t * whether locks within PatternSources will be contended.\n\t */\n\tvirtual void addWrapper() {\n\t\tfor(size_t i = 0; i < src_->size(); i++) {\n\t\t\t(*src_)[i]->addWrapper();\n\t\t}\n\t}\n\n\t/**\n\t * Reset this object and all the PatternSources under it so that\n\t * the next call to nextReadPair gets the very first read pair.\n\t */\n\tvirtual void reset() {\n\t\tfor(size_t i = 0; i < src_->size(); i++) {\n\t\t\t(*src_)[i]->reset();\n\t\t}\n\t\tcur_ = 0;\n\t}\n\n\t/**\n\t * The main member function for dispensing pairs of reads or\n\t * singleton reads.  Returns true iff ra and rb contain a new\n\t * pair; returns false if ra contains a new unpaired read.\n\t */\n\tvirtual bool nextReadPair(\n\t\tRead& ra,\n\t\tRead& rb,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired,\n\t\tbool fixName);\n\n\t/**\n\t * Return the number of reads attempted.\n\t */\n\tvirtual pair<TReadId, TReadId> readCnt() const {\n\t\tuint64_t ret = 0llu;\n\t\tfor(size_t i = 0; i < src_->size(); i++) ret += (*src_)[i]->readCnt();\n\t\treturn make_pair(ret, 0llu);\n\t}\n\nprotected:\n\n\tvolatile uint32_t cur_; // current element in parallel srca_, srcb_ vectors\n\tconst EList<PatternSource*>* src_; /// PatternSources for paired-end reads\n};\n\n/**\n * Encapsulates a synchronized source of both paired-end reads and\n * unpaired reads, where the paired-end must come from parallel files.\n */\nclass PairedDualPatternSource : public PairedPatternSource {\n\npublic:\n\n\tPairedDualPatternSource(\n\t\tconst EList<PatternSource*>* srca,\n\t\tconst EList<PatternSource*>* srcb,\n\t\tconst PatternParams& p) :\n\t\tPairedPatternSource(p), cur_(0), srca_(srca), srcb_(srcb)\n\t{\n\t\tassert(srca_ != NULL);\n\t\tassert(srcb_ != NULL);\n\t\t// srca_ and srcb_ must be parallel\n\t\tassert_eq(srca_->size(), srcb_->size());\n\t\tfor(size_t i = 0; i < srca_->size(); i++) {\n\t\t\t// Can't have NULL first-mate sources.  Second-mate sources\n\t\t\t// can be NULL, in the case when the corresponding first-\n\t\t\t// mate source is unpaired.\n\t\t\tassert((*srca_)[i] != NULL);\n\t\t\tfor(size_t j = 0; j < srcb_->size(); j++) {\n\t\t\t\tassert_neq((*srca_)[i], (*srcb_)[j]);\n\t\t\t}\n\t\t}\n\t}\n\n\tvirtual ~PairedDualPatternSource() {\n\t\tdelete srca_;\n\t\tdelete srcb_;\n\t}\n\n\t/**\n\t * Call this whenever this PairedPatternSource is wrapped by a new\n\t * WrappedPatternSourcePerThread.  This helps us keep track of\n\t * whether locks within PatternSources will be contended.\n\t */\n\tvirtual void addWrapper() {\n\t\tfor(size_t i = 0; i < srca_->size(); i++) {\n\t\t\t(*srca_)[i]->addWrapper();\n\t\t\tif((*srcb_)[i] != NULL) {\n\t\t\t\t(*srcb_)[i]->addWrapper();\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Reset this object and all the PatternSources under it so that\n\t * the next call to nextReadPair gets the very first read pair.\n\t */\n\tvirtual void reset() {\n\t\tfor(size_t i = 0; i < srca_->size(); i++) {\n\t\t\t(*srca_)[i]->reset();\n\t\t\tif((*srcb_)[i] != NULL) {\n\t\t\t\t(*srcb_)[i]->reset();\n\t\t\t}\n\t\t}\n\t\tcur_ = 0;\n\t}\n\n\t/**\n\t * The main member function for dispensing pairs of reads or\n\t * singleton reads.  Returns true iff ra and rb contain a new\n\t * pair; returns false if ra contains a new unpaired read.\n\t */\n\tvirtual bool nextReadPair(\n\t\tRead& ra,\n\t\tRead& rb,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired,\n\t\tbool fixName);\n\t\n\t/**\n\t * Return the number of reads attempted.\n\t */\n\tvirtual pair<TReadId, TReadId> readCnt() const;\n\nprotected:\n\n\tvolatile uint32_t cur_; // current element in parallel srca_, srcb_ vectors\n\tconst EList<PatternSource*>* srca_; /// PatternSources for 1st mates and/or unpaired reads\n\tconst EList<PatternSource*>* srcb_; /// PatternSources for 2nd mates\n};\n\n/**\n * Encapsulates a single thread's interaction with the PatternSource.\n * Most notably, this class holds the buffers into which the\n * PatterSource will write sequences.  This class is *not* threadsafe\n * - it doesn't need to be since there's one per thread.  PatternSource\n * is thread-safe.\n */\nclass PatternSourcePerThread {\n\npublic:\n\n\tPatternSourcePerThread() :\n\t\tbuf1_(), buf2_(), rdid_(0xffffffff), endid_(0xffffffff) { }\n\n\tvirtual ~PatternSourcePerThread() { }\n\n\t/**\n\t * Read the next read pair.\n\t */\n\tvirtual bool nextReadPair(\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired,\n\t\tbool fixName)\n\t{\n\t\treturn success;\n\t}\n\n\tRead& bufa()             { return buf1_;    }\t\n\tRead& bufb()             { return buf2_;    }\n\tconst Read& bufa() const { return buf1_;    }\n\tconst Read& bufb() const { return buf2_;    }\n\n\tTReadId       rdid()  const { return rdid_;  }\n\tTReadId       endid() const { return endid_; }\n\tvirtual void  reset()       { rdid_ = endid_ = 0xffffffff;  }\n\t\n\t/**\n\t * Return the length of mate 1 or mate 2.\n\t */\n\tsize_t length(int mate) const {\n\t\treturn (mate == 1) ? buf1_.length() : buf2_.length();\n\t}\n\nprotected:\n\n\tRead  buf1_;    // read buffer for mate a\n\tRead  buf2_;    // read buffer for mate b\n\tTReadId rdid_;  // index of read just read\n\tTReadId endid_; // index of read just read\n};\n\n/**\n * Abstract parent factory for PatternSourcePerThreads.\n */\nclass PatternSourcePerThreadFactory {\npublic:\n\tvirtual ~PatternSourcePerThreadFactory() { }\n\tvirtual PatternSourcePerThread* create() const = 0;\n\tvirtual EList<PatternSourcePerThread*>* create(uint32_t n) const = 0;\n\n\t/// Free memory associated with a pattern source\n\tvirtual void destroy(PatternSourcePerThread* patsrc) const {\n\t\tassert(patsrc != NULL);\n\t\t// Free the PatternSourcePerThread\n\t\tdelete patsrc;\n\t}\n\n\t/// Free memory associated with a pattern source list\n\tvirtual void destroy(EList<PatternSourcePerThread*>* patsrcs) const {\n\t\tassert(patsrcs != NULL);\n\t\t// Free all of the PatternSourcePerThreads\n\t\tfor(size_t i = 0; i < patsrcs->size(); i++) {\n\t\t\tif((*patsrcs)[i] != NULL) {\n\t\t\t\tdelete (*patsrcs)[i];\n\t\t\t\t(*patsrcs)[i] = NULL;\n\t\t\t}\n\t\t}\n\t\t// Free the vector\n\t\tdelete patsrcs;\n\t}\n};\n\n/**\n * A per-thread wrapper for a PairedPatternSource.\n */\nclass WrappedPatternSourcePerThread : public PatternSourcePerThread {\npublic:\n\tWrappedPatternSourcePerThread(PairedPatternSource& __patsrc) :\n\t\tpatsrc_(__patsrc)\n\t{\n\t\tpatsrc_.addWrapper();\n\t}\n\n\t/**\n\t * Get the next paired or unpaired read from the wrapped\n\t * PairedPatternSource.\n\t */\n\tvirtual bool nextReadPair(\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired,\n\t\tbool fixName);\n\nprivate:\n\n\t/// Container for obtaining paired reads from PatternSources\n\tPairedPatternSource& patsrc_;\n};\n\n/**\n * Abstract parent factory for PatternSourcePerThreads.\n */\nclass WrappedPatternSourcePerThreadFactory : public PatternSourcePerThreadFactory {\npublic:\n\tWrappedPatternSourcePerThreadFactory(PairedPatternSource& patsrc) :\n\t\tpatsrc_(patsrc) { }\n\n\t/**\n\t * Create a new heap-allocated WrappedPatternSourcePerThreads.\n\t */\n\tvirtual PatternSourcePerThread* create() const {\n\t\treturn new WrappedPatternSourcePerThread(patsrc_);\n\t}\n\n\t/**\n\t * Create a new heap-allocated vector of heap-allocated\n\t * WrappedPatternSourcePerThreads.\n\t */\n\tvirtual EList<PatternSourcePerThread*>* create(uint32_t n) const {\n\t\tEList<PatternSourcePerThread*>* v = new EList<PatternSourcePerThread*>;\n\t\tfor(size_t i = 0; i < n; i++) {\n\t\t\tv->push_back(new WrappedPatternSourcePerThread(patsrc_));\n\t\t\tassert(v->back() != NULL);\n\t\t}\n\t\treturn v;\n\t}\n\nprivate:\n\t/// Container for obtaining paired reads from PatternSources\n\tPairedPatternSource& patsrc_;\n};\n\n/// Skip to the end of the current string of newline chars and return\n/// the first character after the newline chars, or -1 for EOF\nstatic inline int getOverNewline(FileBuf& in) {\n\tint c;\n\twhile(isspace(c = in.get()));\n\treturn c;\n}\n\n/// Skip to the end of the current string of newline chars such that\n/// the next call to get() returns the first character after the\n/// whitespace\nstatic inline int peekOverNewline(FileBuf& in) {\n\twhile(true) {\n\t\tint c = in.peek();\n\t\tif(c != '\\r' && c != '\\n') {\n\t\t\treturn c;\n\t\t}\n\t\tin.get();\n\t}\n}\n\n/// Skip to the end of the current line; return the first character\n/// of the next line or -1 for EOF\nstatic inline int getToEndOfLine(FileBuf& in) {\n\twhile(true) {\n\t\tint c = in.get(); if(c < 0) return -1;\n\t\tif(c == '\\n' || c == '\\r') {\n\t\t\twhile(c == '\\n' || c == '\\r') {\n\t\t\t\tc = in.get(); if(c < 0) return -1;\n\t\t\t}\n\t\t\t// c now holds first character of next line\n\t\t\treturn c;\n\t\t}\n\t}\n}\n\n/// Skip to the end of the current line such that the next call to\n/// get() returns the first character on the next line\nstatic inline int peekToEndOfLine(FileBuf& in) {\n\twhile(true) {\n\t\tint c = in.get(); if(c < 0) return c;\n\t\tif(c == '\\n' || c == '\\r') {\n\t\t\tc = in.peek();\n\t\t\twhile(c == '\\n' || c == '\\r') {\n\t\t\t\tin.get(); if(c < 0) return c; // consume \\r or \\n\n\t\t\t\tc = in.peek();\n\t\t\t}\n\t\t\t// next get() gets first character of next line\n\t\t\treturn c;\n\t\t}\n\t}\n}\n\nextern void wrongQualityFormat(const BTString& read_name);\nextern void tooFewQualities(const BTString& read_name);\nextern void tooManyQualities(const BTString& read_name);\n\n/**\n * Encapsulates a source of patterns which is an in-memory vector.\n */\nclass VectorPatternSource : public PatternSource {\n\npublic:\n\n\tVectorPatternSource(\n\t\tconst EList<string>& v,\n\t\tconst PatternParams& p);\n\t\n\tvirtual ~VectorPatternSource() { }\n\t\n\tvirtual bool nextReadImpl(\n\t\tRead& r,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done);\n\t\n\t/**\n\t * This is unused, but implementation is given for completeness.\n\t */\n\tvirtual bool nextReadPairImpl(\n\t\tRead& ra,\n\t\tRead& rb,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired);\n\t\n\tvirtual void reset() {\n\t\tPatternSource::reset();\n\t\tcur_ = skip_;\n\t\tpaired_ = false;\n\t}\n\t\nprivate:\n\n\tsize_t cur_;\n\tuint32_t skip_;\n\tbool paired_;\n\tEList<BTDnaString> v_;  // forward sequences\n\tEList<BTString> quals_; // forward qualities\n\tEList<BTString> names_; // names\n\tEList<int> trimmed3_;   // names\n\tEList<int> trimmed5_;   // names\n};\n\n/**\n *\n */\nclass BufferedFilePatternSource : public PatternSource {\npublic:\n\tBufferedFilePatternSource(\n\t\tconst EList<string>& infiles,\n\t\tconst PatternParams& p) :\n\t\tPatternSource(p),\n\t\tinfiles_(infiles),\n\t\tfilecur_(0),\n\t\tfb_(),\n\t\tskip_(p.skip),\n\t\tfirst_(true)\n\t{\n\t\tassert_gt(infiles.size(), 0);\n\t\terrs_.resize(infiles_.size());\n\t\terrs_.fill(0, infiles_.size(), false);\n\t\tassert(!fb_.isOpen());\n\t\topen(); // open first file in the list\n\t\tfilecur_++;\n\t}\n\n\tvirtual ~BufferedFilePatternSource() {\n\t\tif(fb_.isOpen()) fb_.close();\n\t}\n\n\t/**\n\t * Fill Read with the sequence, quality and name for the next\n\t * read in the list of read files.  This function gets called by\n\t * all the search threads, so we must handle synchronization.\n\t */\n\tvirtual bool nextReadImpl(\n\t\tRead& r,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done)\n\t{\n\t\t// We'll be manipulating our file handle/filecur_ state\n\t\tlock();\n\t\twhile(true) {\n\t\t\tdo { read(r, rdid, endid, success, done); }\n\t\t\twhile(!success && !done);\n\t\t\tif(!success && filecur_ < infiles_.size()) {\n\t\t\t\tassert(done);\n\t\t\t\topen();\n\t\t\t\tresetForNextFile(); // reset state to handle a fresh file\n\t\t\t\tfilecur_++;\n\t\t\t\tcontinue;\n\t\t\t}\n\t\t\tbreak;\n\t\t}\n\t\tassert(r.repOk());\n\t\t// Leaving critical region\n\t\tunlock();\n\t\treturn success;\n\t}\n\t\n\t/**\n\t *\n\t */\n\tvirtual bool nextReadPairImpl(\n\t\tRead& ra,\n\t\tRead& rb,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired)\n\t{\n\t\t// We'll be manipulating our file handle/filecur_ state\n\t\tlock();\n\t\twhile(true) {\n\t\t\tdo { readPair(ra, rb, rdid, endid, success, done, paired); }\n\t\t\twhile(!success && !done);\n\t\t\tif(!success && filecur_ < infiles_.size()) {\n\t\t\t\tassert(done);\n\t\t\t\topen();\n\t\t\t\tresetForNextFile(); // reset state to handle a fresh file\n\t\t\t\tfilecur_++;\n\t\t\t\tcontinue;\n\t\t\t}\n\t\t\tbreak;\n\t\t}\n\t\tassert(ra.repOk());\n\t\tassert(rb.repOk());\n\t\t// Leaving critical region\n\t\tunlock();\n\t\treturn success;\n\t}\n\t\n\t/**\n\t * Reset state so that we read start reading again from the\n\t * beginning of the first file.  Should only be called by the\n\t * master thread.\n\t */\n\tvirtual void reset() {\n\t\tPatternSource::reset();\n\t\tfilecur_ = 0,\n\t\topen();\n\t\tfilecur_++;\n\t}\n\nprotected:\n\n\t/// Read another pattern from the input file; this is overridden\n\t/// to deal with specific file formats\n\tvirtual bool read(\n\t\tRead& r,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done) = 0;\n\t\n\t/// Read another pattern pair from the input file; this is\n\t/// overridden to deal with specific file formats\n\tvirtual bool readPair(\n\t\tRead& ra,\n\t\tRead& rb,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired) = 0;\n\t\n\t/// Reset state to handle a fresh file\n\tvirtual void resetForNextFile() { }\n\t\n\tvoid open() {\n\t\tif(fb_.isOpen()) fb_.close();\n\t\twhile(filecur_ < infiles_.size()) {\n\t\t\t// Open read\n\t\t\tFILE *in;\n\t\t\tif(infiles_[filecur_] == \"-\") {\n\t\t\t\tin = stdin;\n\t\t\t} else if((in = fopen(infiles_[filecur_].c_str(), \"rb\")) == NULL) {\n\t\t\t\tif(!errs_[filecur_]) {\n\t\t\t\t\tcerr << \"Warning: Could not open read file \\\"\" << infiles_[filecur_].c_str() << \"\\\" for reading; skipping...\" << endl;\n\t\t\t\t\terrs_[filecur_] = true;\n\t\t\t\t}\n\t\t\t\tfilecur_++;\n\t\t\t\tcontinue;\n\t\t\t}\n\t\t\tfb_.newFile(in);\n\t\t\treturn;\n\t\t}\n\t\tcerr << \"Error: No input read files were valid\" << endl;\n\t\texit(1);\n\t\treturn;\n\t}\n\t\n\tEList<string> infiles_;  // filenames for read files\n\tEList<bool> errs_;       // whether we've already printed an error for each file\n\tsize_t filecur_;         // index into infiles_ of next file to read\n\tFileBuf fb_;             // read file currently being read from\n\tTReadId skip_;           // number of reads to skip\n\tbool first_;\n};\n\n/**\n * Parse a single quality string from fb and store qualities in r.\n * Assume the next character obtained via fb.get() is the first\n * character of the quality string.  When returning, the next\n * character returned by fb.peek() or fb.get() should be the first\n * character of the following line.\n */\nint parseQuals(\n\tRead& r,\n\tFileBuf& fb,\n\tint firstc,\n\tint readLen,\n\tint trim3,\n\tint trim5,\n\tbool intQuals,\n\tbool phred64,\n\tbool solexa64);\n\n/**\n * Synchronized concrete pattern source for a list of FASTA or CSFASTA\n * (if color = true) files.\n */\nclass FastaPatternSource : public BufferedFilePatternSource {\npublic:\n\tFastaPatternSource(const EList<string>& infiles,\n\t                   const PatternParams& p) :\n\t\tBufferedFilePatternSource(infiles, p),\n\t\tfirst_(true), solexa64_(p.solexa64), phred64_(p.phred64), intQuals_(p.intQuals)\n\t{ }\n\tvirtual void reset() {\n\t\tfirst_ = true;\n\t\tBufferedFilePatternSource::reset();\n\t}\nprotected:\n\t/**\n\t * Scan to the next FASTA record (starting with >) and return the first\n\t * character of the record (which will always be >).\n\t */\n\tstatic int skipToNextFastaRecord(FileBuf& in) {\n\t\tint c;\n\t\twhile((c = in.get()) != '>') {\n\t\t\tif(in.eof()) return -1;\n\t\t}\n\t\treturn c;\n\t}\n\n\t/// Called when we have to bail without having parsed a read.\n\tvoid bail(Read& r) {\n\t\tr.reset();\n\t\tfb_.resetLastN();\n\t}\n\n\t/// Read another pattern from a FASTA input file\n\tvirtual bool read(\n\t\tRead& r,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done);\n\t\n\t/// Read another pair of patterns from a FASTA input file\n\tvirtual bool readPair(\n\t\tRead& ra,\n\t\tRead& rb,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired)\n\t{\n\t\t// (For now, we shouldn't ever be here)\n\t\tcerr << \"In FastaPatternSource.readPair()\" << endl;\n\t\tthrow 1;\n\t\treturn false;\n\t}\n\t\n\tvirtual void resetForNextFile() {\n\t\tfirst_ = true;\n\t}\n\t\nprivate:\n\tbool first_;\n    \npublic:\n\tbool solexa64_;\n\tbool phred64_;\n\tbool intQuals_;\n};\n\n\n/**\n * Tokenize a line of space-separated integer quality values.\n */\nstatic inline bool tokenizeQualLine(\n\tFileBuf& filebuf,\n\tchar *buf,\n\tsize_t buflen,\n\tEList<string>& toks)\n{\n\tsize_t rd = filebuf.gets(buf, buflen);\n\tif(rd == 0) return false;\n\tassert(NULL == strrchr(buf, '\\n'));\n\ttokenize(string(buf), \" \", toks);\n\treturn true;\n}\n\n/**\n * Synchronized concrete pattern source for a list of files with tab-\n * delimited name, seq, qual fields (or, for paired-end reads,\n * basename, seq1, qual1, seq2, qual2).\n */\nclass TabbedPatternSource : public BufferedFilePatternSource {\n\npublic:\n\n\tTabbedPatternSource(\n\t\tconst EList<string>& infiles,\n\t\tconst PatternParams& p,\n\t\tbool  secondName) :\n\t\tBufferedFilePatternSource(infiles, p),\n\t\tsolQuals_(p.solexa64),\n\t\tphred64Quals_(p.phred64),\n\t\tintQuals_(p.intQuals),\n\t\tsecondName_(secondName) { }\n\nprotected:\n\n\t/// Read another pattern from a FASTA input file\n\tvirtual bool read(\n\t\tRead& r,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done);\n\n\t/// Read another pair of patterns from a FASTA input file\n\tvirtual bool readPair(\n\t\tRead& ra,\n\t\tRead& rb,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired);\n\t\nprivate:\n\n\t/**\n\t * Parse a name from fb_ and store in r.  Assume that the next\n\t * character obtained via fb_.get() is the first character of\n\t * the sequence and the string stops at the next char upto (could\n\t * be tab, newline, etc.).\n\t */\n\tint parseName(Read& r, Read* r2, char upto = '\\t');\n\n\t/**\n\t * Parse a single sequence from fb_ and store in r.  Assume\n\t * that the next character obtained via fb_.get() is the first\n\t * character of the sequence and the sequence stops at the next\n\t * char upto (could be tab, newline, etc.).\n\t */\n\tint parseSeq(Read& r, int& charsRead, int& trim5, char upto = '\\t');\n\n\t/**\n\t * Parse a single quality string from fb_ and store in r.\n\t * Assume that the next character obtained via fb_.get() is\n\t * the first character of the quality string and the string stops\n\t * at the next char upto (could be tab, newline, etc.).\n\t */\n\tint parseQuals(Read& r, int charsRead, int dstLen, int trim5,\n\t               char& c2, char upto = '\\t', char upto2 = -1);\n\n\tbool solQuals_;\n\tbool phred64Quals_;\n\tbool intQuals_;\n\tEList<string> qualToks_;\n\tbool secondName_;\n};\n\n/**\n * Synchronized concrete pattern source for Illumina Qseq files.  In\n * Qseq files, each read appears on a separate line and the tab-\n * delimited fields are:\n *\n * 1. Machine name\n * 2. Run number\n * 3. Lane number\n * 4. Tile number\n * 5. X coordinate of spot\n * 6. Y coordinate of spot\n * 7. Index: \"Index sequence or 0. For no indexing, or for a file that\n *    has not been demultiplexed yet, this field should have a value of\n *    0.\"\n * 8. Read number: 1 for unpaired, 1 or 2 for paired\n * 9. Sequence\n * 10. Quality\n * 11. Filter: 1 = passed, 0 = didn't\n */\nclass QseqPatternSource : public BufferedFilePatternSource {\n\npublic:\n\n\tQseqPatternSource(\n\t\tconst EList<string>& infiles,\n\t    const PatternParams& p) :\n\t\tBufferedFilePatternSource(infiles, p),\n\t\tsolQuals_(p.solexa64),\n\t\tphred64Quals_(p.phred64),\n\t\tintQuals_(p.intQuals) { }\n\nprotected:\n\n#define BAIL_UNPAIRED() { \\\n\tpeekOverNewline(fb_); \\\n\tr.reset(); \\\n\tsuccess = false; \\\n\tdone = true; \\\n\treturn success; \\\n}\n\n\t/**\n\t * Parse a name from fb_ and store in r.  Assume that the next\n\t * character obtained via fb_.get() is the first character of\n\t * the sequence and the string stops at the next char upto (could\n\t * be tab, newline, etc.).\n\t */\n\tint parseName(\n\t\tRead& r,      // buffer for mate 1\n\t\tRead* r2,     // buffer for mate 2 (NULL if mate2 is read separately)\n\t\tbool append,     // true -> append characters, false -> skip them\n\t\tbool clearFirst, // clear the name buffer first\n\t\tbool warnEmpty,  // emit a warning if nothing was added to the name\n\t\tbool useDefault, // if nothing is read, put readCnt_ as a default value\n\t\tint upto);       // stop parsing when we first reach character 'upto'\n\n\t/**\n\t * Parse a single sequence from fb_ and store in r.  Assume\n\t * that the next character obtained via fb_.get() is the first\n\t * character of the sequence and the sequence stops at the next\n\t * char upto (could be tab, newline, etc.).\n\t */\n\tint parseSeq(\n\t\tRead& r,      // buffer for read\n\t\tint& charsRead,\n\t\tint& trim5,\n\t\tchar upto);\n\n\t/**\n\t * Parse a single quality string from fb_ and store in r.\n\t * Assume that the next character obtained via fb_.get() is\n\t * the first character of the quality string and the string stops\n\t * at the next char upto (could be tab, newline, etc.).\n\t */\n\tint parseQuals(\n\t\tRead& r,      // buffer for read\n\t\tint charsRead,\n\t\tint dstLen,\n\t\tint trim5,\n\t\tchar& c2,\n\t\tchar upto,\n\t\tchar upto2);\n\n\t/**\n\t * Read another pattern from a Qseq input file.\n\t */\n\tvirtual bool read(\n\t\tRead& r,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done);\n\n\t/**\n\t * Read a pair of patterns from 1 Qseq file.  Note: this is never used.\n\t */\n\tvirtual bool readPair(\n\t\tRead& ra,\n\t\tRead& rb,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired)\n\t{\n\t\t// (For now, we shouldn't ever be here)\n\t\tcerr << \"In QseqPatternSource.readPair()\" << endl;\n\t\tthrow 1;\n\t\treturn false;\n\t}\n\n\tbool solQuals_;\n\tbool phred64Quals_;\n\tbool intQuals_;\n\tEList<string> qualToks_;\n};\n\n/**\n * Synchronized concrete pattern source for a list of FASTA files where\n * reads need to be extracted from long continuous sequences.\n */\nclass FastaContinuousPatternSource : public BufferedFilePatternSource {\npublic:\n\tFastaContinuousPatternSource(const EList<string>& infiles, const PatternParams& p) :\n\t\tBufferedFilePatternSource(infiles, p),\n\t\tlength_(p.sampleLen), freq_(p.sampleFreq),\n\t\teat_(length_-1), beginning_(true),\n\t\tbufCur_(0), subReadCnt_(0llu)\n\t{\n\t\tresetForNextFile();\n\t}\n\n\tvirtual void reset() {\n\t\tBufferedFilePatternSource::reset();\n\t\tresetForNextFile();\n\t}\n\nprotected:\n\n\t/// Read another pattern from a FASTA input file\n\tvirtual bool read(\n\t\tRead& r,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done)\n\t{\n\t\tsuccess = true;\n\t\tdone = false;\n\t\tr.reset();\n\t\twhile(true) {\n\t\t\tr.color = gColor;\n\t\t\tint c = fb_.get();\n\t\t\tif(c < 0) { success = false; done = true; return success; }\n\t\t\tif(c == '>') {\n\t\t\t\tresetForNextFile();\n\t\t\t\tc = fb_.peek();\n\t\t\t\tbool sawSpace = false;\n\t\t\t\twhile(c != '\\n' && c != '\\r') {\n\t\t\t\t\tif(!sawSpace) {\n\t\t\t\t\t\tsawSpace = isspace(c);\n\t\t\t\t\t}\n\t\t\t\t\tif(!sawSpace) {\n\t\t\t\t\t\tnameBuf_.append(c);\n\t\t\t\t\t}\n\t\t\t\t\tfb_.get();\n\t\t\t\t\tc = fb_.peek();\n\t\t\t\t}\n\t\t\t\twhile(c == '\\n' || c == '\\r') {\n\t\t\t\t\tfb_.get();\n\t\t\t\t\tc = fb_.peek();\n\t\t\t\t}\n\t\t\t\tnameBuf_.append('_');\n\t\t\t} else {\n\t\t\t\tint cat = asc2dnacat[c];\n\t\t\t\tif(cat >= 2) c = 'N';\n\t\t\t\tif(cat == 0) {\n\t\t\t\t\t// Encountered non-DNA, non-IUPAC char; skip it\n\t\t\t\t\tcontinue;\n\t\t\t\t} else {\n\t\t\t\t\t// DNA char\n\t\t\t\t\tbuf_[bufCur_++] = c;\n\t\t\t\t\tif(bufCur_ == 1024) bufCur_ = 0;\n\t\t\t\t\tif(eat_ > 0) {\n\t\t\t\t\t\teat_--;\n\t\t\t\t\t\t// Try to keep readCnt_ aligned with the offset\n\t\t\t\t\t\t// into the reference; that lets us see where\n\t\t\t\t\t\t// the sampling gaps are by looking at the read\n\t\t\t\t\t\t// name\n\t\t\t\t\t\tif(!beginning_) readCnt_++;\n\t\t\t\t\t\tcontinue;\n\t\t\t\t\t}\n\t\t\t\t\tfor(size_t i = 0; i < length_; i++) {\n\t\t\t\t\t\tif(length_ - i <= bufCur_) {\n\t\t\t\t\t\t\tc = buf_[bufCur_ - (length_ - i)];\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t// Rotate\n\t\t\t\t\t\t\tc = buf_[bufCur_ - (length_ - i) + 1024];\n\t\t\t\t\t\t}\n\t\t\t\t\t\tr.patFw.append(asc2dna[c]);\n\t\t\t\t\t\tr.qual.append('I');\n\t\t\t\t\t}\n\t\t\t\t\t// Set up a default name if one hasn't been set\n\t\t\t\t\tr.name = nameBuf_;\n\t\t\t\t\tchar cbuf[20];\n\t\t\t\t\titoa10<TReadId>(readCnt_ - subReadCnt_, cbuf);\n\t\t\t\t\tr.name.append(cbuf);\n\t\t\t\t\teat_ = freq_-1;\n\t\t\t\t\treadCnt_++;\n\t\t\t\t\tbeginning_ = false;\n\t\t\t\t\trdid = endid = readCnt_-1;\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\treturn true;\n\t}\n\t\n\t/// Shouldn't ever be here; it's not sensible to obtain read pairs\n\t// from a continuous input.\n\tvirtual bool readPair(\n\t\tRead& ra,\n\t\tRead& rb,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired)\n\t{\n\t\tcerr << \"In FastaContinuousPatternSource.readPair()\" << endl;\n\t\tthrow 1;\n\t\treturn false;\n\t}\n\n\t/**\n\t * Reset state to be read for the next file.\n\t */\n\tvirtual void resetForNextFile() {\n\t\teat_ = length_-1;\n\t\tbeginning_ = true;\n\t\tbufCur_ = 0;\n\t\tnameBuf_.clear();\n\t\tsubReadCnt_ = readCnt_;\n\t}\n\nprivate:\n\tsize_t length_;     /// length of reads to generate\n\tsize_t freq_;       /// frequency to sample reads\n\tsize_t eat_;        /// number of characters we need to skip before\n\t                    /// we have flushed all of the ambiguous or\n\t                    /// non-existent characters out of our read\n\t                    /// window\n\tbool beginning_;    /// skipping over the first read length?\n\tchar buf_[1024];    /// read buffer\n\tBTString nameBuf_;  /// read buffer for name of fasta record being\n\t                    /// split into mers\n\tsize_t bufCur_;     /// buffer cursor; points to where we should\n\t                    /// insert the next character\n\tuint64_t subReadCnt_;/// number to subtract from readCnt_ to get\n\t                    /// the pat id to output (so it resets to 0 for\n\t                    /// each new sequence)\n};\n\n/**\n * Read a FASTQ-format file.\n * See: http://maq.sourceforge.net/fastq.shtml\n */\nclass FastqPatternSource : public BufferedFilePatternSource {\n\npublic:\n\n\tFastqPatternSource(const EList<string>& infiles, const PatternParams& p) :\n\t\tBufferedFilePatternSource(infiles, p),\n\t\tfirst_(true),\n\t\tsolQuals_(p.solexa64),\n\t\tphred64Quals_(p.phred64),\n\t\tintQuals_(p.intQuals),\n\t\tfuzzy_(p.fuzzy)\n\t{ }\n\t\n\tvirtual void reset() {\n\t\tfirst_ = true;\n\t\tfb_.resetLastN();\n\t\tBufferedFilePatternSource::reset();\n\t}\n\t\nprotected:\n\n\t/**\n\t * Scan to the next FASTQ record (starting with @) and return the first\n\t * character of the record (which will always be @).  Since the quality\n\t * line may start with @, we keep scanning until we've seen a line\n\t * beginning with @ where the line two lines back began with +.\n\t */\n\tstatic int skipToNextFastqRecord(FileBuf& in, bool sawPlus) {\n\t\tint line = 0;\n\t\tint plusLine = -1;\n\t\tint c = in.get();\n\t\tint firstc = c;\n\t\twhile(true) {\n\t\t\tif(line > 20) {\n\t\t\t\t// If we couldn't find our desired '@' in the first 20\n\t\t\t\t// lines, it's time to give up\n\t\t\t\tif(firstc == '>') {\n\t\t\t\t\t// That firstc is '>' may be a hint that this is\n\t\t\t\t\t// actually a FASTA file, so return it intact\n\t\t\t\t\treturn '>';\n\t\t\t\t}\n\t\t\t\t// Return an error\n\t\t\t\treturn -1;\n\t\t\t}\n\t\t\tif(c == -1) return -1;\n\t\t\tif(c == '\\n') {\n\t\t\t\tc = in.get();\n\t\t\t\tif(c == '@' && sawPlus && plusLine == (line-2)) {\n\t\t\t\t\treturn '@';\n\t\t\t\t}\n\t\t\t\telse if(c == '+') {\n\t\t\t\t\t// Saw a '+' at the beginning of a line; remember where\n\t\t\t\t\t// we saw it\n\t\t\t\t\tsawPlus = true;\n\t\t\t\t\tplusLine = line;\n\t\t\t\t}\n\t\t\t\telse if(c == -1) {\n\t\t\t\t\treturn -1;\n\t\t\t\t}\n\t\t\t\tline++;\n\t\t\t}\n\t\t\tc = in.get();\n\t\t}\n\t}\n\n\t/// Read another pattern from a FASTQ input file\n\tvirtual bool read(\n\t\tRead& r,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done);\n\t\n\t/// Read another read pair from a FASTQ input file\n\tvirtual bool readPair(\n\t\tRead& ra,\n\t\tRead& rb,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired)\n\t{\n\t\t// (For now, we shouldn't ever be here)\n\t\tcerr << \"In FastqPatternSource.readPair()\" << endl;\n\t\tthrow 1;\n\t\treturn false;\n\t}\n\t\n\tvirtual void resetForNextFile() {\n\t\tfirst_ = true;\n\t}\n\t\nprivate:\n\n\t/**\n\t * Do things we need to do if we have to bail in the middle of a\n\t * read, usually because we reached the end of the input without\n\t * finishing.\n\t */\n\tvoid bail(Read& r) {\n\t\tr.patFw.clear();\n\t\tfb_.resetLastN();\n\t}\n\n\tbool first_;\n\tbool solQuals_;\n\tbool phred64Quals_;\n\tbool intQuals_;\n\tbool fuzzy_;\n\tEList<string> qualToks_;\n};\n\n/**\n * Read a Raw-format file (one sequence per line).  No quality strings\n * allowed.  All qualities are assumed to be 'I' (40 on the Phred-33\n * scale).\n */\nclass RawPatternSource : public BufferedFilePatternSource {\n\npublic:\n\n\tRawPatternSource(const EList<string>& infiles, const PatternParams& p) :\n\t\tBufferedFilePatternSource(infiles, p), first_(true) { }\n\n\tvirtual void reset() {\n\t\tfirst_ = true;\n\t\tBufferedFilePatternSource::reset();\n\t}\n\nprotected:\n\n\t/// Read another pattern from a Raw input file\n\tvirtual bool read(\n\t\tRead& r,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done)\n\t{\n\t\tint c;\n\t\tsuccess = true;\n\t\tdone = false;\n\t\tr.reset();\n\t\tc = getOverNewline(this->fb_);\n\t\tif(c < 0) {\n\t\t\tbail(r); success = false; done = true; return success;\n\t\t}\n\t\tassert(!isspace(c));\n\t\tr.color = gColor;\n\t\tint mytrim5 = gTrim5;\n\t\tif(first_) {\n\t\t\t// Check that the first character is sane for a raw file\n\t\t\tint cc = c;\n\t\t\tif(gColor) {\n\t\t\t\tif(cc >= '0' && cc <= '4') cc = \"ACGTN\"[(int)cc - '0'];\n\t\t\t\tif(cc == '.') cc = 'N';\n\t\t\t}\n\t\t\tif(asc2dnacat[cc] == 0) {\n\t\t\t\tcerr << \"Error: reads file does not look like a Raw file\" << endl;\n\t\t\t\tif(c == '>') {\n\t\t\t\t\tcerr << \"Reads file looks like a FASTA file; please use -f\" << endl;\n\t\t\t\t}\n\t\t\t\tif(c == '@') {\n\t\t\t\t\tcerr << \"Reads file looks like a FASTQ file; please use -q\" << endl;\n\t\t\t\t}\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\tfirst_ = false;\n\t\t}\n\t\tif(gColor) {\n\t\t\t// This may be a primer character.  If so, keep it in the\n\t\t\t// 'primer' field of the read buf and parse the rest of the\n\t\t\t// read without it.\n\t\t\tc = toupper(c);\n\t\t\tif(asc2dnacat[c] > 0) {\n\t\t\t\t// First char is a DNA char\n\t\t\t\tint c2 = toupper(fb_.peek());\n\t\t\t\t// Second char is a color char\n\t\t\t\tif(asc2colcat[c2] > 0) {\n\t\t\t\t\tr.primer = c;\n\t\t\t\t\tr.trimc = c2;\n\t\t\t\t\tmytrim5 += 2; // trim primer and first color\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(c < 0) {\n\t\t\t\tbail(r); success = false; done = true; return success;\n\t\t\t}\n\t\t}\n\t\t// _in now points just past the first character of a sequence\n\t\t// line, and c holds the first character\n\t\tint chs = 0;\n\t\twhile(!isspace(c) && c >= 0) {\n\t\t\tif(gColor) {\n\t\t\t\tif(c >= '0' && c <= '4') c = \"ACGTN\"[(int)c - '0'];\n\t\t\t\tif(c == '.') c = 'N';\n\t\t\t}\n\t\t\t// 5' trimming\n\t\t\tif(isalpha(c) && chs >= mytrim5) {\n\t\t\t\t//size_t len = chs - mytrim5;\n\t\t\t\t//if(len >= 1024) tooManyQualities(BTString(\"(no name)\"));\n\t\t\t\tr.patFw.append(asc2dna[c]);\n\t\t\t\tr.qual.append('I');\n\t\t\t}\n\t\t\tchs++;\n\t\t\tif(isspace(fb_.peek())) break;\n\t\t\tc = fb_.get();\n\t\t}\n\t\t// 3' trimming\n\t\tr.patFw.trimEnd(gTrim3);\n\t\tr.qual.trimEnd(gTrim3);\n\t\tc = peekToEndOfLine(fb_);\n\t\tr.trimmed3 = gTrim3;\n\t\tr.trimmed5 = mytrim5;\n\t\tr.readOrigBuf.install(fb_.lastN(), fb_.lastNLen());\n\t\tfb_.resetLastN();\n\n\t\t// Set up name\n\t\tchar cbuf[20];\n\t\titoa10<TReadId>(readCnt_, cbuf);\n\t\tr.name.install(cbuf);\n\t\treadCnt_++;\n\n\t\trdid = endid = readCnt_-1;\n\t\treturn success;\n\t}\n\t\n\t/// Read another read pair from a FASTQ input file\n\tvirtual bool readPair(\n\t\tRead& ra,\n\t\tRead& rb,\n\t\tTReadId& rdid,\n\t\tTReadId& endid,\n\t\tbool& success,\n\t\tbool& done,\n\t\tbool& paired)\n\t{\n\t\t// (For now, we shouldn't ever be here)\n\t\tcerr << \"In RawPatternSource.readPair()\" << endl;\n\t\tthrow 1;\n\t\treturn false;\n\t}\n\t\n\tvirtual void resetForNextFile() {\n\t\tfirst_ = true;\n\t}\n\t\nprivate:\n\n\t/**\n\t * Do things we need to do if we have to bail in the middle of a\n\t * read, usually because we reached the end of the input without\n\t * finishing.\n\t */\n\tvoid bail(Read& r) {\n\t\tr.patFw.clear();\n\t\tfb_.resetLastN();\n\t}\n\t\n\tbool first_;\n};\n\n#ifdef USE_SRA\n\nnamespace ngs {\n    class ReadCollection;\n    class ReadIterator;\n}\n\nnamespace tthread {\n    class thread;\n};\n\nstruct SRA_Data;\n\n/**\n *\n */\nclass SRAPatternSource : public PatternSource {\npublic:\n    SRAPatternSource(\n                     const EList<string>& sra_accs,\n                     const PatternParams& p,\n                     const size_t nthreads = 1) :\n    PatternSource(p),\n    sra_accs_(sra_accs),\n    sra_acc_cur_(0),\n    skip_(p.skip),\n    first_(true),\n    nthreads_(nthreads),\n    sra_run_(NULL),\n    sra_it_(NULL),\n    sra_data_(NULL),\n    io_thread_(NULL)\n    {\n        assert_gt(sra_accs_.size(), 0);\n        errs_.resize(sra_accs_.size());\n        errs_.fill(0, sra_accs_.size(), false);\n        open(); // open first file in the list\n        sra_acc_cur_++;\n    }\n    \n    virtual ~SRAPatternSource();\n    \n    /**\n     * Fill Read with the sequence, quality and name for the next\n     * read in the list of read files.  This function gets called by\n     * all the search threads, so we must handle synchronization.\n     */\n    virtual bool nextReadImpl(\n                              Read& r,\n                              TReadId& rdid,\n                              TReadId& endid,\n                              bool& success,\n                              bool& done)\n    {\n        // We'll be manipulating our file handle/filecur_ state\n        lock();\n        while(true) {\n            do { read(r, rdid, endid, success, done); }\n            while(!success && !done);\n            if(!success && sra_acc_cur_ < sra_accs_.size()) {\n                assert(done);\n                open();\n                resetForNextFile(); // reset state to handle a fresh file\n                sra_acc_cur_++;\n                continue;\n            }\n            break;\n        }\n        assert(r.repOk());\n        // Leaving critical region\n        unlock();\n        return success;\n    }\n    \n    /**\n     *\n     */\n    virtual bool nextReadPairImpl(\n                                  Read& ra,\n                                  Read& rb,\n                                  TReadId& rdid,\n                                  TReadId& endid,\n                                  bool& success,\n                                  bool& done,\n                                  bool& paired)\n    {\n        // We'll be manipulating our file handle/filecur_ state\n        lock();\n        while(true) {\n            do { readPair(ra, rb, rdid, endid, success, done, paired); }\n            while(!success && !done);\n            if(!success && sra_acc_cur_ < sra_accs_.size()) {\n                assert(done);\n                open();\n                resetForNextFile(); // reset state to handle a fresh file\n                sra_acc_cur_++;\n                continue;\n            }\n            break;\n        }\n        assert(ra.repOk());\n        assert(rb.repOk());\n        // Leaving critical region\n        unlock();\n        return success;\n    }\n    \n    /**\n     * Reset state so that we read start reading again from the\n     * beginning of the first file.  Should only be called by the\n     * master thread.\n     */\n    virtual void reset() {\n        PatternSource::reset();\n        sra_acc_cur_ = 0,\n        open();\n        sra_acc_cur_++;\n    }\n    \n    /// Read another pattern from the input file; this is overridden\n    /// to deal with specific file formats\n    virtual bool read(\n                      Read& r,\n                      TReadId& rdid,\n                      TReadId& endid,\n                      bool& success,\n                      bool& done)\n    {\n        return true;\n    }\n    \n    /// Read another pattern pair from the input file; this is\n    /// overridden to deal with specific file formats\n    virtual bool readPair(\n                          Read& ra,\n                          Read& rb,\n                          TReadId& rdid,\n                          TReadId& endid,\n                          bool& success,\n                          bool& done,\n                          bool& paired);\n    \nprotected:\n    \n    /// Reset state to handle a fresh file\n    virtual void resetForNextFile() { }\n    \n    void open();\n    \n    EList<string> sra_accs_; // filenames for read files\n    EList<bool> errs_;       // whether we've already printed an error for each file\n    size_t sra_acc_cur_;     // index into infiles_ of next file to read\n    TReadId skip_;           // number of reads to skip\n    bool first_;\n    \n    size_t nthreads_;\n    \n    ngs::ReadCollection* sra_run_;\n    ngs::ReadIterator* sra_it_;\n    \n    SRA_Data* sra_data_;\n    tthread::thread* io_thread_;\n};\n\n#endif\n\n#endif /*PAT_H_*/\n"
  },
  {
    "path": "pe.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"assert_helpers.h\"\n#include \"pe.h\"\n\nusing namespace std;\n\n/**\n * Return a PE_TYPE flag indicating, given a PE_POLICY and coordinates\n * for a paired-end alignment, what type of alignment it is, i.e.,\n * whether it's:\n *\n * 1. Straightforwardly concordant\n * 2. Mates dovetail (one extends beyond the end of the other)\n * 3. One mate contains the other but they don't dovetail\n * 4. One mate overlaps the other but neither contains the other and\n *    they don't dovetail\n * 5. Discordant\n */\nint PairedEndPolicy::peClassifyPair(\n\tint64_t  off1,   // offset of mate 1\n\tsize_t   len1,   // length of mate 1\n\tbool     fw1,    // whether mate 1 aligned to Watson\n\tint64_t  off2,   // offset of mate 2\n\tsize_t   len2,   // length of mate 2\n\tbool     fw2)    // whether mate 2 aligned to Watson\n\tconst\n{\n\tassert_gt(len1, 0);\n\tassert_gt(len2, 0);\n\t// Expand the maximum fragment length if necessary to accomodate\n\t// the longer mate\n\tsize_t maxfrag = maxfrag_;\n\tif(len1 > maxfrag && expandToFit_) maxfrag = len1;\n\tif(len2 > maxfrag && expandToFit_) maxfrag = len2;\n\tsize_t minfrag = minfrag_;\n\tif(minfrag < 1) {\n\t\tminfrag = 1;\n\t}\n\tbool oneLeft = false;\n\tif(pol_ == PE_POLICY_FF) {\n\t\tif(fw1 != fw2) {\n\t\t\t// Bad combination of orientations\n\t\t\treturn PE_ALS_DISCORD;\n\t\t}\n\t\toneLeft = fw1;\n\t} else if(pol_ == PE_POLICY_RR) {\n\t\tif(fw1 != fw2) {\n\t\t\t// Bad combination of orientations\n\t\t\treturn PE_ALS_DISCORD;\n\t\t}\n\t\toneLeft = !fw1;\n\t} else if(pol_ == PE_POLICY_FR) {\n\t\tif(fw1 == fw2) {\n\t\t\t// Bad combination of orientations\n\t\t\treturn PE_ALS_DISCORD;\n\t\t}\n\t\toneLeft = fw1;\n\t} else if(pol_ == PE_POLICY_RF) {\n\t\tif(fw1 == fw2) {\n\t\t\t// Bad combination of orientations\n\t\t\treturn PE_ALS_DISCORD;\n\t\t}\n\t\toneLeft = !fw1;\n\t}\n\t// Calc implied fragment size\n\tint64_t fraglo = min<int64_t>(off1, off2);\n\tint64_t fraghi = max<int64_t>(off1+len1, off2+len2);\n\tassert_gt(fraghi, fraglo);\n\tsize_t frag = (size_t)(fraghi - fraglo);\n\tif(frag > maxfrag || frag < minfrag) {\n\t\t// Pair is discordant by virtue of the extents\n\t\treturn PE_ALS_DISCORD;\n\t}\n\tint64_t lo1 = off1;\n\tint64_t hi1 = off1 + len1 - 1;\n\tint64_t lo2 = off2;\n\tint64_t hi2 = off2 + len2 - 1;\n\tbool containment = false;\n\t// Check whether one mate entirely contains the other\n\tif((lo1 >= lo2 && hi1 <= hi2) ||\n\t   (lo2 >= lo1 && hi2 <= hi1))\n\t{\n\t\tcontainment = true;\n\t}\n\tint type = PE_ALS_NORMAL;\n\t// Check whether one mate overlaps the other\n\tbool olap = false;\n\tif((lo1 <= lo2 && hi1 >= lo2) ||\n\t   (lo1 <= hi2 && hi1 >= hi2) ||\n\t   containment)\n\t{\n\t\t// The mates overlap\n\t\tolap = true;\n\t\tif(!olapOk_) return PE_ALS_DISCORD;\n\t\ttype = PE_ALS_OVERLAP;\n\t}\n\t// Check if the mates are in the wrong relative orientation,\n\t// without any overlap\n\tif(!olap) {\n\t\tif((oneLeft && lo2 < lo1) || (!oneLeft && lo1 < lo2)) {\n\t\t\treturn PE_ALS_DISCORD;\n\t\t}\n\t}\n\t// If one mate contained the other, report that\n\tif(containment) {\n\t\tif(!containOk_) return PE_ALS_DISCORD;\n\t\ttype = PE_ALS_CONTAIN;\n\t}\n\t// Check whether there's dovetailing; i.e. does the left mate\n\t// extend past the right end of the right mate, or vice versa\n\tif(( oneLeft && (hi1 > hi2 || lo2 < lo1)) ||\n\t   (!oneLeft && (hi2 > hi1 || lo1 < lo2)))\n\t{\n\t\tif(!dovetailOk_) return PE_ALS_DISCORD;\n\t\ttype = PE_ALS_DOVETAIL;\n\t}\n\treturn type;\n}\n\n/**\n * Given details about how one mate aligns, and some details about the\n * reference sequence it aligned to, calculate a window and orientation s.t.\n * a paired-end alignment is concordant iff the opposite mate aligns in the\n * calculated window with the calculated orientation.  The \"window\" is really a\n * cosntraint on which positions the extreme end of the opposite mate can fall.\n * This is a different type of constraint from the one placed on seed-extend\n * dynamic programming problems.  That constraints requires that alignments at\n * one point pass through one of a set of \"core\" columns.\n *\n * When the opposite mate is to the left, we're constraining where its\n * left-hand extreme can fall, i.e., which cells in the top row of the matrix\n * it can end in.  When the opposite mate is to the right, we're cosntraining\n * where its right-hand extreme can fall, i.e., which cells in the bottom row\n * of the matrix it can end in.  However, in practice we can only constrain\n * where we start the backtrace, i.e. where the RHS of the alignment falls.\n * See frameFindMateRect for details.\n *\n * This calculaton does not consider gaps - the dynamic programming framer will\n * take gaps into account.\n *\n * Returns false if no concordant alignments are possible, true otherwise.\n */\nbool PairedEndPolicy::otherMate(\n\tbool     is1,       // true -> mate 1 aligned and we're looking\n\t\t\t\t\t    // for 2, false -> vice versa\n\tbool     fw,        // orientation of aligned mate\n\tint64_t  off,       // offset into the reference sequence\n\tint64_t  maxalcols, // max # columns spanned by alignment\n\tsize_t   reflen,    // length of reference sequence aligned to\n\tsize_t   len1,      // length of mate 1\n\tsize_t   len2,      // length of mate 2\n\tbool&    oleft,     // out: true iff opp mate must be to right of anchor\n\tint64_t& oll,       // out: leftmost Watson off for LHS of opp alignment\n\tint64_t& olr,       // out: rightmost Watson off for LHS of opp alignment\n\tint64_t& orl,       // out: leftmost Watson off for RHS of opp alignment\n\tint64_t& orr,       // out: rightmost Watson off for RHS of opp alignment\n\tbool&    ofw)       // out: true iff opp mate must be on Watson strand\n\tconst\n{\n\tassert_gt(len1, 0);\n\tassert_gt(len2, 0);\n\tassert_gt(maxfrag_, 0);\n\tassert_geq(minfrag_, 0);\n\tassert_geq(maxfrag_, minfrag_);\n\tassert(maxalcols == -1 || maxalcols > 0);\n\t\n\t// Calculate whether opposite mate should align to left or to right\n\t// of given mate, and what strand it should align to\n\tpePolicyMateDir(pol_, is1, fw, oleft, ofw);\n\t\n\tsize_t alen = is1 ? len1 : len2; // length of opposite mate\n\t\n\t// Expand the maximum fragment length if necessary to accomodate\n\t// the longer mate\n\tsize_t maxfrag = maxfrag_;\n\tsize_t minfrag = minfrag_;\n\tif(minfrag < 1) {\n\t\tminfrag = 1;\n\t}\n\tif(len1 > maxfrag && expandToFit_) maxfrag = len1;\n\tif(len2 > maxfrag && expandToFit_) maxfrag = len2;\n\tif(!expandToFit_ && (len1 > maxfrag || len2 > maxfrag)) {\n\t\t// Not possible to find a concordant alignment; one of the\n\t\t// mates is too long\n\t\treturn false;\n\t}\n\t\n\t// Now calculate bounds within which a dynamic programming\n\t// algorithm should search for an alignment for the opposite mate\n\tif(oleft) {\n\t\t//    -----------FRAG MAX----------------\n\t\t//                 -------FRAG MIN-------\n\t\t//                               |-alen-|\n\t\t//                             Anchor mate\n\t\t//                               ^off\n\t\t//                  |------|\n\t\t//       Not concordant: LHS not outside min\n\t\t//                 |------|\n\t\t//                Concordant\n\t\t//      |------|\n\t\t//     Concordant\n\t\t//  |------|\n\t\t// Not concordant: LHS outside max\n\t\t\n\t\t//    -----------FRAG MAX----------------\n\t\t//                 -------FRAG MIN-------\n\t\t//                               |-alen-|\n\t\t//                             Anchor mate\n\t\t//                               ^off\n\t\t//    |------------|\n\t\t// LHS can't be outside this range\n\t\t//                               -----------FRAG MAX----------------\n\t\t//    |------------------------------------------------------------|\n\t\t// LHS can't be outside this range, assuming no restrictions on\n\t\t// flipping, dovetailing, containment, overlap, etc.\n\t\t//                                      |-------|\n\t\t//                                      maxalcols\n\t\t//    |-----------------------------------------|\n\t\t// LHS can't be outside this range, assuming no flipping\n\t\t//    |---------------------------------|\n\t\t// LHS can't be outside this range, assuming no dovetailing\n\t\t//    |-------------------------|\n\t\t// LHS can't be outside this range, assuming no overlap\n\n\t\toll = off + alen - maxfrag;\n\t\tolr = off + alen - minfrag;\n\t\tassert_geq(olr, oll);\n\t\t\n\t\torl = oll;\n\t\torr = off + maxfrag - 1;\n\t\tassert_geq(olr, oll);\n\n\t\t// What if overlapping alignments are not allowed?\n\t\tif(!olapOk_) {\n\t\t\t// RHS can't be flush with or to the right of off\n\t\t\torr = min<int64_t>(orr, off-1);\n\t\t\tif(orr < olr) olr = orr;\n\t\t\tassert_leq(oll, olr);\n\t\t\tassert_leq(orl, orr);\n\t\t\tassert_geq(orr, olr);\n\t\t}\n\t\t// What if dovetail alignments are not allowed?\n\t\telse if(!dovetailOk_) {\n\t\t\t// RHS can't be past off+alen-1\n\t\t\torr = min<int64_t>(orr, off + alen - 1);\n\t\t\tassert_leq(oll, olr);\n\t\t\tassert_leq(orl, orr);\n\t\t}\n\t\t// What if flipped alignments are not allowed?\n\t\telse if(!flippingOk_ && maxalcols != -1) {\n\t\t\t// RHS can't be right of ???\n\t\t\torr = min<int64_t>(orr, off + alen - 1 + (maxalcols-1));\n\t\t\tassert_leq(oll, olr);\n\t\t\tassert_leq(orl, orr);\n\t\t}\n\t\tassert_geq(olr, oll);\n\t\tassert_geq(orr, orl);\n\t\tassert_geq(orr, olr);\n\t\tassert_geq(orl, oll);\n\t} else {\n\t\t//                             -----------FRAG MAX----------------\n\t\t//                             -------FRAG MIN-------\n\t\t//  -----------FRAG MAX----------------\n\t\t//                             |-alen-|\n\t\t//                           Anchor mate\n\t\t//                             ^off\n \t\t//                                          |------|\n\t\t//                            Not concordant: RHS not outside min\n\t\t//                                           |------|\n\t\t//                                          Concordant\n\t\t//                                                      |------|\n\t\t//                                                     Concordant\n\t\t//                                                          |------|\n\t\t//                                      Not concordant: RHS outside max\n\t\t//\n\n\t\t//                             -----------FRAG MAX----------------\n\t\t//                             -------FRAG MIN-------\n\t\t//  -----------FRAG MAX----------------\n\t\t//                             |-alen-|\n\t\t//                           Anchor mate\n\t\t//                             ^off\n\t\t//                                                  |------------|\n\t\t//                                      RHS can't be outside this range\n\t\t//  |------------------------------------------------------------|\n\t\t// LHS can't be outside this range, assuming no restrictions on\n\t\t// dovetailing, containment, overlap, etc.\n\t\t//                     |-------|\n\t\t//                     maxalcols\n\t\t//                     |-----------------------------------------|\n\t\t//             LHS can't be outside this range, assuming no flipping\n\t\t//                             |---------------------------------|\n\t\t//          LHS can't be outside this range, assuming no dovetailing\n\t\t//                                     |-------------------------|\n\t\t//              LHS can't be outside this range, assuming no overlap\n\t\t\n\t\torr = off + (maxfrag - 1);\n\t\torl  = off + (minfrag - 1);\n\t\tassert_geq(orr, orl);\n\t\t\n\t\toll = off + alen - maxfrag;\n\t\tolr = orr;\n\t\tassert_geq(olr, oll);\n\t\t\n\t\t// What if overlapping alignments are not allowed?\n\t\tif(!olapOk_) {\n\t\t\t// LHS can't be left of off+alen\n\t\t\toll = max<int64_t>(oll, off+alen);\n\t\t\tif(oll > orl) orl = oll;\n\t\t\tassert_leq(oll, olr);\n\t\t\tassert_leq(orl, orr);\n\t\t\tassert_geq(orl, oll);\n\t\t}\n\t\t// What if dovetail alignments are not allowed?\n\t\telse if(!dovetailOk_) {\n\t\t\t// LHS can't be left of off\n\t\t\toll = max<int64_t>(oll, off);\n\t\t\tassert_leq(oll, olr);\n\t\t\tassert_leq(orl, orr);\n\t\t}\n\t\t// What if flipped alignments are not allowed?\n\t\telse if(!flippingOk_ && maxalcols != -1) {\n\t\t\t// LHS can't be left of off - maxalcols + 1\n\t\t\toll = max<int64_t>(oll, off - maxalcols + 1);\n\t\t\tassert_leq(oll, olr);\n\t\t\tassert_leq(orl, orr);\n\t\t}\n\t\tassert_geq(olr, oll);\n\t\tassert_geq(orr, orl);\n\t\tassert_geq(orr, olr);\n\t\tassert_geq(orl, oll);\n\t}\n\n\t// Boundaries and orientation determined\n\treturn true;\n}\n\n#ifdef MAIN_PE\n\n#include <string>\n#include <sstream>\n\nvoid testCaseClassify(\n\tconst string& name,\n\tint      pol,\n\tsize_t   maxfrag,\n\tsize_t   minfrag,\n\tbool     local,\n\tbool     flip,\n\tbool     dove,\n\tbool     cont,\n\tbool     olap,\n\tbool     expand,\n\tint64_t  off1,\n\tsize_t   len1,\n\tbool     fw1,\n\tint64_t  off2,\n\tsize_t   len2,\n\tbool     fw2,\n\tint      expect_class)\n{\n\tPairedEndPolicy pepol(\n\t\tpol,\n\t\tmaxfrag,\n\t\tminfrag,\n\t\tlocal,\n\t\tflip,\n\t\tdove,\n\t\tcont,\n\t\tolap,\n\t\texpand);\n\tint ret = pepol.peClassifyPair(\n\t\toff1,   // offset of mate 1\n\t\tlen1,   // length of mate 1\n\t\tfw1,    // whether mate 1 aligned to Watson\n\t\toff2,   // offset of mate 2\n\t\tlen2,   // length of mate 2\n\t\tfw2);   // whether mate 2 aligned to Watson\n\tassert_eq(expect_class, ret);\n\tcout << \"peClassifyPair: \" << name << \"...PASSED\" << endl;\n}\n\nvoid testCaseOtherMate(\n\tconst string& name,\n\tint      pol,\n\tsize_t   maxfrag,\n\tsize_t   minfrag,\n\tbool     local,\n\tbool     flip,\n\tbool     dove,\n\tbool     cont,\n\tbool     olap,\n\tbool     expand,\n\tbool     is1,\n\tbool     fw,\n\tint64_t  off,\n\tint64_t  maxalcols,\n\tsize_t   reflen,\n\tsize_t   len1,\n\tsize_t   len2,\n\tbool     expect_ret,\n\tbool     expect_oleft,\n\tint64_t  expect_oll,\n\tint64_t  expect_olr,\n\tint64_t  expect_orl,\n\tint64_t  expect_orr,\n\tbool     expect_ofw)\n{\n\tPairedEndPolicy pepol(\n\t\tpol,\n\t\tmaxfrag,\n\t\tminfrag,\n\t\tlocal,\n\t\tflip,\n\t\tdove,\n\t\tcont,\n\t\tolap,\n\t\texpand);\n\tint64_t oll = 0, olr = 0;\n\tint64_t orl = 0, orr = 0;\n\tbool oleft = false, ofw = false;\n\tbool ret = pepol.otherMate(\n\t\tis1,\n\t\tfw,\n\t\toff,\n\t\tmaxalcols,\n\t\treflen,\n\t\tlen1,\n\t\tlen2,\n\t\toleft,\n\t\toll,\n\t\tolr,\n\t\torl,\n\t\torr,\n\t\tofw);\n\tassert(ret == expect_ret);\n\tif(ret) {\n\t\tassert_eq(expect_oleft, oleft);\n\t\tassert_eq(expect_oll, oll);\n\t\tassert_eq(expect_olr, olr);\n\t\tassert_eq(expect_orl, orl);\n\t\tassert_eq(expect_orr, orr);\n\t\tassert_eq(expect_ofw, ofw);\n\t}\n\tcout << \"otherMate: \" << name << \"...PASSED\" << endl;\n}\n\nint main(int argc, char **argv) {\n\n\t// Set of 8 cases where we look for the opposite mate to the right\n\t// of the anchor mate, with various combinations of policies and\n\t// anchor-mate orientations.\n\n\t// |--------|\n\t//           |--------|\n\t//           ^110     ^119\n\t// |------------------|\n\t//      min frag\n\t//                     |--------|\n\t//                     ^120     ^129\n\t// |----------------------------|\n\t//           max frag\n\t// ^\n\t// 100\n\n\t{\n\tint  policies[] = { PE_POLICY_FF, PE_POLICY_RR, PE_POLICY_FR, PE_POLICY_RF, PE_POLICY_FF, PE_POLICY_RR, PE_POLICY_FR, PE_POLICY_RF };\n\tbool is1[]      = { true,  true,   true,  true, false, false, false, false };\n\tbool fw[]       = { true,  false,  true, false, false,  true,  true, false };\n\tbool oleft[]    = { false, false, false, false, false, false, false, false };\n\tbool ofw[]      = { true,  false, false,  true, false,  true, false,  true };\n\n\tfor(int i = 0; i < 8; i++) {\n\t\tostringstream oss;\n\t\toss << \"Simple\";\n\t\toss << i;\n\t\ttestCaseOtherMate(\n\t\t\toss.str(),\n\t\t\tpolicies[i],  // policy\n\t\t\t30,           // maxfrag\n\t\t\t20,           // minfrag\n\t\t\tfalse,        // local\n\t\t\ttrue,         // flipping OK\n\t\t\ttrue,         // dovetail OK\n\t\t\ttrue,         // containment OK\n\t\t\ttrue,         // overlap OK\n\t\t\ttrue,         // expand-to-fit\n\t\t\tis1[i],       // mate 1 is anchor\n\t\t\tfw[i],        // anchor aligned to Watson\n\t\t\t100,          // anchor's offset into ref\n\t\t\t-1,           // max # alignment cols\n\t\t\t200,          // ref length\n\t\t\t10,           // mate 1 length\n\t\t\t10,           // mate 2 length\n\t\t\ttrue,         // expected return val from otherMate\n\t\t\toleft[i],     // wheter to look for opposite to left\n\t\t\t80,           // expected leftmost pos for opp mate LHS\n\t\t\t129,          // expected rightmost pos for opp mate LHS\n\t\t\t119,          // expected leftmost pos for opp mate RHS\n\t\t\t129,          // expected rightmost pos for opp mate RHS\n\t\t\tofw[i]);      // expected orientation in which opposite mate must align\n\t}\n\t}\n\n\t// Set of 8 cases where we look for the opposite mate to the left\n\t// of the anchor mate, with various combinations of policies and\n\t// anchor-mate orientations.\n\n\t// |--------|\n\t// ^100     ^109\n\t//           |--------|\n\t//           ^110     ^119\n\t//           |------------------|\n\t//                 min frag\n\t//                     |-Anchor-|\n\t//                     ^120     ^129\n\t// |----------------------------|\n\t//           max frag\n\t// ^\n\t// 100\n\n\t{\n\tint  policies[] = { PE_POLICY_FF, PE_POLICY_RR, PE_POLICY_FR, PE_POLICY_RF, PE_POLICY_FF, PE_POLICY_RR, PE_POLICY_FR, PE_POLICY_RF };\n\tbool is1[]      = { false, false, false, false,  true,  true,  true,  true };\n\tbool fw[]       = {  true, false, false,  true, false,  true, false,  true };\n\tbool oleft[]    = {  true,  true,  true,  true,  true,  true,  true,  true };\n\tbool ofw[]      = {  true, false,  true, false, false,  true,  true, false };\n\t\n\tfor(int i = 0; i < 8; i++) {\n\t\tostringstream oss;\n\t\toss << \"Simple\";\n\t\toss << (i+8);\n\t\ttestCaseOtherMate(\n\t\t\toss.str(),\n\t\t\tpolicies[i],  // policy\n\t\t\t30,           // maxfrag\n\t\t\t20,           // minfrag\n\t\t\tfalse,        // local\n\t\t\ttrue,         // flipping OK\n\t\t\ttrue,         // dovetail OK\n\t\t\ttrue,         // containment OK\n\t\t\ttrue,         // overlap OK\n\t\t\ttrue,         // expand-to-fit\n\t\t\tis1[i],       // mate 1 is anchor\n\t\t\tfw[i],        // anchor aligned to Watson\n\t\t\t120,          // anchor's offset into ref\n\t\t\t-1,           // max # alignment cols\n\t\t\t200,          // ref length\n\t\t\t10,           // mate 1 length\n\t\t\t10,           // mate 2 length\n\t\t\ttrue,         // expected return val from otherMate\n\t\t\toleft[i],     // wheter to look for opposite to left\n\t\t\t100,          // expected leftmost pos for opp mate LHS\n\t\t\t110,          // expected rightmost pos for opp mate LHS\n\t\t\t100,          // expected leftmost pos for opp mate RHS\n\t\t\t149,          // expected rightmost pos for opp mate RHS\n\t\t\tofw[i]);      // expected orientation in which opposite mate must align\n\t}\n\t}\n\n\t// Case where min frag == max frag and opposite is to the right\n\n\t// |----------------------------|\n\t//      min frag\n\t//                     |--------|\n\t//                     ^120     ^129\n\t// |----------------------------|\n\t//           max frag\n\t// ^\n\t// 100\n\ttestCaseOtherMate(\n\t\t\"MinFragEqMax1\",\n\t\tPE_POLICY_FR, // policy\n\t\t30,           // maxfrag\n\t\t30,           // minfrag\n\t\tfalse,        // local\n\t\ttrue,         // flipping OK\n\t\ttrue,         // dovetail OK\n\t\ttrue,         // containment OK\n\t\ttrue,         // overlap OK\n\t\ttrue,         // expand-to-fit\n\t\tfalse,        // mate 1 is anchor\n\t\tfalse,        // anchor aligned to Watson\n\t\t120,          // anchor's offset into ref\n\t\t-1,           // max # alignment cols\n\t\t200,          // ref length\n\t\t10,           // mate 1 length\n\t\t10,           // mate 2 length\n\t\ttrue,         // expected return val from otherMate\n\t\ttrue,         // wheter to look for opposite to left\n\t\t100,          // expected leftmost pos for opp mate LHS\n\t\t100,          // expected rightmost pos for opp mate LHS\n\t\t100,          // expected leftmost pos for opp mate RHS\n\t\t149,          // expected rightmost pos for opp mate RHS\n\t\ttrue);        // expected orientation in which opposite mate must align\n\n\t// Case where min frag == max frag and opposite is to the right\n\n\t// |----------------------------|\n\t//      min frag                ^129\n\t// |--------|\n\t// ^100     ^109\n\t// |----------------------------|\n\t//           max frag\n\ttestCaseOtherMate(\n\t\t\"MinFragEqMax2\",\n\t\tPE_POLICY_FR, // policy\n\t\t30,           // maxfrag\n\t\t30,           // minfrag\n\t\tfalse,        // local\n\t\ttrue,         // flipping OK\n\t\ttrue,         // dovetail OK\n\t\ttrue,         // containment OK\n\t\ttrue,         // overlap OK\n\t\ttrue,         // expand-to-fit\n\t\ttrue,         // mate 1 is anchor\n\t\ttrue,         // anchor aligned to Watson\n\t\t100,          // anchor's offset into ref\n\t\t-1,           // max # alignment cols\n\t\t200,          // ref length\n\t\t10,           // mate 1 length\n\t\t10,           // mate 2 length\n\t\ttrue,         // expected return val from otherMate\n\t\tfalse,        // wheter to look for opposite to left\n\t\t80,           // expected leftmost pos for opp mate LHS\n\t\t129,          // expected rightmost pos for opp mate LHS\n\t\t129,          // expected leftmost pos for opp mate RHS\n\t\t129,          // expected rightmost pos for opp mate RHS\n\t\tfalse);       // expected orientation in which opposite mate must align\n\n\ttestCaseOtherMate(\n\t\t\"MinFragEqMax4NoDove1\",\n\t\tPE_POLICY_FR, // policy\n\t\t30,           // maxfrag\n\t\t25,           // minfrag\n\t\tfalse,        // local\n\t\ttrue,         // flipping OK\n\t\tfalse,        // dovetail OK\n\t\ttrue,         // containment OK\n\t\ttrue,         // overlap OK\n\t\ttrue,         // expand-to-fit\n\t\ttrue,         // mate 1 is anchor\n\t\ttrue,         // anchor aligned to Watson\n\t\t100,          // anchor's offset into ref\n\t\t-1,           // max # alignment cols\n\t\t200,          // ref length\n\t\t10,           // mate 1 length\n\t\t10,           // mate 2 length\n\t\ttrue,         // expected return val from otherMate\n\t\tfalse,        // wheter to look for opposite to left\n\t\t100,          // expected leftmost pos for opp mate LHS\n\t\t129,          // expected rightmost pos for opp mate LHS\n\t\t124,          // expected leftmost pos for opp mate RHS\n\t\t129,          // expected rightmost pos for opp mate RHS\n\t\tfalse);       // expected orientation in which opposite mate must align\n\n\ttestCaseOtherMate(\n\t\t\"MinFragEqMax4NoCont1\",\n\t\tPE_POLICY_FR, // policy\n\t\t30,           // maxfrag\n\t\t25,           // minfrag\n\t\tfalse,        // local\n\t\ttrue,         // flipping OK\n\t\tfalse,        // dovetail OK\n\t\tfalse,        // containment OK\n\t\ttrue,         // overlap OK\n\t\ttrue,         // expand-to-fit\n\t\ttrue,         // mate 1 is anchor\n\t\ttrue,         // anchor aligned to Watson\n\t\t100,          // anchor's offset into ref\n\t\t-1,           // max # alignment cols\n\t\t200,          // ref length\n\t\t10,           // mate 1 length\n\t\t10,           // mate 2 length\n\t\ttrue,         // expected return val from otherMate\n\t\tfalse,        // wheter to look for opposite to left\n\t\t100,          // expected leftmost pos for opp mate LHS\n\t\t129,          // expected rightmost pos for opp mate LHS\n\t\t124,          // expected leftmost pos for opp mate RHS\n\t\t129,          // expected rightmost pos for opp mate RHS\n\t\tfalse);       // expected orientation in which opposite mate must align\n\n\ttestCaseOtherMate(\n\t\t\"MinFragEqMax4NoOlap1\",\n\t\tPE_POLICY_FR, // policy\n\t\t30,           // maxfrag\n\t\t25,           // minfrag\n\t\tfalse,        // local\n\t\ttrue,         // flipping OK\n\t\tfalse,        // dovetail OK\n\t\tfalse,        // containment OK\n\t\tfalse,        // overlap OK\n\t\ttrue,         // expand-to-fit\n\t\ttrue,         // mate 1 is anchor\n\t\ttrue,         // anchor aligned to Watson\n\t\t100,          // anchor's offset into ref\n\t\t-1,           // max # alignment cols\n\t\t200,          // ref length\n\t\t10,           // mate 1 length\n\t\t10,           // mate 2 length\n\t\ttrue,         // expected return val from otherMate\n\t\tfalse,        // wheter to look for opposite to left\n\t\t110,          // expected leftmost pos for opp mate LHS\n\t\t129,          // expected rightmost pos for opp mate LHS\n\t\t124,          // expected leftmost pos for opp mate RHS\n\t\t129,          // expected rightmost pos for opp mate RHS\n\t\tfalse);       // expected orientation in which opposite mate must align\n\n\ttestCaseOtherMate(\n\t\t\"MinFragEqMax4NoDove2\",\n\t\tPE_POLICY_FR, // policy\n\t\t30,           // maxfrag\n\t\t25,           // minfrag\n\t\tfalse,        // local\n\t\ttrue,         // flipping OK\n\t\tfalse,        // dovetail OK\n\t\ttrue,         // containment OK\n\t\ttrue,         // overlap OK\n\t\ttrue,         // expand-to-fit\n\t\tfalse,        // mate 1 is anchor\n\t\tfalse,        // anchor aligned to Watson\n\t\t120,          // anchor's offset into ref\n\t\t-1,           // max # alignment cols\n\t\t200,          // ref length\n\t\t10,           // mate 1 length\n\t\t10,           // mate 2 length\n\t\ttrue,         // expected return val from otherMate\n\t\ttrue,         // whether to look for opposite to left\n\t\t100,          // expected leftmost pos for opp mate LHS\n\t\t105,          // expected rightmost pos for opp mate LHS\n\t\t100,          // expected leftmost pos for opp mate RHS\n\t\t129,          // expected rightmost pos for opp mate RHS\n\t\ttrue);        // expected orientation in which opposite mate must align\n\n\ttestCaseOtherMate(\n\t\t\"MinFragEqMax4NoOlap2\",\n\t\tPE_POLICY_FR, // policy\n\t\t30,           // maxfrag\n\t\t25,           // minfrag\n\t\tfalse,        // local\n\t\ttrue,         // flipping OK\n\t\tfalse,        // dovetail OK\n\t\tfalse,        // containment OK\n\t\tfalse,        // overlap OK\n\t\ttrue,         // expand-to-fit\n\t\tfalse,        // mate 1 is anchor\n\t\tfalse,        // anchor aligned to Watson\n\t\t120,          // anchor's offset into ref\n\t\t-1,           // max # alignment cols\n\t\t200,          // ref length\n\t\t10,           // mate 1 length\n\t\t10,           // mate 2 length\n\t\ttrue,         // expected return val from otherMate\n\t\ttrue,         // whether to look for opposite to left\n\t\t100,          // expected leftmost pos for opp mate LHS\n\t\t105,          // expected rightmost pos for opp mate LHS\n\t\t100,          // expected leftmost pos for opp mate RHS\n\t\t119,          // expected rightmost pos for opp mate RHS\n\t\ttrue);        // expected orientation in which opposite mate must align\n\n\t{\n\tint olls[] = { 110 };\n\tint olrs[] = { 299 };\n\tint orls[] = { 149 };\n\tint orrs[] = { 299 };\n\tfor(int i = 0; i < 1; i++) {\n\t\tostringstream oss;\n\t\toss << \"Overhang1_\";\n\t\toss << (i+1);\n\t\ttestCaseOtherMate(\n\t\t\toss.str(),\n\t\t\tPE_POLICY_FR, // policy\n\t\t\t200,          // maxfrag\n\t\t\t50,           // minfrag\n\t\t\tfalse,        // local\n\t\t\ttrue,         // flipping OK\n\t\t\ttrue,         // dovetail OK\n\t\t\ttrue,         // containment OK\n\t\t\tfalse,        // overlap OK\n\t\t\ttrue,         // expand-to-fit\n\t\t\ttrue,         // mate 1 is anchor\n\t\t\ttrue,         // anchor aligned to Watson\n\t\t\t100,          // anchor's offset into ref\n\t\t\t-1,           // max # alignment cols\n\t\t\t200,          // ref length\n\t\t\t10,           // mate 1 length\n\t\t\t10,           // mate 2 length\n\t\t\ttrue,         // expected return val from otherMate\n\t\t\tfalse,        // whether to look for opposite to left\n\t\t\tolls[i],      // expected leftmost pos for opp mate LHS\n\t\t\tolrs[i],      // expected rightmost pos for opp mate LHS\n\t\t\torls[i],      // expected leftmost pos for opp mate RHS\n\t\t\torrs[i],      // expected rightmost pos for opp mate RHS\n\t\t\tfalse);       // expected orientation in which opposite mate must align\n\t}\n\t}\n\n\t{\n\tint olls[] = { -100 };\n\tint olrs[] = {   50 };\n\tint orls[] = { -100 };\n\tint orrs[] = {   89 };\n\tfor(int i = 0; i < 1; i++) {\n\t\tostringstream oss;\n\t\toss << \"Overhang2_\";\n\t\toss << (i+1);\n\t\ttestCaseOtherMate(\n\t\t\toss.str(),\n\t\t\tPE_POLICY_FR, // policy\n\t\t\t200,          // maxfrag\n\t\t\t50,           // minfrag\n\t\t\tfalse,        // local\n\t\t\ttrue,         // flipping OK\n\t\t\ttrue,         // dovetail OK\n\t\t\ttrue,         // containment OK\n\t\t\tfalse,        // overlap OK\n\t\t\ttrue,         // expand-to-fit\n\t\t\ttrue,         // mate 1 is anchor\n\t\t\tfalse,        // anchor aligned to Watson\n\t\t\t90,           // anchor's offset into ref\n\t\t\t-1,           // max # alignment cols\n\t\t\t200,          // ref length\n\t\t\t10,           // mate 1 length\n\t\t\t10,           // mate 2 length\n\t\t\ttrue,         // expected return val from otherMate\n\t\t\ttrue,         // whether to look for opposite to left\n\t\t\tolls[i],      // expected leftmost pos for opp mate LHS\n\t\t\tolrs[i],      // expected rightmost pos for opp mate LHS\n\t\t\torls[i],      // expected leftmost pos for opp mate RHS\n\t\t\torrs[i],      // expected rightmost pos for opp mate RHS\n\t\t\ttrue);        // expected orientation in which opposite mate must align\n\t}\n\t}\n\n\t{\n\tint mate2offs[] = {           150,            149,            149,            100,              99,           299,              1,            250,            250 };\n\tint mate2lens[] = {            50,             50,             51,            100,             101,             1,             50,             50,             51 };\n\tint peExpects[] = { PE_ALS_NORMAL, PE_ALS_DISCORD, PE_ALS_OVERLAP, PE_ALS_CONTAIN, PE_ALS_DOVETAIL, PE_ALS_NORMAL, PE_ALS_DISCORD,  PE_ALS_NORMAL, PE_ALS_DISCORD };\n\n\tfor(int i = 0; i < 9; i++) {\n\t\tostringstream oss;\n\t\toss << \"Simple1_\";\n\t\toss << (i);\n\t\ttestCaseClassify(\n\t\t\toss.str(),\n\t\t\tPE_POLICY_FR, // policy\n\t\t\t200,          // maxfrag\n\t\t\t100,          // minfrag\n\t\t\tfalse,        // local\n\t\t\ttrue,         // flipping OK\n\t\t\ttrue,         // dovetail OK\n\t\t\ttrue,         // containment OK\n\t\t\ttrue,         // overlap OK\n\t\t\ttrue,         // expand-to-fit\n\t\t\t100,          // offset of mate 1\n\t\t\t50,           // length of mate 1\n\t\t\ttrue,         // whether mate 1 aligned to Watson\n\t\t\tmate2offs[i], // offset of mate 2\n\t\t\tmate2lens[i], // length of mate 2\n\t\t\tfalse,        // whether mate 2 aligned to Watson\n\t\t\tpeExpects[i]);// expectation for PE_ALS flag returned\n\t}\n\t}\n\n\t{\n\tint mate1offs[] = {           200,            201,            200,            200,             200,           100,            400,            100,             99 };\n\tint mate1lens[] = {            50,             49,             51,            100,             101,             1,             50,             50,             51 };\n\tint peExpects[] = { PE_ALS_NORMAL, PE_ALS_DISCORD, PE_ALS_OVERLAP, PE_ALS_CONTAIN, PE_ALS_DOVETAIL, PE_ALS_NORMAL, PE_ALS_DISCORD,  PE_ALS_NORMAL, PE_ALS_DISCORD };\n\n\tfor(int i = 0; i < 9; i++) {\n\t\tostringstream oss;\n\t\toss << \"Simple2_\";\n\t\toss << (i);\n\t\ttestCaseClassify(\n\t\t\toss.str(),\n\t\t\tPE_POLICY_FR, // policy\n\t\t\t200,          // maxfrag\n\t\t\t100,          // minfrag\n\t\t\tfalse,        // local\n\t\t\ttrue,         // flipping OK\n\t\t\ttrue,         // dovetail OK\n\t\t\ttrue,         // containment OK\n\t\t\ttrue,         // overlap OK\n\t\t\ttrue,         // expand-to-fit\n\t\t\tmate1offs[i], // offset of mate 1\n\t\t\tmate1lens[i], // length of mate 1\n\t\t\ttrue,         // whether mate 1 aligned to Watson\n\t\t\t250,          // offset of mate 2\n\t\t\t50,           // length of mate 2\n\t\t\tfalse,        // whether mate 2 aligned to Watson\n\t\t\tpeExpects[i]);// expectation for PE_ALS flag returned\n\t}\n\t}\n\n\ttestCaseOtherMate(\n\t\t\"Regression1\",\n\t\tPE_POLICY_FF, // policy\n\t\t50,           // maxfrag\n\t\t0,            // minfrag\n\t\tfalse,        // local\n\t\ttrue,         // flipping OK\n\t\ttrue,         // dovetail OK\n\t\ttrue,         // containment OK\n\t\ttrue,         // overlap OK\n\t\ttrue,         // expand-to-fit\n\t\ttrue,         // mate 1 is anchor\n\t\tfalse,        // anchor aligned to Watson\n\t\t3,            // anchor's offset into ref\n\t\t-1,           // max # alignment cols\n\t\t53,           // ref length\n\t\t10,           // mate 1 length\n\t\t10,           // mate 2 length\n\t\ttrue,         // expected return val from otherMate\n\t\ttrue,         // whether to look for opposite to left\n\t\t-37,          // expected leftmost pos for opp mate LHS\n\t\t13,           // expected rightmost pos for opp mate LHS\n\t\t-37,          // expected leftmost pos for opp mate RHS\n\t\t52,           // expected rightmost pos for opp mate RHS\n\t\tfalse);       // expected orientation in which opposite mate must align\n}\n\n#endif /*def MAIN_PE*/\n"
  },
  {
    "path": "pe.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n/*\n *  pe.h\n *\n * A class encapsulating a paired-end policy and routines for\n * identifying intervals according to the policy.  For instance,\n * contains a routine that, given a policy and details about a match\n * for one mate, returns details about where to search for the other\n * mate.\n */\n\n#ifndef PE_H_\n#define PE_H_\n\n#include <iostream>\n#include <stdint.h>\n\n// In description below \"To the left\" = \"Upstream of w/r/t the Watson strand\"\n\n// The 4 possible policies describing how mates 1 and 2 should be\n// oriented with respect to the reference genome and each other\nenum {\n\t// (fw) Both mates from Watson with 1 to the left, or\n\t// (rc) Both mates from Crick with 2 to the left\n\tPE_POLICY_FF = 1,\n\n\t// (fw) Both mates from Crick with 1 to the left, or\n\t// (rc) Both mates from Watson with 2 to the left\n\tPE_POLICY_RR,\n\t\n\t// (fw) Mate 1 from Watson and mate 2 from Crick with 1 to the left, or\n\t// (rc) Mate 2 from Watson and mate 1 from Crick with 2 to the left\n\tPE_POLICY_FR,\n\t\n\t// (fw) Mate 1 from Crick and mate 2 from Watson with 1 to the left, or\n\t// (rc) Mate 2 from Crick and mate 1 from Watson with 2 to the left\n\tPE_POLICY_RF\n};\n\n// Various distinct ways that the mates might align with respect to\n// each other in a concordant alignment.  We distinguish between them\n// because in some cases a user may want to consider some of these\n// categories to be discordant, even if the alignment otherwise\n// conforms to the paired-end policy.\n\nenum {\n\t// Describes a paired-end alignment where the mates\n\t// straightforwardly conform to the paired-end policy without any\n\t// overlap between the mates\n\tPE_ALS_NORMAL = 1,\n\n\t// Describes a paired-end alignment where the mate overlap, but\n\t// neither contains the other and they do not dovetail, but the\n\t// alignment conforms to the paired-end policy\n\tPE_ALS_OVERLAP,\n\t\n\t// Describes a paired-end alignment where the mates conform to the\n\t// paired-end policy, but one mate strictly contains the other but\n\t// they don't dovetail.  We distinguish this from a \"normal\"\n\t// concordant alignment because some users may wish to categorize\n\t// such an alignment as discordant.\n\tPE_ALS_CONTAIN,\n\t\n\t// Describes a paired-end alignment where the mates conform to the\n\t// paired-end policy, but mates \"fall off\" each other.  E.g. if the\n\t// policy is FR and any of these happen:\n\t// 1:     >>>>>   >>>>>\n\t// 2:  <<<<<<    <<<<<<\n\t// And the overall extent is consistent with the minimum fragment\n\t// length, this is a dovetail alignment.  We distinguish this from\n\t// a \"normal\" concordant alignment because some users may wish to\n\t// categorize such an alignment as discordant.\n\tPE_ALS_DOVETAIL,\n\t\n\t// The mates are clearly discordant, owing to their orientations\n\t// and/or implied fragment length\n\tPE_ALS_DISCORD\n};\n\n/**\n * Return true iff the orientations and relative positions of mates 1\n * and 2 are compatible with the given PE_POLICY.\n */\nstatic inline bool pePolicyCompat(\n\tint policy,   // PE_POLICY\n\tbool oneLeft, // true iff mate 1 is to the left of mate 2\n\tbool oneWat,  // true iff mate 1 aligned to Watson strand\n\tbool twoWat)  // true iff mate 2 aligned to Watson strand\n{\n\tswitch(policy) {\n\t\tcase PE_POLICY_FF:\n\t\t\treturn oneWat == twoWat && oneWat == oneLeft;\n\t\tcase PE_POLICY_RR:\n\t\t\treturn oneWat == twoWat && oneWat != oneLeft;\n\t\tcase PE_POLICY_FR:\n\t\t\treturn oneWat != twoWat && oneWat == oneLeft;\n\t\tcase PE_POLICY_RF:\n\t\t\treturn oneWat != twoWat && oneWat != oneLeft;\n\t\tdefault: {\n\t\t\tstd::cerr << \"Bad PE_POLICY: \" << policy << std::endl;\n\t\t\tthrow 1;\n\t\t}\n\t}\n\tthrow 1;\n}\n\n/**\n * Given that the given mate aligns in the given orientation, return\n * true iff the other mate must appear \"to the right\" of the given mate\n * in order for the alignment to be concordant.\n */\nstatic inline void pePolicyMateDir(\n\tint   policy,// in: PE_POLICY\n\tbool  is1,   // in: true iff mate 1 is the one that already aligned\n\tbool  fw,    // in: true iff already-aligned mate aligned to Watson\n\tbool& left,  // out: set =true iff other mate must be to the left\n\tbool& mfw)   // out: set =true iff other mate must align to watson\n{\n\tswitch(policy) {\n\t\tcase PE_POLICY_FF: {\n\t\t\tleft = (is1 != fw);\n\t\t\tmfw = fw;\n\t\t\tbreak;\n\t\t}\n\t\tcase PE_POLICY_RR: {\n\t\t\tleft = (is1 == fw);\n\t\t\tmfw = fw;\n\t\t\tbreak;\n\t\t}\n\t\tcase PE_POLICY_FR: {\n\t\t\tleft = !fw;\n\t\t\tmfw = !fw;\n\t\t\tbreak;\n\t\t}\n\t\tcase PE_POLICY_RF: {\n\t\t\tleft = fw;\n\t\t\tmfw = !fw;\n\t\t\tbreak;\n\t\t}\n\t\tdefault: {\n\t\t\tstd::cerr << \"Error: No such PE_POLICY: \" << policy << std::endl;\n\t\t\tthrow 1;\n\t\t}\n\t}\n\treturn;\n}\n\n/**\n * Encapsulates paired-end alignment parameters.\n */\nclass PairedEndPolicy {\n\npublic:\n\n\tPairedEndPolicy() { reset(); }\n\t\n\tPairedEndPolicy(\n\t\tint pol,\n\t\tsize_t maxfrag,\n\t\tsize_t minfrag,\n\t\tbool local,\n\t\tbool flippingOk,\n\t\tbool dovetailOk,\n\t\tbool containOk,\n\t\tbool olapOk,\n\t\tbool expandToFit)\n\t{\n\t\tinit(\n\t\t\tpol,\n\t\t\tmaxfrag,\n\t\t\tminfrag,\n\t\t\tlocal,\n\t\t\tflippingOk,\n\t\t\tdovetailOk,\n\t\t\tcontainOk,\n\t\t\tolapOk,\n\t\t\texpandToFit);\n\t}\n\n\t/** \n\t * Initialize with nonsense values.\n\t */\n\tvoid reset() {\n\t\tinit(-1, 0xffffffff, 0xffffffff, false, false, false, false, false, false);\n\t}\n\n\t/**\n\t * Initialize given policy, maximum & minimum fragment lengths.\n\t */\n\tvoid init(\n\t\tint pol,\n\t\tsize_t maxfrag,\n\t\tsize_t minfrag,\n\t\tbool local,\n\t\tbool flippingOk,\n\t\tbool dovetailOk,\n\t\tbool containOk,\n\t\tbool olapOk,\n\t\tbool expandToFit)\n\t{\n\t\tpol_         = pol;\n\t\tmaxfrag_     = maxfrag;\n\t\tminfrag_     = minfrag;\n\t\tlocal_       = local;\n\t\tflippingOk_  = flippingOk;\n\t\tdovetailOk_  = dovetailOk;\n\t\tcontainOk_   = containOk;\n\t\tolapOk_      = olapOk;\n\t\texpandToFit_ = expandToFit;\n\t}\n\n/**\n * Given details about how one mate aligns, and some details about the\n * reference sequence it aligned to, calculate a window and orientation s.t.\n * a paired-end alignment is concordant iff the opposite mate aligns in the\n * calculated window with the calculated orientation.  The calculaton does not\n * consider gaps.  The dynamic programming framer will take gaps into account.\n *\n * Returns false if no concordant alignments are possible, true otherwise.\n */\nbool otherMate(\n\tbool     is1,       // true -> mate 1 aligned and we're looking\n\t\t\t\t\t\t// for 2, false -> vice versa\n\tbool     fw,        // orientation of aligned mate\n\tint64_t  off,       // offset into the reference sequence\n\tint64_t  maxalcols, // max # columns spanned by alignment\n\tsize_t   reflen,    // length of reference sequence aligned to\n\tsize_t   len1,      // length of mate 1\n\tsize_t   len2,      // length of mate 2\n\tbool&    oleft,     // out: true iff opp mate must be to right of anchor\n\tint64_t& oll,       // out: leftmost Watson off for LHS of opp alignment\n\tint64_t& olr,       // out: rightmost Watson off for LHS of opp alignment\n\tint64_t& orl,       // out: leftmost Watson off for RHS of opp alignment\n\tint64_t& orr,       // out: rightmost Watson off for RHS of opp alignment\n\tbool&    ofw)       // out: true iff opp mate must be on Watson strand\n\tconst;\n\n\t/**\n\t * Return a PE_TYPE flag indicating, given a PE_POLICY and coordinates\n\t * for a paired-end alignment,\tqwhat type of alignment it is, i.e.,\n\t * whether it's:\n\t *\n\t * 1. Straightforwardly concordant\n\t * 2. Mates dovetail (one extends beyond the end of the other)\n\t * 3. One mate contains the other but they don't dovetail\n\t * 4. One mate overlaps the other but neither contains the other and\n\t *    they don't dovetail\n\t * 5. Discordant\n\t */\n\tint peClassifyPair(\n\t\tint64_t  off1,   // offset of mate 1\n\t\tsize_t   len1,   // length of mate 1\n\t\tbool     fw1,    // whether mate 1 aligned to Watson\n\t\tint64_t  off2,   // offset of mate 2\n\t\tsize_t   len2,   // length of mate 2\n\t\tbool     fw2)    // whether mate 2 aligned to Watson\n\t\tconst;\n\n\tint    policy()     const { return pol_;     }\n\tsize_t maxFragLen() const { return maxfrag_; }\n\tsize_t minFragLen() const { return minfrag_; }\n\nprotected:\n\n\t// Use local alignment to search for the opposite mate, rather than\n\t// a type of alignment that requires the read to align end-to-end\n\tbool local_;\n\n\t// Policy governing how mates should be oriented with respect to\n\t// each other and the reference genome\n\tint pol_;\n\t\n\t// true iff settings are such that mates that violate the expected relative\n\t// orientation but are still consistent with maximum fragment length are OK\n\tbool flippingOk_;\n\n\t// true iff settings are such that dovetailed mates should be\n\t// considered concordant.\n\tbool dovetailOk_;\n\n\t// true iff paired-end alignments where one mate's alignment is\n\t// strictly contained within the other's should be considered\n\t// concordant\n\tbool containOk_;\n\n\t// true iff paired-end alignments where one mate's alignment\n\t// overlaps the other's should be considered concordant\n\tbool olapOk_;\n\t\n\t// What to do when a mate length is > maxfrag_?  If expandToFit_ is\n\t// true, we temporarily increase maxfrag_ to equal the mate length.\n\t// Otherwise we say that any paired-end alignment involving the\n\t// long mate is discordant.\n\tbool expandToFit_;\n\t\n\t// Maximum fragment size to consider\n\tsize_t maxfrag_;\n\n\t// Minimum fragment size to consider\n\tsize_t minfrag_;\n};\n\n#endif /*ndef PE_H_*/\n"
  },
  {
    "path": "presets.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <iostream>\n#include \"presets.h\"\n#include \"opts.h\"\n\nusing namespace std;\n\nvoid PresetsV0::apply(\n\tconst std::string& preset,\n\tstd::string& policy,\n\tEList<std::pair<int, std::string> >& opts)\n{\n\t// Presets:                 Same as:\n\t//  For --end-to-end:\n\t//   --very-fast            -M 5 -R 1 -N 0 -L 22 -i S,1,2.50\n\t//   --fast                 -M 10 -R 2 -N 0 -L 22 -i S,1,2.50\n\t//   --sensitive            -M 15 -R 2 -N 0 -L 22 -i S,1,1.15\n\t//   --very-sensitive       -M 25 -R 3 -N 0 -L 19 -i S,1,0.50\n\tif(preset == \"very-fast\") {\n\t\tpolicy += \";SEED=0,22\";\n\t\tpolicy += \";DPS=5\";\n\t\tpolicy += \";ROUNDS=1\";\n\t\tpolicy += \";IVAL=S,0,2.50\";\n\t} else if(preset == \"fast\") {\n\t\tpolicy += \";SEED=0,22\";\n\t\tpolicy += \";DPS=10\";\n\t\tpolicy += \";ROUNDS=2\";\n\t\tpolicy += \";IVAL=S,0,2.50\";\n\t} else if(preset == \"sensitive\") {\n\t\tpolicy += \";SEED=0,22\";\n\t\tpolicy += \";DPS=15\";\n\t\tpolicy += \";ROUNDS=2\";\n\t\tpolicy += \";IVAL=S,1,1.15\";\n\t} else if(preset == \"very-sensitive\") {\n\t\tpolicy += \";SEED=0,20\";\n\t\tpolicy += \";DPS=20\";\n\t\tpolicy += \";ROUNDS=3\";\n\t\tpolicy += \";IVAL=S,1,0.50\";\n\t}\n\t//  For --local:\n\t//   --very-fast-local      -M 1 -N 0 -L 25 -i S,1,2.00\n\t//   --fast-local           -M 2 -N 0 -L 22 -i S,1,1.75\n\t//   --sensitive-local      -M 2 -N 0 -L 20 -i S,1,0.75 (default)\n\t//   --very-sensitive-local -M 3 -N 0 -L 20 -i S,1,0.50\n\telse if(preset == \"very-fast-local\") {\n\t\tpolicy += \";SEED=0,25\";\n\t\tpolicy += \";DPS=5\";\n\t\tpolicy += \";ROUNDS=1\";\n\t\tpolicy += \";IVAL=S,1,2.00\";\n\t} else if(preset == \"fast-local\") {\n\t\tpolicy += \";SEED=0,22\";\n\t\tpolicy += \";DPS=10\";\n\t\tpolicy += \";ROUNDS=2\";\n\t\tpolicy += \";IVAL=S,1,1.75\";\n\t} else if(preset == \"sensitive-local\") {\n\t\tpolicy += \";SEED=0,20\";\n\t\tpolicy += \";DPS=15\";\n\t\tpolicy += \";ROUNDS=2\";\n\t\tpolicy += \";IVAL=S,1,0.75\";\n\t} else if(preset == \"very-sensitive-local\") {\n\t\tpolicy += \";SEED=0,20\";\n\t\tpolicy += \";DPS=20\";\n\t\tpolicy += \";ROUNDS=3\";\n\t\tpolicy += \";IVAL=S,1,0.50\";\n\t}\n\telse {\n\t\tcerr << \"Unknown preset: \" << preset.c_str() << endl;\n\t}\n}\n"
  },
  {
    "path": "presets.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n/**\n * presets.h\n *\n * Maps simple command-line options to more complicated combinations of\n * options for ease-of-use.\n */\n\n#ifndef PRESETS_H_\n#define PRESETS_H_\n\n#include <string>\n#include <utility>\n#include \"ds.h\"\n\nclass Presets {\npublic:\n\t\n\tPresets() { }\n\t\n\tvirtual ~Presets() { }\n\t\n\tvirtual void apply(\n\t\tconst std::string& preset,\n\t\tstd::string& policy,\n\t\tEList<std::pair<int, std::string> >& opts) = 0;\n\t\n\tvirtual const char * name() = 0;\n};\n\n/**\n * Initial collection of presets: 8/14/2011 prior to first Bowtie 2 release.\n */\nclass PresetsV0 : public Presets {\npublic:\n\t\n\tPresetsV0() : Presets() { }\n\t\n\tvirtual ~PresetsV0() { }\n\t\n\tvirtual void apply(\n\t\tconst std::string& preset,\n\t\tstd::string& policy,\n\t\tEList<std::pair<int, std::string> >& opts);\n\n\tvirtual const char * name() { return \"V0\"; }\n};\n\n#endif /*ndef PRESETS_H_*/\n"
  },
  {
    "path": "processor_support.h",
    "content": "#ifndef PROCESSOR_SUPPORT_H_\n#define PROCESSOR_SUPPORT_H_\n\n// Utility class ProcessorSupport provides POPCNTenabled() to determine\n// processor support for POPCNT instruction. It uses CPUID to\n// retrieve the processor capabilities.\n// for Intel ICC compiler __cpuid() is an intrinsic \n// for Microsoft compiler __cpuid() is provided by #include <intrin.h>\n// for GCC compiler __get_cpuid() is provided by #include <cpuid.h>\n\n// Intel compiler defines __GNUC__, so this is needed to disambiguate\n\n#if defined(__INTEL_COMPILER)\n#   define USING_INTEL_COMPILER\n#elif defined(__GNUC__)\n#   define USING_GCC_COMPILER\n#   include <cpuid.h>\n#elif defined(_MSC_VER)\n// __MSC_VER defined by Microsoft compiler\n#define USING MSC_COMPILER\n#endif\n\nstruct regs_t {unsigned int EAX, EBX, ECX, EDX;};\n#define BIT(n) ((1<<n))\n\nclass ProcessorSupport {\n\n#ifdef POPCNT_CAPABILITY \n\npublic: \n    ProcessorSupport() { } \n    bool POPCNTenabled()\n    {\n    // from: Intel® 64 and IA-32 Architectures Software Developer’s Manual, 325462-036US,March 2013\n    //Before an application attempts to use the POPCNT instruction, it must check that the\n    //processor supports SSE4.2\n    //“(if CPUID.01H:ECX.SSE4_2[bit 20] = 1) and POPCNT (if CPUID.01H:ECX.POPCNT[bit 23] = 1)”\n    //\n    // see p.272 of http://download.intel.com/products/processor/manual/253667.pdf available at\n    // http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html\n    // Also http://en.wikipedia.org/wiki/SSE4 talks about available on Intel & AMD processors\n\n    regs_t regs;\n\n    try {\n#if ( defined(USING_INTEL_COMPILER) || defined(USING_MSC_COMPILER) )\n        __cpuid((void *) &regs,0); // test if __cpuid() works, if not catch the exception\n        __cpuid((void *) &regs,0x1); // POPCNT bit is bit 23 in ECX\n#elif defined(USING_GCC_COMPILER)\n        __get_cpuid(0x1, &regs.EAX, &regs.EBX, &regs.ECX, &regs.EDX);\n#else\n        std::cerr << “ERROR: please define __cpuid() for this build.\\n”; \n        assert(0);\n#endif\n        if( !( (regs.ECX & BIT(20)) && (regs.ECX & BIT(23)) ) ) return false;\n    }\n    catch (int e) {\n        return false;\n    }\n    return true;\n    }\n\n#endif // POPCNT_CAPABILITY\n};\n\n#endif /*PROCESSOR_SUPPORT_H_*/\n\n\n\n\n"
  },
  {
    "path": "qual.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n/// An array that transforms Phred qualities into their maq-like\n/// equivalents by dividing by ten and rounding to the nearest 10,\n/// but saturating at 3.\nunsigned char qualRounds[] = {\n\t0, 0, 0, 0, 0,                          //   0 -   4\n\t10, 10, 10, 10, 10, 10, 10, 10, 10, 10, //   5 -  14\n\t20, 20, 20, 20, 20, 20, 20, 20, 20, 20, //  15 -  24\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, //  25 -  34\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, //  35 -  44\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, //  45 -  54\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, //  55 -  64\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, //  65 -  74\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, //  75 -  84\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, //  85 -  94\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, //  95 - 104\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, // 105 - 114\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, // 115 - 124\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, // 125 - 134\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, // 135 - 144\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, // 145 - 154\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, // 155 - 164\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, // 165 - 174\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, // 175 - 184\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, // 185 - 194\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, // 195 - 204\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, // 205 - 214\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, // 215 - 224\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, // 225 - 234\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, // 235 - 244\n\t30, 30, 30, 30, 30, 30, 30, 30, 30, 30, // 245 - 254\n\t30                                      // 255\n};\n\n/**\n * Lookup table for converting from Solexa-scaled (log-odds) quality\n * values to Phred-scaled quality values.\n */\nunsigned char solToPhred[] = {\n\t/* -10 */ 0, 1, 1, 1, 1, 1, 1, 2, 2, 3,\n\t/* 0 */ 3, 4, 4, 5, 5, 6, 7, 8, 9, 10,\n\t/* 10 */ 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,\n\t/* 20 */ 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,\n\t/* 30 */ 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,\n\t/* 40 */ 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,\n\t/* 50 */ 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,\n\t/* 60 */ 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,\n\t/* 70 */ 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,\n\t/* 80 */ 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,\n\t/* 90 */ 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,\n\t/* 100 */ 100, 101, 102, 103, 104, 105, 106, 107, 108, 109,\n\t/* 110 */ 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,\n\t/* 120 */ 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,\n\t/* 130 */ 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,\n\t/* 140 */ 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,\n\t/* 150 */ 150, 151, 152, 153, 154, 155, 156, 157, 158, 159,\n\t/* 160 */ 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,\n\t/* 170 */ 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,\n\t/* 180 */ 180, 181, 182, 183, 184, 185, 186, 187, 188, 189,\n\t/* 190 */ 190, 191, 192, 193, 194, 195, 196, 197, 198, 199,\n\t/* 200 */ 200, 201, 202, 203, 204, 205, 206, 207, 208, 209,\n\t/* 210 */ 210, 211, 212, 213, 214, 215, 216, 217, 218, 219,\n\t/* 220 */ 220, 221, 222, 223, 224, 225, 226, 227, 228, 229,\n\t/* 230 */ 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,\n\t/* 240 */ 240, 241, 242, 243, 244, 245, 246, 247, 248, 249,\n\t/* 250 */ 250, 251, 252, 253, 254, 255\n};\n"
  },
  {
    "path": "qual.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef QUAL_H_\n#define QUAL_H_\n\n#include <stdexcept>\n#include \"search_globals.h\"\n#include \"sstring.h\"\n\nextern unsigned char qualRounds[];\nextern unsigned char solToPhred[];\n\n/// Translate a Phred-encoded ASCII character into a Phred quality\nstatic inline uint8_t phredcToPhredq(char c) {\n\treturn ((uint8_t)c >= 33 ? ((uint8_t)c - 33) : 0);\n}\n\n/**\n * Convert a Solexa-scaled quality value into a Phred-scale quality\n * value.\n *\n * p = probability that base is miscalled\n * Qphred = -10 * log10 (p)\n * Qsolexa = -10 * log10 (p / (1 - p))\n * See: http://en.wikipedia.org/wiki/FASTQ_format\n *\n */\nstatic inline uint8_t solexaToPhred(int sol) {\n\tassert_lt(sol, 256);\n\tif(sol < -10) return 0;\n\treturn solToPhred[sol+10];\n}\n\nclass SimplePhredPenalty {\npublic:\n\tstatic uint8_t mmPenalty (uint8_t qual) {\n\t\treturn qual;\n\t}\n\tstatic uint8_t delPenalty(uint8_t qual) {\n\t\treturn qual;\n\t}\n\tstatic uint8_t insPenalty(uint8_t qual_left, uint8_t qual_right) {\n\t\treturn std::max(qual_left, qual_right);\n\t}\n};\n\nclass MaqPhredPenalty {\npublic:\n\tstatic uint8_t mmPenalty (uint8_t qual) {\n\t\treturn qualRounds[qual];\n\t}\n\tstatic uint8_t delPenalty(uint8_t qual) {\n\t\treturn qualRounds[qual];\n\t}\n\tstatic uint8_t insPenalty(uint8_t qual_left, uint8_t qual_right) {\n\t\treturn qualRounds[std::max(qual_left, qual_right)];\n\t}\n};\n\nstatic inline uint8_t mmPenalty(bool maq, uint8_t qual) {\n\tif(maq) {\n\t\treturn MaqPhredPenalty::mmPenalty(qual);\n\t} else {\n\t\treturn SimplePhredPenalty::mmPenalty(qual);\n\t}\n}\n\nstatic inline uint8_t delPenalty(bool maq, uint8_t qual) {\n\tif(maq) {\n\t\treturn MaqPhredPenalty::delPenalty(qual);\n\t} else {\n\t\treturn SimplePhredPenalty::delPenalty(qual);\n\t}\n}\n\nstatic inline uint8_t insPenalty(bool maq, uint8_t qual_left, uint8_t qual_right) {\n\tif(maq) {\n\t\treturn MaqPhredPenalty::insPenalty(qual_left, qual_right);\n\t} else {\n\t\treturn SimplePhredPenalty::insPenalty(qual_left, qual_right);\n\t}\n}\n\n/**\n * Take an ASCII-encoded quality value and convert it to a Phred33\n * ASCII char.\n */\ninline static char charToPhred33(char c, bool solQuals, bool phred64Quals) {\n\tusing namespace std;\n\tif(c == ' ') {\n\t\tstd::cerr << \"Saw a space but expected an ASCII-encoded quality value.\" << endl\n\t\t          << \"Are quality values formatted as integers?  If so, try --integer-quals.\" << endl;\n\t\tthrow 1;\n\t}\n\tif (solQuals) {\n\t\t// Convert solexa-scaled chars to phred\n\t\t// http://maq.sourceforge.net/fastq.shtml\n\t\tchar cc = solexaToPhred((int)c - 64) + 33;\n\t\tif (cc < 33) {\n\t\t\tstd::cerr << \"Saw ASCII character \"\n\t\t\t          << ((int)c)\n\t\t\t          << \" but expected 64-based Solexa qual (converts to \" << (int)cc << \").\" << endl\n\t\t\t          << \"Try not specifying --solexa-quals.\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tc = cc;\n\t}\n\telse if(phred64Quals) {\n\t\tif (c < 64) {\n\t\t\tcerr << \"Saw ASCII character \"\n\t\t\t     << ((int)c)\n\t\t\t     << \" but expected 64-based Phred qual.\" << endl\n\t\t\t     << \"Try not specifying --solexa1.3-quals/--phred64-quals.\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\t// Convert to 33-based phred\n\t\tc -= (64-33);\n\t}\n\telse {\n\t\t// Keep the phred quality\n\t\tif (c < 33) {\n\t\t\tcerr << \"Saw ASCII character \"\n\t\t\t     << ((int)c)\n\t\t\t     << \" but expected 33-based Phred qual.\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t}\n\treturn c;\n}\n\n/**\n * Take an integer quality value and convert it to a Phred33 ASCII\n * char.\n */\ninline static char intToPhred33(int iQ, bool solQuals) {\n\tusing namespace std;\n\tint pQ;\n\tif (solQuals) {\n\t\t// Convert from solexa quality to phred\n\t\t// quality and translate to ASCII\n\t\t// http://maq.sourceforge.net/qual.shtml\n\t\tpQ = solexaToPhred((int)iQ) + 33;\n\t} else {\n\t\t// Keep the phred quality and translate\n\t\t// to ASCII\n\t\tpQ = (iQ <= 93 ? iQ : 93) + 33;\n\t}\n\tif (pQ < 33) {\n\t\tcerr << \"Saw negative Phred quality \" << ((int)pQ-33) << \".\" << endl;\n\t\tthrow 1;\n\t}\n\tassert_geq(pQ, 0);\n\treturn (int)pQ;\n}\n\ninline static uint8_t roundPenalty(uint8_t p) {\n\tif(gNoMaqRound) return p;\n\treturn qualRounds[p];\n}\n\n/**\n * Fill the q[] array with the penalties that are determined by\n * subtracting the quality values of the alternate basecalls from\n * the quality of the primary basecall.\n */\ninline static uint8_t penaltiesAt(size_t off, uint8_t *q,\n                                  int alts,\n                                  const BTString&    qual,\n                                  const BTDnaString *altQry,\n                                  const BTString    *altQual)\n{\n\tuint8_t primQ = qual[off]; // qual of primary call\n\tuint8_t bestPenalty = roundPenalty(phredcToPhredq(primQ));\n\t// By default, any mismatch incurs a penalty equal to the quality\n\t// of the called base\n\tq[0] = q[1] = q[2] = q[3] = bestPenalty;\n\tfor(int i = 0; i < alts; i++) {\n\t\tuint8_t altQ = altQual[i][off]; // qual of alt call\n\t\tif(altQ == 33) break; // no alt call\n\t\tassert_leq(altQ, primQ);\n\t\tuint8_t pen = roundPenalty(primQ - altQ);\n\t\tif(pen < bestPenalty) {\n\t\t\tbestPenalty = pen;\n\t\t}\n\t\t// Get the base\n\t\tint altC = (int)altQry[i][off];\n\t\tassert_lt(altC, 4);\n\t\tq[altC] = pen;\n\t}\n\t// Return the best penalty so that the caller can evaluate whether\n\t// any of the penalties are within-budget\n\treturn bestPenalty;\n}\n\n/**\n * Fill the q[] array with the penalties that are determined by\n * subtracting the quality values of the alternate basecalls from\n * the quality of the primary basecall.\n */\ninline static uint8_t loPenaltyAt(size_t off, int alts,\n                                  const BTString&    qual,\n                                  const BTString    *altQual)\n{\n\tuint8_t primQ = qual[off]; // qual of primary call\n\tuint8_t bestPenalty = roundPenalty(phredcToPhredq(primQ));\n\tfor(int i = 0; i < alts; i++) {\n\t\tuint8_t altQ = altQual[i][off]; // qual of alt call\n\t\tif(altQ == 33) break; // no more alt calls at this position\n\t\tassert_leq(altQ, primQ);\n\t\tuint8_t pen = roundPenalty(primQ - altQ);\n\t\tif(pen < bestPenalty) {\n\t\t\tbestPenalty = pen;\n\t\t}\n\t}\n\treturn bestPenalty;\n}\n\n#endif /*QUAL_H_*/\n"
  },
  {
    "path": "random_source.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"random_source.h\"\n#include \"random_util.h\"\n\n#ifdef MERSENNE_TWISTER\n\nvoid RandomSource::gen_state() {\n\tfor(int i = 0; i < (n - m); ++i) {\n\t\tstate_[i] = state_[i + m] ^ twiddle(state_[i], state_[i + 1]);\n\t}\n\tfor(int i = n - m; i < (n - 1); ++i) {\n\t\tstate_[i] = state_[i + m - n] ^ twiddle(state_[i], state_[i + 1]);\n\t}\n\tstate_[n - 1] = state_[m - 1] ^ twiddle(state_[n - 1], state_[0]);\n\tp_ = 0; // reset position\n}\n\nvoid RandomSource::init(uint32_t s) {  // init by 32 bit seed\n\treset();\n\tstate_[0] = s;\n\tfor(int i = 1; i < n; ++i) {\n\t\tstate_[i] = 1812433253UL * (state_[i - 1] ^ (state_[i - 1] >> 30)) + i;\n\t}\n\tp_ = n; // force gen_state() to be called for next random number\n\tinited_ = true;\n}\n\nvoid RandomSource::init(const uint32_t* array, int size) { // init by array\n\tinit(19650218UL);\n\tint i = 1, j = 0;\n\tfor(int k = ((n > size) ? n : size); k; --k) {\n\t\tstate_[i] = (state_[i] ^ ((state_[i - 1] ^ (state_[i - 1] >> 30)) * 1664525UL)) + array[j] + j; // non linear\n\t\t++j; j %= size;\n\t\tif((++i) == n) { state_[0] = state_[n - 1]; i = 1; }\n\t}\n\tfor(int k = n - 1; k; --k) {\n\t\tstate_[i] = (state_[i] ^ ((state_[i - 1] ^ (state_[i - 1] >> 30)) * 1566083941UL)) - i;\n\t\tif((++i) == n) { state_[0] = state_[n - 1]; i = 1; }\n\t}\n\tstate_[0] = 0x80000000UL; // MSB is 1; assuring non-zero initial array\n\tp_ = n; // force gen_state() to be called for next random number\n\tinited_ = true;\n}\n\n#endif\n\n#ifdef MAIN_RANDOM_SOURCE\n\nusing namespace std;\n\nint main(void) {\n\tcerr << \"Test 1\" << endl;\n\t{\n\t\tRandomSource rnd;\n\t\tint cnts[32];\n\t\tfor(size_t i = 0; i < 32; i++) {\n\t\t\tcnts[i] = 0;\n\t\t}\n\t\tfor(uint32_t j = 0; j < 10; j++) {\n\t\t\trnd.init(j);\n\t\t\tfor(size_t i = 0; i < 10000; i++) {\n\t\t\t\tuint32_t rndi = rnd.nextU32();\n\t\t\t\tfor(size_t i = 0; i < 32; i++) {\n\t\t\t\t\tif((rndi & 1) != 0) {\n\t\t\t\t\t\tcnts[i]++;\n\t\t\t\t\t}\n\t\t\t\t\trndi >>= 1;\n\t\t\t\t}\n\t\t\t}\n\t\t\tfor(size_t i = 0; i < 32; i++) {\n\t\t\t\tcerr << i << \": \" << cnts[i] << endl;\n\t\t\t}\n\t\t}\n\t}\n\n\tcerr << \"Test 2\" << endl;\n\t{\n\t\tint cnts[4][4];\n\t\tfor(size_t i = 0; i < 4; i++) {\n\t\t\tfor(size_t j = 0; j < 4; j++) {\n\t\t\t\tcnts[i][j] = 0;\n\t\t\t}\n\t\t}\n\t\tRandomSource rnd;\n\t\tRandom1toN rn1n;\n\t\tfor(size_t i = 0; i < 100; i++) {\n\t\t\trnd.init((uint32_t)i);\n\t\t\trn1n.init(4, true);\n\t\t\tuint32_t ri = rn1n.next(rnd);\n\t\t\tcnts[ri][0]++;\n\t\t\tri = rn1n.next(rnd);\n\t\t\tcnts[ri][1]++;\n\t\t\tri = rn1n.next(rnd);\n\t\t\tcnts[ri][2]++;\n\t\t\tri = rn1n.next(rnd);\n\t\t\tcnts[ri][3]++;\n\t\t}\n\t\tfor(size_t i = 0; i < 4; i++) {\n\t\t\tfor(size_t j = 0; j < 4; j++) {\n\t\t\t\tcerr << cnts[i][j];\n\t\t\t\tif(j < 3) {\n\t\t\t\t\tcerr << \", \";\n\t\t\t\t}\n\t\t\t}\n\t\t\tcerr << endl;\n\t\t}\n\t}\n}\n\n#endif\n"
  },
  {
    "path": "random_source.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef RANDOM_GEN_H_\n#define RANDOM_GEN_H_\n\n#include <stdint.h>\n#include \"assert_helpers.h\"\n\n//#define MERSENNE_TWISTER\n\n#ifndef MERSENNE_TWISTER\n\n/**\n * Simple pseudo-random linear congruential generator, a la Numerical\n * Recipes.\n */\nclass RandomSource {\npublic:\n\tstatic const uint32_t DEFUALT_A = 1664525;\n\tstatic const uint32_t DEFUALT_C = 1013904223;\n\n\tRandomSource() :\n\t\ta(DEFUALT_A), c(DEFUALT_C), inited_(false) { }\n\tRandomSource(uint32_t _last) :\n\t\ta(DEFUALT_A), c(DEFUALT_C), last(_last), inited_(true) { }\n\tRandomSource(uint32_t _a, uint32_t _c) :\n\t\ta(_a), c(_c), inited_(false) { }\n\n\tvoid init(uint32_t seed = 0) {\n\t\tlast = seed;\n\t\tinited_ = true;\n\t\tlastOff = 30;\n\t}\n\n\tuint32_t nextU32() {\n\t\tassert(inited_);\n\t\tuint32_t ret;\n\t\tlast = a * last + c;\n\t\tret = last >> 16;\n\t\tlast = a * last + c;\n\t\tret ^= last;\n\t\tlastOff = 0;\n\t\treturn ret;\n\t}\n    \n    uint64_t nextU64() {\n\t\tassert(inited_);\n\t\tuint64_t first = nextU32();\n\t\tfirst = first << 32;\n\t\tuint64_t second = nextU32();\n\t\treturn first | second;\n\t}\n\n\t/**\n\t * Return a pseudo-random unsigned 32-bit integer sampled uniformly\n\t * from [lo, hi].\n\t */\n\tuint32_t nextU32Range(uint32_t lo, uint32_t hi) {\n\t\tuint32_t ret = lo;\n\t\tif(hi > lo) {\n\t\t\tret += (nextU32() % (hi-lo+1));\n\t\t}\n\t\treturn ret;\n\t}\n\n\t/**\n\t * Get next 2-bit unsigned integer.\n\t */\n\tuint32_t nextU2() {\n\t\tassert(inited_);\n\t\tif(lastOff > 30) {\n\t\t\tnextU32();\n\t\t}\n\t\tuint32_t ret = (last >> lastOff) & 3;\n\t\tlastOff += 2;\n\t\treturn ret;\n\t}\n\n\t/**\n\t * Get next boolean.\n\t */\n\tbool nextBool() {\n\t\tassert(inited_);\n\t\tif(lastOff > 31) {\n\t\t\tnextU32();\n\t\t}\n\t\tuint32_t ret = (last >> lastOff) & 1;\n\t\tlastOff++;\n\t\treturn ret;\n\t}\n\t\n\t/**\n\t * Return an unsigned int chosen by picking randomly from among\n\t * options weighted by probabilies supplied as the elements of the\n\t * 'weights' array of length 'numWeights'.  The weights should add\n\t * to 1.\n\t */\n\tuint32_t nextFromProbs(\n\t\tconst float* weights,\n\t\tsize_t numWeights)\n\t{\n\t\tfloat f = nextFloat();\n\t\tfloat tot = 0.0f; // total weight seen so far\n\t\tfor(uint32_t i = 0; i < numWeights; i++) {\n\t\t\ttot += weights[i];\n\t\t\tif(f < tot) return i;\n\t\t}\n\t\treturn (uint32_t)(numWeights-1);\n\t}\n\n\tfloat nextFloat() {\n\t\tassert(inited_);\n\t\treturn (float)nextU32() / (float)0xffffffff;\n\t}\n\n\tstatic uint32_t nextU32(uint32_t last,\n\t                        uint32_t a = DEFUALT_A,\n\t                        uint32_t c = DEFUALT_C)\n\t{\n\t\treturn (a * last) + c;\n\t}\n\t\n\tuint32_t currentA() const { return a; }\n\tuint32_t currentC() const { return c; }\n\tuint32_t currentLast() const { return last; }\n\nprivate:\n\tuint32_t a;\n\tuint32_t c;\n\tuint32_t last;\n\tuint32_t lastOff;\n\tbool inited_;\n};\n\n#else\n\nclass RandomSource { // Mersenne Twister random number generator\n\npublic:\n\n\t// default constructor: uses default seed only if this is the first instance\n\tRandomSource() {\n\t\treset();\n\t}\n\t\n\t// constructor with 32 bit int as seed\n\tRandomSource(uint32_t s) {\n\t\tinit(s);\n\t}\n\t\n\t// constructor with array of size 32 bit ints as seed\n\tRandomSource(const uint32_t* array, int size) {\n\t\tinit(array, size);\n\t}\n\t\n\tvoid reset() {\n\t\tstate_[0] = 0;\n\t\tp_ = 0;\n\t\tinited_ = false;\n\t}\n\t\n\tvirtual ~RandomSource() { }\n\t\n\t// the two seed functions\n\tvoid init(uint32_t); // seed with 32 bit integer\n\tvoid init(const uint32_t*, int size); // seed with array\n\n\t/**\n\t * Return next 1-bit unsigned integer.\n\t */\n\tbool nextBool() {\n\t\treturn (nextU32() & 1) == 0;\n\t}\n\t\n\t/**\n\t * Get next unsigned 32-bit integer.\n\t */\n\tinline uint32_t nextU32() {\n\t\tassert(inited_);\n\t\tif(p_ == n) {\n\t\t\tgen_state(); // new state vector needed\n\t\t}\n\t\t// gen_state() is split off to be non-inline, because it is only called once\n\t\t// in every 624 calls and otherwise irand() would become too big to get inlined\n\t\tuint32_t x = state_[p_++];\n\t\tx ^= (x >> 11);\n\t\tx ^= (x << 7) & 0x9D2C5680UL;\n\t\tx ^= (x << 15) & 0xEFC60000UL;\n\t\tx ^= (x >> 18);\n\t\treturn x;\n\t}\n\t\n\t/**\n\t * Return next float between 0 and 1.\n\t */\n\tfloat nextFloat() {\n\t\tassert(inited_);\n\t\treturn (float)nextU32() / (float)0xffffffff;\n\t}\n\t\nprotected: // used by derived classes, otherwise not accessible; use the ()-operator\n\n\tstatic const int n = 624, m = 397; // compile time constants\n\n\t// the variables below are static (no duplicates can exist)\n\tuint32_t state_[n]; // state vector array\n\tint p_; // position in state array\n\t\n\tbool inited_; // true if init function has been called\n\t\n\t// private functions used to generate the pseudo random numbers\n\tuint32_t twiddle(uint32_t u, uint32_t v) {\n\t\treturn (((u & 0x80000000UL) | (v & 0x7FFFFFFFUL)) >> 1) ^ ((v & 1UL) ? 0x9908B0DFUL : 0x0UL);\n\t}\n\t\n\tvoid gen_state(); // generate new state\n\t\n};\n\n#endif\n\n#endif /*RANDOM_GEN_H_*/\n"
  },
  {
    "path": "random_util.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"random_util.h\"\n\nconst size_t Random1toN::SWAPLIST_THRESH = 128;\nconst size_t Random1toN::CONVERSION_THRESH = 16;\nconst float Random1toN::CONVERSION_FRAC = 0.10f;\n"
  },
  {
    "path": "random_util.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef RANDOM_UTIL_H_\n#define RANDOM_UTIL_H_\n\n#include <algorithm>\n#include \"random_source.h\"\n#include \"ds.h\"\n\n/**\n * Return a random integer in [1, N].  Each time it's called it samples again\n * without replacement.  done() indicates when all elements have been given\n * out.\n */\nclass Random1toN {\n\n\ttypedef uint32_t T;\n\npublic:\n\n\t// A set with fewer than this many elements should kick us into swap-list\n\t// mode immediately.  Otherwise we start in seen-list mode and then\n\t// possibly proceed to swap-list mode later.\n\tstatic const size_t SWAPLIST_THRESH;\n\t\n\t// Convert seen-list to swap-list after this many entries in the seen-list.\n\tstatic const size_t CONVERSION_THRESH;\n\n\t// Convert seen-list to swap-list after this (this times n_) many entries\n\t// in the seen-list.\n\tstatic const float CONVERSION_FRAC;\n\n\tRandom1toN(int cat = 0) :\n\t\tsz_(0), n_(0), cur_(0),\n\t\tlist_(SWAPLIST_THRESH, cat), seen_(CONVERSION_THRESH, cat),\n\t\tthresh_(0) {}\n\t\n\tRandom1toN(size_t n, int cat = 0) :\n\t\tsz_(0), n_(n), cur_(0),\n\t\tlist_(SWAPLIST_THRESH, cat), seen_(CONVERSION_THRESH, cat),\n\t\tthresh_(0) {}\n\n\t/**\n\t * Initialize the set of pseudo-randoms to be given out without replacement.\n\t */\n\tvoid init(size_t n, bool withoutReplacement) {\n\t\tsz_ = n_ = n;\n\t\tconverted_ = false;\n\t\tswaplist_ = n < SWAPLIST_THRESH || withoutReplacement;\n\t\tcur_ = 0;\n\t\tlist_.clear();\n\t\tseen_.clear();\n\t\tthresh_ = std::max(CONVERSION_THRESH, (size_t)(CONVERSION_FRAC * n));\n\t}\n\t\n\t/**\n\t * Reset in preparation for giving out a fresh collection of pseudo-randoms\n\t * without replacement.\n\t */\n\tvoid reset() {\n\t\tsz_ = n_ = cur_ = 0; swaplist_ = converted_ = false;\n\t\tlist_.clear(); seen_.clear();\n\t\tthresh_ = 0;\n\t}\n\n\t/**\n\t * Get next pseudo-random element without replacement.\n\t */\n\tT next(RandomSource& rnd) {\n\t\tassert(!done());\n\t\tif(cur_ == 0 && !converted_) {\n\t\t\t// This is the first call to next()\n\t\t\tif(n_ == 1) {\n\t\t\t\t// Trivial case: set of 1\n\t\t\t\tcur_ = 1;\n\t\t\t\treturn 0;\n\t\t\t}\n\t\t\tif(swaplist_) {\n\t\t\t\t// The set is small, so we go immediately to the random\n\t\t\t\t// swapping list\n\t\t\t\tlist_.resize(n_);\n\t\t\t\tfor(size_t i = 0; i < n_; i++) {\n\t\t\t\t\tlist_[i] = (T)i;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tif(swaplist_) {\n\t\t\t// Get next pseudo-random using the swap-list\n\t\t\tsize_t r = cur_ + (rnd.nextU32() % (n_ - cur_));\n\t\t\tif(r != cur_) {\n\t\t\t\tstd::swap(list_[cur_], list_[r]);\n\t\t\t}\n\t\t\treturn list_[cur_++];\n\t\t} else {\n\t\t\tassert(!converted_);\n\t\t\t// Get next pseudo-random but reject it if it's in the seen-list\n\t\t\tbool again = true;\n\t\t\tT rn = 0;\n\t\t\tsize_t seenSz = seen_.size();\n\t\t\twhile(again) {\n\t\t\t\trn = rnd.nextU32() % (T)n_;\n\t\t\t\tagain = false;\n\t\t\t\tfor(size_t i = 0; i < seenSz; i++) {\n\t\t\t\t\tif(seen_[i] == rn) {\n\t\t\t\t\t\tagain = true;\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\t// Add it to the seen-list\n\t\t\tseen_.push_back(rn);\n\t\t\tcur_++;\n\t\t\tassert_leq(cur_, n_);\n\t\t\t// Move on to using the swap-list?\n\t\t\tassert_gt(thresh_, 0);\n\t\t\tif(seen_.size() >= thresh_ && cur_ < n_) {\n\t\t\t\t// Add all elements not already in the seen list to the\n\t\t\t\t// swap-list\n\t\t\t\tassert(!seen_.empty());\n\t\t\t\tseen_.sort();\n\t\t\t\tlist_.resize(n_ - cur_);\n\t\t\t\tsize_t prev = 0;\n\t\t\t\tsize_t cur = 0;\n\t\t\t\tfor(size_t i = 0; i <= seenSz; i++) {\n\t\t\t\t\t// Add all the elements between the previous element and\n\t\t\t\t\t// this one\n\t\t\t\t\tfor(size_t j = prev; j < seen_[i]; j++) {\n\t\t\t\t\t\tlist_[cur++] = (T)j;\n\t\t\t\t\t}\n\t\t\t\t\tprev = seen_[i]+1;\n\t\t\t\t}\n\t\t\t\tfor(size_t j = prev; j < n_; j++) {\n\t\t\t\t\tlist_[cur++] = (T)j;\n\t\t\t\t}\n\t\t\t\tassert_eq(cur, n_ - cur_);\n\t\t\t\tseen_.clear();\n\t\t\t\tcur_ = 0;\n\t\t\t\tn_ = list_.size();\n\t\t\t\tconverted_ = true;\n\t\t\t\tswaplist_ = true;\n\t\t\t}\n\t\t\treturn rn;\n\t\t}\n\t}\n\t\n\t/**\n\t * Return true iff the generator was initialized.\n\t */\n\tbool inited() const { return n_ > 0; }\n\t\n\t/**\n\t * Set so that there are no pseudo-randoms remaining.\n\t */\n\tvoid setDone() { assert(inited()); cur_ = n_; }\n\t\n\t/**\n\t * Return true iff all pseudo-randoms have already been given out.\n\t */\n\tbool done() const { return inited() && cur_ >= n_; }\n\n\t/**\n\t * Return the total number of pseudo-randoms we are initialized to give\n\t * out, including ones already given out.\n\t */\n\tsize_t size() const { return n_; }\n\t\n\t/**\n\t * Return the number of pseudo-randoms left to give out.\n\t */\n\tsize_t left() const { return n_ - cur_; }\n\n\t/**\n\t * Return the total size occupued by the Descent driver and all its\n\t * constituent parts.\n\t */\n\tsize_t totalSizeBytes() const {\n\t\treturn list_.totalSizeBytes() +\n\t\t       seen_.totalSizeBytes();\n\t}\n\n\t/**\n\t * Return the total capacity of the Descent driver and all its constituent\n\t * parts.\n\t */\n\tsize_t totalCapacityBytes() const {\n\t\treturn list_.totalCapacityBytes() +\n\t\t       seen_.totalCapacityBytes();\n\t}\n\nprotected:\n\n\tsize_t   sz_;        // domain to pick elts from\n\tsize_t   n_;         // number of elements in active list\n\tbool     swaplist_;  // if small, use swapping\n\tbool     converted_; // true iff seen-list was converted to swap-list\n\tsize_t   cur_;       // # times next() was called\n\tEList<T> list_;      // pseudo-random swapping list\n\tEList<T> seen_;      // prior to swaplist_ mode, list of\n\t                     // pseudo-randoms given out\n\tsize_t   thresh_;    // conversion threshold for this instantiation, which\n\t                     // depends both on CONVERSION_THRESH and on n_\n};\n\n#endif\n"
  },
  {
    "path": "read.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef READ_H_\n#define READ_H_\n\n#include <stdint.h>\n#include <sys/time.h>\n#include \"ds.h\"\n#include \"sstring.h\"\n#include \"filebuf.h\"\n#include \"util.h\"\n\nenum rna_strandness_format {\n    RNA_STRANDNESS_UNKNOWN = 0,\n    RNA_STRANDNESS_F,\n    RNA_STRANDNESS_R,\n    RNA_STRANDNESS_FR,\n    RNA_STRANDNESS_RF\n};\n\ntypedef uint64_t TReadId;\ntypedef size_t TReadOff;\ntypedef int64_t TAlScore;\n\nclass HitSet;\n\n/**\n * A buffer for keeping all relevant information about a single read.\n */\nstruct Read {\n\n\tRead() { reset(); }\n\t\n\tRead(const char *nm, const char *seq, const char *ql) { init(nm, seq, ql); }\n\n\tvoid reset() {\n\t\trdid = 0;\n\t\tendid = 0;\n\t\talts = 0;\n\t\ttrimmed5 = trimmed3 = 0;\n\t\treadOrigBuf.clear();\n\t\tpatFw.clear();\n\t\tpatRc.clear();\n\t\tqual.clear();\n\t\tpatFwRev.clear();\n\t\tpatRcRev.clear();\n\t\tqualRev.clear();\n\t\tname.clear();\n\t\tfor(int j = 0; j < 3; j++) {\n\t\t\taltPatFw[j].clear();\n\t\t\taltPatFwRev[j].clear();\n\t\t\taltPatRc[j].clear();\n\t\t\taltPatRcRev[j].clear();\n\t\t\taltQual[j].clear();\n\t\t\taltQualRev[j].clear();\n\t\t}\n\t\tcolor = fuzzy = false;\n\t\tprimer = '?';\n\t\ttrimc = '?';\n\t\tfilter = '?';\n\t\tseed = 0;\n\t\tns_ = 0;\n\t}\n\t\n\t/**\n\t * Finish initializing a new read.\n\t */\n\tvoid finalize() {\n\t\tfor(size_t i = 0; i < patFw.length(); i++) {\n\t\t\tif((int)patFw[i] > 3) {\n\t\t\t\tns_++;\n\t\t\t}\n\t\t}\n\t\tconstructRevComps();\n\t\tconstructReverses();\n\t}\n\n\t/**\n\t * Simple init function, used for testing.\n\t */\n\tvoid init(\n\t\tconst char *nm,\n\t\tconst char *seq,\n\t\tconst char *ql)\n\t{\n\t\treset();\n\t\tpatFw.installChars(seq);\n\t\tqual.install(ql);\n\t\tfor(size_t i = 0; i < patFw.length(); i++) {\n\t\t\tif((int)patFw[i] > 3) {\n\t\t\t\tns_++;\n\t\t\t}\n\t\t}\n\t\tconstructRevComps();\n\t\tconstructReverses();\n\t\tif(nm != NULL) name.install(nm);\n\t}\n\n\t/// Return true iff the read (pair) is empty\n\tbool empty() const {\n\t\treturn patFw.empty();\n\t}\n\n\t/// Return length of the read in the buffer\n\tsize_t length() const {\n\t\treturn patFw.length();\n\t}\n\t\n\t/**\n\t * Return the number of Ns in the read.\n\t */\n\tsize_t ns() const {\n\t\treturn ns_;\n\t}\n\n\t/**\n\t * Construct reverse complement of the pattern and the fuzzy\n\t * alternative patters.  If read is in colorspace, just reverse\n\t * them.\n\t */\n\tvoid constructRevComps() {\n\t\tif(color) {\n\t\t\tpatRc.installReverse(patFw);\n\t\t\tfor(int j = 0; j < alts; j++) {\n\t\t\t\taltPatRc[j].installReverse(altPatFw[j]);\n\t\t\t}\n\t\t} else {\n\t\t\tpatRc.installReverseComp(patFw);\n\t\t\tfor(int j = 0; j < alts; j++) {\n\t\t\t\taltPatRc[j].installReverseComp(altPatFw[j]);\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Given patFw, patRc, and qual, construct the *Rev versions in\n\t * place.  Assumes constructRevComps() was called previously.\n\t */\n\tvoid constructReverses() {\n\t\tpatFwRev.installReverse(patFw);\n\t\tpatRcRev.installReverse(patRc);\n\t\tqualRev.installReverse(qual);\n\t\tfor(int j = 0; j < alts; j++) {\n\t\t\taltPatFwRev[j].installReverse(altPatFw[j]);\n\t\t\taltPatRcRev[j].installReverse(altPatRc[j]);\n\t\t\taltQualRev[j].installReverse(altQual[j]);\n\t\t}\n\t}\n\n\t/**\n\t * Append a \"/1\" or \"/2\" string onto the end of the name buf if\n\t * it's not already there.\n\t */\n\tvoid fixMateName(int i) {\n\t\tassert(i == 1 || i == 2);\n\t\tsize_t namelen = name.length();\n\t\tbool append = false;\n\t\tif(namelen < 2) {\n\t\t\t// Name is too short to possibly have /1 or /2 on the end\n\t\t\tappend = true;\n\t\t} else {\n\t\t\tif(i == 1) {\n\t\t\t\t// append = true iff mate name does not already end in /1\n\t\t\t\tappend =\n\t\t\t\t\tname[namelen-2] != '/' ||\n\t\t\t\t\tname[namelen-1] != '1';\n\t\t\t} else {\n\t\t\t\t// append = true iff mate name does not already end in /2\n\t\t\t\tappend =\n\t\t\t\t\tname[namelen-2] != '/' ||\n\t\t\t\t\tname[namelen-1] != '2';\n\t\t\t}\n\t\t}\n\t\tif(append) {\n\t\t\tname.append('/');\n\t\t\tname.append(\"012\"[i]);\n\t\t}\n\t}\n\n\t/**\n\t * Dump basic information about this read to the given ostream.\n\t */\n\tvoid dump(std::ostream& os) const {\n\t\tusing namespace std;\n\t\tos << name << ' ';\n\t\tif(color) {\n\t\t\tos << patFw.toZBufXForm(\"0123.\");\n\t\t} else {\n\t\t\tos << patFw;\n\t\t}\n\t\tos << ' ';\n\t\t// Print out the fuzzy alternative sequences\n\t\tfor(int j = 0; j < 3; j++) {\n\t\t\tbool started = false;\n\t\t\tif(!altQual[j].empty()) {\n\t\t\t\tfor(size_t i = 0; i < length(); i++) {\n\t\t\t\t\tif(altQual[j][i] != '!') {\n\t\t\t\t\t\tstarted = true;\n\t\t\t\t\t}\n\t\t\t\t\tif(started) {\n\t\t\t\t\t\tif(altQual[j][i] == '!') {\n\t\t\t\t\t\t\tos << '-';\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tif(color) {\n\t\t\t\t\t\t\t\tos << \"0123.\"[(int)altPatFw[j][i]];\n\t\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\t\tos << altPatFw[j][i];\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\tcout << \" \";\n\t\t}\n\t\tos << qual.toZBuf() << \" \";\n\t\t// Print out the fuzzy alternative quality strings\n\t\tfor(int j = 0; j < 3; j++) {\n\t\t\tbool started = false;\n\t\t\tif(!altQual[j].empty()) {\n\t\t\t\tfor(size_t i = 0; i < length(); i++) {\n\t\t\t\t\tif(altQual[j][i] != '!') {\n\t\t\t\t\t\tstarted = true;\n\t\t\t\t\t}\n\t\t\t\t\tif(started) {\n\t\t\t\t\t\tos << altQual[j][i];\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\tif(j == 2) {\n\t\t\t\tos << endl;\n\t\t\t} else {\n\t\t\t\tos << \" \";\n\t\t\t}\n\t\t}\n\t}\n\t\n\t/**\n\t * Check whether two reads are the same in the sense that they will\n\t * lead to us finding the same set of alignments.\n\t */\n\tstatic bool same(\n\t\tconst BTDnaString& seq1,\n\t\tconst BTString&    qual1,\n\t\tconst BTDnaString& seq2,\n\t\tconst BTString&    qual2,\n\t\tbool qualitiesMatter)\n\t{\n\t\tif(seq1.length() != seq2.length()) {\n\t\t\treturn false;\n\t\t}\n\t\tfor(size_t i = 0; i < seq1.length(); i++) {\n\t\t\tif(seq1[i] != seq2[i]) return false;\n\t\t}\n\t\tif(qualitiesMatter) {\n\t\t\tif(qual1.length() != qual2.length()) {\n\t\t\t\treturn false;\n\t\t\t}\n\t\t\tfor(size_t i = 0; i < qual1.length(); i++) {\n\t\t\t\tif(qual1[i] != qual2[i]) return false;\n\t\t\t}\n\t\t}\n\t\treturn true;\n\t}\n\n\t/**\n\t * Get the nucleotide and quality value at the given offset from 5' end.\n\t * If 'fw' is false, get the reverse complement.\n\t */\n\tstd::pair<int, int> get(TReadOff off5p, bool fw) const {\n\t\tassert_lt(off5p, length());\n\t\tint c = (int)patFw[off5p];\n        int q = qual[off5p];\n        assert_geq(q, 33);\n\t\treturn make_pair((!fw && c < 4) ? (c ^ 3) : c, q - 33);\n\t}\n\t\n\t/**\n\t * Get the nucleotide at the given offset from 5' end.\n\t * If 'fw' is false, get the reverse complement.\n\t */\n\tint getc(TReadOff off5p, bool fw) const {\n\t\tassert_lt(off5p, length());\n\t\tint c = (int)patFw[off5p];\n\t\treturn (!fw && c < 4) ? (c ^ 3) : c;\n\t}\n\t\n\t/**\n\t * Get the quality value at the given offset from 5' end.\n\t */\n\tint getq(TReadOff off5p) const {\n\t\tassert_lt(off5p, length());\n        int q = qual[off5p];\n        assert_geq(q, 33);\n\t\treturn q-33;\n\t}\n\n#ifndef NDEBUG\n\t/**\n\t * Check that read info is internally consistent.\n\t */\n\tbool repOk() const {\n\t\tif(patFw.empty()) return true;\n\t\tassert_eq(qual.length(), patFw.length());\n\t\treturn true;\n\t}\n#endif\n\n\tBTDnaString patFw;            // forward-strand sequence\n\tBTDnaString patRc;            // reverse-complement sequence\n\tBTString    qual;             // quality values\n\n\tBTDnaString altPatFw[3];\n\tBTDnaString altPatRc[3];\n\tBTString    altQual[3];\n\n\tBTDnaString patFwRev;\n\tBTDnaString patRcRev;\n\tBTString    qualRev;\n\n\tBTDnaString altPatFwRev[3];\n\tBTDnaString altPatRcRev[3];\n\tBTString    altQualRev[3];\n\n\t// For remembering the exact input text used to define a read\n\tSStringExpandable<char> readOrigBuf;\n\n\tBTString name;      // read name\n\tTReadId  rdid;      // 0-based id based on pair's offset in read file(s)\n\tTReadId  endid;     // 0-based id based on pair's offset in read file(s)\n\t                    // and which mate (\"end\") this is\n\tint      mate;      // 0 = single-end, 1 = mate1, 2 = mate2\n\tuint32_t seed;      // random seed\n\tsize_t   ns_;       // # Ns\n\tint      alts;      // number of alternatives\n\tbool     fuzzy;     // whether to employ fuzziness\n\tbool     color;     // whether read is in color space\n\tchar     primer;    // primer base, for csfasta files\n\tchar     trimc;     // trimmed color, for csfasta files\n\tchar     filter;    // if read format permits filter char, set it here\n\tint      trimmed5;  // amount actually trimmed off 5' end\n\tint      trimmed3;  // amount actually trimmed off 3' end\n\tHitSet  *hitset;    // holds previously-found hits; for chaining\n};\n\n/**\n * A string of FmStringOps represent a string of tasks performed by the\n * best-first alignment search.  We model the search as a series of FM ops\n * interspersed with reported alignments.\n */\nstruct FmStringOp {\n\tbool alignment;  // true -> found an alignment\n\tTAlScore pen;    // penalty of the FM op or alignment\n\tsize_t n;        // number of FM ops (only relevant for non-alignment)\n};\n\n/**\n * A string that summarizes the progress of an FM-index-assistet best-first\n * search.  Useful for trying to figure out what the aligner is spending its\n * time doing for a given read.\n */\nstruct FmString {\n\n\t/**\n\t * Add one or more FM index ops to the op string\n\t */\n\tvoid add(bool alignment, TAlScore pen, size_t nops) {\n\t\tif(ops.empty() || ops.back().pen != pen) {\n\t\t\tops.expand();\n\t\t\tops.back().alignment = alignment;\n\t\t\tops.back().pen = pen;\n\t\t\tops.back().n = 0;\n\t\t}\n\t\tops.back().n++;\n\t}\n\t\n\t/**\n\t * Reset FmString to uninitialized state.\n\t */\n\tvoid reset() {\n\t\tpen = std::numeric_limits<TAlScore>::max();\n\t\tops.clear();\n\t}\n\n\t/**\n\t * Print a :Z optional field where certain characters (whitespace, colon\n\t * and percent) are escaped using % escapes.\n\t */\n\tvoid print(BTString& o, char *buf) const {\n\t\tfor(size_t i = 0; i < ops.size(); i++) {\n\t\t\tif(i > 0) {\n\t\t\t\to.append(';');\n\t\t\t}\n\t\t\tif(ops[i].alignment) {\n\t\t\t\to.append(\"A,\");\n\t\t\t\titoa10(ops[i].pen, buf);\n\t\t\t\to.append(buf);\n\t\t\t} else {\n\t\t\t\to.append(\"F,\");\n\t\t\t\titoa10(ops[i].pen, buf); o.append(buf);\n\t\t\t\to.append(',');\n\t\t\t\titoa10(ops[i].n, buf); o.append(buf);\n\t\t\t}\n\t\t}\n\t}\n\n\tTAlScore pen;          // current penalty\n\tEList<FmStringOp> ops; // op string\n};\n\n/**\n * Key per-read metrics.  These are used for thresholds, allowing us to bail\n * for unproductive reads.  They also the basis of what's printed when the user\n * specifies --read-times.\n */\nstruct PerReadMetrics {\n\n\tPerReadMetrics() { reset(); }\n\n\tvoid reset() {\n\t\tnExIters =\n\t\tnExDps   = nExDpSuccs   = nExDpFails   =\n\t\tnMateDps = nMateDpSuccs = nMateDpFails =\n\t\tnExUgs   = nExUgSuccs   = nExUgFails   =\n\t\tnMateUgs = nMateUgSuccs = nMateUgFails =\n\t\tnExEes   = nExEeSuccs   = nExEeFails   =\n\t\tnRedundants =\n\t\tnEeFmops = nSdFmops = nExFmops =\n\t\tnDpFail = nDpFailStreak = nDpLastSucc =\n\t\tnUgFail = nUgFailStreak = nUgLastSucc =\n\t\tnEeFail = nEeFailStreak = nEeLastSucc =\n\t\tnFilt = 0;\n\t\tnFtabs = 0;\n\t\tnRedSkip = 0;\n\t\tnRedFail = 0;\n\t\tnRedIns = 0;\n\t\tdoFmString = false;\n\t\tnSeedRanges = nSeedElts = 0;\n\t\tnSeedRangesFw = nSeedEltsFw = 0;\n\t\tnSeedRangesRc = nSeedEltsRc = 0;\n\t\tseedMedian = seedMean = 0;\n\t\tbestLtMinscMate1 =\n\t\tbestLtMinscMate2 = std::numeric_limits<TAlScore>::min();\n\t\tfmString.reset();\n\t}\n\n\tstruct timeval  tv_beg; // timer start to measure how long alignment takes\n\tstruct timezone tz_beg; // timer start to measure how long alignment takes\n\n\tuint64_t nExIters;      // iterations of seed hit extend loop\n\n\tuint64_t nExDps;        // # extend DPs run on this read\n\tuint64_t nExDpSuccs;    // # extend DPs run on this read\n\tuint64_t nExDpFails;    // # extend DPs run on this read\n\t\n\tuint64_t nExUgs;        // # extend ungapped alignments run on this read\n\tuint64_t nExUgSuccs;    // # extend ungapped alignments run on this read\n\tuint64_t nExUgFails;    // # extend ungapped alignments run on this read\n\n\tuint64_t nExEes;        // # extend ungapped alignments run on this read\n\tuint64_t nExEeSuccs;    // # extend ungapped alignments run on this read\n\tuint64_t nExEeFails;    // # extend ungapped alignments run on this read\n\n\tuint64_t nMateDps;      // # mate DPs run on this read\n\tuint64_t nMateDpSuccs;  // # mate DPs run on this read\n\tuint64_t nMateDpFails;  // # mate DPs run on this read\n\t\n\tuint64_t nMateUgs;      // # mate ungapped alignments run on this read\n\tuint64_t nMateUgSuccs;  // # mate ungapped alignments run on this read\n\tuint64_t nMateUgFails;  // # mate ungapped alignments run on this read\n\n\tuint64_t nRedundants;   // # redundant seed hits\n\t\n\tuint64_t nSeedRanges;   // # BW ranges found for seeds\n\tuint64_t nSeedElts;     // # BW elements found for seeds\n\n\tuint64_t nSeedRangesFw; // # BW ranges found for seeds from fw read\n\tuint64_t nSeedEltsFw;   // # BW elements found for seeds from fw read\n\n\tuint64_t nSeedRangesRc; // # BW ranges found for seeds from fw read\n\tuint64_t nSeedEltsRc;   // # BW elements found for seeds from fw read\n\t\n\tuint64_t seedMedian;    // median seed hit count\n\tuint64_t seedMean;      // rounded mean seed hit count\n\t\n\tuint64_t nEeFmops;      // FM Index ops for end-to-end alignment\n\tuint64_t nSdFmops;      // FM Index ops used to align seeds\n\tuint64_t nExFmops;      // FM Index ops used to resolve offsets\n\t\n\tuint64_t nFtabs;        // # ftab lookups\n\tuint64_t nRedSkip;      // # times redundant path was detected and aborted\n\tuint64_t nRedFail;      // # times a path was deemed non-redundant\n\tuint64_t nRedIns;       // # times a path was added to redundancy list\n\t\n\tuint64_t nDpFail;       // number of dp failures in a row up until now\n\tuint64_t nDpFailStreak; // longest streak of dp failures\n\tuint64_t nDpLastSucc;   // index of last dp attempt that succeeded\n\t\n\tuint64_t nUgFail;       // number of ungap failures in a row up until now\n\tuint64_t nUgFailStreak; // longest streak of ungap failures\n\tuint64_t nUgLastSucc;   // index of last ungap attempt that succeeded\n\n\tuint64_t nEeFail;       // number of ungap failures in a row up until now\n\tuint64_t nEeFailStreak; // longest streak of ungap failures\n\tuint64_t nEeLastSucc;   // index of last ungap attempt that succeeded\n\t\n\tuint64_t nFilt;         // # mates filtered\n\t\n\tTAlScore bestLtMinscMate1; // best invalid score observed for mate 1\n\tTAlScore bestLtMinscMate2; // best invalid score observed for mate 2\n\t\n\t// For collecting information to go into an FM string\n\tbool doFmString;\n\tFmString fmString;\n};\n\n#endif /*READ_H_*/\n"
  },
  {
    "path": "read_qseq.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"pat.h\"\n\n/**\n * Parse a name from fb_ and store in r.  Assume that the next\n * character obtained via fb_.get() is the first character of\n * the sequence and the string stops at the next char upto (could\n * be tab, newline, etc.).\n */\nint QseqPatternSource::parseName(\n\tRead& r,      // buffer for mate 1\n\tRead* r2,     // buffer for mate 2 (NULL if mate2 is read separately)\n\tbool append,     // true -> append characters, false -> skip them\n\tbool clearFirst, // clear the name buffer first\n\tbool warnEmpty,  // emit a warning if nothing was added to the name\n\tbool useDefault, // if nothing is read, put readCnt_ as a default value\n\tint upto)        // stop parsing when we first reach character 'upto'\n{\n\tif(clearFirst) {\n\t\tif(r2 != NULL) r2->name.clear();\n\t\tr.name.clear();\n\t}\n\twhile(true) {\n\t\tint c;\n\t\tif((c = fb_.get()) < 0) {\n\t\t\t// EOF reached in the middle of the name\n\t\t\treturn -1;\n\t\t}\n\t\tif(c == '\\n' || c == '\\r') {\n\t\t\t// EOL reached in the middle of the name\n\t\t\treturn -1;\n\t\t}\n\t\tif(c == upto) {\n\t\t\t// Finished with field\n\t\t\tbreak;\n\t\t}\n\t\tif(append) {\n\t\t\tif(r2 != NULL) r2->name.append(c);\n\t\t\tr.name.append(c);\n\t\t}\n\t}\n\t// Set up a default name if one hasn't been set\n\tif(r.name.empty() && useDefault && append) {\n\t\tchar cbuf[20];\n\t\titoa10(readCnt_, cbuf);\n\t\tr.name.append(cbuf);\n\t\tif(r2 != NULL) r2->name.append(cbuf);\n\t}\n\tif(r.name.empty() && warnEmpty) {\n\t\tcerr << \"Warning: read had an empty name field\" << endl;\n\t}\n\treturn (int)r.name.length();\n}\n\n/**\n * Parse a single sequence from fb_ and store in r.  Assume\n * that the next character obtained via fb_.get() is the first\n * character of the sequence and the sequence stops at the next\n * char upto (could be tab, newline, etc.).\n */\nint QseqPatternSource::parseSeq(\n\tRead& r,\n\tint& charsRead,\n\tint& trim5,\n\tchar upto)\n{\n\tint begin = 0;\n\tint c = fb_.get();\n\tassert(c != upto);\n\tr.patFw.clear();\n\tr.color = gColor;\n\tif(gColor) {\n\t\t// NOTE: clearly this is not relevant for Illumina output, but\n\t\t// I'm keeping it here in case there's some reason to put SOLiD\n\t\t// data in this format in the future.\n\t\n\t\t// This may be a primer character.  If so, keep it in the\n\t\t// 'primer' field of the read buf and parse the rest of the\n\t\t// read without it.\n\t\tc = toupper(c);\n\t\tif(asc2dnacat[c] > 0) {\n\t\t\t// First char is a DNA char\n\t\t\tint c2 = toupper(fb_.peek());\n\t\t\t// Second char is a color char\n\t\t\tif(asc2colcat[c2] > 0) {\n\t\t\t\tr.primer = c;\n\t\t\t\tr.trimc = c2;\n\t\t\t\ttrim5 += 2; // trim primer and first color\n\t\t\t}\n\t\t}\n\t\tif(c < 0) { return -1; }\n\t}\n\twhile(c != upto) {\n\t\tif(c == '.') c = 'N';\n\t\tif(gColor) {\n\t\t\tif(c >= '0' && c <= '4') c = \"ACGTN\"[(int)c - '0'];\n\t\t}\n\t\tif(isalpha(c)) {\n\t\t\tassert_in(toupper(c), \"ACGTN\");\n\t\t\tif(begin++ >= trim5) {\n\t\t\t\tassert_neq(0, asc2dnacat[c]);\n\t\t\t\tr.patFw.append(asc2dna[c]);\n\t\t\t}\n\t\t\tcharsRead++;\n\t\t}\n\t\tif((c = fb_.get()) < 0) {\n\t\t\treturn -1;\n\t\t}\n\t}\n\tr.patFw.trimEnd(gTrim3);\n\treturn (int)r.patFw.length();\n}\n\n/**\n * Parse a single quality string from fb_ and store in r.\n * Assume that the next character obtained via fb_.get() is\n * the first character of the quality string and the string stops\n * at the next char upto (could be tab, newline, etc.).\n */\nint QseqPatternSource::parseQuals(\n\tRead& r,\n\tint charsRead,\n\tint dstLen,\n\tint trim5,\n\tchar& c2,\n\tchar upto = '\\t',\n\tchar upto2 = -1)\n{\n\tint qualsRead = 0;\n\tint c = 0;\n\tif (intQuals_) {\n\t\t// Probably not relevant\n\t\tchar buf[4096];\n\t\twhile (qualsRead < charsRead) {\n\t\t\tqualToks_.clear();\n\t\t\tif(!tokenizeQualLine(fb_, buf, 4096, qualToks_)) break;\n\t\t\tfor (unsigned int j = 0; j < qualToks_.size(); ++j) {\n\t\t\t\tchar c = intToPhred33(atoi(qualToks_[j].c_str()), solQuals_);\n\t\t\t\tassert_geq(c, 33);\n\t\t\t\tif (qualsRead >= trim5) {\n\t\t\t\t\tr.qual.append(c);\n\t\t\t\t}\n\t\t\t\t++qualsRead;\n\t\t\t}\n\t\t} // done reading integer quality lines\n\t\tif (charsRead > qualsRead) tooFewQualities(r.name);\n\t} else {\n\t\t// Non-integer qualities\n\t\twhile((qualsRead < dstLen + trim5) && c >= 0) {\n\t\t\tc = fb_.get();\n\t\t\tc2 = c;\n\t\t\tif (c == ' ') wrongQualityFormat(r.name);\n\t\t\tif(c < 0) {\n\t\t\t\t// EOF occurred in the middle of a read - abort\n\t\t\t\treturn -1;\n\t\t\t}\n\t\t\tif(!isspace(c) && c != upto && (upto2 == -1 || c != upto2)) {\n\t\t\t\tif (qualsRead >= trim5) {\n\t\t\t\t\tc = charToPhred33(c, solQuals_, phred64Quals_);\n\t\t\t\t\tassert_geq(c, 33);\n\t\t\t\t\tr.qual.append(c);\n\t\t\t\t}\n\t\t\t\tqualsRead++;\n\t\t\t} else {\n\t\t\t\tbreak;\n\t\t\t}\n\t\t}\n\t}\n\tif(r.qual.length() < (size_t)dstLen) {\n\t\ttooFewQualities(r.name);\n\t}\n\t// TODO: How to detect too many qualities??\n\tr.qual.resize(dstLen);\n\twhile(c != -1 && c != upto && (upto2 == -1 || c != upto2)) {\n\t\tc = fb_.get();\n\t\tc2 = c;\n\t}\n\treturn qualsRead;\n}\n\n/**\n * Read another pattern from a Qseq input file.\n */\nbool QseqPatternSource::read(\n\tRead& r,\n\tTReadId& rdid,\n\tTReadId& endid,\n\tbool& success,\n\tbool& done)\n{\n\tr.reset();\n\tr.color = gColor;\n\tsuccess = true;\n\tdone = false;\n\treadCnt_++;\n\trdid = endid = readCnt_-1;\n\tpeekOverNewline(fb_);\n\tfb_.resetLastN();\n\t// 1. Machine name\n\tif(parseName(r, NULL, true, true,  true, false, '\\t') == -1) BAIL_UNPAIRED();\n\tassert_neq('\\t', fb_.peek());\n\tr.name.append('_');\n\t// 2. Run number\n\tif(parseName(r, NULL, true, false, true, false, '\\t') == -1) BAIL_UNPAIRED();\n\tassert_neq('\\t', fb_.peek());\n\tr.name.append('_');\n\t// 3. Lane number\n\tif(parseName(r, NULL, true, false, true, false, '\\t') == -1) BAIL_UNPAIRED();\n\tassert_neq('\\t', fb_.peek());\n\tr.name.append('_');\n\t// 4. Tile number\n\tif(parseName(r, NULL, true, false, true, false, '\\t') == -1) BAIL_UNPAIRED();\n\tassert_neq('\\t', fb_.peek());\n\tr.name.append('_');\n\t// 5. X coordinate of spot\n\tif(parseName(r, NULL, true, false, true, false, '\\t') == -1) BAIL_UNPAIRED();\n\tassert_neq('\\t', fb_.peek());\n\tr.name.append('_');\n\t// 6. Y coordinate of spot\n\tif(parseName(r, NULL, true, false, true, false, '\\t') == -1) BAIL_UNPAIRED();\n\tassert_neq('\\t', fb_.peek());\n\tr.name.append('_');\n\t// 7. Index\n\tif(parseName(r, NULL, true, false, true, false, '\\t') == -1) BAIL_UNPAIRED();\n\tassert_neq('\\t', fb_.peek());\n\tr.name.append('/');\n\t// 8. Mate number\n\tif(parseName(r, NULL, true, false, true, false, '\\t') == -1) BAIL_UNPAIRED();\n\t// Empty sequence??\n\tif(fb_.peek() == '\\t') {\n\t\t// Get tab that separates seq from qual\n\t\tASSERT_ONLY(int c =) fb_.get();\n\t\tassert_eq('\\t', c);\n\t\tassert_eq('\\t', fb_.peek());\n\t\t// Get tab that separates qual from filter\n\t\tASSERT_ONLY(c =) fb_.get();\n\t\tassert_eq('\\t', c);\n\t\t// Next char is first char of filter flag\n\t\tassert_neq('\\t', fb_.peek());\n\t\tfb_.resetLastN();\n\t\tcerr << \"Warning: skipping empty QSEQ read with name '\" << r.name << \"'\" << endl;\n\t} else {\n\t\tassert_neq('\\t', fb_.peek());\n\t\tint charsRead = 0;\n\t\tint mytrim5 = gTrim5;\n\t\t// 9. Sequence\n\t\tint dstLen = parseSeq(r, charsRead, mytrim5, '\\t');\n\t\tassert_neq('\\t', fb_.peek());\n\t\tif(dstLen < 0) BAIL_UNPAIRED();\n\t\tchar ct = 0;\n\t\t// 10. Qualities\n\t\tif(parseQuals(r, charsRead, dstLen, mytrim5, ct, '\\t', -1) < 0) BAIL_UNPAIRED();\n\t\tr.trimmed3 = gTrim3;\n\t\tr.trimmed5 = mytrim5;\n\t\tif(ct != '\\t') {\n\t\t\tcerr << \"Error: QSEQ with name \" << r.name << \" did not have tab after qualities\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tassert_eq(ct, '\\t');\n\t}\n\t// 11. Filter flag\n\tint filt = fb_.get();\n\tif(filt == -1) BAIL_UNPAIRED();\n\tr.filter = filt;\n\tif(filt != '0' && filt != '1') {\n\t\t// Bad value for filt\n\t}\n\tif(fb_.peek() != -1 && fb_.peek() != '\\n') {\n\t\t// Bad value right after the filt field\n\t}\n\tfb_.get();\n\tr.readOrigBuf.install(fb_.lastN(), fb_.lastNLen());\n\tfb_.resetLastN();\n\tif(r.qual.length() < r.patFw.length()) {\n\t\ttooFewQualities(r.name);\n\t} else if(r.qual.length() > r.patFw.length()) {\n\t\ttooManyQualities(r.name);\n\t}\n#ifndef NDEBUG\n\tassert_eq(r.patFw.length(), r.qual.length());\n\tfor(size_t i = 0; i < r.qual.length(); i++) {\n\t\tassert_geq((int)r.qual[i], 33);\n\t}\n#endif\n\treturn true;\n}\n"
  },
  {
    "path": "ref_coord.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"ref_coord.h\"\n#include <iostream>\n\nusing namespace std;\n\nostream& operator<<(ostream& out, const Interval& c) {\n\tout << c.upstream() << \"+\" << c.len();\n\treturn out;\n}\n\nostream& operator<<(ostream& out, const Coord& c) {\n\tout << c.ref() << \":\" << c.off();\n\treturn out;\n}\n"
  },
  {
    "path": "ref_coord.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef REF_COORD_H_\n#define REF_COORD_H_\n\n#include <stdint.h>\n#include <iostream>\n#include <limits>\n#include \"assert_helpers.h\"\n\ntypedef int64_t TRefId;\ntypedef int64_t TRefOff;\n\n/**\n * Encapsulates a reference coordinate; i.e. identifiers for\n *  (a) a reference sequence, and\n *  (b) a 0-based offset into that sequence.\n */\nclass Coord {\n\npublic:\n\n\tCoord() { reset(); }\n\n\tCoord(const Coord& c) { init(c); }\n\t\n\tCoord(TRefId rf, TRefOff of, bool fw) { init(rf, of, fw); }\n\n\t/**\n\t * Copy given fields into this Coord.\n\t */\n\tvoid init(TRefId rf, TRefOff of, bool fw) {\n\t\tref_ = rf;\n\t\toff_ = of;\n\t\torient_ = (fw ? 1 : 0);\n\t}\n\n\t/**\n\t * Copy contents of given Coord into this one.\n\t */\n\tvoid init(const Coord& c) {\n\t\tref_ = c.ref_;\n\t\toff_ = c.off_;\n\t\torient_ = c.orient_;\n\t}\n\t\n\t/**\n\t * Return true iff this Coord is identical to the given Coord.\n\t */\n\tbool operator==(const Coord& o) const {\n\t\tassert(inited());\n\t\tassert(o.inited());\n\t\treturn ref_ == o.ref_ && off_ == o.off_ && fw() == o.fw();\n\t}\n\n\t/**\n\t * Return true iff this Coord is less than the given Coord.  One Coord is\n\t * less than another if (a) its reference id is less, (b) its orientation is\n\t * less, or (c) its offset is less.\n\t */\n\tbool operator<(const Coord& o) const {\n\t\tif(ref_ < o.ref_) return true;\n\t\tif(ref_ > o.ref_) return false;\n\t\tif(orient_ < o.orient_) return true;\n\t\tif(orient_ > o.orient_) return false;\n\t\tif(off_ < o.off_) return true;\n\t\tif(off_ > o.off_) return false;\n\t\treturn false;\n\t}\n\t\n\t/**\n\t * Return the opposite result from operator<.\n\t */\n\tbool operator>=(const Coord& o) const {\n\t\treturn !((*this) < o);\n\t}\n\t\n\t/**\n\t * Return true iff this Coord is greater than the given Coord.  One Coord\n\t * is greater than another if (a) its reference id is greater, (b) its\n\t * orientation is greater, or (c) its offset is greater.\n\t */\n\tbool operator>(const Coord& o) const {\n\t\tif(ref_ > o.ref_) return true;\n\t\tif(ref_ < o.ref_) return false;\n\t\tif(orient_ > o.orient_) return true;\n\t\tif(orient_ < o.orient_) return false;\n\t\tif(off_ > o.off_) return true;\n\t\tif(off_ < o.off_) return false;\n\t\treturn false;\n\t}\n\t\n\t/**\n\t * Return the opposite result from operator>.\n\t */\n\tbool operator<=(const Coord& o) const {\n\t\treturn !((*this) > o);\n\t}\n\t\n\t/**\n\t * Reset this coord to uninitialized state.\n\t */\n\tvoid reset() {\n\t\tref_ = std::numeric_limits<TRefId>::max();\n\t\toff_ = std::numeric_limits<TRefOff>::max();\n\t\torient_ = -1;\n\t}\n\t\n\t/**\n\t * Return true iff this Coord is initialized (i.e. ref and off have both\n\t * been set since the last call to reset()).\n\t */\n\tbool inited() const {\n\t\tif(ref_ != std::numeric_limits<TRefId>::max() &&\n\t\t   off_ != std::numeric_limits<TRefOff>::max())\n\t\t{\n\t\t\tassert(orient_ == 0 || orient_ == 1);\n\t\t\treturn true;\n\t\t}\n\t\treturn false;\n\t}\n\t\n\t/**\n\t * Get orientation of the Coord.\n\t */\n\tbool fw() const {\n\t\tassert(inited());\n\t\tassert(orient_ == 0 || orient_ == 1);\n\t\treturn orient_ == 1;\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that coord is internally consistent.\n\t */\n\tbool repOk() const {\n\t\tif(ref_ != std::numeric_limits<TRefId>::max() &&\n\t\t   off_ != std::numeric_limits<TRefOff>::max())\n\t\t{\n\t\t\tassert(orient_ == 0 || orient_ == 1);\n\t\t}\n\t\treturn true;\n\t}\n#endif\n\t\n\t/**\n\t * Check whether an interval defined by this coord and having\n\t * length 'len' is contained within an interval defined by\n\t * 'inbegin' and 'inend'.\n\t */\n\tbool within(int64_t len, int64_t inbegin, int64_t inend) const {\n\t\treturn off_ >= inbegin && off_ + len <= inend;\n\t}\n\t\n\tinline TRefId  ref()    const { return ref_; }\n\tinline TRefOff off()    const { return off_; }\n\tinline int     orient() const { return orient_; }\n\t\n\tinline void setRef(TRefId  id)  { ref_ = id;  }\n\tinline void setOff(TRefOff off) { off_ = off; }\n\n\tinline void adjustOff(TRefOff off) { off_ += off; }\n\nprotected:\n\n\tTRefId  ref_;    // which reference?\n\tTRefOff off_;    // 0-based offset into reference\n\tint     orient_; // true -> Watson strand\n};\n\nstd::ostream& operator<<(std::ostream& out, const Coord& c);\n\n/**\n * Encapsulates a reference interval, which consists of a Coord and a length.\n */\nclass Interval {\n\npublic:\n\t\n\tInterval() { reset(); }\n\t\n\texplicit Interval(const Coord& upstream, TRefOff len) {\n\t\tinit(upstream, len);\n\t}\n\n\texplicit Interval(TRefId rf, TRefOff of, bool fw, TRefOff len) {\n\t\tinit(rf, of, fw, len);\n\t}\n\n\tvoid init(const Coord& upstream, TRefOff len) {\n\t\tupstream_ = upstream;\n\t\tlen_ = len;\n\t}\n\t\n\tvoid init(TRefId rf, TRefOff of, bool fw, TRefOff len) {\n\t\tupstream_.init(rf, of, fw);\n\t\tlen_ = len;\n\t}\n\t\n\t/**\n\t * Set offset.\n\t */\n\tvoid setOff(TRefOff of) {\n\t\tupstream_.setOff(of);\n\t}\n\n\t/**\n\t * Set length.\n\t */\n\tvoid setLen(TRefOff len) {\n\t\tlen_ = len;\n\t}\n\n\t/**\n\t * Reset this interval to uninitialized state.\n\t */\n\tvoid reset() {\n\t\tupstream_.reset();\n\t\tlen_ = 0;\n\t}\n\t\n\t/**\n\t * Return true iff this Interval is initialized.\n\t */\n\tbool inited() const {\n\t\tif(upstream_.inited()) {\n\t\t\tassert_gt(len_, 0);\n\t\t\treturn true;\n\t\t} else {\n\t\t\treturn false;\n\t\t}\n\t}\n\t\n\t/**\n\t * Return true iff this Interval is equal to the given Interval,\n\t * i.e. if they cover the same set of positions.\n\t */\n\tbool operator==(const Interval& o) const {\n\t\treturn upstream_ == o.upstream_ &&\n\t\t       len_ == o.len_;\n\t}\n\n\t/**\n\t * Return true iff this Interval is less than the given Interval.\n\t * One interval is less than another if its upstream location is\n\t * prior to the other's or, if their upstream locations are equal,\n\t * if its length is less than the other's.\n\t */\n\tbool operator<(const Interval& o) const {\n\t\tif(upstream_ < o.upstream_) return true;\n\t\tif(upstream_ > o.upstream_) return false;\n\t\tif(len_ < o.len_) return true;\n\t\treturn false;\n\t}\n\t\n\t/**\n\t * Return opposite result from operator<.\n\t */\n\tbool operator>=(const Interval& o) const {\n\t\treturn !((*this) < o);\n\t}\n\n\t/**\n\t * Return true iff this Interval is greater than than the given\n\t * Interval.  One interval is greater than another if its upstream\n\t * location is after the other's or, if their upstream locations\n\t * are equal, if its length is greater than the other's.\n\t */\n\tbool operator>(const Interval& o) const {\n\t\tif(upstream_ > o.upstream_) return true;\n\t\tif(upstream_ < o.upstream_) return false;\n\t\tif(len_ > o.len_) return true;\n\t\treturn false;\n\t}\n\n\t/**\n\t * Return opposite result from operator>.\n\t */\n\tbool operator<=(const Interval& o) const {\n\t\treturn !((*this) > o);\n\t}\n\t\n\t/**\n\t * Set upstream Coord.\n\t */\n\tvoid setUpstream(const Coord& c) {\n\t\tupstream_ = c;\n\t}\n\n\t/**\n\t * Set length.\n\t */\n\tvoid setLength(TRefOff l) {\n\t\tlen_ = l;\n\t}\n\t\n\tinline TRefId  ref()    const { return upstream_.ref(); }\n\tinline TRefOff off()    const { return upstream_.off(); }\n\tinline TRefOff dnoff()  const { return upstream_.off() + len_; }\n\tinline int     orient() const { return upstream_.orient(); }\n\n\t/**\n\t * Return a Coord encoding the coordinate just past the downstream edge of\n\t * the interval.\n\t */\n\tinline Coord downstream() const {\n\t\treturn Coord(\n\t\t\tupstream_.ref(),\n\t\t\tupstream_.off() + len_,\n\t\t\tupstream_.orient());\n\t}\n\t\n\t/**\n\t * Return true iff the given Coord is inside this Interval.\n\t */\n\tinline bool contains(const Coord& c) const {\n\t\treturn\n\t\t\tc.ref()    == ref() &&\n\t\t\tc.orient() == orient() &&\n\t\t\tc.off()    >= off() &&\n\t\t\tc.off()    <  dnoff();\n\t}\n\n\t/**\n\t * Return true iff the given Coord is inside this Interval, without\n\t * requiring orientations to match.\n\t */\n\tinline bool containsIgnoreOrient(const Coord& c) const {\n\t\treturn\n\t\t\tc.ref()    == ref() &&\n\t\t\tc.off()    >= off() &&\n\t\t\tc.off()    <  dnoff();\n\t}\n\n\t/**\n\t * Return true iff the given Interval is inside this Interval.\n\t */\n\tinline bool contains(const Interval& c) const {\n\t\treturn\n\t\t\tc.ref()    == ref() &&\n\t\t\tc.orient() == orient() &&\n\t\t\tc.off()    >= off() &&\n\t\t\tc.dnoff()  <= dnoff();\n\t}\n\n\t/**\n\t * Return true iff the given Interval is inside this Interval, without\n\t * requiring orientations to match.\n\t */\n\tinline bool containsIgnoreOrient(const Interval& c) const {\n\t\treturn\n\t\t\tc.ref()    == ref() &&\n\t\t\tc.off()    >= off() &&\n\t\t\tc.dnoff()  <= dnoff();\n\t}\n\n\t/**\n\t * Return true iff the given Interval overlaps this Interval.\n\t */\n\tinline bool overlaps(const Interval& c) const {\n\t\treturn\n\t\t\tc.ref()    == upstream_.ref() &&\n\t\t\tc.orient() == upstream_.orient() &&\n\t\t\t((off() <= c.off()   && dnoff() > c.off())   ||\n\t\t\t (off() <= c.dnoff() && dnoff() > c.dnoff()) ||\n\t\t\t (c.off() <= off()   && c.dnoff() > off())   ||\n\t\t\t (c.off() <= dnoff() && c.dnoff() > dnoff()));\n\t}\n\n\t/**\n\t * Return true iff the given Interval overlaps this Interval, without\n\t * requiring orientations to match.\n\t */\n\tinline bool overlapsIgnoreOrient(const Interval& c) const {\n\t\treturn\n\t\t\tc.ref()    == upstream_.ref() &&\n\t\t\t((off() <= c.off()   && dnoff() > c.off())   ||\n\t\t\t (off() <= c.dnoff() && dnoff() > c.dnoff()) ||\n\t\t\t (c.off() <= off()   && c.dnoff() > off())   ||\n\t\t\t (c.off() <= dnoff() && c.dnoff() > dnoff()));\n\t}\n\t\n\tinline const Coord&  upstream()   const { return upstream_; }\n\tinline TRefOff       len()      const { return len_;      }\n\n#ifndef NDEBUG\n\t/**\n\t * Check that the Interval is internally consistent.\n\t */\n\tbool repOk() const {\n\t\tassert(upstream_.repOk());\n\t\tassert_geq(len_, 0);\n\t\treturn true;\n\t}\n#endif\n\n\tinline void adjustOff(TRefOff off) { upstream_.adjustOff(off); }\n\nprotected:\n\n\tCoord   upstream_;\n\tTRefOff len_;\n};\n\nstd::ostream& operator<<(std::ostream& out, const Interval& c);\n\n#endif /*ndef REF_COORD_H_*/\n"
  },
  {
    "path": "ref_read.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"ref_read.h\"\n\n/**\n * Reads past the next ambiguous or unambiguous stretch of sequence\n * from the given FASTA file and returns its length.  Does not do\n * anything with the sequence characters themselves; this is purely for\n * measuring lengths.\n */\nRefRecord fastaRefReadSize(\n\tFileBuf& in,\n\tconst RefReadInParams& rparms,\n\tbool first,\n\tBitpairOutFileBuf* bpout,\n\tint& n_empty_ref_sequences)\n{\n\tint c;\n\tstatic int lastc = '>'; // last character seen\n\n\t// RefRecord params\n\tTIndexOffU len = 0; // 'len' counts toward total length\n\t// 'off' counts number of ambiguous characters before first\n\t// unambiguous character\n\tsize_t off = 0;\n\n\t// Pick off the first carat and any preceding whitespace\n\tif(first) {\n\t\tassert(!in.eof());\n\t\tlastc = '>';\n\t\tc = in.getPastWhitespace();\n\t\tif(in.eof()) {\n\t\t\t// Got eof right away; emit warning\n\t\t\tcerr << \"Warning: Empty input file\" << endl;\n\t\t\tlastc = -1;\n\t\t\treturn RefRecord(0, 0, true);\n\t\t}\n\t\tassert(c == '>');\n\t}\n\n\tfirst = true;\n\t// Skip to the end of the id line; if the next line is either\n\t// another id line or a comment line, keep skipping\n\tif(lastc == '>') {\n\t\t// Skip to the end of the name line\n\t\tdo {\n\t\t\tif((c = in.getPastNewline()) == -1) {\n\t\t\t\t// No more input\n\t\t\t\tcerr << \"Warning: Encountered empty reference sequence at end of file\" << endl;\n\t\t\t\tlastc = -1;\n\t\t\t\treturn RefRecord(0, 0, true);\n\t\t\t}\n\t\t\tif(c == '>') {\n\t\t\t\t++ n_empty_ref_sequences;\n\t\t\t\t//cerr << \"Warning: Encountered empty reference sequence\" << endl;\n\t\t\t}\n\t\t\t// continue until a non-name, non-comment line\n\t\t} while (c == '>');\n\t} else {\n\t\tfirst = false; // not the first in a sequence\n\t\toff = 1; // The gap has already been consumed, so count it\n\t\tif((c = in.get()) == -1) {\n\t\t\t// Don't emit a warning, since this might legitimately be\n\t\t\t// a gap on the end of the final sequence in the file\n\t\t\tlastc = -1;\n\t\t\treturn RefRecord((TIndexOffU)off, (TIndexOffU)len, first);\n\t\t}\n\t}\n\n\t// Now skip to the first DNA character, counting gap characters\n\t// as we go\n\tint lc = -1; // last-DNA char variable for color conversion\n\twhile(true) {\n\t\tint cat = asc2dnacat[c];\n\t\tif(rparms.nsToAs && cat >= 2) c = 'A';\n\t\tif(cat == 1) {\n\t\t\t// This is a DNA character\n\t\t\tif(rparms.color) {\n\t\t\t\tif(lc != -1) {\n\t\t\t\t\t// Got two consecutive unambiguous DNAs\n\t\t\t\t\tbreak; // to read-in loop\n\t\t\t\t}\n\t\t\t\t// Keep going; we need two consecutive unambiguous DNAs\n\t\t\t\tlc = asc2dna[(int)c];\n\t\t\t\t// The 'if(off > 0)' takes care of the case where\n\t\t\t\t// the reference is entirely unambiguous and we don't\n\t\t\t\t// want to incorrectly increment off.\n\t\t\t\tif(off > 0) off++;\n\t\t\t} else {\n\t\t\t\tbreak; // to read-in loop\n\t\t\t}\n\t\t} else if(cat >= 2) {\n\t\t\tif(lc != -1 && off == 0) off++;\n\t\t\tlc = -1;\n\t\t\toff++; // skip over gap character and increment\n\t\t} else if(c == '>') {\n\t\t\tif(off > 0 && lastc == '>') {\n\t\t\t\tcerr << \"Warning: Encountered reference sequence with only gaps\" << endl;\n\t\t\t} else if(lastc == '>') {\n\t\t\t\tcerr << \"Warning: Encountered empty reference sequence\" << endl;\n\t\t\t}\n\t\t\tlastc = '>';\n\t\t\t//return RefRecord(off, 0, false);\n\t\t\treturn RefRecord((TIndexOffU)off, 0, first);\n\t\t}\n\t\tc = in.get();\n\t\tif(c == -1) {\n\t\t\t// End-of-file\n\t\t\tif(off > 0 && lastc == '>') {\n\t\t\t\tcerr << \"Warning: Encountered reference sequence with only gaps\" << endl;\n\t\t\t} else if(lastc == '>') {\n\t\t\t\tcerr << \"Warning: Encountered empty reference sequence\" << endl;\n\t\t\t}\n\t\t\tlastc = -1;\n\t\t\t//return RefRecord(off, 0, false);\n\t\t\treturn RefRecord((TIndexOffU)off, 0, first);\n\t\t}\n\t}\n\tassert(!rparms.color || (lc != -1));\n\tassert_eq(1, asc2dnacat[c]); // C must be unambiguous base\n\tif(off > 0 && rparms.color && first) {\n\t\t// Handle the case where the first record has ambiguous\n\t\t// characters but we're in color space; one of those counts is\n\t\t// spurious\n\t\toff--;\n\t}\n\n\t// in now points just past the first character of a sequence\n\t// line, and c holds the first character\n\twhile(c != -1 && c != '>') {\n\t\tif(rparms.nsToAs && asc2dnacat[c] >= 2) c = 'A';\n\t\tuint8_t cat = asc2dnacat[c];\n\t\tint cc = toupper(c);\n\t\tif(rparms.bisulfite && cc == 'C') c = cc = 'T';\n\t\tif(cat == 1) {\n\t\t\t// It's a DNA character\n\t\t\tassert(cc == 'A' || cc == 'C' || cc == 'G' || cc == 'T');\n\t\t\t// Check for overflow\n\t\t\tif((TIndexOffU)(len + 1) < len) {\n\t\t\t\tthrow RefTooLongException();\n\t\t\t}\n\t\t\t// Consume it\n\t\t\tlen++;\n\t\t\t// Output it\n\t\t\tif(bpout != NULL) {\n\t\t\t\tif(rparms.color) {\n\t\t\t\t\t// output color\n\t\t\t\t\tbpout->write(dinuc2color[asc2dna[(int)c]][lc]);\n\t\t\t\t} else if(!rparms.color) {\n\t\t\t\t\t// output nucleotide\n\t\t\t\t\tbpout->write(asc2dna[c]);\n\t\t\t\t}\n\t\t\t}\n\t\t\tlc = asc2dna[(int)c];\n\t\t} else if(cat >= 2) {\n\t\t\t// It's an N or a gap\n\t\t\tlastc = c;\n\t\t\tassert(cc != 'A' && cc != 'C' && cc != 'G' && cc != 'T');\n\t\t\treturn RefRecord((TIndexOffU)off, (TIndexOffU)len, first);\n\t\t} else {\n\t\t\t// Not DNA and not a gap, ignore it\n#ifndef NDEBUG\n\t\t\tif(!isspace(c)) {\n\t\t\t\tcerr << \"Unexpected character in sequence: \";\n\t\t\t\tif(isprint(c)) {\n\t\t\t\t\tcerr << ((char)c) << endl;\n\t\t\t\t} else {\n\t\t\t\t\tcerr << \"(\" << c << \")\" << endl;\n\t\t\t\t}\n\t\t\t}\n#endif\n\t\t}\n\t\tc = in.get();\n\t}\n\n\tlastc = c;\n\treturn RefRecord((TIndexOffU)off, (TIndexOffU)len, first);\n}\n\n#if 0\nstatic void\nprintRecords(ostream& os, const EList<RefRecord>& l) {\n\tfor(size_t i = 0; i < l.size(); i++) {\n\t\tos << l[i].first << \", \" << l[i].off << \", \" << l[i].len << endl;\n\t}\n}\n#endif\n\n/**\n * Reverse the 'src' list of RefRecords into the 'dst' list.  Don't\n * modify 'src'.\n */\nvoid reverseRefRecords(\n\tconst EList<RefRecord>& src,\n\tEList<RefRecord>& dst,\n\tbool recursive,\n\tbool verbose)\n{\n\tdst.clear();\n\t{\n\t\tEList<RefRecord> cur;\n\t\tfor(int i = (int)src.size()-1; i >= 0; i--) {\n\t\t\tbool first = (i == (int)src.size()-1 || src[i+1].first);\n\t\t\t// Clause after the || on next line is to deal with empty FASTA\n\t\t\t// records at the end of the 'src' list, which would be wrongly\n\t\t\t// omitted otherwise.\n\t\t\tif(src[i].len || (first && src[i].off == 0)) {\n\t\t\t\tcur.push_back(RefRecord(0, src[i].len, first));\n\t\t\t\tfirst = false;\n\t\t\t}\n\t\t\tif(src[i].off) cur.push_back(RefRecord(src[i].off, 0, first));\n\t\t}\n\t\tfor(int i = 0; i < (int)cur.size(); i++) {\n\t\t\tassert(cur[i].off == 0 || cur[i].len == 0);\n\t\t\tif(i < (int)cur.size()-1 && cur[i].off != 0 && !cur[i+1].first) {\n\t\t\t\tdst.push_back(RefRecord(cur[i].off, cur[i+1].len, cur[i].first));\n\t\t\t\ti++;\n\t\t\t} else {\n\t\t\t\tdst.push_back(cur[i]);\n\t\t\t}\n\t\t}\n\t}\n\t//if(verbose) {\n\t//\tcout << \"Source: \" << endl;\n\t//\tprintRecords(cout, src);\n\t//\tcout << \"Dest: \" << endl;\n\t//\tprintRecords(cout, dst);\n\t//}\n#ifndef NDEBUG\n\tsize_t srcnfirst = 0, dstnfirst = 0;\n\tfor(size_t i = 0; i < src.size(); i++) {\n\t\tif(src[i].first) {\n\t\t\tsrcnfirst++;\n\t\t}\n\t}\n\tfor(size_t i = 0; i < dst.size(); i++) {\n\t\tif(dst[i].first) {\n\t\t\tdstnfirst++;\n\t\t}\n\t}\n\tassert_eq(srcnfirst, dstnfirst);\n\tif(!recursive) {\n\t\tEList<RefRecord> tmp;\n\t\treverseRefRecords(dst, tmp, true);\n\t\tassert_eq(tmp.size(), src.size());\n\t\tfor(size_t i = 0; i < src.size(); i++) {\n\t\t\tassert_eq(src[i].len, tmp[i].len);\n\t\t\tassert_eq(src[i].off, tmp[i].off);\n\t\t\tassert_eq(src[i].first, tmp[i].first);\n\t\t}\n\t}\n#endif\n}\n\n/**\n * Calculate a vector containing the sizes of all of the patterns in\n * all of the given input files, in order.  Returns the total size of\n * all references combined.  Rewinds each istream before returning.\n */\nstd::pair<size_t, size_t>\nfastaRefReadSizes(\n\tEList<FileBuf*>& in,\n\tEList<RefRecord>& recs,\n\tconst RefReadInParams& rparms,\n\tBitpairOutFileBuf* bpout,\n\tTIndexOff& numSeqs)\n{\n\tTIndexOffU unambigTot = 0;\n\tsize_t bothTot = 0;\n\tassert_gt(in.size(), 0);\n\tint n_empty_ref_sequences = 0;\n\n\t// For each input istream\n\tfor(size_t i = 0; i < in.size(); i++) {\n\t\tbool first = true;\n\t\tassert(!in[i]->eof());\n\t\t// For each pattern in this istream\n\t\twhile(!in[i]->eof()) {\n\t\t\tRefRecord rec;\n\t\t\ttry {\n\t\t\t\trec = fastaRefReadSize(*in[i], rparms, first, bpout, n_empty_ref_sequences);\n\t\t\t\tif((unambigTot + rec.len) < unambigTot) {\n\t\t\t\t\tthrow RefTooLongException();\n\t\t\t\t}\n\t\t\t}\n\t\t\tcatch(RefTooLongException& e) {\n\t\t\t\tcerr << e.what() << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\t// Add the length of this record.\n\t\t\tif(rec.first) numSeqs++;\n\t\t\tunambigTot += rec.len;\n\t\t\tbothTot += rec.len;\n\t\t\tbothTot += rec.off;\n\t\t\tfirst = false;\n\t\t\tif(rec.len == 0 && rec.off == 0 && !rec.first) continue;\n\t\t\trecs.push_back(rec);\n\t\t}\n\t\t// Reset the input stream\n\t\tin[i]->reset();\n\t\tassert(!in[i]->eof());\n#ifndef NDEBUG\n\t\t// Check that it's really reset\n\t\tint c = in[i]->get();\n\t\tassert_eq('>', c);\n\t\tin[i]->reset();\n\t\tassert(!in[i]->eof());\n#endif\n\t}\n\tif (n_empty_ref_sequences > 0) {\n\t\tcerr << \"Warning: Encountered \" << n_empty_ref_sequences << \" empty reference sequence(s)\" << endl;\n\t}\n\n\tassert_geq(bothTot, 0);\n\tassert_geq(unambigTot, 0);\n\treturn make_pair(\n\t\tunambigTot, // total number of unambiguous DNA characters read\n\t\tbothTot); // total number of DNA characters read, incl. ambiguous ones\n}\n"
  },
  {
    "path": "ref_read.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef REF_READ_H_\n#define REF_READ_H_\n\n#include <iostream>\n#include <cassert>\n#include <string>\n#include <ctype.h>\n#include <fstream>\n#include <stdexcept>\n#include \"alphabet.h\"\n#include \"assert_helpers.h\"\n#include \"filebuf.h\"\n#include \"word_io.h\"\n#include \"ds.h\"\n#include \"endian_swap.h\"\n\nusing namespace std;\n\nclass RefTooLongException : public exception {\n\npublic:\n\tRefTooLongException() {\n#ifdef BOWTIE_64BIT_INDEX\n\t\t// This should never happen!\n\t\tmsg = \"Error: Reference sequence has more than 2^64-1 characters!  \"\n\t\t      \"Please divide the reference into smaller chunks and index each \"\n\t\t\t  \"independently.\";\n#else\n\t\tmsg = \"Error: Reference sequence has more than 2^32-1 characters!  \"\n\t\t      \"Please build a large index by passing the --large-index option \"\n\t\t\t  \"to centrifuge-build\";\n#endif\n\t}\n\t\n\t~RefTooLongException() throw() {}\n\t\n\tconst char* what() const throw() {\n\t\treturn msg.c_str();\n\t}\n\nprotected:\n\t\n\tstring msg;\n\t\n};\n\n/**\n * Encapsulates a stretch of the reference containing only unambiguous\n * characters.  From an ordered list of RefRecords, one can (almost)\n * deduce the \"shape\" of the reference sequences (almost because we\n * lose information about stretches of ambiguous characters at the end\n * of reference sequences).\n */\nstruct RefRecord {\n\tRefRecord() : off(), len(), first() { }\n\tRefRecord(TIndexOffU _off, TIndexOffU _len, bool _first) :\n\t\toff(_off), len(_len), first(_first)\n\t{ }\n\n\tRefRecord(FILE *in, bool swap) {\n\t\tassert(in != NULL);\n\t\tif(!fread(&off, OFF_SIZE, 1, in)) {\n\t\t\tcerr << \"Error reading RefRecord offset from FILE\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tif(swap) off = endianSwapIndex(off);\n\t\tif(!fread(&len, OFF_SIZE, 1, in)) {\n\t\t\tcerr << \"Error reading RefRecord offset from FILE\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tif(swap) len = endianSwapIndex(len);\n\t\tfirst = fgetc(in) ? true : false;\n\t}\n\n\tvoid write(std::ostream& out, bool be) {\n\t\twriteIndex<TIndexOffU>(out, off, be);\n\t\twriteIndex<TIndexOffU>(out, len, be);\n\t\tout.put(first ? 1 : 0);\n\t}\n\n\tTIndexOffU off; /// Offset of the first character in the record\n\tTIndexOffU len; /// Length of the record\n\tbool   first; /// Whether this record is the first for a reference sequence\n};\n\nenum {\n\tREF_READ_FORWARD = 0, // don't reverse reference sequence\n\tREF_READ_REVERSE,     // reverse entire reference sequence\n\tREF_READ_REVERSE_EACH // reverse each unambiguous stretch of reference\n};\n\n/**\n * Parameters governing treatment of references as they're read in.\n */\nstruct RefReadInParams {\n\tRefReadInParams(bool col, int r, bool nsToA, bool bisulf) :\n\t\tcolor(col), reverse(r), nsToAs(nsToA), bisulfite(bisulf) { }\n\t// extract colors from reference\n\tbool color;\n\t// reverse each reference sequence before passing it along\n\tint reverse;\n\t// convert ambiguous characters to As\n\tbool nsToAs;\n\t// bisulfite-convert the reference\n\tbool bisulfite;\n};\n\nextern RefRecord\nfastaRefReadSize(\n\tFileBuf& in,\n\tconst RefReadInParams& rparms,\n\tbool first,\n\tBitpairOutFileBuf* bpout = NULL);\n\nextern std::pair<size_t, size_t>\nfastaRefReadSizes(\n\tEList<FileBuf*>& in,\n\tEList<RefRecord>& recs,\n\tconst RefReadInParams& rparms,\n\tBitpairOutFileBuf* bpout,\n\tTIndexOff& numSeqs);\n\nextern void\nreverseRefRecords(\n\tconst EList<RefRecord>& src,\n\tEList<RefRecord>& dst,\n\tbool recursive = false,\n\tbool verbose = false);\n\n/**\n * Reads the next sequence from the given FASTA file and appends it to\n * the end of dst, optionally reversing it.\n */\ntemplate <typename TStr>\nstatic RefRecord fastaRefReadAppend(\n\tFileBuf& in,             // input file\n\tbool first,              // true iff this is the first record in the file\n\tTStr& dst,               // destination buf for parsed characters\n\tTIndexOffU& dstoff,          // index of next character in dst to assign\n\tRefReadInParams& rparms, // \n\tstring* name = NULL)     // put parsed FASTA name here\n{\n\tint c;\n\tstatic int lastc = '>';\n\tif(first) {\n\t\tc = in.getPastWhitespace();\n\t\tif(c != '>') {\n\t\t\tcerr << \"Reference file does not seem to be a FASTA file\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tlastc = c;\n\t}\n\tassert_neq(-1, lastc);\n\n\t// RefRecord params\n\tsize_t len = 0;\n\tsize_t off = 0;\n\tfirst = true;\n\n\tsize_t ilen = dstoff;\n\n\t// Chew up the id line; if the next line is either\n\t// another id line or a comment line, keep chewing\n\tint lc = -1; // last-DNA char variable for color conversion\n\tc = lastc;\n\tif(c == '>' || c == '#') {\n\t\tdo {\n\t\t\twhile (c == '#') {\n\t\t\t\tif((c = in.getPastNewline()) == -1) {\n\t\t\t\t\tlastc = -1;\n\t\t\t\t\tgoto bail;\n\t\t\t\t}\n\t\t\t}\n\t\t\tassert_eq('>', c);\n\t\t\twhile(true) {\n\t\t\t\tc = in.get();\n\t\t\t\tif(c == -1) {\n\t\t\t\t\tlastc = -1;\n\t\t\t\t\tgoto bail;\n\t\t\t\t}\n\t\t\t\tif(c == '\\n' || c == '\\r') {\n\t\t\t\t\twhile(c == '\\r' || c == '\\n') c = in.get();\n\t\t\t\t\tif(c == -1) {\n\t\t\t\t\t\tlastc = -1;\n\t\t\t\t\t\tgoto bail;\n\t\t\t\t\t}\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t\tif (name) name->push_back(c);\n\t\t\t}\n\t\t\t// c holds the first character on the line after the name\n\t\t\t// line\n\t\t\tif(c == '>') {\n\t\t\t\t// If there's another name line immediately after this one,\n\t\t\t\t// discard the previous name and start fresh with the new one\n\t\t\t\tif (name) name->clear();\n\t\t\t}\n\t\t} while (c == '>' || c == '#');\n\t} else {\n\t\tASSERT_ONLY(int cc = toupper(c));\n\t\tassert(cc != 'A' && cc != 'C' && cc != 'G' && cc != 'T');\n\t\tfirst = false;\n\t}\n\n\t// Skip over an initial stretch of gaps or ambiguous characters.\n\t// For colorspace we skip until we see two consecutive unambiguous\n\t// characters (i.e. the first unambiguous color).\n\twhile(true) {\n\t\tint cat = asc2dnacat[c];\n\t\tif(rparms.nsToAs && cat >= 2) {\n\t\t\tc = 'A';\n\t\t}\n\t\tint cc = toupper(c);\n\t\tif(rparms.bisulfite && cc == 'C') c = cc = 'T';\n\t\tif(cat == 1) {\n\t\t\t// This is a DNA character\n\t\t\tif(rparms.color) {\n\t\t\t\tif(lc != -1) {\n\t\t\t\t\t// Got two consecutive unambiguous DNAs\n\t\t\t\t\tbreak; // to read-in loop\n\t\t\t\t}\n\t\t\t\t// Keep going; we need two consecutive unambiguous DNAs\n\t\t\t\tlc = asc2dna[(int)c];\n\t\t\t\t// The 'if(off > 0)' takes care of the case where\n\t\t\t\t// the reference is entirely unambiguous and we don't\n\t\t\t\t// want to incorrectly increment off.\n\t\t\t\tif(off > 0) off++;\n\t\t\t} else {\n\t\t\t\tbreak; // to read-in loop\n\t\t\t}\n\t\t} else if(cat >= 2) {\n\t\t\tif(lc != -1 && off == 0) {\n\t\t\t\toff++;\n\t\t\t}\n\t\t\tlc = -1;\n\t\t\toff++; // skip it\n\t\t} else if(c == '>') {\n\t\t\tlastc = '>';\n\t\t\tgoto bail;\n\t\t}\n\t\tc = in.get();\n\t\tif(c == -1) {\n\t\t\tlastc = -1;\n\t\t\tgoto bail;\n\t\t}\n\t}\n\tif(first && rparms.color && off > 0) {\n\t\t// Handle the case where the first record has ambiguous\n\t\t// characters but we're in color space; one of those counts is\n\t\t// spurious\n\t\toff--;\n\t}\n\tassert(!rparms.color || lc != -1);\n\tassert_eq(1, asc2dnacat[c]);\n\n\t// in now points just past the first character of a sequence\n\t// line, and c holds the first character\n\twhile(true) {\n\t\t// Note: can't have a comment in the middle of a sequence,\n\t\t// though a comment can end a sequence\n\t\tint cat = asc2dnacat[c];\n\t\tassert_neq(2, cat);\n\t\tif(cat == 1) {\n\t\t\t// Consume it\n\t\t\tif(!rparms.color || lc != -1) len++;\n\t\t\t// Add it to reference buffer\n\t\t\tif(rparms.color) {\n\t\t\t\tdst.set((char)dinuc2color[asc2dna[(int)c]][lc], dstoff++);\n\t\t\t} else if(!rparms.color) {\n\t\t\t\tdst.set(asc2dna[c], dstoff++);\n\t\t\t}\n\t\t\tassert_lt((int)dst[dstoff-1], 4);\n\t\t\tlc = asc2dna[(int)c];\n\t\t}\n\t\tc = in.get();\n\t\tif(rparms.nsToAs && asc2dnacat[c] >= 2) c = 'A';\n\t\tif (c == -1 || c == '>' || c == '#' || asc2dnacat[c] >= 2) {\n\t\t\tlastc = c;\n\t\t\tbreak;\n\t\t}\n\t\tif(rparms.bisulfite && toupper(c) == 'C') c = 'T';\n\t}\n\n  bail:\n\t// Optionally reverse the portion that we just appended.\n\t// ilen = length of buffer before this last sequence was appended.\n\tif(rparms.reverse == REF_READ_REVERSE_EACH) {\n\t\t// Find limits of the portion we just appended\n\t\tsize_t nlen = dstoff;\n\t\tdst.reverseWindow(ilen, nlen);\n\t}\n\treturn RefRecord((TIndexOffU)off, (TIndexOffU)len, first);\n}\n\n#endif /*ndef REF_READ_H_*/\n"
  },
  {
    "path": "reference.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <string>\n#include <string.h>\n#include \"reference.h\"\n#include \"mem_ids.h\"\n\nusing namespace std;\n\n/**\n * Load from .3.gEbwt_ext/.4.gEbwt_ext Bowtie index files.\n */\nBitPairReference::BitPairReference(\n\tconst string& in,\n\tbool color,\n\tbool sanity,\n\tEList<string>* infiles,\n\tEList<SString<char> >* origs,\n\tbool infilesSeq,\n\tbool useMm,\n\tbool useShmem,\n\tbool mmSweep,\n\tbool verbose,\n\tbool startVerbose) :\n\tbuf_(NULL),\n\tsanityBuf_(NULL),\n\tloaded_(true),\n\tsanity_(sanity),\n\tuseMm_(useMm),\n\tuseShmem_(useShmem),\n\tverbose_(verbose)\n{\n\tstring s3 = in + \".3.\" + gEbwt_ext;\n\tstring s4 = in + \".4.\" + gEbwt_ext;\n\t\n\tFILE *f3, *f4;\n\tif((f3 = fopen(s3.c_str(), \"rb\")) == NULL) {\n\t    cerr << \"Could not open reference-string index file \" << s3 << \" for reading.\" << endl;\n\t\tcerr << \"This is most likely because your index was built with an older version\" << endl\n\t\t<< \"(<= 0.9.8.1) of bowtie-build.  Please re-run bowtie-build to generate a new\" << endl\n\t\t<< \"index (or download one from the Bowtie website) and try again.\" << endl;\n\t\tloaded_ = false;\n\t\treturn;\n\t}\n    if((f4 = fopen(s4.c_str(), \"rb\"))  == NULL) {\n        cerr << \"Could not open reference-string index file \" << s4 << \" for reading.\" << endl;\n\t\tloaded_ = false;\n\t\treturn;\n\t}\n#ifdef BOWTIE_MM\n    char *mmFile = NULL;\n\tif(useMm_) {\n\t\tif(verbose_ || startVerbose) {\n\t\t\tcerr << \"  Memory-mapping reference index file \" << s4.c_str() << \": \";\n\t\t\tlogTime(cerr);\n\t\t}\n\t\tstruct stat sbuf;\n\t\tif (stat(s4.c_str(), &sbuf) == -1) {\n\t\t\tperror(\"stat\");\n\t\t\tcerr << \"Error: Could not stat index file \" << s4.c_str() << \" prior to memory-mapping\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tmmFile = (char*)mmap((void *)0, (size_t)sbuf.st_size,\n\t\t\t\t     PROT_READ, MAP_SHARED, fileno(f4), 0);\n\t\tif(mmFile == (void *)(-1) || mmFile == NULL) {\n\t\t\tperror(\"mmap\");\n\t\t\tcerr << \"Error: Could not memory-map the index file \" << s4.c_str() << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tif(mmSweep) {\n\t\t\tTIndexOff sum = 0;\n\t\t\tfor(off_t i = 0; i < sbuf.st_size; i += 1024) {\n\t\t\t\tsum += (TIndexOff) mmFile[i];\n\t\t\t}\n\t\t\tif(startVerbose) {\n\t\t\t\tcerr << \"  Swept the memory-mapped ref index file; checksum: \" << sum << \": \";\n\t\t\t\tlogTime(cerr);\n\t\t\t}\n\t\t}\n\t}\n#endif\n\t\n\t// Read endianness sentinel, set 'swap'\n\tuint32_t one;\n\tbool swap = false;\n\tone = readIndex<int32_t>(f3, swap);\n\tif(one != 1) {\n\t\tif(useMm_) {\n\t\t\tcerr << \"Error: Can't use memory-mapped files when the index is the opposite endianness\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tassert_eq(0x1000000, one);\n\t\tswap = true; // have to endian swap U32s\n\t}\n\t\n\t// Read # records\n\tTIndexOffU sz;\n\tsz = readIndex<TIndexOffU>(f3, swap);\n\tif(sz == 0) {\n\t\tcerr << \"Error: number of reference records is 0 in \" << s3.c_str() << endl;\n\t\tthrow 1;\n\t}\n\t\n\t// Read records\n\tnrefs_ = 0;\n\t\n\t// Cumulative count of all unambiguous characters on a per-\n\t// stretch 8-bit alignment (i.e. count of bytes we need to\n\t// allocate in buf_)\n\tTIndexOffU cumsz = 0;\n\tTIndexOffU cumlen = 0;\n\t// For each unambiguous stretch...\n\tfor(TIndexOffU i = 0; i < sz; i++) {\n\t\trecs_.push_back(RefRecord(f3, swap));\n\t\tif(recs_.back().first) {\n\t\t\t// This is the first record for this reference sequence (and the\n\t\t\t// last record for the one before)\n\t\t\trefRecOffs_.push_back((TIndexOffU)recs_.size()-1);\n\t\t\t// refOffs_ links each reference sequence with the total number of\n\t\t\t// unambiguous characters preceding it in the pasted reference\n\t\t\trefOffs_.push_back(cumsz);\n\t\t\tif(nrefs_ > 0) {\n\t\t\t\t// refLens_ links each reference sequence with the total number\n\t\t\t\t// of ambiguous and unambiguous characters in it.\n\t\t\t\trefLens_.push_back(cumlen);\n\t\t\t}\n\t\t\tcumlen = 0;\n\t\t\tnrefs_++;\n\t\t} else if(i == 0) {\n\t\t\tcerr << \"First record in reference index file was not marked as \"\n\t\t\t     << \"'first'\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tcumUnambig_.push_back(cumsz);\n\t\tcumRefOff_.push_back(cumlen);\n\t\tcumsz += recs_.back().len;\n\t\tcumlen += recs_.back().off;\n\t\tcumlen += recs_.back().len;\n\t}\n\tif(verbose_ || startVerbose) {\n\t\tcerr << \"Read \" << nrefs_ << \" reference strings from \"\n\t\t     << sz << \" records: \";\n\t\tlogTime(cerr);\n\t}\n\t// Store a cap entry for the end of the last reference seq\n\trefRecOffs_.push_back((TIndexOffU)recs_.size());\n\trefOffs_.push_back(cumsz);\n\trefLens_.push_back(cumlen);\n\tcumUnambig_.push_back(cumsz);\n\tcumRefOff_.push_back(cumlen);\n\tbufSz_ = cumsz;\n\tassert_eq(nrefs_, refLens_.size());\n\tassert_eq(sz, recs_.size());\n\tif (f3 != NULL) fclose(f3); // done with .3.gEbwt_ext file\n\t// Round cumsz up to nearest byte boundary\n\tif((cumsz & 3) != 0) {\n\t\tcumsz += (4 - (cumsz & 3));\n\t}\n\tbufAllocSz_ = cumsz >> 2;\n\tassert_eq(0, cumsz & 3); // should be rounded up to nearest 4\n\tif(useMm_) {\n#ifdef BOWTIE_MM\n\t\tbuf_ = (uint8_t*)mmFile;\n\t\tif(sanity_) {\n\t\t\tFILE *ftmp = fopen(s4.c_str(), \"rb\");\n\t\t\tsanityBuf_ = new uint8_t[cumsz >> 2];\n\t\t\tsize_t ret = fread(sanityBuf_, 1, cumsz >> 2, ftmp);\n\t\t\tif(ret != (cumsz >> 2)) {\n\t\t\t\tcerr << \"Only read \" << ret << \" bytes (out of \" << (cumsz >> 2) << \") from reference index file \" << s4.c_str() << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\tfclose(ftmp);\n\t\t\tfor(size_t i = 0; i < (cumsz >> 2); i++) {\n\t\t\t\tassert_eq(sanityBuf_[i], buf_[i]);\n\t\t\t}\n\t\t}\n#else\n\t\tcerr << \"Shouldn't be at \" << __FILE__ << \":\" << __LINE__ << \" without BOWTIE_MM defined\" << endl;\n\t\tthrow 1;\n#endif\n\t} else {\n\t\tbool shmemLeader = true;\n\t\tif(!useShmem_) {\n\t\t\t// Allocate a buffer to hold the reference string\n\t\t\ttry {\n\t\t\t\tbuf_ = new uint8_t[cumsz >> 2];\n\t\t\t\tif(buf_ == NULL) throw std::bad_alloc();\n\t\t\t} catch(std::bad_alloc& e) {\n\t\t\t\tcerr << \"Error: Ran out of memory allocating space for the bitpacked reference.  Please\" << endl\n\t\t\t\t<< \"re-run on a computer with more memory.\" << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t} else {\n\t\t\tshmemLeader = ALLOC_SHARED_U8(\n\t\t\t\t\t\t\t\t\t\t  (s4 + \"[ref]\"), (cumsz >> 2), &buf_,\n\t\t\t\t\t\t\t\t\t\t  \"ref\", (verbose_ || startVerbose));\n\t\t}\n\t\tif(shmemLeader) {\n\t\t\t// Open the bitpair-encoded reference file\n\t\t\tFILE *f4 = fopen(s4.c_str(), \"rb\");\n\t\t\tif(f4 == NULL) {\n\t\t\t\tcerr << \"Could not open reference-string index file \" << s4.c_str() << \" for reading.\" << endl;\n\t\t\t\tcerr << \"This is most likely because your index was built with an older version\" << endl\n\t\t\t\t<< \"(<= 0.9.8.1) of bowtie-build.  Please re-run bowtie-build to generate a new\" << endl\n\t\t\t\t<< \"index (or download one from the Bowtie website) and try again.\" << endl;\n\t\t\t\tloaded_ = false;\n\t\t\t\treturn;\n\t\t\t}\n\t\t\t// Read the whole thing in\n\t\t\tsize_t ret = fread(buf_, 1, cumsz >> 2, f4);\n\t\t\t// Didn't read all of it?\n\t\t\tif(ret != (cumsz >> 2)) {\n\t\t\t\tcerr << \"Only read \" << ret << \" bytes (out of \" << (cumsz >> 2) << \") from reference index file \" << s4.c_str() << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t\t// Make sure there's no more\n\t\t\tchar c;\n\t\t\tret = fread(&c, 1, 1, f4);\n\t\t\tassert_eq(0, ret); // should have failed\n\t\t\tfclose(f4);\n#ifdef BOWTIE_SHARED_MEM\n\t\t\tif(useShmem_) NOTIFY_SHARED(buf_, (cumsz >> 2));\n#endif\n\t\t} else {\n#ifdef BOWTIE_SHARED_MEM\n\t\t\tif(useShmem_) WAIT_SHARED(buf_, (cumsz >> 2));\n#endif\n\t\t}\n\t}\n\t\n\t// Populate byteToU32_\n\tbool big = currentlyBigEndian();\n\tfor(int i = 0; i < 256; i++) {\n\t\tuint32_t word = 0;\n\t\tif(big) {\n\t\t\tword |= ((i >> 0) & 3) << 24;\n\t\t\tword |= ((i >> 2) & 3) << 16;\n\t\t\tword |= ((i >> 4) & 3) << 8;\n\t\t\tword |= ((i >> 6) & 3) << 0;\n\t\t} else {\n\t\t\tword |= ((i >> 0) & 3) << 0;\n\t\t\tword |= ((i >> 2) & 3) << 8;\n\t\t\tword |= ((i >> 4) & 3) << 16;\n\t\t\tword |= ((i >> 6) & 3) << 24;\n\t\t}\n\t\tbyteToU32_[i] = word;\n\t}\n\t\n#ifndef NDEBUG\n\tif(sanity_) {\n\t\t// Compare the sequence we just read from the compact index\n\t\t// file to the true reference sequence.\n\t\tEList<SString<char> > *os; // for holding references\n\t\tEList<SString<char> > osv(DEBUG_CAT); // for holding ref seqs\n\t\tEList<SString<char> > osn(DEBUG_CAT); // for holding ref names\n\t\tEList<size_t> osvLen(DEBUG_CAT); // for holding ref seq lens\n\t\tEList<size_t> osnLen(DEBUG_CAT); // for holding ref name lens\n\t\tSStringExpandable<uint32_t> tmp_destU32_;\n\t\tif(infiles != NULL) {\n\t\t\tif(infilesSeq) {\n\t\t\t\tfor(size_t i = 0; i < infiles->size(); i++) {\n\t\t\t\t\t// Remove initial backslash; that's almost\n\t\t\t\t\t// certainly being used to protect the first\n\t\t\t\t\t// character of the sequence from getopts (e.g.,\n\t\t\t\t\t// when the first char is -)\n\t\t\t\t\tif((*infiles)[i].at(0) == '\\\\') {\n\t\t\t\t\t\t(*infiles)[i].erase(0, 1);\n\t\t\t\t\t}\n\t\t\t\t\tosv.push_back(SString<char>((*infiles)[i]));\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\tparseFastas(*infiles, osn, osnLen, osv, osvLen);\n\t\t\t}\n\t\t\tos = &osv;\n\t\t} else {\n\t\t\tassert(origs != NULL);\n\t\t\tos = origs;\n\t\t}\n\t\t\n\t\t// Go through the loaded reference files base-by-base and\n\t\t// sanity check against what we get by calling getBase and\n\t\t// getStretch\n\t\tfor(size_t i = 0; i < os->size(); i++) {\n\t\t\tsize_t olen = ((*os)[i]).length();\n\t\t\tsize_t olenU32 = (olen + 12) / 4;\n\t\t\tuint32_t *buf = new uint32_t[olenU32];\n\t\t\tuint8_t *bufadj = (uint8_t*)buf;\n\t\t\tbufadj += getStretch(buf, i, 0, olen, tmp_destU32_);\n\t\t\tfor(size_t j = 0; j < olen; j++) {\n\t\t\t\tassert_eq((int)(*os)[i][j], (int)bufadj[j]);\n\t\t\t\tassert_eq((int)(*os)[i][j], (int)getBase(i, j));\n\t\t\t}\n\t\t\tdelete[] buf;\n\t\t}\n\t}\n#endif\n}\n\nBitPairReference::~BitPairReference() {\n\tif(buf_ != NULL && !useMm_ && !useShmem_) delete[] buf_;\n\tif(sanityBuf_ != NULL) delete[] sanityBuf_;\n}\n\n/**\n * Return a single base of the reference.  Calling this repeatedly\n * is not an efficient way to retrieve bases from the reference;\n * use loadStretch() instead.\n *\n * This implementation scans linearly through the records for the\n * unambiguous stretches of the target reference sequence.  When\n * there are many records, binary search would be more appropriate.\n */\nint BitPairReference::getBase(size_t tidx, size_t toff) const {\n\tuint64_t reci = refRecOffs_[tidx];   // first record for target reference sequence\n\tuint64_t recf = refRecOffs_[tidx+1]; // last record (exclusive) for target seq\n\tassert_gt(recf, reci);\n\tuint64_t bufOff = refOffs_[tidx];\n\tuint64_t off = 0;\n\t// For all records pertaining to the target reference sequence...\n\tfor(uint64_t i = reci; i < recf; i++) {\n\t\tassert_geq(toff, off);\n\t\toff += recs_[i].off;\n\t\tif(toff < off) {\n\t\t\treturn 4;\n\t\t}\n\t\tassert_geq(toff, off);\n\t\tuint64_t recOff = off + recs_[i].len;\n\t\tif(toff < recOff) {\n\t\t\ttoff -= off;\n\t\t\tbufOff += (uint64_t)toff;\n\t\t\tassert_lt(bufOff, bufSz_);\n\t\t\tconst uint64_t bufElt = (bufOff) >> 2;\n\t\t\tconst uint64_t shift = (bufOff & 3) << 1;\n\t\t\treturn ((buf_[bufElt] >> shift) & 3);\n\t\t}\n\t\tbufOff += recs_[i].len;\n\t\toff = recOff;\n\t\tassert_geq(toff, off);\n\t} // end for loop over records\n\treturn 4;\n}\n\n/**\n * Load a stretch of the reference string into memory at 'dest'.\n *\n * This implementation scans linearly through the records for the\n * unambiguous stretches of the target reference sequence.  When\n * there are many records, binary search would be more appropriate.\n */\nint BitPairReference::getStretchNaive(\n\tuint32_t *destU32,\n\tsize_t tidx,\n\tsize_t toff,\n\tsize_t count) const\n{\n\tuint8_t *dest = (uint8_t*)destU32;\n\tuint64_t reci = refRecOffs_[tidx];   // first record for target reference sequence\n\tuint64_t recf = refRecOffs_[tidx+1]; // last record (exclusive) for target seq\n\tassert_gt(recf, reci);\n\tuint64_t cur = 0;\n\tuint64_t bufOff = refOffs_[tidx];\n\tuint64_t off = 0;\n\t// For all records pertaining to the target reference sequence...\n\tfor(uint64_t i = reci; i < recf; i++) {\n\t\tassert_geq(toff, off);\n\t\toff += recs_[i].off;\n\t\tfor(; toff < off && count > 0; toff++) {\n\t\t\tdest[cur++] = 4;\n\t\t\tcount--;\n\t\t}\n\t\tif(count == 0) break;\n\t\tassert_geq(toff, off);\n\t\tif(toff < off + recs_[i].len) {\n\t\t\tbufOff += (TIndexOffU)(toff - off); // move bufOff pointer forward\n\t\t} else {\n\t\t\tbufOff += recs_[i].len;\n\t\t}\n\t\toff += recs_[i].len;\n\t\tfor(; toff < off && count > 0; toff++) {\n\t\t\tassert_lt(bufOff, bufSz_);\n\t\t\tconst uint64_t bufElt = (bufOff) >> 2;\n\t\t\tconst uint64_t shift = (bufOff & 3) << 1;\n\t\t\tdest[cur++] = (buf_[bufElt] >> shift) & 3;\n\t\t\tbufOff++;\n\t\t\tcount--;\n\t\t}\n\t\tif(count == 0) break;\n\t\tassert_geq(toff, off);\n\t} // end for loop over records\n\t// In any chars are left after scanning all the records,\n\t// they must be ambiguous\n\twhile(count > 0) {\n\t\tcount--;\n\t\tdest[cur++] = 4;\n\t}\n\tassert_eq(0, count);\n\treturn 0;\n}\n\n/**\n * Load a stretch of the reference string into memory at 'dest'.\n */\nint BitPairReference::getStretch(\n\tuint32_t *destU32,\n\tsize_t tidx,\n\tsize_t toff,\n\tsize_t count\n\tASSERT_ONLY(, SStringExpandable<uint32_t>& destU32_2)) const\n{\n\tASSERT_ONLY(size_t origCount = count);\n\tASSERT_ONLY(size_t origToff = toff);\n\tif(count == 0) return 0;\n\tuint8_t *dest = (uint8_t*)destU32;\n#ifndef NDEBUG\n\tdestU32_2.clear();\n\tuint8_t *dest_2 = NULL;\n\tint off2;\n\tif((rand() % 10) == 0) {\n\t\tdestU32_2.resize((origCount >> 2) + 2);\n\t\toff2 = getStretchNaive(destU32_2.wbuf(), tidx, origToff, origCount);\n\t\tdest_2 = ((uint8_t*)destU32_2.wbuf()) + off2;\n\t}\n#endif\n\tdestU32[0] = 0x04040404; // Add Ns, which we might end up using later\n\tuint64_t reci = refRecOffs_[tidx];   // first record for target reference sequence\n\tuint64_t recf = refRecOffs_[tidx+1]; // last record (exclusive) for target seq\n\tassert_gt(recf, reci);\n\tuint64_t cur = 4; // keep a cushion of 4 bases at the beginning\n\tuint64_t bufOff = refOffs_[tidx];\n\tuint64_t off = 0;\n\tint64_t offset = 4;\n\tbool firstStretch = true;\n\tbool binarySearched = false;\n\tuint64_t left  = reci;\n\tuint64_t right = recf;\n\tuint64_t mid   = 0;\n\t// For all records pertaining to the target reference sequence...\n\tfor(uint64_t i = reci; i < recf; i++) {\n\t\tuint64_t origBufOff = bufOff;\n\t\tassert_geq(toff, off);\n\t\tif (firstStretch && recf > reci + 16){\n\t\t\t// binary search finds smallest i s.t. toff >= cumRefOff_[i]\n\t\t\twhile (left < right-1) {\n\t\t\t\tmid = left + ((right - left) >> 1);\n\t\t\t\tif (cumRefOff_[mid] <= toff)\n\t\t\t\t\tleft = mid;\n\t\t\t\telse\n\t\t\t\t\tright = mid;\n\t\t\t}\n\t\t\toff = cumRefOff_[left];\n\t\t\tbufOff = cumUnambig_[left];\n\t\t\torigBufOff = bufOff;\n\t\t\ti = left;\n\t\t\tassert(cumRefOff_[i+1] == 0 || cumRefOff_[i+1] > toff);\n\t\t\tbinarySearched = true;\n\t\t}\n\t\toff += recs_[i].off; // skip Ns at beginning of stretch\n\t\tassert_gt(count, 0);\n\t\tif(toff < off) {\n\t\t\tsize_t cpycnt = min((size_t)(off - toff), count);\n\t\t\tmemset(&dest[cur], 4, cpycnt);\n\t\t\tcount -= cpycnt;\n\t\t\ttoff += cpycnt;\n\t\t\tcur += cpycnt;\n\t\t\tif(count == 0) break;\n\t\t}\n\t\tassert_geq(toff, off);\n\t\tif(toff < off + recs_[i].len) {\n\t\t\tbufOff += toff - off; // move bufOff pointer forward\n\t\t} else {\n\t\t\tbufOff += recs_[i].len;\n\t\t}\n\t\toff += recs_[i].len;\n\t\tassert(off == cumRefOff_[i+1] || cumRefOff_[i+1] == 0);\n\t\tassert(!binarySearched || toff < off);\n\t\t_unused(binarySearched); //make production build happy\n\t\tif(toff < off) {\n\t\t\tif(firstStretch) {\n\t\t\t\tif(toff + 8 < off && count > 8) {\n\t\t\t\t\t// We already added some Ns, so we have to do\n\t\t\t\t\t// a fixup at the beginning of the buffer so\n\t\t\t\t\t// that we can start clobbering at cur >> 2\n\t\t\t\t\tif(cur & 3) {\n\t\t\t\t\t\toffset -= (cur & 3);\n\t\t\t\t\t}\n\t\t\t\t\tuint64_t curU32 = cur >> 2;\n\t\t\t\t\t// Do the initial few bases\n\t\t\t\t\tif(bufOff & 3) {\n\t\t\t\t\t\tconst uint64_t bufElt = (bufOff) >> 2;\n\t\t\t\t\t\tconst int64_t low2 = bufOff & 3;\n\t\t\t\t\t\t// Lots of cache misses on the following line\n\t\t\t\t\t\tdestU32[curU32] = byteToU32_[buf_[bufElt]];\n\t\t\t\t\t\tfor(int j = 0; j < low2; j++) {\n\t\t\t\t\t\t\t((char *)(&destU32[curU32]))[j] = 4;\n\t\t\t\t\t\t}\n\t\t\t\t\t\tcurU32++;\n\t\t\t\t\t\toffset += low2;\n\t\t\t\t\t\tconst int64_t chars = 4 - low2;\n\t\t\t\t\t\tcount -= chars;\n\t\t\t\t\t\tbufOff += chars;\n\t\t\t\t\t\ttoff += chars;\n\t\t\t\t\t}\n\t\t\t\t\tassert_eq(0, bufOff & 3);\n\t\t\t\t\tuint64_t bufOffU32 = bufOff >> 2;\n\t\t\t\t\tuint64_t countLim = count >> 2;\n\t\t\t\t\tuint64_t offLim = ((off - (toff + 4)) >> 2);\n\t\t\t\t\tuint64_t lim = min(countLim, offLim);\n\t\t\t\t\t// Do the fast thing for as far as possible\n\t\t\t\t\tfor(uint64_t j = 0; j < lim; j++) {\n\t\t\t\t\t\t// Lots of cache misses on the following line\n\t\t\t\t\t\tdestU32[curU32] = byteToU32_[buf_[bufOffU32++]];\n#ifndef NDEBUG\n\t\t\t\t\t\tif(dest_2 != NULL) {\n\t\t\t\t\t\t\tassert_eq(dest[(curU32 << 2) + 0], dest_2[(curU32 << 2) - offset + 0]);\n\t\t\t\t\t\t\tassert_eq(dest[(curU32 << 2) + 1], dest_2[(curU32 << 2) - offset + 1]);\n\t\t\t\t\t\t\tassert_eq(dest[(curU32 << 2) + 2], dest_2[(curU32 << 2) - offset + 2]);\n\t\t\t\t\t\t\tassert_eq(dest[(curU32 << 2) + 3], dest_2[(curU32 << 2) - offset + 3]);\n\t\t\t\t\t\t}\n#endif\n\t\t\t\t\t\tcurU32++;\n\t\t\t\t\t}\n\t\t\t\t\ttoff += (lim << 2);\n\t\t\t\t\tassert_leq(toff, off);\n\t\t\t\t\tassert_leq((lim << 2), count);\n\t\t\t\t\tcount -= (lim << 2);\n\t\t\t\t\tbufOff = bufOffU32 << 2;\n\t\t\t\t\tcur = curU32 << 2;\n\t\t\t\t}\n\t\t\t\t// Do the slow thing for the rest\n\t\t\t\tfor(; toff < off && count > 0; toff++) {\n\t\t\t\t\tassert_lt(bufOff, bufSz_);\n\t\t\t\t\tconst uint64_t bufElt = (bufOff) >> 2;\n\t\t\t\t\tconst uint64_t shift = (bufOff & 3) << 1;\n\t\t\t\t\tdest[cur++] = (buf_[bufElt] >> shift) & 3;\n\t\t\t\t\tbufOff++;\n\t\t\t\t\tcount--;\n\t\t\t\t}\n\t\t\t\tfirstStretch = false;\n\t\t\t} else {\n\t\t\t\t// Do the slow thing\n\t\t\t\tfor(; toff < off && count > 0; toff++) {\n\t\t\t\t\tassert_lt(bufOff, bufSz_);\n\t\t\t\t\tconst uint64_t bufElt = (bufOff) >> 2;\n\t\t\t\t\tconst uint64_t shift = (bufOff & 3) << 1;\n\t\t\t\t\tdest[cur++] = (buf_[bufElt] >> shift) & 3;\n\t\t\t\t\tbufOff++;\n\t\t\t\t\tcount--;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tif(count == 0) break;\n\t\tassert_eq(recs_[i].len, bufOff - origBufOff);\n\t\t_unused(origBufOff); // make production build happy\n\t\tassert_geq(toff, off);\n\t} // end for loop over records\n\t// In any chars are left after scanning all the records,\n\t// they must be ambiguous\n\twhile(count > 0) {\n\t\tcount--;\n\t\tdest[cur++] = 4;\n\t}\n\tassert_eq(0, count);\n\treturn (int)offset;\n}\n\n\n/**\n * Parse the input fasta files, populating the szs list and writing the\n * .3.gEbwt_ext and .4.gEbwt_ext portions of the index as we go.\n */\npair<size_t, size_t>\nBitPairReference::szsFromFasta(\n\tEList<FileBuf*>& is,\n\tconst string& outfile,\n\tbool bigEndian,\n\tconst RefReadInParams& refparams,\n\tEList<RefRecord>& szs,\n\tbool sanity)\n{\n\tRefReadInParams parms = refparams;\n\tstd::pair<size_t, size_t> sztot;\n\tif(!outfile.empty()) {\n\t\tstring file3 = outfile + \".3.\" + gEbwt_ext;\n\t\tstring file4 = outfile + \".4.\" + gEbwt_ext;\n\t\t// Open output stream for the '.3.gEbwt_ext' file which will\n\t\t// hold the size records.\n\t\tofstream fout3(file3.c_str(), ios::binary);\n\t\tif(!fout3.good()) {\n\t\t\tcerr << \"Could not open index file for writing: \\\"\" << file3.c_str() << \"\\\"\" << endl\n\t\t\t\t << \"Please make sure the directory exists and that permissions allow writing by\" << endl\n\t\t\t\t << \"Bowtie.\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tBitpairOutFileBuf bpout(file4.c_str());\n\t\t// Read in the sizes of all the unambiguous stretches of the genome\n\t\t// into a vector of RefRecords.  The input streams are reset once\n\t\t// it's done.\n\t\twriteIndex<int32_t>(fout3, 1, bigEndian); // endianness sentinel\n\t\tbool color = parms.color;\n\t\tif(color) {\n\t\t\tparms.color = false;\n\t\t\t// Make sure the .3.gEbwt_ext and .4.gEbwt_ext files contain\n\t\t\t// nucleotides; not colors\n\t\t\tTIndexOff numSeqs = 0;\n\t\t\tASSERT_ONLY(std::pair<size_t, size_t> sztot2 =)\n\t\t\tfastaRefReadSizes(is, szs, parms, &bpout, numSeqs);\n\t\t\tparms.color = true;\n\t\t\twriteIndex<TIndexOffU>(fout3, (TIndexOffU)szs.size(), bigEndian); // write # records\n\t\t\tfor(size_t i = 0; i < szs.size(); i++) {\n\t\t\t\tszs[i].write(fout3, bigEndian);\n\t\t\t}\n\t\t\tszs.clear();\n\t\t\t// Now read in the colorspace size records; these are\n\t\t\t// the ones that were indexed\n\t\t\tTIndexOff numSeqs2 = 0;\n\t\t\tsztot = fastaRefReadSizes(is, szs, parms, NULL, numSeqs2);\n\t\t\tassert_eq(numSeqs, numSeqs2);\n\t\t\tassert_eq(sztot2.second, sztot.second + numSeqs);\n\t\t} else {\n\t\t\tTIndexOff numSeqs = 0;\n\t\t\tsztot = fastaRefReadSizes(is, szs, parms, &bpout, numSeqs);\n\t\t\twriteIndex<TIndexOffU>(fout3, (TIndexOffU)szs.size(), bigEndian); // write # records\n\t\t\tfor(size_t i = 0; i < szs.size(); i++) szs[i].write(fout3, bigEndian);\n\t\t}\n\t\tif(sztot.first == 0) {\n\t\t\tcerr << \"Error: No unambiguous stretches of characters in the input.  Aborting...\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tassert_gt(sztot.first, 0);\n\t\tassert_gt(sztot.second, 0);\n\t\tbpout.close();\n\t\tfout3.close();\n\t} else {\n\t\t// Read in the sizes of all the unambiguous stretches of the\n\t\t// genome into a vector of RefRecords\n\t\tTIndexOff numSeqs = 0;\n\t\tsztot = fastaRefReadSizes(is, szs, parms, NULL, numSeqs);\n#ifndef NDEBUG\n\t\tif(parms.color) {\n\t\t\tparms.color = false;\n\t\t\tEList<RefRecord> szs2(EBWTB_CAT);\n\t\t\tTIndexOff numSeqs2 = 0;\n\t\t\tASSERT_ONLY(std::pair<size_t, size_t> sztot2 =)\n\t\t\tfastaRefReadSizes(is, szs2, parms, NULL, numSeqs2);\n\t\t\tassert_eq(numSeqs, numSeqs2);\n\t\t\t// One less color than base\n\t\t\tassert_geq(sztot2.second, sztot.second + numSeqs);\n\t\t\tparms.color = true;\n\t\t}\n#endif\n\t}\n\treturn sztot;\n}\n"
  },
  {
    "path": "reference.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef REFERENCE_H_\n#define REFERENCE_H_\n\n#include <stdexcept>\n#include <fcntl.h>\n#include <sys/stat.h>\n#include <utility>\n#ifdef BOWTIE_MM\n#include <sys/mman.h>\n#include <sys/shm.h>\n#endif\n#include \"endian_swap.h\"\n#include \"ref_read.h\"\n#include \"sequence_io.h\"\n#include \"mm.h\"\n#include \"shmem.h\"\n#include \"timer.h\"\n#include \"sstring.h\"\n#include \"btypes.h\"\n\n\n/**\n * Concrete reference representation that bulk-loads the reference from\n * the bit-pair-compacted binary file and stores it in memory also in\n * bit-pair-compacted format.  The user may request reference\n * characters either on a per-character bases or by \"stretch\" using\n * getBase(...) and getStretch(...) respectively.\n *\n * Most of the complexity in this class is due to the fact that we want\n * to represent references with ambiguous (non-A/C/G/T) characters but\n * we don't want to use more than two bits per base.  This means we\n * need a way to encode the ambiguous stretches of the reference in a\n * way that is external to the bitpair sequence.  To accomplish this,\n * we use the RefRecords vector, which is stored in the .3.ebwt index\n * file.  The bitpairs themselves are stored in the .4.ebwt index file.\n *\n * Once it has been loaded, a BitPairReference is read-only, and is\n * safe for many threads to access at once.\n */\nclass BitPairReference {\n\npublic:\n\t/**\n\t * Load from .3.ebwt/.4.ebwt Bowtie index files.\n\t */\n\tBitPairReference(\n\t\tconst string& in,\n\t\tbool color,\n\t\tbool sanity = false,\n\t\tEList<string>* infiles = NULL,\n\t\tEList<SString<char> >* origs = NULL,\n\t\tbool infilesSeq = false,\n\t\tbool useMm = false,\n\t\tbool useShmem = false,\n\t\tbool mmSweep = false,\n\t\tbool verbose = false,\n\t\tbool startVerbose = false);\n\n\t~BitPairReference();\n\n\t/**\n\t * Return a single base of the reference.  Calling this repeatedly\n\t * is not an efficient way to retrieve bases from the reference;\n\t * use loadStretch() instead.\n\t *\n\t * This implementation scans linearly through the records for the\n\t * unambiguous stretches of the target reference sequence.  When\n\t * there are many records, binary search would be more appropriate.\n\t */\n\tint getBase(size_t tidx, size_t toff) const;\n\n\t/**\n\t * Load a stretch of the reference string into memory at 'dest'.\n\t *\n\t * This implementation scans linearly through the records for the\n\t * unambiguous stretches of the target reference sequence.  When\n\t * there are many records, binary search would be more appropriate.\n\t */\n\tint getStretchNaive(\n\t\tuint32_t *destU32,\n\t\tsize_t tidx,\n\t\tsize_t toff,\n\t\tsize_t count) const;\n\n\t/**\n\t * Load a stretch of the reference string into memory at 'dest'.\n\t *\n\t * This implementation scans linearly through the records for the\n\t * unambiguous stretches of the target reference sequence.  When\n\t * there are many records, binary search would be more appropriate.\n\t */\n\tint getStretch(\n\t\tuint32_t *destU32,\n\t\tsize_t tidx,\n\t\tsize_t toff,\n\t\tsize_t count\n\t\tASSERT_ONLY(, SStringExpandable<uint32_t>& destU32_2)) const;\n\n\t/**\n\t * Return the number of reference sequences.\n\t */\n\tTIndexOffU numRefs() const {\n\t\treturn nrefs_;\n\t}\n\n\t/**\n\t * Return the approximate length of a reference sequence (it might leave\n\t * off some Ns on the end).\n\t *\n\t * TODO: Is it still true that it might leave off Ns?\n\t */\n\tTIndexOffU approxLen(TIndexOffU elt) const {\n\t\tassert_lt(elt, nrefs_);\n\t\treturn refLens_[elt];\n\t}\n\n\t/**\n\t * Return true iff buf_ and all the vectors are populated.\n\t */\n\tbool loaded() const {\n\t\treturn loaded_;\n\t}\n\t\n\t/**\n\t * Given a reference sequence id, return its offset into the pasted\n\t * reference string; i.e., return the number of unambiguous nucleotides\n\t * preceding it.\n\t */\n\tTIndexOffU pastedOffset(TIndexOffU idx) const {\n\t\treturn refOffs_[idx];\n\t}\n\n\t/**\n\t * Parse the input fasta files, populating the szs list and writing the\n\t * .3.ebwt and .4.ebwt portions of the index as we go.\n\t */\n\tstatic std::pair<size_t, size_t>\n\tszsFromFasta(\n\t\tEList<FileBuf*>& is,\n\t\tconst string& outfile,\n\t\tbool bigEndian,\n\t\tconst RefReadInParams& refparams,\n\t\tEList<RefRecord>& szs,\n\t\tbool sanity);\n\t\nprotected:\n\n\tuint32_t byteToU32_[256];\n\n\tEList<RefRecord> recs_;       /// records describing unambiguous stretches\n\t// following two lists are purely for the binary search in getStretch\n\tEList<TIndexOffU> cumUnambig_; // # unambig ref chars up to each record\n\tEList<TIndexOffU> cumRefOff_;  // # ref chars up to each record\n\tEList<TIndexOffU> refLens_;    /// approx lens of ref seqs (excludes trailing ambig chars)\n\tEList<TIndexOffU> refOffs_;    /// buf_ begin offsets per ref seq\n\tEList<TIndexOffU> refRecOffs_; /// record begin/end offsets per ref seq\n\tuint8_t *buf_;      /// the whole reference as a big bitpacked byte array\n\tuint8_t *sanityBuf_;/// for sanity-checking buf_\n\tTIndexOffU bufSz_;    /// size of buf_\n\tTIndexOffU bufAllocSz_;\n\tTIndexOffU nrefs_;    /// the number of reference sequences\n\tbool     loaded_;   /// whether it's loaded\n\tbool     sanity_;   /// do sanity checking\n\tbool     useMm_;    /// load the reference as a memory-mapped file\n\tbool     useShmem_; /// load the reference into shared memory\n\tbool     verbose_;\n\tASSERT_ONLY(SStringExpandable<uint32_t> tmp_destU32_);\n};\n\n#endif\n"
  },
  {
    "path": "scoring.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <iostream>\n#include \"scoring.h\"\n\nusing namespace std;\n\n/**\n * Return true iff a read of length 'rdlen' passes the score filter, i.e.,\n * has enough characters to rise above the minimum score threshold.\n */\nbool Scoring::scoreFilter(\n\tint64_t minsc,\n\tsize_t rdlen) const\n{\n\tint64_t sc = (int64_t)(rdlen * match(30));\n\treturn sc >= minsc;\n}\n\n/**\n * Given the score floor for valid alignments and the length of the read,\n * calculate the maximum possible number of read gaps that could occur in a\n * valid alignment.\n */\nint Scoring::maxReadGaps(\n\tint64_t minsc,\n\tsize_t rdlen) const\n{\n\t// Score if all characters match.  TODO: remove assumption that match bonus\n\t// is independent of quality value.\n\tint64_t sc = (int64_t)(rdlen * match(30));\n\tassert_geq(sc, minsc);\n\t// Now convert matches to read gaps until sc calls below minsc\n\tbool first = true;\n\tint num = 0;\n\twhile(sc >= minsc) {\n\t\tif(first) {\n\t\t\tfirst = false;\n\t\t\t// Subtract both penalties\n\t\t\tsc -= readGapOpen();\n\t\t} else {\n\t\t\t// Subtract just the extension penalty\n\t\t\tsc -= readGapExtend();\n\t\t}\n\t\tnum++;\n\t}\n\tassert_gt(num, 0);\n\treturn num-1;\n}\n\n/**\n * Given the score floor for valid alignments and the length of the read,\n * calculate the maximum possible number of reference gaps that could occur\n * in a valid alignment.\n */\nint Scoring::maxRefGaps(\n\tint64_t minsc,\n\tsize_t rdlen) const\n{\n\t// Score if all characters match.  TODO: remove assumption that match bonus\n\t// is independent of quality value.\n\tint64_t sc = (int64_t)(rdlen * match(30));\n\tassert_geq(sc, minsc);\n\t// Now convert matches to read gaps until sc calls below minsc\n\tbool first = true;\n\tint num = 0;\n\twhile(sc >= minsc) {\n\t\tsc -= match(30);\n\t\tif(first) {\n\t\t\tfirst = false;\n\t\t\t// Subtract both penalties\n\t\t\tsc -= refGapOpen();\n\t\t} else {\n\t\t\t// Subtract just the extension penalty\n\t\t\tsc -= refGapExtend();\n\t\t}\n\t\tnum++;\n\t}\n\tassert_gt(num, 0);\n\treturn num-1;\n}\n\n/**\n * Given a read sequence, return true iff the read passes the N filter.\n * The N filter rejects reads with more than the number of Ns.\n */\nbool Scoring::nFilter(const BTDnaString& rd, size_t& ns) const {\n\tsize_t rdlen = rd.length();\n\tsize_t maxns = nCeil.f<size_t>((double)rdlen);\n\tassert_geq(rd.length(), 0);\n\tfor(size_t i = 0; i < rdlen; i++) {\n\t\tif(rd[i] == 4) {\n\t\t\tns++;\n\t\t\tif(ns > maxns) {\n\t\t\t\treturn false; // doesn't pass\n\t\t\t}\n\t\t}\n\t}\n\treturn true; // passes\n}\n\n/**\n * Given a read sequence, return true iff the read passes the N filter.\n * The N filter rejects reads with more than the number of Ns.\n *\n * For paired-end reads, there is a\tquestion of how to apply the filter.\n * The filter could be applied to both mates separately, which might then\n * prevent paired-end alignment.  Or the filter could be applied to the\n * reads as though they're concatenated together.  The latter approach has\n * pros and cons.  The pro is that we can use paired-end information to\n * recover alignments for mates that would not have passed the N filter on\n * their own.  The con is that we might not want to do that, since the\n * non-N portion of the bad mate might contain particularly unreliable\n * information.\n */\nvoid Scoring::nFilterPair(\n\tconst BTDnaString* rd1, // mate 1\n\tconst BTDnaString* rd2, // mate 2\n\tsize_t& ns1,            // # Ns in mate 1\n\tsize_t& ns2,            // # Ns in mate 2\n\tbool& filt1,            // true -> mate 1 rejected by filter\n\tbool& filt2)            // true -> mate 2 rejected by filter\n\tconst\n{\n\t// Both fail to pass by default\n\tfilt1 = filt2 = false;\n\tif(rd1 != NULL && rd2 != NULL && ncatpair) {\n\t\tsize_t rdlen1 = rd1->length();\n\t\tsize_t rdlen2 = rd2->length();\n\t\tsize_t maxns = nCeil.f<size_t>((double)(rdlen1 + rdlen2));\n\t\tfor(size_t i = 0; i < rdlen1; i++) {\n\t\t\tif((*rd1)[i] == 4) ns1++;\n\t\t\tif(ns1 > maxns) {\n\t\t\t\t// doesn't pass\n\t\t\t\treturn;\n\t\t\t}\n\t\t}\n\t\tfor(size_t i = 0; i < rdlen2; i++) {\n\t\t\tif((*rd2)[i] == 4) ns2++;\n\t\t\tif(ns2 > maxns) {\n\t\t\t\t// doesn't pass\n\t\t\t\treturn;\n\t\t\t}\n\t\t}\n\t\t// Both pass\n\t\tfilt1 = filt2 = true;\n\t} else {\n\t\tif(rd1 != NULL) filt1 = nFilter(*rd1, ns1);\n\t\tif(rd2 != NULL) filt2 = nFilter(*rd2, ns2);\n\t}\n}\n\n#ifdef SCORING_MAIN\n\nint main() {\n\t{\n\t\tcout << \"Case 1: Simple 1 ... \";\n\t\tScoring sc = Scoring::base1();\n\t\tassert_eq(COST_MODEL_CONSTANT, sc.matchType);\n\t\t\n\t\tassert_eq(0, sc.maxRefGaps(0, 10));  // 10 - 1 - 15 = -6\n\t\tassert_eq(0, sc.maxRefGaps(0, 11));  // 11 - 1 - 15 = -5\n\t\tassert_eq(0, sc.maxRefGaps(0, 12));  // 12 - 1 - 15 = -4\n\t\tassert_eq(0, sc.maxRefGaps(0, 13));  // 13 - 1 - 15 = -3\n\t\tassert_eq(0, sc.maxRefGaps(0, 14));  // 14 - 1 - 15 = -2\n\t\tassert_eq(0, sc.maxRefGaps(0, 15));  // 15 - 1 - 15 = -1\n\t\tassert_eq(1, sc.maxRefGaps(0, 16));  // 16 - 1 - 15 =  0\n\t\tassert_eq(1, sc.maxRefGaps(0, 17));  // 17 - 2 - 19 = -4\n\t\tassert_eq(1, sc.maxRefGaps(0, 18));  // 18 - 2 - 19 = -3\n\t\tassert_eq(1, sc.maxRefGaps(0, 19));  // 19 - 2 - 19 = -2\n\t\tassert_eq(1, sc.maxRefGaps(0, 20));  // 20 - 2 - 19 = -1\n\t\tassert_eq(2, sc.maxRefGaps(0, 21));  // 21 - 2 - 19 =  0\n\t\t\n\t\tassert_eq(0, sc.maxReadGaps(0, 10));   // 10 - 0 - 15 = -5\n\t\tassert_eq(0, sc.maxReadGaps(0, 11));   // 11 - 0 - 15 = -4\n\t\tassert_eq(0, sc.maxReadGaps(0, 12));   // 12 - 0 - 15 = -3\n\t\tassert_eq(0, sc.maxReadGaps(0, 13));   // 13 - 0 - 15 = -2\n\t\tassert_eq(0, sc.maxReadGaps(0, 14));   // 14 - 0 - 15 = -1\n\t\tassert_eq(1, sc.maxReadGaps(0, 15));   // 15 - 0 - 15 =  0\n\t\tassert_eq(1, sc.maxReadGaps(0, 16));   // 16 - 0 - 19 = -3\n\t\tassert_eq(1, sc.maxReadGaps(0, 17));   // 17 - 0 - 19 = -2\n\t\tassert_eq(1, sc.maxReadGaps(0, 18));   // 18 - 0 - 19 = -1\n\t\tassert_eq(2, sc.maxReadGaps(0, 19));   // 19 - 0 - 19 =  0\n\t\tassert_eq(2, sc.maxReadGaps(0, 20));   // 20 - 0 - 23 = -3\n\t\tassert_eq(2, sc.maxReadGaps(0, 21));   // 21 - 0 - 23 = -2\n\t\t\n\t\t// N ceiling: const=2, linear=0.1\n\t\tassert_eq(1, sc.nCeil(1));\n\t\tassert_eq(2, sc.nCeil(3));\n\t\tassert_eq(2, sc.nCeil(5));\n\t\tassert_eq(2, sc.nCeil(7));\n\t\tassert_eq(2, sc.nCeil(9));\n\t\tassert_eq(3, sc.nCeil(10));\n\t\tfor(int i = 0; i < 30; i++) {\n\t\t\tassert_eq(3, sc.n(i));\n\t\t\tassert_eq(3, sc.mm(i));\n\t\t}\n\t\tassert_eq(5, sc.gapbar);\n\t\tcout << \"PASSED\" << endl;\n\t}\n\t{\n\t\tcout << \"Case 2: Simple 2 ... \";\n\t\tScoring sc(\n\t\t\t4,               // reward for a match\n\t\t\tCOST_MODEL_QUAL, // how to penalize mismatches\n\t\t\t0,               // constant if mm pelanty is a constant\n\t\t\t30,              // penalty for nuc mm in decoded colorspace als\n\t\t\t-3.0f,           // constant coeff for minimum score\n\t\t\t-3.0f,           // linear coeff for minimum score\n\t\t\tDEFAULT_FLOOR_CONST,  // constant coeff for score floor\n\t\t\tDEFAULT_FLOOR_LINEAR, // linear coeff for score floor\n\t\t\t3.0f,            // max # ref Ns allowed in alignment; const coeff\n\t\t\t0.4f,            // max # ref Ns allowed in alignment; linear coeff\n\t\t\tCOST_MODEL_QUAL, // how to penalize Ns in the read\n\t\t\t0,               // constant if N pelanty is a constant\n\t\t\ttrue,            // whether to concatenate mates before N filtering\n\t\t\t25,              // constant coeff for cost of gap in the read\n\t\t\t25,              // constant coeff for cost of gap in the ref\n\t\t\t10,              // coeff of linear term for cost of gap in read\n\t\t\t10,              // coeff of linear term for cost of gap in ref\n\t\t\t5,               // 5 rows @ top/bot diagonal-entrance-only\n\t\t\t-1,              // no restriction on row\n\t\t\tfalse            // score prioritized over row\n\t\t);\n\n\t\tassert_eq(COST_MODEL_CONSTANT, sc.matchType);\n\t\tassert_eq(4, sc.matchConst);\n\t\tassert_eq(COST_MODEL_QUAL, sc.mmcostType);\n\t\tassert_eq(COST_MODEL_QUAL, sc.npenType);\n\t\t\n\t\tassert_eq(0, sc.maxRefGaps(0, 8));  // 32 - 4 - 35 = -7\n\t\tassert_eq(0, sc.maxRefGaps(0, 9));  // 36 - 4 - 35 = -3\n\t\tassert_eq(1, sc.maxRefGaps(0, 10)); // 40 - 4 - 35 =  1\n\t\tassert_eq(1, sc.maxRefGaps(0, 11)); // 44 - 8 - 45 = -9\n\t\tassert_eq(1, sc.maxRefGaps(0, 12)); // 48 - 8 - 45 = -5\n\t\tassert_eq(1, sc.maxRefGaps(0, 13)); // 52 - 8 - 45 = -1\n\t\tassert_eq(2, sc.maxRefGaps(0, 14)); // 56 - 8 - 45 =  3\n\t\t\n\t\tassert_eq(0, sc.maxReadGaps(0, 8));   // 32 - 0 - 35 = -3\n\t\tassert_eq(1, sc.maxReadGaps(0, 9));   // 36 - 0 - 35 =  1\n\t\tassert_eq(1, sc.maxReadGaps(0, 10));  // 40 - 0 - 45 = -5\n\t\tassert_eq(1, sc.maxReadGaps(0, 11));  // 44 - 0 - 45 = -1\n\t\tassert_eq(2, sc.maxReadGaps(0, 12));  // 48 - 0 - 45 =  3\n\t\tassert_eq(2, sc.maxReadGaps(0, 13));  // 52 - 0 - 55 = -3\n\t\tassert_eq(3, sc.maxReadGaps(0, 14));  // 56 - 0 - 55 =  1\n\n\t\t// N ceiling: const=3, linear=0.4\n\t\tassert_eq(1, sc.nCeil(1));\n\t\tassert_eq(2, sc.nCeil(2));\n\t\tassert_eq(3, sc.nCeil(3));\n\t\tassert_eq(4, sc.nCeil(4));\n\t\tassert_eq(5, sc.nCeil(5));\n\t\tassert_eq(5, sc.nCeil(6));\n\t\tassert_eq(5, sc.nCeil(7));\n\t\tassert_eq(6, sc.nCeil(8));\n\t\tassert_eq(6, sc.nCeil(9));\n\n\t\tfor(int i = 0; i < 256; i++) {\n\t\t\tassert_eq(i, sc.n(i));\n\t\t\tassert_eq(i, sc.mm(i));\n\t\t}\n\n\t\tassert_eq(5, sc.gapbar);\n\n\t\tcout << \"PASSED\" << endl;\n\t}\n}\n\n#endif /*def SCORING_MAIN*/\n"
  },
  {
    "path": "scoring.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef SCORING_H_\n#define SCORING_H_\n\n#include <limits>\n#include \"qual.h\"\n#include \"simple_func.h\"\n\n// Default type of bonus to added for matches\n#define DEFAULT_MATCH_BONUS_TYPE COST_MODEL_CONSTANT\n// When match bonus type is constant, use this constant\n#define DEFAULT_MATCH_BONUS 0\n// Same settings but different defaults for --local mode\n#define DEFAULT_MATCH_BONUS_TYPE_LOCAL COST_MODEL_CONSTANT\n#define DEFAULT_MATCH_BONUS_LOCAL 2\n\n// Default type of penalty to assess against mismatches\n#define DEFAULT_MM_PENALTY_TYPE COST_MODEL_QUAL\n// Default type of penalty to assess against mismatches\n#define DEFAULT_MM_PENALTY_TYPE_IGNORE_QUALS COST_MODEL_CONSTANT\n// When mismatch penalty type is constant, use this constant\n#define DEFAULT_MM_PENALTY_MAX 6\n#define DEFAULT_MM_PENALTY_MIN 2\n\n// Default type of penalty to assess against mismatches\n#define DEFAULT_N_PENALTY_TYPE COST_MODEL_CONSTANT\n// When mismatch penalty type is constant, use this constant\n#define DEFAULT_N_PENALTY 1\n\n// Constant coefficient b in linear function f(x) = ax + b determining\n// minimum valid score f when read length is x\n#define DEFAULT_MIN_CONST (-0.6f)\n// Linear coefficient a\n#define DEFAULT_MIN_LINEAR (-0.6f)\n// Different defaults for --local mode\n#define DEFAULT_MIN_CONST_LOCAL (0.0f)\n#define DEFAULT_MIN_LINEAR_LOCAL (10.0f)\n\n// Constant coefficient b in linear function f(x) = ax + b determining\n// maximum permitted number of Ns f in a read before it is filtered &\n// the maximum number of Ns in an alignment before it is considered\n// invalid.\n#define DEFAULT_N_CEIL_CONST 0.0f\n// Linear coefficient a\n#define DEFAULT_N_CEIL_LINEAR 0.15f\n\n// Default for whether to concatenate mates before the N filter (as opposed to\n// filting each mate separately)\n#define DEFAULT_N_CAT_PAIR false\n\n// Default read gap penalties for when homopolymer calling is reliable\t\n#define DEFAULT_READ_GAP_CONST 5\n#define DEFAULT_READ_GAP_LINEAR 3\n\n// Default read gap penalties for when homopolymer calling is not reliable\n#define DEFAULT_READ_GAP_CONST_BADHPOLY 3\n#define DEFAULT_READ_GAP_LINEAR_BADHPOLY 1\n\n// Default reference gap penalties for when homopolymer calling is reliable\n#define DEFAULT_REF_GAP_CONST 5\n#define DEFAULT_REF_GAP_LINEAR 3\n\n// Default reference gap penalties for when homopolymer calling is not reliable\n#define DEFAULT_REF_GAP_CONST_BADHPOLY 3\n#define DEFAULT_REF_GAP_LINEAR_BADHPOLY 1\n\nenum {\n\tCOST_MODEL_ROUNDED_QUAL = 1,\n\tCOST_MODEL_QUAL,\n\tCOST_MODEL_CONSTANT\n};\n\n/**\n * How to penalize various types of sequence dissimilarity, and other settings\n * that govern how dynamic programming tables should be filled in and how to\n * backtrace to find solutions.\n */\nclass Scoring {\n\n\t/**\n\t * Init an array that maps quality to penalty or bonus according to 'type'\n\t * and 'cons'\n\t */\n\ttemplate<typename T>\n\tvoid initPens(\n\t\tT *pens,     // array to fill\n\t\tint type,    // penalty type; qual | rounded qual | constant\n\t\tint consMin, // constant for when penalty type is constant\n\t\tint consMax) // constant for when penalty type is constant\n\t{\n\t\tif(type == COST_MODEL_ROUNDED_QUAL) {\n\t\t\tfor(int i = 0; i < 256; i++) {\n\t\t\t\tpens[i] = (T)qualRounds[i];\n\t\t\t}\n\t\t} else if(type == COST_MODEL_QUAL) {\n\t\t\tassert_neq(consMin, 0);\n\t\t\tassert_neq(consMax, 0);\n\t\t\tfor(int i = 0; i < 256; i++) {\n\t\t\t\tint ii = min(i, 40); // TODO: Bit hacky, this\n\t\t\t\tfloat frac = (float)ii / 40.0f;\n\t\t\t\tpens[i] = consMin + (T)(frac * (consMax-consMin));\n\t\t\t\tassert_gt(pens[i], 0);\n\t\t\t\t//if(pens[i] == 0) {\n\t\t\t\t//\tpens[i] = ((consMax > 0) ? (T)1 : (T)-1);\n\t\t\t\t//}\n\t\t\t}\n\t\t} else if(type == COST_MODEL_CONSTANT) {\n\t\t\tfor(int i = 0; i < 256; i++) {\n\t\t\t\tpens[i] = (T)consMax;\n\t\t\t}\n\t\t} else {\n\t\t\tthrow 1;\n\t\t}\n\t}\n\npublic:\n\n\tScoring(\n\t\tint   mat,          // reward for a match\n\t\tint   mmcType,      // how to penalize mismatches\n\t    int   mmpMax_,      // maximum mismatch penalty\n\t    int   mmpMin_,      // minimum mismatch penalty\n\t\tconst SimpleFunc& scoreMin_,   // minimum score for valid alignment; const coeff\n\t\tconst SimpleFunc& nCeil_,      // max # ref Ns allowed in alignment; const coeff\n\t    int   nType,        // how to penalize Ns in the read\n\t    int   n,            // constant if N pelanty is a constant\n\t\tbool  ncat,         // whether to concatenate mates before N filtering\n\t    int   rdGpConst,    // constant coeff for cost of gap in the read\n\t    int   rfGpConst,    // constant coeff for cost of gap in the ref\n\t    int   rdGpLinear,   // coeff of linear term for cost of gap in read\n\t    int   rfGpLinear,   // coeff of linear term for cost of gap in ref\n\t\tint   gapbar_,      // # rows at top/bot can only be entered diagonally\n        int   cp_ = 0,      // canonical splicing penalty\n        int   ncp_ = 12,    // non-canonical splicing penalty\n        int   csp_ = 24,    // conflicting splice site penalty\n        const SimpleFunc* ip_ = NULL)      // penalty as to intron length\n\t{\n\t\tmatchType    = COST_MODEL_CONSTANT;\n\t\tmatchConst   = mat;\n\t\tmmcostType   = mmcType;\n\t\tmmpMax       = mmpMax_;\n\t\tmmpMin       = mmpMin_;\n\t\tscoreMin     = scoreMin_;\n\t\tnCeil        = nCeil_;\n\t\tnpenType     = nType;\n\t\tnpen         = n;\n\t\tncatpair     = ncat;\n\t\trdGapConst   = rdGpConst;\n\t\trfGapConst   = rfGpConst;\n\t\trdGapLinear  = rdGpLinear;\n\t\trfGapLinear  = rfGpLinear;\n\t\tqualsMatter_ = mmcostType != COST_MODEL_CONSTANT;\n\t\tgapbar       = gapbar_;\n\t\tmonotone     = matchType == COST_MODEL_CONSTANT && matchConst == 0;\n\t\tinitPens<int>(mmpens, mmcostType, mmpMin_, mmpMax_);\n\t\tinitPens<int>(npens, npenType, npen, npen);\n\t\tinitPens<float>(matchBonuses, matchType, matchConst, matchConst);\n        cp = cp_;\n        ncp = ncp_;\n        csp = csp_;\n        if(ip_ != NULL) ip = *ip_;\n\t\tassert(repOk());\n\t}\n\t\n\t/**\n\t * Set a constant match bonus.\n\t */\n\tvoid setMatchBonus(int bonus) {\n\t\tmatchType  = COST_MODEL_CONSTANT;\n\t\tmatchConst = bonus;\n\t\tinitPens<float>(matchBonuses, matchType, matchConst, matchConst);\n\t\tassert(repOk());\n\t}\n\t\n\t/**\n\t * Set the mismatch penalty.\n\t */\n\tvoid setMmPen(int mmType_, int mmpMax_, int mmpMin_) {\n\t\tmmcostType = mmType_;\n\t\tmmpMax     = mmpMax_;\n\t\tmmpMin     = mmpMin_;\n\t\tinitPens<int>(mmpens, mmcostType, mmpMin, mmpMax);\n\t}\n\t\n\t/**\n\t * Set the N penalty.\n\t */\n\tvoid setNPen(int nType, int n) {\n\t\tnpenType     = nType;\n\t\tnpen         = n;\n\t\tinitPens<int>(npens, npenType, npen, npen);\n\t}\n\t\n#ifndef NDEBUG\n\t/**\n\t * Check that scoring scheme is internally consistent.\n\t */\n\tbool repOk() const {\n\t\tassert_geq(matchConst, 0);\n\t\tassert_gt(rdGapConst, 0);\n\t\tassert_gt(rdGapLinear, 0);\n\t\tassert_gt(rfGapConst, 0);\n\t\tassert_gt(rfGapLinear, 0);\n        return true;\n\t}\n#endif\n\n\t/**\n\t * Return a linear function of x where 'cnst' is the constant coefficiant\n\t * and 'lin' is the linear coefficient.\n\t */\n\tstatic float linearFunc(int64_t x, float cnst, float lin) {\n\t\treturn (float)((double)cnst + ((double)lin * x));\n\t}\n\n\t/**\n\t * Return the penalty incurred by a mismatch at an alignment column\n\t * with read character 'rdc' reference mask 'refm' and quality 'q'.\n\t *\n\t * qs should be clamped to 63 on the high end before this query.\n\t */\n\tinline int mm(int rdc, int refm, int q) const {\n\t\tassert_range(0, 255, q);\n\t\treturn (rdc > 3 || refm > 15) ? npens[q] : mmpens[q];\n\t}\n\t\n\t/**\n\t * Return the score of the given read character with the given quality\n\t * aligning to the given reference mask.  Take Ns into account.\n\t */\n\tinline int score(int rdc, int refm, int q) const {\n\t\tassert_range(0, 255, q);\n\t\tif(rdc > 3 || refm > 15) {\n\t\t\treturn -npens[q];\n\t\t}\n\t\tif((refm & (1 << rdc)) != 0) {\n\t\t\treturn (int)matchBonuses[q];\n\t\t} else {\n\t\t\treturn -mmpens[q];\n\t\t}\n\t}\n\n\t/**\n\t * Return the score of the given read character with the given quality\n\t * aligning to the given reference mask.  Take Ns into account.  Increment\n\t * a counter if it's an N.\n\t */\n\tinline int score(int rdc, int refm, int q, int& ns) const {\n\t\tassert_range(0, 255, q);\n\t\tif(rdc > 3 || refm > 15) {\n\t\t\tns++;\n\t\t\treturn -npens[q];\n\t\t}\n\t\tif((refm & (1 << rdc)) != 0) {\n\t\t\treturn (int)matchBonuses[q];\n\t\t} else {\n\t\t\treturn -mmpens[q];\n\t\t}\n\t}\n\n\t/**\n\t * Return the penalty incurred by a mismatch at an alignment column\n\t * with read character 'rdc' and quality 'q'.  We assume the\n\t * reference character is non-N.\n\t */\n\tinline int mm(int rdc, int q) const {\n\t\tassert_range(0, 255, q);\n\t\treturn (rdc > 3) ? npens[q] : mmpens[q];\n\t}\n\t\n\t/**\n\t * Return the marginal penalty incurred by a mismatch at a read\n\t * position with quality 'q'.\n\t */\n\tinline int mm(int q) const {\n\t\tassert_geq(q, 0);\n\t\treturn q < 255 ? mmpens[q] : mmpens[255];\n\t}\n\n\t/**\n\t * Return the marginal penalty incurred by a mismatch at a read\n\t * position with quality 30.\n\t */\n\tinline int64_t match() const {\n\t\treturn match(30);\n\t}\n\n\t/**\n\t * Return the marginal penalty incurred by a mismatch at a read\n\t * position with quality 'q'.\n\t */\n\tinline int64_t match(int q) const {\n\t\tassert_geq(q, 0);\n\t\treturn (int64_t)((q < 255 ? matchBonuses[q] : matchBonuses[255]) + 0.5f);\n\t}\n\t\n\t/**\n\t * Return the best score achievable by a read of length 'rdlen'.\n\t */\n\tinline int64_t perfectScore(size_t rdlen) const {\n\t\tif(monotone) {\n\t\t\treturn 0;\n\t\t} else {\n\t\t\treturn rdlen * match(30);\n\t\t}\n\t}\n\n\t/**\n\t * Return true iff the penalities are such that two reads with the\n\t * same sequence but different qualities might yield different\n\t * alignments.\n\t */\n\tinline bool qualitiesMatter() const { return qualsMatter_; }\n\t\n\t/**\n\t * Return the marginal penalty incurred by an N mismatch at a read\n\t * position with quality 'q'.\n\t */\n\tinline int n(int q) const {\n\t\tassert_geq(q, 0);\n\t\treturn q < 255 ? npens[q] : npens[255];\n\t}\n\n\t\n\t/**\n\t * Return the marginal penalty incurred by a gap in the read,\n\t * given that this is the 'ext'th extension of the gap (0 = open,\n\t * 1 = first, etc).\n\t */\n\tinline int ins(int ext) const {\n\t\tassert_geq(ext, 0);\n\t\tif(ext == 0) return readGapOpen();\n\t\treturn readGapExtend();\n\t}\n\n\t/**\n\t * Return the marginal penalty incurred by a gap in the reference,\n\t * given that this is the 'ext'th extension of the gap (0 = open,\n\t * 1 = first, etc).\n\t */\n\tinline int del(int ext) const {\n\t\tassert_geq(ext, 0);\n\t\tif(ext == 0) return refGapOpen();\n\t\treturn refGapExtend();\n\t}\n\n\t/**\n\t * Return true iff a read of length 'rdlen' passes the score filter, i.e.,\n\t * has enough characters to rise above the minimum score threshold.\n\t */\n\tbool scoreFilter(\n\t\tint64_t minsc,\n\t\tsize_t rdlen) const;\n\n\t/**\n\t * Given the score floor for valid alignments and the length of the read,\n\t * calculate the maximum possible number of read gaps that could occur in a\n\t * valid alignment.\n\t */\n\tint maxReadGaps(\n\t\tint64_t minsc,\n\t\tsize_t rdlen) const;\n\n\t/**\n\t * Given the score floor for valid alignments and the length of the read,\n\t * calculate the maximum possible number of reference gaps that could occur\n\t * in a valid alignment.\n\t */\n\tint maxRefGaps(\n\t\tint64_t minsc,\n\t\tsize_t rdlen) const;\n    \n\t/**\n\t * Given a read sequence, return true iff the read passes the N filter.\n\t * The N filter rejects reads with more than the number of Ns calculated by\n\t * taking nCeilConst + nCeilLinear * read length.\n\t */\n\tbool nFilter(const BTDnaString& rd, size_t& ns) const;\n\n\t/**\n\t * Given a read sequence, return true iff the read passes the N filter.\n\t * The N filter rejects reads with more than the number of Ns calculated by\n\t * taking nCeilConst + nCeilLinear * read length.\n\t *\n\t * For paired-end reads, there is a\tquestion of how to apply the filter.\n\t * The filter could be applied to both mates separately, which might then\n\t * prevent paired-end alignment.  Or the filter could be applied to the\n\t * reads as though they're concatenated together.  The latter approach has\n\t * pros and cons.  The pro is that we can use paired-end information to\n\t * recover alignments for mates that would not have passed the N filter on\n\t * their own.  The con is that we might not want to do that, since the\n\t * non-N portion of the bad mate might contain particularly unreliable\n\t * information.\n\t */\n\tvoid nFilterPair(\n\t\tconst BTDnaString* rd1, // mate 1\n\t\tconst BTDnaString* rd2, // mate 2\n\t\tsize_t& ns1,            // # Ns in mate 1\n\t\tsize_t& ns2,            // # Ns in mate 2\n\t\tbool& filt1,            // true -> mate 1 rejected by filter\n\t\tbool& filt2)            // true -> mate 2 rejected by filter\n\t\tconst;\n\t\n\t/**\n\t * The penalty associated with opening a new read gap.\n\t */\n\tinline int readGapOpen() const { \n\t\treturn rdGapConst + rdGapLinear;\n\t}\n\n\t/**\n\t * The penalty associated with opening a new ref gap.\n\t */\n\tinline int refGapOpen() const { \n\t\treturn rfGapConst + rfGapLinear;\n\t}\n\n\t/**\n\t * The penalty associated with extending a read gap by one character.\n\t */\n\tinline int readGapExtend() const { \n\t\treturn rdGapLinear;\n\t}\n\n\t/**\n\t * The penalty associated with extending a ref gap by one character.\n\t */\n\tinline int refGapExtend() const { \n\t\treturn rfGapLinear;\n\t}\n    \n    // avg. known score: -22.96, avg. random score: -33.70\n    inline int canSpl(int intronlen = 0, int minanchor = 100, float probscore = 0.0f) const {\n        int penintron = (intronlen > 0 ? ip.f<int>((double)intronlen) : 0);\n        if(penintron < 0) penintron = 0;\n        if(minanchor < 10 && probscore < -24.0f + (10 - minanchor)) {\n            return 10000;\n        }\n        return penintron + cp;\n    }\n    \n    inline int noncanSpl(int intronlen = 0, int minanchor = 100, float probscore = 0.0f) const {\n        if(minanchor < 14) return 10000;\n        int penintron = (intronlen > 0 ? ip.f<int>((double)intronlen) : 0);\n        if(penintron < 0) penintron = 0;\n        return penintron + ncp;\n    }\n    \n    inline int conflictSpl() const { return csp; }\n\n\tint     matchType;    // how to reward matches\n\tint     matchConst;   // reward for a match\n\tint     mmcostType;   // based on qual? rounded? just a constant?\n\tint     mmpMax;       // maximum mismatch penalty\n\tint     mmpMin;       // minimum mismatch penalty\n\tSimpleFunc scoreMin;  // minimum score for valid alignment, constant coeff\n\tSimpleFunc nCeil;     // max # Ns involved in alignment, constant coeff\n\tint     npenType;     // N: based on qual? rounded? just a constant?\n\tint     npen;         // N: if mmcosttype=constant, this is the const\n\tbool    ncatpair;     // true -> do N filtering on concated pair\n\tint     rdGapConst;   // constant term coeffecient in extend cost\n\tint     rfGapConst;   // constant term coeffecient in extend cost\n\tint     rdGapLinear;  // linear term coeffecient in extend cost\n\tint     rfGapLinear;  // linear term coeffecient in extend cost\n\tint     gapbar;       // # rows at top/bot can only be entered diagonally\n\tbool    monotone;     // scores can only go down?\n\tfloat   matchBonuses[256]; // map from qualities to match bonus\n\tint     mmpens[256];       // map from qualities to mm penalty\n\tint     npens[256];        // map from N qualities to penalty\n    int     cp;           // canonical splicing penalty\n    int     ncp;          // non-canonical splicing penalty\n    int     csp;          // conflicting splice site penalty\n    SimpleFunc     ip;           // intron length penalty\n\n\tstatic Scoring base1() {\n\t\tconst double DMAX = std::numeric_limits<double>::max();\n\t\tSimpleFunc scoreMin(SIMPLE_FUNC_LINEAR, 0.0f, DMAX, 37.0f, 0.3f);\n\t\tSimpleFunc nCeil(SIMPLE_FUNC_LINEAR, 0.0f, DMAX, 2.0f, 0.1f);\n\t\treturn Scoring(\n\t\t\t1,                       // reward for a match\n\t\t\tCOST_MODEL_CONSTANT,     // how to penalize mismatches\n\t\t\t3,                       // max mismatch pelanty\n\t\t\t3,                       // min mismatch pelanty\n\t\t\tscoreMin,                // score min: 37 + 0.3x\n\t\t\tnCeil,                   // n ceiling: 2 + 0.1x\n\t\t\tCOST_MODEL_CONSTANT,     // how to penalize Ns in the read\n\t\t\t3,                       // constant if N pelanty is a constant\n\t\t\tfalse,                   // concatenate mates before N filtering?\n\t\t\t11,                      // constant coeff for gap in read\n\t\t\t11,                      // constant coeff for gap in ref\n\t\t\t4,                       // linear coeff for gap in read\n\t\t\t4,                       // linear coeff for gap in ref\n\t\t\t5);                      // 5 rows @ top/bot diagonal-entrance-only\n\t}\n\nprotected:\n\n\tbool qualsMatter_;\n};\n\n#endif /*SCORING_H_*/\n"
  },
  {
    "path": "search_globals.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef SEARCH_GLOBALS_H_\n#define SEARCH_GLOBALS_H_\n\n#include <stdint.h>\n\n// declared in ebwt_search.cpp\nextern bool     gColor;\nextern bool     gColorExEnds;\nextern bool     gReportOverhangs;\nextern bool     gColorSeq;\nextern bool     gColorEdit;\nextern bool     gColorQual;\nextern bool     gNoMaqRound;\nextern bool     gStrandFix;\nextern bool     gRangeMode;\nextern int      gVerbose;\nextern int      gQuiet;\nextern bool     gNofw;\nextern bool     gNorc;\nextern bool     gMate1fw;\nextern bool     gMate2fw;\nextern int      gMinInsert;\nextern int      gMaxInsert;\nextern int      gTrim5;\nextern int      gTrim3;\nextern int      gGapBarrier;\nextern int      gAllowRedundant;\n\n#endif /* SEARCH_GLOBALS_H_ */\n"
  },
  {
    "path": "sequence_io.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef SEQUENCE_IO_H_\n#define SEQUENCE_IO_H_\n\n#include <string>\n#include <stdexcept>\n#include <fstream>\n#include <stdio.h>\n#include \"assert_helpers.h\"\n#include \"ds.h\"\n#include \"filebuf.h\"\n#include \"sstring.h\"\n\nusing namespace std;\n\n/**\n * Parse the fasta file 'infile'.  Store \n */\ntemplate<typename TFnStr>\nstatic void parseFastaLens(\n\tconst TFnStr&  infile,   // filename\n\tEList<size_t>& namelens, // destination for fasta name lengths\n\tEList<size_t>& seqlens)  // destination for fasta sequence lengths\n{\n\tFILE *in = fopen(sstr_to_cstr(infile), \"r\");\n\tif(in == NULL) {\n\t\tcerr << \"Could not open sequence file\" << endl;\n\t\tthrow 1;\n\t}\n\tFileBuf fb(in);\n\twhile(!fb.eof()) {\n\t\tnamelens.expand(); namelens.back() = 0;\n\t\tseqlens.expand();  seqlens.back() = 0;\n\t\tfb.parseFastaRecordLength(namelens.back(), seqlens.back());\n\t\tif(seqlens.back() == 0) {\n\t\t\t// Couldn't read a record.  We're probably done with this file.\n\t\t\tnamelens.pop_back();\n\t\t\tseqlens.pop_back();\n\t\t\tcontinue;\n\t\t}\n\t}\n\tfb.close();\n}\n\n/**\n * Parse the fasta file 'infile'.  Store each name record in 'names', each\n * sequence record  in 'seqs', and the lengths of each \n */\ntemplate<typename TFnStr, typename TNameStr, typename TSeqStr>\nstatic void parseFasta(\n\tconst TFnStr&    infile,   // filename\n\tEList<TNameStr>& names,    // destination for fasta names\n\tEList<size_t>&   namelens, // destination for fasta name lengths\n\tEList<TSeqStr>&  seqs,     // destination for fasta sequences\n\tEList<size_t>&   seqlens)  // destination for fasta sequence lengths\n{\n\tassert_eq(namelens.size(), seqlens.size());\n\tassert_eq(names.size(),    namelens.size());\n\tassert_eq(seqs.size(),     seqlens.size());\n\tsize_t cur = namelens.size();\n\tparseFastaLens(infile, namelens, seqlens);\n\tFILE *in = fopen(sstr_to_cstr(infile), \"r\");\n\tif(in == NULL) {\n\t\tcerr << \"Could not open sequence file\" << endl;\n\t\tthrow 1;\n\t}\n\tFileBuf fb(in);\n\twhile(!fb.eof()) {\n\t\t// Add a new empty record to the end\n\t\tnames.expand();\n\t\tseqs.expand();\n\t\tnames.back() = new char[namelens[cur]+1];\n\t\tseqs.back() = new char[seqlens[cur]+1];\n\t\tfb.parseFastaRecord(names.back(), seqs.back());\n\t\tif(seqs.back().empty()) {\n\t\t\t// Couldn't read a record.  We're probably done with this file.\n\t\t\tnames.pop_back();\n\t\t\tseqs.pop_back();\n\t\t\tcontinue;\n\t\t}\n\t}\n\tfb.close();\n}\n\n/**\n * Read a set of FASTA sequence files of the given format and alphabet type.\n * Store all of the extracted sequences in vector ss.\n */\ntemplate <typename TFnStr, typename TNameStr, typename TSeqStr>\nstatic void parseFastas(\n\tconst EList<TFnStr>& infiles, // filenames\n\tEList<TNameStr>& names,    // destination for fasta names\n\tEList<size_t>&   namelens, // destination for fasta name lengths\n\tEList<TSeqStr>&  seqs,     // destination for fasta sequences\n\tEList<size_t>&   seqlens)  // destination for fasta sequence lengths\n{\n\tfor(size_t i = 0; i < infiles.size(); i++) {\n\t\tparseFasta<TFnStr, TNameStr, TSeqStr>(\n\t\t\tinfiles[i],\n\t\t\tnames,\n\t\t\tnamelens,\n\t\t\tseqs,\n\t\t\tseqlens);\n\t}\n}\n\n#endif /*SEQUENCE_IO_H_*/\n"
  },
  {
    "path": "shmem.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifdef BOWTIE_SHARED_MEM\n\n#include <iostream>\n#include <string>\n#include <unistd.h>\n#include <sys/shm.h>\n#include <errno.h>\n#include \"shmem.h\"\n\nusing namespace std;\n\n/**\n * Notify other users of a shared-memory chunk that the leader has\n * finished initializing it.\n */\nvoid notifySharedMem(void *mem, size_t len) {\n\t((volatile uint32_t*)((char*)mem + len))[0] = SHMEM_INIT;\n}\n\n/**\n * Wait until the leader of a shared-memory chunk has finished\n * initializing it.\n */\nvoid waitSharedMem(void *mem, size_t len) {\n\twhile(((volatile uint32_t*)((char*)mem + len))[0] != SHMEM_INIT) {\n\t\tsleep(1);\n\t}\n}\n\n#endif\n"
  },
  {
    "path": "shmem.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef SHMEM_H_\n#define SHMEM_H_\n\n#ifdef BOWTIE_SHARED_MEM\n\n#include <string>\n#include <sys/shm.h>\n#include <unistd.h>\n#include <sys/shm.h>\n#include <errno.h>\n#include <stdint.h>\n#include <stdexcept>\n#include \"str_util.h\"\n#include \"btypes.h\"\n\nextern void notifySharedMem(void *mem, size_t len);\n\nextern void waitSharedMem(void *mem, size_t len);\n\n#define ALLOC_SHARED_U allocSharedMem<TIndexOffU>\n#define ALLOC_SHARED_U8 allocSharedMem<uint8_t>\n#define ALLOC_SHARED_U32 allocSharedMem<uint32_t>\n#define FREE_SHARED shmdt\n#define NOTIFY_SHARED notifySharedMem\n#define WAIT_SHARED waitSharedMem\n\n#define SHMEM_UNINIT  0xafba4242\n#define SHMEM_INIT    0xffaa6161\n\n/**\n * Tries to allocate a shared-memory chunk for a given file of a given size.\n */\ntemplate <typename T>\nbool allocSharedMem(std::string fname,\n                    size_t len,\n                    T ** dst,\n                    const char *memName,\n                    bool verbose)\n{\n\tusing namespace std;\n\tint shmid = -1;\n\t// Calculate key given string\n\tkey_t key = (key_t)hash_string(fname);\n\tshmid_ds ds;\n\tint ret;\n\t// Reserve 4 bytes at the end for silly synchronization\n\tsize_t shmemLen = len + 4;\n\tif(verbose) {\n\t\tcerr << \"Reading \" << len << \"+4 bytes into shared memory for \" << memName << endl;\n\t}\n\tT *ptr = NULL;\n\twhile(true) {\n\t\t// Create the shrared-memory block\n\t\tif((shmid = shmget(key, shmemLen, IPC_CREAT | 0666)) < 0) {\n\t\t\tif(errno == ENOMEM) {\n\t\t\t\tcerr << \"Out of memory allocating shared area \" << memName << endl;\n\t\t\t} else if(errno == EACCES) {\n\t\t\t\tcerr << \"EACCES\" << endl;\n\t\t\t} else if(errno == EEXIST) {\n\t\t\t\tcerr << \"EEXIST\" << endl;\n\t\t\t} else if(errno == EINVAL) {\n\t\t\t\tcerr << \"Warning: shared-memory chunk's segment size doesn't match expected size (\" << (shmemLen) << \")\" << endl\n\t\t\t\t\t << \"Deleteing old shared memory block and trying again.\" << endl;\n\t\t\t\tshmid = shmget(key, 0, 0);\n\t\t\t\tif((ret = shmctl(shmid, IPC_RMID, &ds)) < 0) {\n\t\t\t\t\tcerr << \"shmctl returned \" << ret\n\t\t\t\t\t\t << \" for IPC_RMID, errno is \" << errno\n\t\t\t\t\t\t << \", shmid is \" << shmid << endl;\n\t\t\t\t\tthrow 1;\n\t\t\t\t} else {\n\t\t\t\t\tcerr << \"Deleted shared mem chunk with shmid \" << shmid << endl;\n\t\t\t\t}\n\t\t\t\tcontinue;\n\t\t\t} else if(errno == ENOENT) {\n\t\t\t\tcerr << \"ENOENT\" << endl;\n\t\t\t} else if(errno == ENOSPC) {\n\t\t\t\tcerr << \"ENOSPC\" << endl;\n\t\t\t} else {\n\t\t\t\tcerr << \"shmget returned \" << shmid << \" for and errno is \" << errno << endl;\n\t\t\t}\n\t\t\tthrow 1;\n\t\t}\n\t\tptr = (T*)shmat(shmid, 0, 0);\n\t\tif(ptr == (void*)-1) {\n\t\t\tcerr << \"Failed to attach \" << memName << \" to shared memory with shmat().\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tif(ptr == NULL) {\n\t\t\tcerr << memName << \" pointer returned by shmat() was NULL.\" << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\t// Did I create it, or did I just attach to one created by\n\t\t// another process?\n\t\tif((ret = shmctl(shmid, IPC_STAT, &ds)) < 0) {\n\t\t\tcerr << \"shmctl returned \" << ret << \" for IPC_STAT and errno is \" << errno << endl;\n\t\t\tthrow 1;\n\t\t}\n\t\tif(ds.shm_segsz != shmemLen) {\n\t\t\tcerr << \"Warning: shared-memory chunk's segment size (\" << ds.shm_segsz\n\t\t\t\t << \") doesn't match expected size (\" << shmemLen << \")\" << endl\n\t\t\t\t << \"Deleteing old shared memory block and trying again.\" << endl;\n\t\t\tif((ret = shmctl(shmid, IPC_RMID, &ds)) < 0) {\n\t\t\t\tcerr << \"shmctl returned \" << ret << \" for IPC_RMID and errno is \" << errno << endl;\n\t\t\t\tthrow 1;\n\t\t\t}\n\t\t} else {\n\t\t\tbreak;\n\t\t}\n\t} // while(true)\n\t*dst = ptr;\n\tbool initid = (((volatile uint32_t*)((char*)ptr + len))[0] == SHMEM_INIT);\n\tif(ds.shm_cpid == getpid() && !initid) {\n\t\tif(verbose) {\n\t\t\tcerr << \"  I (pid = \" << getpid() << \") created the \"\n\t\t\t     << \"shared memory for \" << memName << endl;\n\t\t}\n\t\t// Set this value just off the end of the chunk to\n\t\t// indicate that the data hasn't been read yet.\n\t\t((volatile uint32_t*)((char*)ptr + len))[0] = SHMEM_UNINIT;\n\t\treturn true;\n\t} else {\n\t\tif(verbose) {\n\t\t\tcerr << \"  I (pid = \" << getpid()\n\t\t\t     << \") did not create the shared memory for \"\n\t\t\t     << memName << \".  Pid \" << ds.shm_cpid << \" did.\" << endl;\n\t\t}\n\t\treturn false;\n\t}\n}\n\n#else\n\n#define ALLOC_SHARED_U(...) 0\n#define ALLOC_SHARED_U8(...) 0\n#define ALLOC_SHARED_U32(...) 0\n#define FREE_SHARED(...)\n#define NOTIFY_SHARED(...)\n#define WAIT_SHARED(...)\n\n#endif /*BOWTIE_SHARED_MEM*/\n\n#endif /* SHMEM_H_ */\n"
  },
  {
    "path": "simple_func.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include <iostream>\n#include \"simple_func.h\"\n#include \"ds.h\"\n#include \"mem_ids.h\"\n\nint SimpleFunc::parseType(const std::string& otype) {\n\tstring type = otype;\n\tif(type == \"C\" || type == \"Constant\") {\n\t\treturn SIMPLE_FUNC_CONST;\n\t} else if(type == \"L\" || type == \"Linear\") {\n\t\treturn SIMPLE_FUNC_LINEAR;\n\t} else if(type == \"S\" || type == \"Sqrt\") {\n\t\treturn SIMPLE_FUNC_SQRT;\n\t} else if(type == \"G\" || type == \"Log\") {\n\t\treturn SIMPLE_FUNC_LOG;\n\t}\n\tstd::cerr << \"Error: Bad function type '\" << otype.c_str()\n\t\t\t  << \"'.  Should be C (constant), L (linear), \"\n\t\t\t  << \"S (square root) or G (natural log).\" << std::endl;\n\tthrow 1;\n}\n\nSimpleFunc SimpleFunc::parse(\n\tconst std::string& s,\n\tdouble defaultConst,\n\tdouble defaultLinear,\n\tdouble defaultMin,\n\tdouble defaultMax)\n{\n\t// Separate value into comma-separated tokens\n\tEList<string> ctoks(MISC_CAT);\n\tstring ctok;\n\tistringstream css(s);\n\tSimpleFunc fv;\n\twhile(getline(css, ctok, ',')) {\n\t\tctoks.push_back(ctok);\n\t}\n\tif(ctoks.size() >= 1) {\n\t\tfv.setType(parseType(ctoks[0]));\n\t}\n\tif(ctoks.size() >= 2) {\n\t\tdouble co;\n\t\tistringstream tmpss(ctoks[1]);\n\t\ttmpss >> co;\n\t\tfv.setConst(co);\n\t} else {\n\t\tfv.setConst(defaultConst);\n\t}\n\tif(ctoks.size() >= 3) {\n\t\tdouble ce;\n\t\tistringstream tmpss(ctoks[2]);\n\t\ttmpss >> ce;\n\t\tfv.setCoeff(ce);\n\t} else {\n\t\tfv.setCoeff(defaultLinear);\n\t}\n\tif(ctoks.size() >= 4) {\n\t\tdouble mn;\n\t\tistringstream tmpss(ctoks[3]);\n\t\ttmpss >> mn;\n\t\tfv.setMin(mn);\n\t} else {\n\t\tfv.setMin(defaultMin);\n\t}\n\tif(ctoks.size() >= 5) {\n\t\tdouble mx;\n\t\tistringstream tmpss(ctoks[4]);\n\t\ttmpss >> mx;\n\t\tfv.setMax(mx);\n\t} else {\n\t\tfv.setMax(defaultMax);\n\t}\n\treturn fv;\n}\n"
  },
  {
    "path": "simple_func.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef SIMPLE_FUNC_H_\n#define SIMPLE_FUNC_H_\n\n#include <math.h>\n#include <cassert>\n#include <limits>\n#include \"tokenize.h\"\n\n#define SIMPLE_FUNC_CONST  1\n#define SIMPLE_FUNC_LINEAR 2\n#define SIMPLE_FUNC_SQRT   3\n#define SIMPLE_FUNC_LOG    4\n\n/**\n * A simple function of one argument, parmeterized by I, X, C and L: min\n * value, max value, constant term, and coefficient respectively:\n *\n * 1. Constant:    f(x) = max(I, min(X, C + L * 0))\n * 2. Linear:      f(x) = max(I, min(X, C + L * x))\n * 3. Square root: f(x) = max(I, min(X, C + L * sqrt(x)))\n * 4. Log:         f(x) = max(I, min(X, C + L * ln(x)))\n *\n * Clearly, the return value of the Constant function doesn't depend on x.\n */\nclass SimpleFunc {\n\npublic:\n\n\tSimpleFunc() : type_(0), I_(0.0), X_(0.0), C_(0.0), L_(0.0) { }\n\n\tSimpleFunc(int type, double I, double X, double C, double L) {\n\t\tinit(type, I, X, C, L);\n\t}\n\t\n\tvoid init(int type, double I, double X, double C, double L) {\n\t\ttype_ = type; I_ = I; X_ = X; C_ = C; L_ = L;\n\t}\n\n\tvoid init(int type, double C, double L) {\n\t\ttype_ = type; C_ = C; L_ = L;\n\t\tI_ = -std::numeric_limits<double>::max();\n\t\tX_ = std::numeric_limits<double>::max();\n\t}\n\t\n\tvoid setType (int type ) { type_ = type; }\n\tvoid setMin  (double mn) { I_ = mn; }\n\tvoid setMax  (double mx) { X_ = mx; }\n\tvoid setConst(double co) { C_ = co; }\n\tvoid setCoeff(double ce) { L_ = ce; }\n\n\tint    getType () const { return type_; }\n\tdouble getMin  () const { return I_; }\n\tdouble getMax  () const { return X_; }\n\tdouble getConst() const { return C_; }\n\tdouble getCoeff() const { return L_; }\n\t\n\tvoid mult(double x) {\n\t\tif(I_ < std::numeric_limits<double>::max()) {\n\t\t\tI_ *= x; X_ *= x; C_ *= x; L_ *= x;\n\t\t}\n\t}\n\t\n\tbool initialized() const { return type_ != 0; }\n\tvoid reset() { type_ = 0; }\n\t\n\ttemplate<typename T>\n\tT f(double x) const {\n\t\tassert(type_ >= SIMPLE_FUNC_CONST && type_ <= SIMPLE_FUNC_LOG);\n\t\tdouble X;\n\t\tif(type_ == SIMPLE_FUNC_CONST) {\n\t\t\tX = 0.0;\n\t\t} else if(type_ == SIMPLE_FUNC_LINEAR) {\n\t\t\tX = x;\n\t\t} else if(type_ == SIMPLE_FUNC_SQRT) {\n\t\t\tX = sqrt(x);\n\t\t} else if(type_ == SIMPLE_FUNC_LOG) {\n\t\t\tX = log(x);\n\t\t} else {\n\t\t\tthrow 1;\n\t\t}\n\t\tdouble ret = std::max(I_, std::min(X_, C_ + L_ * X));\n\t\tif(ret == std::numeric_limits<double>::max()) {\n\t\t\treturn std::numeric_limits<T>::max();\n\t\t} else if(ret == std::numeric_limits<double>::min()) {\n\t\t\treturn std::numeric_limits<T>::min();\n\t\t} else {\n\t\t\treturn (T)ret;\n\t\t}\n\t}\n\t\n\tstatic int parseType(const std::string& otype);\n\t\n\tstatic SimpleFunc parse(\n\t\tconst std::string& s,\n\t\tdouble defaultConst = 0.0,\n\t\tdouble defaultLinear = 0.0,\n\t\tdouble defaultMin = 0.0,\n\t\tdouble defaultMax = std::numeric_limits<double>::max());\n\nprotected:\n\n\tint type_;\n\tdouble I_, X_, C_, L_;\n};\n\n#endif /*ndef SIMPLE_FUNC_H_*/\n"
  },
  {
    "path": "sse_util.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#include \"sse_util.h\"\n#include \"aligner_swsse.h\"\n#include \"limit.h\"\n\n/**\n * Given a column of filled-in cells, save the checkpointed cells in cs_.\n */\nvoid Checkpointer::commitCol(\n\t__m128i *pvH,\n\t__m128i *pvE,\n\t__m128i *pvF,\n\tsize_t coli)\n{\n}\n"
  },
  {
    "path": "sse_util.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef SSE_UTIL_H_\n#define SSE_UTIL_H_\n\n#include \"assert_helpers.h\"\n#include \"ds.h\"\n#include \"limit.h\"\n#include <iostream>\n#include <emmintrin.h>\n\nclass EList_m128i {\npublic:\n\n\t/**\n\t * Allocate initial default of S elements.\n\t */\n\texplicit EList_m128i(int cat = 0) :\n\t\tcat_(cat), last_alloc_(NULL), list_(NULL), sz_(0), cur_(0)\n\t{\n\t\tassert_geq(cat, 0);\n\t}\n\n\t/**\n\t * Destructor.\n\t */\n\t~EList_m128i() { free(); }\n\n\t/**\n\t * Return number of elements.\n\t */\n\tinline size_t size() const { return cur_; }\n\n\t/**\n\t * Return number of elements allocated.\n\t */\n\tinline size_t capacity() const { return sz_; }\n\t\n\t/**\n\t * Ensure that there is sufficient capacity to expand to include\n\t * 'thresh' more elements without having to expand.\n\t */\n\tinline void ensure(size_t thresh) {\n\t\tif(list_ == NULL) lazyInit();\n\t\texpandCopy(cur_ + thresh);\n\t}\n\n\t/**\n\t * Ensure that there is sufficient capacity to include 'newsz' elements.\n\t * If there isn't enough capacity right now, expand capacity to exactly\n\t * equal 'newsz'.\n\t */\n\tinline void reserveExact(size_t newsz) {\n\t\tif(list_ == NULL) lazyInitExact(newsz);\n\t\texpandCopyExact(newsz);\n\t}\n\n\t/**\n\t * Return true iff there are no elements.\n\t */\n\tinline bool empty() const { return cur_ == 0; }\n\t\n\t/**\n\t * Return true iff list hasn't been initialized yet.\n\t */\n\tinline bool null() const { return list_ == NULL; }\n\n\t/**\n\t * If size is less than requested size, resize up to at least sz\n\t * and set cur_ to requested sz.\n\t */\n\tvoid resize(size_t sz) {\n\t\tif(sz > 0 && list_ == NULL) lazyInit();\n\t\tif(sz <= cur_) {\n\t\t\tcur_ = sz;\n\t\t\treturn;\n\t\t}\n\t\tif(sz_ < sz) {\n\t\t\texpandCopy(sz);\n\t\t}\n\t\tcur_ = sz;\n\t}\n\t\n\t/**\n\t * Zero out contents of vector.\n\t */\n\tvoid zero() {\n\t\tif(cur_ > 0) {\n\t\t\tmemset(list_, 0, cur_ * sizeof(__m128i));\n\t\t}\n\t}\n\n\t/**\n\t * If size is less than requested size, resize up to at least sz\n\t * and set cur_ to requested sz.  Do not copy the elements over.\n\t */\n\tvoid resizeNoCopy(size_t sz) {\n\t\tif(sz > 0 && list_ == NULL) lazyInit();\n\t\tif(sz <= cur_) {\n\t\t\tcur_ = sz;\n\t\t\treturn;\n\t\t}\n\t\tif(sz_ < sz) {\n\t\t\texpandNoCopy(sz);\n\t\t}\n\t\tcur_ = sz;\n\t}\n\n\t/**\n\t * If size is less than requested size, resize up to exactly sz and set\n\t * cur_ to requested sz.\n\t */\n\tvoid resizeExact(size_t sz) {\n\t\tif(sz > 0 && list_ == NULL) lazyInitExact(sz);\n\t\tif(sz <= cur_) {\n\t\t\tcur_ = sz;\n\t\t\treturn;\n\t\t}\n\t\tif(sz_ < sz) expandCopyExact(sz);\n\t\tcur_ = sz;\n\t}\n\n\t/**\n\t * Make the stack empty.\n\t */\n\tvoid clear() {\n\t\tcur_ = 0; // re-use stack memory\n\t\t// Don't clear heap; re-use it\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline __m128i& operator[](size_t i) {\n\t\tassert_lt(i, cur_);\n\t\treturn list_[i];\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline __m128i operator[](size_t i) const {\n\t\tassert_lt(i, cur_);\n\t\treturn list_[i];\n\t}\n\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline __m128i& get(size_t i) {\n\t\treturn operator[](i);\n\t}\n\t\n\t/**\n\t * Return a reference to the ith element.\n\t */\n\tinline __m128i get(size_t i) const {\n\t\treturn operator[](i);\n\t}\n\n\t/**\n\t * Return a pointer to the beginning of the buffer.\n\t */\n\t__m128i *ptr() { return list_; }\n\n\t/**\n\t * Return a const pointer to the beginning of the buffer.\n\t */\n\tconst __m128i *ptr() const { return list_; }\n\n\t/**\n\t * Return memory category.\n\t */\n\tint cat() const { return cat_; }\n\nprivate:\n\n\t/**\n\t * Initialize memory for EList.\n\t */\n\tvoid lazyInit() {\n\t\tassert(list_ == NULL);\n\t\tlist_ = alloc(sz_);\n\t}\n\n\t/**\n\t * Initialize exactly the prescribed number of elements for EList.\n\t */\n\tvoid lazyInitExact(size_t sz) {\n\t\tassert_gt(sz, 0);\n\t\tassert(list_ == NULL);\n\t\tsz_ = sz;\n\t\tlist_ = alloc(sz);\n\t}\n\n\t/**\n\t * Allocate a T array of length sz_ and store in list_.  Also,\n\t * tally into the global memory tally.\n\t */\n\t__m128i *alloc(size_t sz) {\n\t\t__m128i* last_alloc_;\n\t\ttry {\n\t\t\tlast_alloc_ = new __m128i[sz + 2];\n\t\t} catch(std::bad_alloc& e) {\n\t\t\tstd::cerr << \"Error: Out of memory allocating \" << sz << \" __m128i's for DP matrix: '\" << e.what() << \"'\" << std::endl;\n\t\t\tthrow e;\n\t\t}\n\t\t__m128i* tmp = last_alloc_;\n\t\tsize_t tmpint = (size_t)tmp;\n\t\t// Align it!\n\t\tif((tmpint & 0xf) != 0) {\n\t\t\ttmpint += 15;\n\t\t\ttmpint &= (~0xf);\n\t\t\ttmp = reinterpret_cast<__m128i*>(tmpint);\n\t\t}\n\t\tassert_eq(0, (tmpint & 0xf)); // should be 16-byte aligned\n\t\tassert(tmp != NULL);\n\t\tgMemTally.add(cat_, sz);\n\t\treturn tmp;\n\t}\n\n\t/**\n\t * Allocate a T array of length sz_ and store in list_.  Also,\n\t * tally into the global memory tally.\n\t */\n\tvoid free() {\n\t\tif(list_ != NULL) {\n\t\t\tdelete[] last_alloc_;\n\t\t\tgMemTally.del(cat_, sz_);\n\t\t\tlist_ = NULL;\n\t\t\tsz_ = cur_ = 0;\n\t\t}\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has at least 'thresh' elements.  Size\n\t * increases quadratically with number of expansions.  Copy old contents\n\t * into new buffer using operator=.\n\t */\n\tvoid expandCopy(size_t thresh) {\n\t\tif(thresh <= sz_) return;\n\t\tsize_t newsz = (sz_ * 2)+1;\n\t\twhile(newsz < thresh) newsz *= 2;\n\t\texpandCopyExact(newsz);\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has exactly 'newsz' elements.  Copy\n\t * old contents into new buffer using operator=.\n\t */\n\tvoid expandCopyExact(size_t newsz) {\n\t\tif(newsz <= sz_) return;\n\t\t__m128i* tmp = alloc(newsz);\n\t\tassert(tmp != NULL);\n\t\tsize_t cur = cur_;\n\t\tif(list_ != NULL) {\n \t\t\tfor(size_t i = 0; i < cur_; i++) {\n\t\t\t\t// Note: operator= is used\n\t\t\t\ttmp[i] = list_[i];\n\t\t\t}\n\t\t\tfree();\n\t\t}\n\t\tlist_ = tmp;\n\t\tsz_ = newsz;\n\t\tcur_ = cur;\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has at least 'thresh' elements.\n\t * Size increases quadratically with number of expansions.  Don't copy old\n\t * contents into the new buffer.\n\t */\n\tvoid expandNoCopy(size_t thresh) {\n\t\tassert(list_ != NULL);\n\t\tif(thresh <= sz_) return;\n\t\tsize_t newsz = (sz_ * 2)+1;\n\t\twhile(newsz < thresh) newsz *= 2;\n\t\texpandNoCopyExact(newsz);\n\t}\n\n\t/**\n\t * Expand the list_ buffer until it has exactly 'newsz' elements.  Don't\n\t * copy old contents into the new buffer.\n\t */\n\tvoid expandNoCopyExact(size_t newsz) {\n\t\tassert(list_ != NULL);\n\t\tassert_gt(newsz, 0);\n\t\tfree();\n\t\t__m128i* tmp = alloc(newsz);\n\t\tassert(tmp != NULL);\n\t\tlist_ = tmp;\n\t\tsz_ = newsz;\n\t\tassert_gt(sz_, 0);\n\t}\n\n\tint      cat_;        // memory category, for accounting purposes\n\t__m128i* last_alloc_; // what new[] originally returns\n\t__m128i *list_;       // list ptr, aligned version of what new[] returns\n\tsize_t   sz_;         // capacity\n\tsize_t   cur_;        // occupancy (AKA size)\n};\n\nstruct  CpQuad {\n\tCpQuad() { reset(); }\n\t\n\tvoid reset() { sc[0] = sc[1] = sc[2] = sc[3] = 0; }\n\t\n\tbool operator==(const CpQuad& o) const {\n\t\treturn sc[0] == o.sc[0] &&\n\t\t       sc[1] == o.sc[1] &&\n\t\t\t   sc[2] == o.sc[2] &&\n\t\t\t   sc[3] == o.sc[3];\n\t}\n\n\tint16_t sc[4];\n};\n\n/**\n * Encapsulates a collection of checkpoints.  Assumes the scheme is to\n * checkpoint adjacent pairs of anti-diagonals.\n */\nclass Checkpointer {\n\npublic:\n\n\tCheckpointer() { reset(); }\n\t\n\t/**\n\t * Set the checkpointer up for a new rectangle.\n\t */\n\tvoid init(\n\t\tsize_t nrow,          // # of rows\n\t\tsize_t ncol,          // # of columns\n\t\tsize_t perpow2,       // checkpoint every 1 << perpow2 diags (& next)\n\t\tint64_t perfectScore, // what is a perfect score?  for sanity checks\n\t\tbool is8,             // 8-bit?\n\t\tbool doTri,           // triangle shaped?\n\t\tbool local,           // is alignment local?  for sanity checks\n\t\tbool debug)           // gather debug checkpoints?\n\t{\n\t\tassert_gt(perpow2, 0);\n\t\tnrow_ = nrow;\n\t\tncol_ = ncol;\n\t\tperpow2_ = perpow2;\n\t\tper_ = 1 << perpow2;\n\t\tlomask_ = ~(0xffffffff << perpow2);\n\t\tperf_ = perfectScore;\n\t\tlocal_ = local;\n\t\tndiag_ = (ncol + nrow - 1 + 1) / per_;\n\t\tlocol_ = MAX_SIZE_T;\n\t\thicol_ = MIN_SIZE_T;\n//\t\tdebug_ = debug;\n\t\tdebug_ = true;\n\t\tcommitMap_.clear();\n\t\tfirstCommit_ = true;\n\t\tsize_t perword = (is8 ? 16 : 8);\n\t\tis8_ = is8;\n\t\tniter_ = ((nrow_ + perword - 1) / perword);\n\t\tif(doTri) {\n\t\t\t// Save a pair of anti-diagonals every per_ anti-diagonals for\n\t\t\t// backtrace purposes\n\t\t\tqdiag1s_.resize(ndiag_ * nrow_);\n\t\t\tqdiag2s_.resize(ndiag_ * nrow_);\n\t\t} else {\n\t\t\t// Save every per_ columns and rows for backtrace purposes\n\t\t\tqrows_.resize((nrow_ / per_) * ncol_);\n\t\t\tqcols_.resize((ncol_ / per_) * (niter_ << 2));\n\t\t}\n\t\tif(debug_) {\n\t\t\t// Save all columns for debug purposes\n\t\t\tqcolsD_.resize(ncol_ * (niter_ << 2));\n\t\t}\n\t}\n\t\n\t/**\n\t * Return true iff we've been collecting debug cells.\n\t */\n\tbool debug() const { return debug_; }\n\t\n\t/**\n\t * Check whether the given score matches the saved score at row, col, hef.\n\t */\n\tint64_t debugCell(size_t row, size_t col, int hef) const {\n\t\tassert(debug_);\n\t\tconst __m128i* ptr = qcolsD_.ptr() + hef;\n\t\t// Fast forward to appropriate column\n\t\tptr += ((col * niter_) << 2);\n\t\tsize_t mod = row % niter_; // which m128i\n\t\tsize_t div = row / niter_; // offset into m128i\n\t\t// Fast forward to appropriate word\n\t\tptr += (mod << 2);\n\t\t// Extract score\n\t\tint16_t sc = (is8_ ? ((uint8_t*)ptr)[div] : ((int16_t*)ptr)[div]);\n\t\tint64_t asc = MIN_I64;\n\t\t// Convert score\n\t\tif(is8_) {\n\t\t\tif(local_) {\n\t\t\t\tasc = sc;\n\t\t\t} else {\n\t\t\t\tif(sc == 0) asc = MIN_I64;\n\t\t\t\telse asc = sc - 0xff;\n\t\t\t}\n\t\t} else {\n\t\t\tif(local_) {\n\t\t\t\tasc = sc + 0x8000;\n\t\t\t} else {\n\t\t\t\tif(sc != MIN_I16) asc = sc - 0x7fff;\n\t\t\t}\n\t\t}\n\t\treturn asc;\n\t}\n\t\n\t/**\n\t * Return true iff the given row/col is checkpointed.\n\t */\n\tbool isCheckpointed(size_t row, size_t col) const {\n\t\tassert_leq(col, hicol_);\n\t\tassert_geq(col, locol_);\n\t\tsize_t mod = (row + col) & lomask_;\n\t\tassert_lt(mod, per_);\n\t\treturn mod >= per_ - 2;\n\t}\n\n\t/**\n\t * Return the checkpointed H, E, or F score from the given cell.\n\t */\n\tinline int64_t scoreTriangle(size_t row, size_t col, int hef) const {\n\t\tassert(isCheckpointed(row, col));\n\t\tbool diag1 = ((row + col) & lomask_) == per_ - 2;\n\t\tsize_t off = (row + col) >> perpow2_;\n\t\tif(diag1) {\n\t\t\tif(qdiag1s_[off * nrow_ + row].sc[hef] == MIN_I16) {\n\t\t\t\treturn MIN_I64;\n\t\t\t} else {\n\t\t\t\treturn qdiag1s_[off * nrow_ + row].sc[hef];\n\t\t\t}\n\t\t} else {\n\t\t\tif(qdiag2s_[off * nrow_ + row].sc[hef] == MIN_I16) {\n\t\t\t\treturn MIN_I64;\n\t\t\t} else {\n\t\t\t\treturn qdiag2s_[off * nrow_ + row].sc[hef];\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Return the checkpointed H, E, or F score from the given cell.\n\t */\n\tinline int64_t scoreSquare(size_t row, size_t col, int hef) const {\n\t\t// Is it in a checkpointed row?  Note that checkpointed rows don't\n\t\t// necessarily have the horizontal contributions calculated, so we want\n\t\t// to use the column info in that case.\n\t\tif((row & lomask_) == lomask_ && hef != 1) {\n\t\t\tint64_t sc = qrows_[(row >> perpow2_) * ncol_ + col].sc[hef];\n\t\t\tif(sc == MIN_I16) return MIN_I64;\n\t\t\treturn sc;\n\t\t}\n\t\thef--;\n\t\tif(hef == -1) hef = 2;\n\t\t// It must be in a checkpointed column\n\t\tassert_eq(lomask_, (col & lomask_));\n\t\t// Fast forward to appropriate column\n\t\tconst __m128i* ptr = qcols_.ptr() + hef;\n\t\tptr += (((col >> perpow2_) * niter_) << 2);\n\t\tsize_t mod = row % niter_; // which m128i\n\t\tsize_t div = row / niter_; // offset into m128i\n\t\t// Fast forward to appropriate word\n\t\tptr += (mod << 2);\n\t\t// Extract score\n\t\tint16_t sc = (is8_ ? ((uint8_t*)ptr)[div] : ((int16_t*)ptr)[div]);\n\t\tint64_t asc = MIN_I64;\n\t\t// Convert score\n\t\tif(is8_) {\n\t\t\tif(local_) {\n\t\t\t\tasc = sc;\n\t\t\t} else {\n\t\t\t\tif(sc == 0) asc = MIN_I64;\n\t\t\t\telse asc = sc - 0xff;\n\t\t\t}\n\t\t} else {\n\t\t\tif(local_) {\n\t\t\t\tasc = sc + 0x8000;\n\t\t\t} else {\n\t\t\t\tif(sc != MIN_I16) asc = sc - 0x7fff;\n\t\t\t}\n\t\t}\n\t\treturn asc;\n\t}\n\n\t/**\n\t * Given a column of filled-in cells, save the checkpointed cells in cs_.\n\t */\n\tvoid commitCol(__m128i *pvH, __m128i *pvE, __m128i *pvF, size_t coli);\n\t\n\t/**\n\t * Reset the state of the Checkpointer.\n\t */\n\tvoid reset() {\n\t\tperpow2_ = per_ = lomask_ = nrow_ = ncol_ = 0;\n\t\tlocal_ = false;\n\t\tniter_ = ndiag_ = locol_ = hicol_ = 0;\n\t\tperf_ = 0;\n\t\tfirstCommit_ = true;\n\t\tis8_ = debug_ = false;\n\t}\n\t\n\t/**\n\t * Return true iff the Checkpointer has been initialized.\n\t */\n\tbool inited() const {\n\t\treturn nrow_ > 0;\n\t}\n\t\n\tsize_t per()     const { return per_;     }\n\tsize_t perpow2() const { return perpow2_; }\n\tsize_t lomask()  const { return lomask_;  }\n\tsize_t locol()   const { return locol_;   }\n\tsize_t hicol()   const { return hicol_;   }\n\tsize_t nrow()    const { return nrow_;    }\n\tsize_t ncol()    const { return ncol_;    }\n\t\n\tconst CpQuad* qdiag1sPtr() const { return qdiag1s_.ptr(); }\n\tconst CpQuad* qdiag2sPtr() const { return qdiag2s_.ptr(); }\n\n\tsize_t   perpow2_;   // 1 << perpow2_ - 2 is the # of uncheckpointed\n\t                     // anti-diags between checkpointed anti-diag pairs\n\tsize_t   per_;       // 1 << perpow2_\n\tsize_t   lomask_;    // mask for extracting low bits\n\tsize_t   nrow_;      // # rows in current rectangle\n\tsize_t   ncol_;      // # cols in current rectangle\n\tint64_t  perf_;      // perfect score\n\tbool     local_;     // local alignment?\n\t\n\tsize_t   ndiag_;     // # of double-diags\n\t\n\tsize_t   locol_;     // leftmost column committed\n\tsize_t   hicol_;     // rightmost column committed\n\n\t// Map for committing scores from vector columns to checkpointed diagonals\n\tEList<size_t> commitMap_;\n\tbool          firstCommit_;\n\t\n\tEList<CpQuad> qdiag1s_; // checkpoint H/E/F values for diagonal 1\n\tEList<CpQuad> qdiag2s_; // checkpoint H/E/F values for diagonal 2\n\n\tEList<CpQuad> qrows_;   // checkpoint H/E/F values for rows\n\t\n\t// We store columns in this way to reduce overhead of populating them\n\tbool          is8_;     // true -> fill used 8-bit cells\n\tsize_t        niter_;   // # __m128i words per column\n\tEList_m128i   qcols_;   // checkpoint E/F/H values for select columns\n\t\n\tbool          debug_;   // get debug checkpoints? (i.e. fill qcolsD_?)\n\tEList_m128i   qcolsD_;  // checkpoint E/F/H values for all columns (debug)\n};\n\n#endif\n"
  },
  {
    "path": "sstring.cpp",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifdef MAIN_SSTRING\n\n#include <string.h>\n#include <iostream>\n#include \"ds.h\"\n#include \"sstring.h\"\n\nusing namespace std;\n\nint main(void) {\n\tcerr << \"Test inter-class comparison operators...\";\n\t{\n\t\tSString<int> s(2);\n\t\ts.set('a', 0);\n\t\ts.set('b', 1);\n\t\tassert(sstr_eq(s, (const char *)\"ab\"));\n\t\tassert(!sstr_neq(s, (const char *)\"ab\"));\n\t\tassert(!sstr_lt(s, (const char *)\"ab\"));\n\t\tassert(!sstr_gt(s, (const char *)\"ab\"));\n\t\tassert(sstr_leq(s, (const char *)\"ab\"));\n\t\tassert(sstr_geq(s, (const char *)\"ab\"));\n\t\t\n\t\tSStringExpandable<int> s2;\n\t\ts2.append('a');\n\t\ts2.append('b');\n\t\tassert(sstr_eq(s, s2));\n\t\tassert(sstr_eq(s2, (const char *)\"ab\"));\n\t\tassert(!sstr_neq(s, s2));\n\t\tassert(!sstr_neq(s2, (const char *)\"ab\"));\n\t\tassert(!sstr_lt(s, s2));\n\t\tassert(!sstr_lt(s2, (const char *)\"ab\"));\n\t\tassert(!sstr_gt(s, s2));\n\t\tassert(!sstr_gt(s2, (const char *)\"ab\"));\n\t\tassert(sstr_leq(s, s2));\n\t\tassert(sstr_leq(s2, (const char *)\"ab\"));\n\t\tassert(sstr_geq(s, s2));\n\t\tassert(sstr_geq(s2, (const char *)\"ab\"));\n\n\t\tSStringFixed<int, 12> s3;\n\t\ts3.append('a');\n\t\ts3.append('b');\n\t\tassert(sstr_eq(s, s3));\n\t\tassert(sstr_eq(s2, s3));\n\t\tassert(sstr_eq(s3, (const char *)\"ab\"));\n\t\tassert(!sstr_neq(s, s3));\n\t\tassert(!sstr_neq(s2, s3));\n\t\tassert(!sstr_neq(s3, (const char *)\"ab\"));\n\t\tassert(!sstr_lt(s, s3));\n\t\tassert(!sstr_lt(s2, s3));\n\t\tassert(!sstr_lt(s3, (const char *)\"ab\"));\n\t\tassert(!sstr_gt(s, s3));\n\t\tassert(!sstr_gt(s2, s3));\n\t\tassert(!sstr_gt(s3, (const char *)\"ab\"));\n\t\tassert(sstr_geq(s, s3));\n\t\tassert(sstr_geq(s2, s3));\n\t\tassert(sstr_geq(s3, (const char *)\"ab\"));\n\t\tassert(sstr_leq(s, s3));\n\t\tassert(sstr_leq(s2, s3));\n\t\tassert(sstr_leq(s3, (const char *)\"ab\"));\n\t}\n\tcerr << \"PASSED\" << endl;\n\t\n\tcerr << \"Test flag for whether to consider end-of-word < other chars ...\";\n\t{\n\t\tSString<char> ss(\"String\");\n\t\tSString<char> sl(\"String1\");\n\t\tassert(sstr_lt(ss, sl));\n\t\tassert(sstr_gt(ss, sl, false));\n\t\tassert(sstr_leq(ss, sl));\n\t\tassert(sstr_geq(ss, sl, false));\n\t}\n\tcerr << \"PASSED\" << endl;\n\t\n\tcerr << \"Test toZBuf and toZBufXForm ...\";\n\t{\n\t\tSString<uint32_t> s(10);\n\t\tfor(int i = 0; i < 10; i++) {\n\t\t\ts[i] = (uint32_t)i;\n\t\t}\n\t\tassert(strcmp(s.toZBufXForm(\"0123456789\"), \"0123456789\") == 0);\n\t}\n\tcerr << \"PASSED\" << endl;\n\n\tcerr << \"Test S2bDnaString ...\";\n\t{\n\t\tconst char *str =\n\t\t\t\"ACGTACGTAC\" \"ACGTACGTAC\" \"ACGTACGTAC\"\n\t\t\t\"ACGTACGTAC\" \"ACGTACGTAC\" \"ACGTACGTAC\";\n\t\tconst char *gs =\n\t\t\t\"GGGGGGGGGG\" \"GGGGGGGGGG\" \"GGGGGGGGGG\"\n\t\t\t\"GGGGGGGGGG\" \"GGGGGGGGGG\" \"GGGGGGGGGG\";\n\t\tfor(size_t i = 0; i < 60; i++) {\n\t\t\tS2bDnaString s(str, i, true);\n\t\t\tS2bDnaString sr;\n\t\t\tBTDnaString s2(str, i, true);\n\t\t\tassert(sstr_eq(s, s2));\n\t\t\tif(i >= 10) {\n\t\t\t\tBTDnaString s3;\n\t\t\t\ts.windowGetDna(s3, true, false, 3, 4);\n\t\t\t\tassert(sstr_eq(s3.toZBuf(), (const char*)\"TACG\"));\n\t\t\t\ts.windowGetDna(s3, false, false, 3, 4);\n\t\t\t\tassert(sstr_eq(s3.toZBuf(), (const char*)\"CGTA\"));\n\t\t\t\tassert_eq('A', s.toChar(0));\n\t\t\t\tassert_eq('G', s.toChar(2));\n\t\t\t\tassert_eq('A', s.toChar(4));\n\t\t\t\tassert_eq('G', s.toChar(6));\n\t\t\t\tassert_eq('A', s.toChar(8));\n\t\t\t\t\n\t\t\t\ts.reverseWindow(1, 8);\n\t\t\t\ts2.reverseWindow(1, 8);\n\t\t\t\t\n\t\t\t\tassert_eq('A', s.toChar(1));\n\t\t\t\tassert_eq('T', s.toChar(2));\n\t\t\t\tassert_eq('G', s.toChar(3));\n\t\t\t\tassert_eq('C', s.toChar(4));\n\t\t\t\tassert_eq('A', s.toChar(5));\n\t\t\t\tassert_eq('T', s.toChar(6));\n\t\t\t\tassert_eq('G', s.toChar(7));\n\t\t\t\tassert_eq('C', s.toChar(8));\n\t\t\t\tassert(sstr_eq(s, s2));\n\n\t\t\t\ts.reverseWindow(1, 8);\n\t\t\t\ts2.reverseWindow(1, 8);\n\t\t\t\tassert(sstr_eq(s, s2));\n\t\t\t}\n\t\t\tif(i > 1) {\n\t\t\t\ts.reverse();\n\t\t\t\tsr.installReverseChars(str, i);\n\t\t\t\ts2.reverse();\n\t\t\t\tassert(sstr_eq(s, s2));\n\t\t\t\tassert(sstr_eq(sr, s2));\n\t\t\t\ts.reverse();\n\t\t\t\tsr.reverse();\n\t\t\t\tassert(sstr_neq(s, s2));\n\t\t\t\tassert(sstr_neq(sr, s2));\n\t\t\t\ts.fill(2);\n\t\t\t\ts2.reverse();\n\t\t\t\tassert(sstr_leq(s, gs));\n\t\t\t\tassert(sstr_gt(s, s2));\n\t\t\t\tassert(sstr_gt(s, sr));\n\t\t\t\ts2.fill(2);\n\t\t\t\tsr.fill(2);\n\t\t\t\tassert(sstr_eq(s, s2));\n\t\t\t\tassert(sstr_eq(s, sr));\n\t\t\t}\n\t\t}\n\t\tS2bDnaString s(str, true);\n\t\tS2bDnaString sr;\n\t\tBTDnaString s2(str, true);\n\t\tassert(sstr_eq(s2.toZBuf(), str));\n\t\tassert(sstr_eq(s, s2));\n\t\ts.reverse();\n\t\tsr.installReverseChars(str);\n\t\ts2.reverse();\n\t\tassert(sstr_eq(s, s2));\n\t\tassert(sstr_eq(sr, s2));\n\t\ts.reverse();\n\t\tsr.reverse();\n\t\tassert(sstr_neq(s, s2));\n\t\tassert(sstr_neq(sr, s2));\n\t}\n\tcerr << \"PASSED\" << endl;\n\n\tcerr << \"Test operator=() ...\";\n\t{\n\t\tS2bDnaString s;\n\t\ts.installChars(string(\"gtcagtca\"));\n\t\tassert(sstr_eq(s.toZBuf(), (const char *)\"GTCAGTCA\"));\n\t}\n\tcerr << \"PASSED\" << endl;\n\t\n\tcerr << \"Conversions from string ...\";\n\t{\n\t\tSStringExpandable<char> se(string(\"hello\"));\n\t\tEList<SStringExpandable<char> > sel;\n\t\tsel.push_back(SStringExpandable<char>(string(\"hello\")));\n\t}\n\tcerr << \"PASSED\" << endl;\n\t\n\tcerr << \"PASSED\" << endl;\n}\n\n#endif /*def MAIN_SSTRING*/\n"
  },
  {
    "path": "sstring.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef SSTRING_H_\n#define SSTRING_H_\n\n#include <string.h>\n#include <iostream>\n#include <stdlib.h>     /* exit, EXIT_FAILURE */\n#include <bitset>\n#include <vector>\n#include \"assert_helpers.h\"\n#include \"alphabet.h\"\n#include \"random_source.h\"\n\n/**\n * Four kinds of strings defined here:\n *\n * SString:\n *   A fixed-length string using heap memory with size set at construction time\n *   or when install() member is called.\n *\n * S2bDnaString:\n *   Like SString, but stores a list uint32_t words where each word is divided\n *   into 16 2-bit slots interpreted as holding one A/C/G/T nucleotide each.\n *\n * TODO: S3bDnaString allowing N.  S4bDnaString allowing nucleotide masks.\n *\n * SStringExpandable:\n *   A string using heap memory where the size of the backing store is\n *   automatically resized as needed.  Supports operations like append, insert,\n *   erase, etc.\n *\n * SStringFixed:\n *   A fixed-length string using stack memory where size is set at compile\n *   time.\n *\n * All string classes have some extra facilities that make it easy to print the\n * string, including when the string uses an encoded alphabet.  See toZBuf()\n * and toZBufXForm().\n *\n * Global lt, eq, and gt template functions are supplied.  They are capable of\n * doing lexicographical comparisons between any of the three categories of\n * strings defined here.\n */\n\ntemplate<typename T>\nclass Class_sstr_len {\npublic:\n\tstatic inline size_t sstr_len(const T& s) {\n\t\treturn s.length();\n\t}\n};\n\ntemplate<unsigned N>\nclass Class_sstr_len<const char[N]> {\npublic:\n\tstatic inline size_t sstr_len(const char s[N]) {\n\t\treturn strlen(s);\n\t}\n};\n\ntemplate<>\nclass Class_sstr_len<const char *> {\npublic:\n\tstatic inline size_t sstr_len(const char *s) {\n\t\treturn strlen(s);\n\t}\n};\n\ntemplate<>\nclass Class_sstr_len<const unsigned char *> {\npublic:\n\tstatic inline size_t sstr_len(const unsigned char *s) {\n\t\treturn strlen((const char *)s);\n\t}\n};\n\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_eq(const T1& s1, const T2& s2) {\n\tsize_t len1 = Class_sstr_len<T1>::sstr_len(s1);\n\tsize_t len2 = Class_sstr_len<T2>::sstr_len(s2);\n\tif(len1 != len2) return false;\n\tfor(size_t i = 0; i < len1; i++) {\n\t\tif(s1[i] != s2[i]) return false;\n\t}\n\treturn true;\n}\n\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_neq(const T1& s1, const T2& s2) {\n\treturn !sstr_eq(s1, s2);\n}\n\n/**\n * Return true iff the given suffix of s1 is equal to the given suffix of s2 up\n * to upto characters.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_suf_upto_eq(\n\tconst T1& s1, size_t suf1,\n\tconst T2& s2, size_t suf2,\n\tsize_t upto,\n\tbool endlt = true)\n{\n\tassert_leq(suf1, Class_sstr_len<T1>::sstr_len(s1));\n\tassert_leq(suf2, Class_sstr_len<T2>::sstr_len(s2));\n\tsize_t len1 = Class_sstr_len<T1>::sstr_len(s1) - suf1;\n\tsize_t len2 = Class_sstr_len<T2>::sstr_len(s2) - suf2;\n\tif(len1 > upto) len1 = upto;\n\tif(len2 > upto) len2 = upto;\n\tif(len1 != len2) return false;\n\tfor(size_t i = 0; i < len1; i++) {\n\t\tif(s1[suf1+i] != s2[suf2+i]) {\n\t\t\treturn false;\n\t\t}\n\t}\n\treturn true;\n}\n\n/**\n * Return true iff the given suffix of s1 is equal to the given suffix of s2 up\n * to upto characters.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_suf_upto_neq(\n\tconst T1& s1, size_t suf1,\n\tconst T2& s2, size_t suf2,\n\tsize_t upto,\n\tbool endlt = true)\n{\n\treturn !sstr_suf_upto_eq(s1, suf1, s2, suf2, upto, endlt);\n}\n\n/**\n * Return true iff s1 is less than s2.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_lt(const T1& s1, const T2& s2, bool endlt = true) {\n\tsize_t len1 = Class_sstr_len<T1>::sstr_len(s1);\n\tsize_t len2 = Class_sstr_len<T2>::sstr_len(s2);\n\tsize_t minlen = (len1 < len2 ? len1 : len2);\n\tfor(size_t i = 0; i < minlen; i++) {\n\t\tif(s1[i] < s2[i]) {\n\t\t\treturn true;\n\t\t} else if(s1[i] > s2[i]) {\n\t\t\treturn false;\n\t\t}\n\t}\n\tif(len1 == len2) return false;\n\treturn (len1 < len2) == endlt;\n}\n\n/**\n * Return true iff the given suffix of s1 is less than the given suffix of s2.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_suf_lt(\n\tconst T1& s1, size_t suf1,\n\tconst T2& s2, size_t suf2,\n\tbool endlt = true)\n{\n\tassert_leq(suf1, Class_sstr_len<T1>::sstr_len(s1));\n\tassert_leq(suf2, Class_sstr_len<T2>::sstr_len(s2));\n\tsize_t len1 = Class_sstr_len<T1>::sstr_len(s1) - suf1;\n\tsize_t len2 = Class_sstr_len<T2>::sstr_len(s2) - suf2;\n\tsize_t minlen = (len1 < len2 ? len1 : len2);\n\tfor(size_t i = 0; i < minlen; i++) {\n\t\tif(s1[suf1+i] < s2[suf2+i]) {\n\t\t\treturn true;\n\t\t} else if(s1[suf1+i] > s2[suf2+i]) {\n\t\t\treturn false;\n\t\t}\n\t}\n\tif(len1 == len2) return false;\n\treturn (len1 < len2) == endlt;\n}\n\n/**\n * Return true iff the given suffix of s1 is less than the given suffix of s2.\n * Treat s1 and s2 as though they have lengths len1/len2.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_suf_lt(\n\tconst T1& s1, size_t suf1, size_t len1,\n\tconst T2& s2, size_t suf2, size_t len2,\n\tbool endlt = true)\n{\n\tassert_leq(suf1, len1);\n\tassert_leq(suf2, len2);\n\tsize_t left1 = len1 - suf1;\n\tsize_t left2 = len2 - suf2;\n\tsize_t minleft = (left1 < left2 ? left1 : left2);\n\tfor(size_t i = 0; i < minleft; i++) {\n\t\tif(s1[suf1+i] < s2[suf2+i]) {\n\t\t\treturn true;\n\t\t} else if(s1[suf1+i] > s2[suf2+i]) {\n\t\t\treturn false;\n\t\t}\n\t}\n\tif(left1 == left2) return false;\n\treturn (left1 < left2) == endlt;\n}\n\n/**\n * Return true iff the given suffix of s1 is less than the given suffix of s2\n * up to upto characters.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_suf_upto_lt(\n\tconst T1& s1, size_t suf1,\n\tconst T2& s2, size_t suf2,\n\tsize_t upto,\n\tbool endlt = true)\n{\n\tassert_leq(suf1, Class_sstr_len<T1>::sstr_len(s1));\n\tassert_leq(suf2, Class_sstr_len<T2>::sstr_len(s2));\n\tsize_t len1 = Class_sstr_len<T1>::sstr_len(s1) - suf1;\n\tsize_t len2 = Class_sstr_len<T2>::sstr_len(s2) - suf2;\n\tif(len1 > upto) len1 = upto;\n\tif(len2 > upto) len2 = upto;\n\tsize_t minlen = (len1 < len2 ? len1 : len2);\n\tfor(size_t i = 0; i < minlen; i++) {\n\t\tif(s1[suf1+i] < s2[suf2+i]) {\n\t\t\treturn true;\n\t\t} else if(s1[suf1+i] > s2[suf2+i]) {\n\t\t\treturn false;\n\t\t}\n\t}\n\tif(len1 == len2) return false;\n\treturn (len1 < len2) == endlt;\n}\n\n/**\n * Return true iff the given prefix of s1 is less than the given prefix of s2.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_pre_lt(\n\tconst T1& s1, size_t pre1,\n\tconst T2& s2, size_t pre2,\n\tbool endlt = true)\n{\n\tassert_leq(pre1, Class_sstr_len<T1>::sstr_len(s1));\n\tassert_leq(pre2, Class_sstr_len<T2>::sstr_len(s2));\n\tsize_t len1 = pre1;\n\tsize_t len2 = pre2;\n\tsize_t minlen = (len1 < len2 ? len1 : len2);\n\tfor(size_t i = 0; i < minlen; i++) {\n\t\tif(s1[i] < s2[i]) {\n\t\t\treturn true;\n\t\t} else if(s1[i] > s2[i]) {\n\t\t\treturn false;\n\t\t}\n\t}\n\tif(len1 == len2) return false;\n\treturn (len1 < len2) == endlt;\n}\n\n/**\n * Return true iff s1 is less than or equal to s2.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_leq(const T1& s1, const T2& s2, bool endlt = true) {\n\tsize_t len1 = Class_sstr_len<T1>::sstr_len(s1);\n\tsize_t len2 = Class_sstr_len<T2>::sstr_len(s2);\n\tsize_t minlen = (len1 < len2 ? len1 : len2);\n\tfor(size_t i = 0; i < minlen; i++) {\n\t\tif(s1[i] < s2[i]) {\n\t\t\treturn true;\n\t\t} else if(s1[i] > s2[i]) {\n\t\t\treturn false;\n\t\t}\n\t}\n\tif(len1 == len2) return true;\n\treturn (len1 < len2) == endlt;\n}\n\n/**\n * Return true iff the given suffix of s1 is less than or equal to the given\n * suffix of s2.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_suf_leq(\n\tconst T1& s1, size_t suf1,\n\tconst T2& s2, size_t suf2,\n\tbool endlt = true)\n{\n\tassert_leq(suf1, Class_sstr_len<T1>::sstr_len(s1));\n\tassert_leq(suf2, Class_sstr_len<T2>::sstr_len(s2));\n\tsize_t len1 = Class_sstr_len<T1>::sstr_len(s1) - suf1;\n\tsize_t len2 = Class_sstr_len<T2>::sstr_len(s2) - suf2;\n\tsize_t minlen = (len1 < len2 ? len1 : len2);\n\tfor(size_t i = 0; i < minlen; i++) {\n\t\tif(s1[suf1+i] < s2[suf2+i]) {\n\t\t\treturn true;\n\t\t} else if(s1[suf1+i] > s2[suf2+i]) {\n\t\t\treturn false;\n\t\t}\n\t}\n\tif(len1 == len2) return true;\n\treturn (len1 < len2) == endlt;\n}\n\n/**\n * Return true iff the given prefix of s1 is less than or equal to the given\n * prefix of s2.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_pre_leq(\n\tconst T1& s1, size_t pre1,\n\tconst T2& s2, size_t pre2,\n\tbool endlt = true)\n{\n\tassert_leq(pre1, Class_sstr_len<T1>::sstr_len(s1));\n\tassert_leq(pre2, Class_sstr_len<T2>::sstr_len(s2));\n\tsize_t len1 = pre1;\n\tsize_t len2 = pre2;\n\tsize_t minlen = (len1 < len2 ? len1 : len2);\n\tfor(size_t i = 0; i < minlen; i++) {\n\t\tif(s1[i] < s2[i]) {\n\t\t\treturn true;\n\t\t} else if(s1[i] > s2[i]) {\n\t\t\treturn false;\n\t\t}\n\t}\n\tif(len1 == len2) return true;\n\treturn (len1 < len2) == endlt;\n}\n\n/**\n * Return true iff s1 is greater than s2.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_gt(const T1& s1, const T2& s2, bool endlt = true) {\n\tsize_t len1 = Class_sstr_len<T1>::sstr_len(s1);\n\tsize_t len2 = Class_sstr_len<T2>::sstr_len(s2);\n\tsize_t minlen = (len1 < len2 ? len1 : len2);\n\tfor(size_t i = 0; i < minlen; i++) {\n\t\tif(s1[i] > s2[i]) {\n\t\t\treturn true;\n\t\t} else if(s1[i] < s2[i]) {\n\t\t\treturn false;\n\t\t}\n\t}\n\tif(len1 == len2) return false;\n\treturn (len1 > len2) == endlt;\n}\n\n/**\n * Return true iff the given suffix of s1 is greater than the given suffix of\n * s2.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_suf_gt(\n\tconst T1& s1, size_t suf1,\n\tconst T2& s2, size_t suf2,\n\tbool endlt = true)\n{\n\tassert_leq(suf1, Class_sstr_len<T1>::sstr_len(s1));\n\tassert_leq(suf2, Class_sstr_len<T2>::sstr_len(s2));\n\tsize_t len1 = Class_sstr_len<T1>::sstr_len(s1) - suf1;\n\tsize_t len2 = Class_sstr_len<T2>::sstr_len(s2) - suf2;\n\tsize_t minlen = (len1 < len2 ? len1 : len2);\n\tfor(size_t i = 0; i < minlen; i++) {\n\t\tif(s1[suf1+i] > s2[suf2+i]) {\n\t\t\treturn true;\n\t\t} else if(s1[suf1+i] < s2[suf2+i]) {\n\t\t\treturn false;\n\t\t}\n\t}\n\tif(len1 == len2) return false;\n\treturn (len1 > len2) == endlt;\n}\n\n/**\n * Return true iff the given prefix of s1 is greater than the given prefix of\n * s2.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_pre_gt(\n\tconst T1& s1, size_t pre1,\n\tconst T2& s2, size_t pre2,\n\tbool endlt = true)\n{\n\tassert_leq(pre1, Class_sstr_len<T1>::sstr_len(s1));\n\tassert_leq(pre2, Class_sstr_len<T2>::sstr_len(s2));\n\tsize_t len1 = pre1;\n\tsize_t len2 = pre2;\n\tsize_t minlen = (len1 < len2 ? len1 : len2);\n\tfor(size_t i = 0; i < minlen; i++) {\n\t\tif(s1[i] > s2[i]) {\n\t\t\treturn true;\n\t\t} else if(s1[i] < s2[i]) {\n\t\t\treturn false;\n\t\t}\n\t}\n\tif(len1 == len2) return false;\n\treturn (len1 > len2) == endlt;\n}\n\n/**\n * Return true iff s1 is greater than or equal to s2.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_geq(const T1& s1, const T2& s2, bool endlt = true) {\n\tsize_t len1 = Class_sstr_len<T1>::sstr_len(s1);\n\tsize_t len2 = Class_sstr_len<T2>::sstr_len(s2);\n\tsize_t minlen = (len1 < len2 ? len1 : len2);\n\tfor(size_t i = 0; i < minlen; i++) {\n\t\tif(s1[i] > s2[i]) {\n\t\t\treturn true;\n\t\t} else if(s1[i] < s2[i]) {\n\t\t\treturn false;\n\t\t}\n\t}\n\tif(len1 == len2) return true;\n\treturn (len1 > len2) == endlt;\n}\n\n/**\n * Return true iff the given suffix of s1 is greater than or equal to the given\n * suffix of s2.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_suf_geq(\n\tconst T1& s1, size_t suf1,\n\tconst T2& s2, size_t suf2,\n\tbool endlt = true)\n{\n\tassert_leq(suf1, Class_sstr_len<T1>::sstr_len(s1));\n\tassert_leq(suf2, Class_sstr_len<T2>::sstr_len(s2));\n\tsize_t len1 = Class_sstr_len<T1>::sstr_len(s1) - suf1;\n\tsize_t len2 = Class_sstr_len<T2>::sstr_len(s2) - suf2;\n\tsize_t minlen = (len1 < len2 ? len1 : len2);\n\tfor(size_t i = 0; i < minlen; i++) {\n\t\tif(s1[suf1+i] > s2[suf2+i]) {\n\t\t\treturn true;\n\t\t} else if(s1[suf1+i] < s2[suf2+i]) {\n\t\t\treturn false;\n\t\t}\n\t}\n\tif(len1 == len2) return true;\n\treturn (len1 > len2) == endlt;\n}\n\n/**\n * Return true iff the given prefix of s1 is greater than or equal to the given\n * prefix of s2.\n */\ntemplate<typename T1, typename T2>\nstatic inline bool sstr_pre_geq(\n\tconst T1& s1, size_t pre1,\n\tconst T2& s2, size_t pre2,\n\tbool endlt = true)\n{\n\tassert_leq(pre1, Class_sstr_len<T1>::sstr_len(s1));\n\tassert_leq(pre2, Class_sstr_len<T2>::sstr_len(s2));\n\tsize_t len1 = pre1;\n\tsize_t len2 = pre2;\n\tsize_t minlen = (len1 < len2 ? len1 : len2);\n\tfor(size_t i = 0; i < minlen; i++) {\n\t\tif(s1[i] > s2[i]) {\n\t\t\treturn true;\n\t\t} else if(s1[i] < s2[i]) {\n\t\t\treturn false;\n\t\t}\n\t}\n\tif(len1 == len2) return true;\n\treturn (len1 > len2) == endlt;\n}\n\ntemplate<typename T>\nstatic inline const char * sstr_to_cstr(const T& s) {\n\treturn s.toZBuf();\n}\n\ntemplate<>\ninline const char * sstr_to_cstr<std::basic_string<char> >(\n\tconst std::basic_string<char>& s)\n{\n\treturn s.c_str();\n}\n\n/**\n * Simple string class with backing memory whose size is managed by the user\n * using the constructor and install() member function.  No behind-the-scenes\n * reallocation or copying takes place.\n */\ntemplate<typename T>\nclass SString {\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \npublic:\n\n\texplicit SString() :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0)\n\t{ }\n\n\texplicit SString(size_t sz) :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0)\n\t{\n\t\tresize(sz);\n\t}\n\n\t/**\n\t * Create an SStringExpandable from another SStringExpandable.\n\t */\n\tSString(const SString<T>& o) :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0)\n\t{\n\t\t*this = o;\n\t}\n\n\t/**\n\t * Create an SStringExpandable from a std::basic_string of the\n\t * appropriate type.\n\t */\n\texplicit SString(const std::basic_string<T>& str) :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0)\n\t{\n\t\tinstall(str.c_str(), str.length());\n\t}\n\n\t/**\n\t * Create an SStringExpandable from an array and size.\n\t */\n\texplicit SString(const T* b, size_t sz) :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0)\n\t{\n\t\tinstall(b, sz);\n\t}\n\n\t/**\n\t * Create an SStringExpandable from a zero-terminated array.\n\t */\n\texplicit SString(const T* b) :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0)\n\t{\n\t\tinstall(b, strlen(b));\n\t}\n\n\t/**\n\t * Destroy the expandable string object.\n\t */\n\tvirtual ~SString() {\n\t\tif(cs_ != NULL) {\n\t\t\tdelete[] cs_;\n\t\t\tcs_ = NULL;\n\t\t}\n\t\tif(printcs_ != NULL) {\n\t\t\tdelete[] printcs_;\n\t\t\tprintcs_ = NULL;\n\t\t}\n\t\tlen_ = 0;\n\t}\n\n\t/**\n\t * Assignment to other SString.\n\t */\n\tSString<T>& operator=(const SString<T>& o) {\n\t\tinstall(o.cs_, o.len_);\n\t\treturn *this;\n\t}\n\n\t/**\n\t * Assignment to other SString.\n\t */\n\tSString<T>& operator=(const std::basic_string<T>& o) {\n\t\tinstall(o);\n\t\treturn *this;\n\t}\n\n\t/**\n\t * Resizes the string without preserving its contents.\n\t */\n\tvoid resize(size_t sz) {\n\t\tif(cs_ != NULL) {\n\t\t\tdelete cs_;\n\t\t\tcs_ = NULL;\n\t\t}\n\t\tif(printcs_ != NULL) {\n\t\t\tdelete printcs_;\n\t\t\tprintcs_ = NULL;\n\t\t}\n\t\tif(sz != 0) {\n\t\t\tcs_ = new T[sz+1];\n\t\t}\n\t\tlen_ = sz;\n\t}\n\n\t/**\n\t * Return ith character from the left of either the forward or the\n\t * reverse version of the read.\n\t */\n\tT windowGet(\n\t\tsize_t i,\n\t\tbool   fw,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = len_;\n\t\tassert_lt(i, len);\n\t\tassert_leq(len, len_ - depth);\n\t\treturn fw ? cs_[depth+i] : cs_[depth+len-i-1];\n\t}\n\n\t/**\n\t * Return ith character from the left of either the forward or the\n\t * reverse-complement version of the read.\n\t */\n\tvoid windowGet(\n\t\tT& ret,\n\t\tbool   fw,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = len_;\n\t\tassert_leq(len, len_ - depth);\n\t\tret.resize(len);\n\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\tret.set(fw ? cs_[depth+i] : cs_[depth+len-i-1], i);\n\t\t}\n\t}\n\n\t/**\n\t * Set character at index 'idx' to 'c'.\n\t */\n\tinline void set(int c, size_t idx) {\n\t\tassert_lt(idx, len_);\n\t\tcs_[idx] = c;\n\t}\n\n\t/**\n\t * Retrieve constant version of element i.\n\t */\n\tinline const T& operator[](size_t i) const {\n\t\tassert_lt(i, len_);\n\t\treturn cs_[i];\n\t}\n\n\t/**\n\t * Retrieve mutable version of element i.\n\t */\n\tinline T& operator[](size_t i) {\n\t\tassert_lt(i, len_);\n\t\treturn cs_[i];\n\t}\n\n\t/**\n\t * Retrieve constant version of element i.\n\t */\n\tinline const T& get(size_t i) const {\n\t\tassert_lt(i, len_);\n\t\treturn cs_[i];\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string.  memcpy is used, not\n\t * operator=.\n\t */\n\tvirtual void install(const T* b, size_t sz) {\n\t\tif(sz == 0) return;\n\t\tresize(sz);\n\t\tmemcpy(cs_, b, sz * sizeof(T));\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string.  memcpy is used, not\n\t * operator=.\n\t */\n\tvirtual void install(const std::basic_string<T>& b) {\n\t\tsize_t sz = b.length();\n\t\tif(sz == 0) return;\n\t\tresize(sz);\n\t\tmemcpy(cs_, b.c_str(), sz * sizeof(T));\n\t}\n\n\t/**\n\t * Copy all bytes from zero-terminated buffer 'b' into this string.\n\t */\n\tvoid install(const T* b) {\n\t\tinstall(b, strlen(b));\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string, reversing them\n\t * in the process.\n\t */\n\tvoid installReverse(const char* b, size_t sz) {\n\t\tif(sz == 0) return;\n\t\tresize(sz);\n\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\tcs_[i] = b[sz-i-1];\n\t\t}\n\t\tlen_ = sz;\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string, reversing them\n\t * in the process.\n\t */\n\tvoid installReverse(const SString<T>& b) {\n\t\tinstallReverse(b.cs_, b.len_);\n\t}\n\t\n\t/**\n\t * Return true iff the two strings are equal.\n\t */\n\tbool operator==(const SString<T>& o) {\n\t\treturn sstr_eq(*this, o);\n\t}\n\n\t/**\n\t * Return true iff the two strings are not equal.\n\t */\n\tbool operator!=(const SString<T>& o) {\n\t\treturn sstr_neq(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is less than given string.\n\t */\n\tbool operator<(const SString<T>& o) {\n\t\treturn sstr_lt(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is greater than given string.\n\t */\n\tbool operator>(const SString<T>& o) {\n\t\treturn sstr_gt(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is less than or equal to given string.\n\t */\n\tbool operator<=(const SString<T>& o) {\n\t\treturn sstr_leq(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is greater than or equal to given string.\n\t */\n\tbool operator>=(const SString<T>& o) {\n\t\treturn sstr_geq(*this, o);\n\t}\n\n\t/**\n\t * Reverse the buffer in place.\n\t */\n\tvoid reverse() {\n\t\tfor(size_t i = 0; i < (len_ >> 1); i++) {\n\t\t\tT tmp = get(i);\n\t\t\tset(get(len_-i-1), i);\n\t\t\tset(tmp, len_-i-1);\n\t\t}\n\t}\n\n\t/**\n\t * Reverse a substring of the buffer in place.\n\t */\n\tvoid reverseWindow(size_t off, size_t len) {\n\t\tassert_leq(off, len_);\n\t\tassert_leq(off + len, len_);\n\t\tsize_t mid = len >> 1;\n\t\tfor(size_t i = 0; i < mid; i++) {\n\t\t\tT tmp = get(off+i);\n\t\t\tset(get(off+len-i-1), off+i);\n\t\t\tset(tmp, off+len-i-1);\n\t\t}\n\t}\n\n\t/**\n\t * Set the first len elements of the buffer to el.\n\t */\n\tvoid fill(size_t len, const T& el) {\n\t\tassert_leq(len, len_);\n\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\tset(el, i);\n\t\t}\n\t}\n\n\t/**\n\t * Set all elements of the buffer to el.\n\t */\n\tvoid fill(const T& el) {\n\t\tfill(len_, el);\n\t}\n\n\t/**\n\t * Return the length of the string.\n\t */\n\tinline size_t length() const { return len_; }\n\n\t/**\n\t * Clear the buffer.\n\t */\n\tvoid clear() { len_ = 0; }\n\n\t/**\n\t * Return true iff the buffer is empty.\n\t */\n\tinline bool empty() const { return len_ == 0; }\n\n\t/**\n\t * Put a terminator in the 'len_'th element and then return a\n\t * pointer to the buffer.  Useful for printing.\n\t */\n\tconst char* toZBufXForm(const char *xform) const {\n\t\tASSERT_ONLY(size_t xformElts = strlen(xform));\n\t\t// Lazily allocate space for print buffer\n\t\tif(printcs_ == NULL) {\n\t\t\tconst_cast<char*&>(printcs_) = new char[len_+1];\n\t\t}\n\t\tchar* printcs = const_cast<char*>(printcs_);\n\t\tassert(printcs != NULL);\n\t\tfor(size_t i = 0; i < len_; i++) {\n\t\t\tassert_lt(cs_[i], (int)xformElts);\n\t\t\tprintcs[i] = xform[cs_[i]];\n\t\t}\n\t\tprintcs[len_] = 0;\n\t\treturn printcs_;\n\t}\n\n\t/**\n\t * Put a terminator in the 'len_'th element and then return a\n\t * pointer to the buffer.  Useful for printing.\n\t */\n\tvirtual const T* toZBuf() const {\n\t\tconst_cast<T*>(cs_)[len_] = 0;\n\t\treturn cs_;\n\t}\n\n\t/**\n\t * Return a const version of the raw buffer.\n\t */\n\tconst T* buf() const { return cs_; }\n\n\t/**\n\t * Return a writeable version of the raw buffer.\n\t */\n\tT* wbuf() { return cs_; }\n\nprotected:\n\n\tT *cs_;      // +1 so that we have the option of dropping in a terminating \"\\0\"\n\tchar *printcs_; // +1 so that we have the option of dropping in a terminating \"\\0\"\n\tsize_t len_; // # elements\n};\n\n/**\n * Simple string class with backing memory whose size is managed by the user\n * using the constructor and install() member function.  No behind-the-scenes\n * reallocation or copying takes place.\n */\nclass S2bDnaString {\n\npublic:\n\n\texplicit S2bDnaString() :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0)\n\t{ }\n\n\texplicit S2bDnaString(size_t sz) :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0)\n\t{\n\t\tresize(sz);\n\t}\n\n\t/**\n\t * Copy another object of the same class.\n\t */\n\tS2bDnaString(const S2bDnaString& o) :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0)\n\t{\n\t\t*this = o;\n\t}\n\n\t/**\n\t * Create an SStringExpandable from a std::basic_string of the\n\t * appropriate type.\n\t */\n\texplicit S2bDnaString(\n\t\tconst std::basic_string<char>& str,\n\t\tbool chars = false,\n\t\tbool colors = false) :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0)\n\t{\n\t\tif(chars) {\n\t\t\tif(colors) {\n\t\t\t\tinstallColors(str.c_str(), str.length());\n\t\t\t} else {\n\t\t\t\tinstallChars(str.c_str(), str.length());\n\t\t\t}\n\t\t} else {\n\t\t\tinstall(str.c_str(), str.length());\n\t\t}\n\t}\n\n\t/**\n\t * Create an SStringExpandable from an array and size.\n\t */\n\texplicit S2bDnaString(\n\t\tconst char* b,\n\t\tsize_t sz,\n\t\tbool chars = false,\n\t\tbool colors = false) :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0)\n\t{\n\t\tif(chars) {\n\t\t\tif(colors) {\n\t\t\t\tinstallColors(b, sz);\n\t\t\t} else {\n\t\t\t\tinstallChars(b, sz);\n\t\t\t}\n\t\t} else {\n\t\t\tinstall(b, sz);\n\t\t}\n\t}\n\n\t/**\n\t * Create an SStringFixed from a zero-terminated string.\n\t */\n\texplicit S2bDnaString(\n\t\tconst char* b,\n\t\tbool chars = false,\n\t\tbool colors = false) :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0)\n\t{\n\t\tif(chars) {\n\t\t\tif(colors) {\n\t\t\t\tinstallColors(b, strlen(b));\n\t\t\t} else {\n\t\t\t\tinstallChars(b, strlen(b));\n\t\t\t}\n\t\t} else {\n\t\t\tinstall(b, strlen(b));\n\t\t}\n\t}\n\n\t/**\n\t * Destroy the expandable string object.\n\t */\n\tvirtual ~S2bDnaString() {\n\t\tif(cs_ != NULL) {\n\t\t\tdelete[] cs_;\n\t\t\tcs_ = NULL;\n\t\t}\n\t\tif(printcs_ != NULL) {\n\t\t\tdelete[] printcs_;\n\t\t\tprintcs_ = NULL;\n\t\t}\n\t\tlen_ = 0;\n\t}\n\n\t/**\n\t * Assignment to other SString.\n\t */\n\ttemplate<typename T>\n\tS2bDnaString& operator=(const T& o) {\n\t\tinstall(o.c_str(), o.length());\n\t\treturn *this;\n\t}\n\n\t/**\n\t * Assignment from a std::basic_string\n\t */\n\ttemplate<typename T>\n\tS2bDnaString& operator=(const std::basic_string<char>& o) {\n\t\tinstall(o);\n\t\treturn *this;\n\t}\n\n\t/**\n\t * Resizes the string without preserving its contents.\n\t */\n\tvoid resize(size_t sz) {\n\t\tif(cs_ != NULL) {\n\t\t\tdelete cs_;\n\t\t\tcs_ = NULL;\n\t\t}\n\t\tif(printcs_ != NULL) {\n\t\t\tdelete printcs_;\n\t\t\tprintcs_ = NULL;\n\t\t}\n\t\tlen_ = sz;\n\t\tif(sz != 0) {\n\t\t\tcs_ = new uint32_t[nwords()];\n\t\t}\n\t}\n\n\t/**\n\t * Return DNA character corresponding to element 'idx'.\n\t */\n\tchar toChar(size_t idx) const {\n\t\tint c = (int)get(idx);\n\t\tassert_range(0, 3, c);\n\t\treturn \"ACGT\"[c];\n\t}\n\n\t/**\n\t * Return color character corresponding to element 'idx'.\n\t */\n\tchar toColor(size_t idx) const {\n\t\tint c = (int)get(idx);\n\t\tassert_range(0, 3, c);\n\t\treturn \"0123\"[c];\n\t}\n\n\t/**\n\t * Return ith character from the left of either the forward or the\n\t * reverse version of the read.\n\t */\n\tchar windowGet(\n\t\tsize_t i,\n\t\tbool   fw,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = len_;\n\t\tassert_lt(i, len);\n\t\tassert_leq(len, len_ - depth);\n\t\treturn fw ? get(depth+i) : get(depth+len-i-1);\n\t}\n\n\t/**\n\t * Return ith character from the left of either the forward or the\n\t * reverse-complement version of the read.\n\t */\n\ttemplate<typename T>\n\tvoid windowGet(\n\t\tT& ret,\n\t\tbool   fw,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = len_;\n\t\tassert_leq(len, len_ - depth);\n\t\tret.resize(len);\n\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\tret.set((fw ? get(depth+i) : get(depth+len-i-1)), i);\n\t\t}\n\t}\n\t\n\t/**\n\t * Return length in 32-bit words.\n\t */\n\tsize_t nwords() const {\n\t\treturn (len_ + 15) >> 4;\n\t}\n\n\t/**\n\t * Set character at index 'idx' to 'c'.\n\t */\n\tvoid set(int c, size_t idx) {\n\t\tassert_lt(idx, len_);\n\t\tassert_range(0, 3, c);\n\t\tsize_t word = idx >> 4;\n\t\tsize_t bpoff = (idx & 15) << 1;\n\t\tcs_[word] = cs_[word] & ~(uint32_t)(3 << bpoff);\n\t\tcs_[word] = cs_[word] |  (uint32_t)(c << bpoff);\n\t}\n\n\t/**\n\t * Set character at index 'idx' to DNA char 'c'.\n\t */\n\tvoid setChar(int c, size_t idx) {\n\t\tassert_in(toupper(c), \"ACGT\");\n\t\tint bp = asc2dna[c];\n\t\tset(bp, idx);\n\t}\n\n\t/**\n\t * Set character at index 'idx' to color char 'c'.\n\t */\n\tvoid setColor(int c, size_t idx) {\n\t\tassert_in(toupper(c), \"0123\");\n\t\tint co = asc2col[c];\n\t\tset(co, idx);\n\t}\n\n\t/**\n\t * Set the ith 32-bit word to given word.\n\t */\n\tvoid setWord(uint32_t w, size_t i) {\n\t\tassert_lt(i, nwords());\n\t\tcs_[i] = w;\n\t}\n\n\t/**\n\t * Retrieve constant version of element i.\n\t */\n\tchar operator[](size_t i) const {\n\t\tassert_lt(i, len_);\n\t\treturn get(i);\n\t}\n\n\t/**\n\t * Retrieve constant version of element i.\n\t */\n\tchar get(size_t i) const {\n\t\tassert_lt(i, len_);\n\t\tsize_t word = i >> 4;\n\t\tsize_t bpoff = (i & 15) << 1;\n\t\treturn (char)((cs_[word] >> bpoff) & 3);\n\t}\n\n\t/**\n\t * Copy packed words from string 'b' into this packed string.\n\t */\n\tvoid install(const uint32_t* b, size_t sz) {\n\t\tif(sz == 0) return;\n\t\tresize(sz);\n\t\tmemcpy(cs_, b, sizeof(uint32_t)*nwords());\n\t}\n\n\t/**\n\t * Copy 'sz' DNA characters encoded as integers from buffer 'b' into this\n\t * packed string.\n\t */\n\tvoid install(const char* b, size_t sz) {\n\t\tif(sz == 0) return;\n\t\tresize(sz);\n\t\tsize_t wordi = 0;\n\t\tfor(size_t i = 0; i < sz; i += 16) {\n\t\t\tuint32_t word = 0;\n\t\t\tfor(int j = 0; j < 16 && (size_t)(i+j) < sz; j++) {\n\t\t\t\tuint32_t bp = (int)b[i+j];\n\t\t\t\tuint32_t shift = (uint32_t)j << 1;\n\t\t\t\tassert_range(0, 3, (int)bp);\n\t\t\t\tword |= (bp << shift);\n\t\t\t}\n\t\t\tcs_[wordi++] = word;\n\t\t}\n\t}\n\n\t/**\n\t * Copy 'sz' DNA characters from buffer 'b' into this packed string.\n\t */\n\tvoid installChars(const char* b, size_t sz) {\n\t\tif(sz == 0) return;\n\t\tresize(sz);\n\t\tsize_t wordi = 0;\n\t\tfor(size_t i = 0; i < sz; i += 16) {\n\t\t\tuint32_t word = 0;\n\t\t\tfor(int j = 0; j < 16 && (size_t)(i+j) < sz; j++) {\n\t\t\t\tchar c = b[i+j];\n\t\t\t\tassert_in(toupper(c), \"ACGT\");\n\t\t\t\tint bp = asc2dna[(int)c];\n\t\t\t\tassert_range(0, 3, (int)bp);\n\t\t\t\tuint32_t shift = (uint32_t)j << 1;\n\t\t\t\tword |= (bp << shift);\n\t\t\t}\n\t\t\tcs_[wordi++] = word;\n\t\t}\n\t}\n\n\t/**\n\t * Copy 'sz' color characters from buffer 'b' into this packed string.\n\t */\n\tvoid installColors(const char* b, size_t sz) {\n\t\tif(sz == 0) return;\n\t\tresize(sz);\n\t\tsize_t wordi = 0;\n\t\tfor(size_t i = 0; i < sz; i += 16) {\n\t\t\tuint32_t word = 0;\n\t\t\tfor(int j = 0; j < 16 && (size_t)(i+j) < sz; j++) {\n\t\t\t\tchar c = b[i+j];\n\t\t\t\tassert_in(c, \"0123\");\n\t\t\t\tint bp = asc2col[(int)c];\n\t\t\t\tassert_range(0, 3, (int)bp);\n\t\t\t\tuint32_t shift = (uint32_t)j << 1;\n\t\t\t\tword |= (bp << shift);\n\t\t\t}\n\t\t\tcs_[wordi++] = word;\n\t\t}\n\t}\n\n\t/**\n\t * Copy 'sz' DNA characters from buffer 'b' into this packed string.\n\t */\n\tvoid install(const char* b) {\n\t\tinstall(b, strlen(b));\n\t}\n\n\t/**\n\t * Copy 'sz' DNA characters from buffer 'b' into this packed string.\n\t */\n\tvoid installChars(const char* b) {\n\t\tinstallChars(b, strlen(b));\n\t}\n\n\t/**\n\t * Copy 'sz' DNA characters from buffer 'b' into this packed string.\n\t */\n\tvoid installColors(const char* b) {\n\t\tinstallColors(b, strlen(b));\n\t}\n\n\t/**\n\t * Copy 'sz' DNA characters from buffer 'b' into this packed string.\n\t */\n\tvoid install(const std::basic_string<char>& b) {\n\t\tinstall(b.c_str(), b.length());\n\t}\n\n\t/**\n\t * Copy 'sz' DNA characters from buffer 'b' into this packed string.\n\t */\n\tvoid installChars(const std::basic_string<char>& b) {\n\t\tinstallChars(b.c_str(), b.length());\n\t}\n\n\t/**\n\t * Copy 'sz' DNA characters from buffer 'b' into this packed string.\n\t */\n\tvoid installColors(const std::basic_string<char>& b) {\n\t\tinstallColors(b.c_str(), b.length());\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string, reversing them\n\t * in the process.\n\t */\n\tvoid installReverse(const char* b, size_t sz) {\n\t\tresize(sz);\n\t\tif(sz == 0) return;\n\t\tsize_t wordi = 0;\n\t\tsize_t bpi   = 0;\n\t\tcs_[0] = 0;\n\t\tfor(size_t i =sz; i > 0; i--) {\n\t\t\tassert_range(0, 3, (int)b[i-1]);\n\t\t\tcs_[wordi] |= ((int)b[i-1] << (bpi<<1));\n\t\t\tif(bpi == 15) {\n\t\t\t\twordi++;\n\t\t\t\tcs_[wordi] = 0;\n\t\t\t\tbpi = 0;\n\t\t\t} else bpi++;\n\t\t}\n\t}\n\n\t/**\n\t * Copy all chars from buffer of DNA characters 'b' into this string,\n\t * reversing them in the process.\n\t */\n\tvoid installReverse(const char* b) {\n\t\tinstallReverse(b, strlen(b));\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer of DNA characters 'b' into this string,\n\t * reversing them in the process.\n\t */\n\tvoid installReverseChars(const char* b, size_t sz) {\n\t\tresize(sz);\n\t\tif(sz == 0) return;\n\t\tsize_t wordi = 0;\n\t\tsize_t bpi   = 0;\n\t\tcs_[0] = 0;\n\t\tfor(size_t i =sz; i > 0; i--) {\n\t\t\tchar c = b[i-1];\n\t\t\tassert_in(toupper(c), \"ACGT\");\n\t\t\tint bp = asc2dna[(int)c];\n\t\t\tassert_range(0, 3, bp);\n\t\t\tcs_[wordi] |= (bp << (bpi<<1));\n\t\t\tif(bpi == 15) {\n\t\t\t\twordi++;\n\t\t\t\tcs_[wordi] = 0;\n\t\t\t\tbpi = 0;\n\t\t\t} else bpi++;\n\t\t}\n\t}\n\n\t/**\n\t * Copy all chars from buffer of DNA characters 'b' into this string,\n\t * reversing them in the process.\n\t */\n\tvoid installReverseChars(const char* b) {\n\t\tinstallReverseChars(b, strlen(b));\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer of color characters 'b' into this string,\n\t * reversing them in the process.\n\t */\n\tvoid installReverseColors(const char* b, size_t sz) {\n\t\tresize(sz);\n\t\tif(sz == 0) return;\n\t\tsize_t wordi = 0;\n\t\tsize_t bpi   = 0;\n\t\tcs_[0] = 0;\n\t\tfor(size_t i =sz; i > 0; i--) {\n\t\t\tchar c = b[i-1];\n\t\t\tassert_in(c, \"0123\");\n\t\t\tint bp = asc2col[(int)c];\n\t\t\tassert_range(0, 3, bp);\n\t\t\tcs_[wordi] |= (bp << (bpi<<1));\n\t\t\tif(bpi == 15) {\n\t\t\t\twordi++;\n\t\t\t\tcs_[wordi] = 0;\n\t\t\t\tbpi = 0;\n\t\t\t} else bpi++;\n\t\t}\n\t}\n\n\t/**\n\t * Copy all chars from buffer of color characters 'b' into this string,\n\t * reversing them in the process.\n\t */\n\tvoid installReverseColors(const char* b) {\n\t\tinstallReverseColors(b, strlen(b));\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string, reversing them\n\t * in the process.\n\t */\n\tvoid installReverse(const S2bDnaString& b) {\n\t\tresize(b.len_);\n\t\tif(b.len_ == 0) return;\n\t\tsize_t wordi = 0;\n\t\tsize_t bpi   = 0;\n\t\tsize_t wordb = b.nwords()-1;\n\t\tsize_t bpb   = (b.len_-1) & 15;\n\t\tcs_[0] = 0;\n\t\tfor(size_t i = b.len_; i > 0; i--) {\n\t\t\tint bbp = (int)((b[wordb] >> (bpb << 1)) & 3);\n\t\t\tassert_range(0, 3, bbp);\n\t\t\tcs_[wordi] |= (bbp << (bpi << 1));\n\t\t\tif(bpi == 15) {\n\t\t\t\twordi++;\n\t\t\t\tcs_[wordi] = 0;\n\t\t\t\tbpi = 0;\n\t\t\t} else bpi++;\n\t\t\tif(bpb == 0) {\n\t\t\t\twordb--;\n\t\t\t\tbpi = 15;\n\t\t\t} else bpi--;\n\t\t}\n\t}\n\n\t/**\n\t * Return true iff the two strings are equal.\n\t */\n\tbool operator==(const S2bDnaString& o) {\n\t\treturn sstr_eq(*this, o);\n\t}\n\n\t/**\n\t * Return true iff the two strings are not equal.\n\t */\n\tbool operator!=(const S2bDnaString& o) {\n\t\treturn sstr_neq(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is less than given string.\n\t */\n\tbool operator<(const S2bDnaString& o) {\n\t\treturn sstr_lt(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is greater than given string.\n\t */\n\tbool operator>(const S2bDnaString& o) {\n\t\treturn sstr_gt(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is less than or equal to given string.\n\t */\n\tbool operator<=(const S2bDnaString& o) {\n\t\treturn sstr_leq(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is greater than or equal to given string.\n\t */\n\tbool operator>=(const S2bDnaString& o) {\n\t\treturn sstr_geq(*this, o);\n\t}\n\n\t/**\n\t * Reverse the 2-bit encoded DNA string in-place.\n\t */\n\tvoid reverse() {\n\t\tif(len_ <= 1) return;\n\t\tsize_t wordf = nwords()-1;\n\t\tsize_t bpf   = (len_-1) & 15;\n\t\tsize_t wordi = 0;\n\t\tsize_t bpi   = 0;\n\t\twhile(wordf > wordi || (wordf == wordi && bpf > bpi)) {\n\t\t\tint f = (cs_[wordf] >> (bpf << 1)) & 3;\n\t\t\tint i = (cs_[wordi] >> (bpi << 1)) & 3;\n\t\t\tcs_[wordf] &= ~(uint32_t)(3 << (bpf << 1));\n\t\t\tcs_[wordi] &= ~(uint32_t)(3 << (bpi << 1));\n\t\t\tcs_[wordf] |=  (uint32_t)(i << (bpf << 1));\n\t\t\tcs_[wordi] |=  (uint32_t)(f << (bpi << 1));\n\t\t\tif(bpf == 0) {\n\t\t\t\tbpf = 15;\n\t\t\t\twordf--;\n\t\t\t} else bpf--;\n\t\t\tif(bpi == 15) {\n\t\t\t\tbpi = 0;\n\t\t\t\twordi++;\n\t\t\t} else bpi++;\n\t\t}\n\t}\n\t\n\t/**\n\t * Reverse a substring of the buffer in place.\n\t */\n\tvoid reverseWindow(size_t off, size_t len) {\n\t\tassert_leq(off, len_);\n\t\tassert_leq(off+len, len_);\n\t\tif(len <= 1) return;\n\t\tsize_t wordf = (off+len-1) >> 4;\n\t\tsize_t bpf   = (off+len-1) & 15;\n\t\tsize_t wordi = (off      ) >> 4;\n\t\tsize_t bpi   = (off      ) & 15;\n\t\twhile(wordf > wordi || (wordf == wordi && bpf > bpi)) {\n\t\t\tint f = (cs_[wordf] >> (bpf << 1)) & 3;\n\t\t\tint i = (cs_[wordi] >> (bpi << 1)) & 3;\n\t\t\tcs_[wordf] &= ~(uint32_t)(3 << (bpf << 1));\n\t\t\tcs_[wordi] &= ~(uint32_t)(3 << (bpi << 1));\n\t\t\tcs_[wordf] |=  (uint32_t)(i << (bpf << 1));\n\t\t\tcs_[wordi] |=  (uint32_t)(f << (bpi << 1));\n\t\t\tif(bpf == 0) {\n\t\t\t\tbpf = 15;\n\t\t\t\twordf--;\n\t\t\t} else bpf--;\n\t\t\tif(bpi == 15) {\n\t\t\t\tbpi = 0;\n\t\t\t\twordi++;\n\t\t\t} else bpi++;\n\t\t}\n\t}\n\n\n\t/**\n\t * Set the first len elements of the buffer to el.\n\t */\n\tvoid fill(size_t len, char el) {\n\t\tassert_leq(len, len_);\n\t\tassert_range(0, 3, (int)el);\n\t\tsize_t word = 0;\n\t\tif(len > 32) {\n\t\t\t// Copy el throughout block\n\t\t\tuint32_t bl = (uint32_t)el;\n\t\t\tbl |= (bl << 2);\n\t\t\tbl |= (bl << 4);\n\t\t\tbl |= (bl << 8);\n\t\t\tbl |= (bl << 16);\n\t\t\t// Fill with blocks\n\t\t\tsize_t blen = len >> 4;\n\t\t\tfor(; word < blen; word++) {\n\t\t\t\tcs_[word] = bl;\n\t\t\t}\n\t\t\tlen = len & 15;\n\t\t}\n\t\tsize_t bp = 0;\n\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\tcs_[word] &= ~(uint32_t)(3  << (bp << 1));\n\t\t\tcs_[word] |=  (uint32_t)(el << (bp << 1));\n\t\t\tif(bp == 15) {\n\t\t\t\tword++;\n\t\t\t\tbp = 0;\n\t\t\t} else bp++;\n\t\t}\n\t}\n\n\t/**\n\t * Set all elements of the buffer to el.\n\t */\n\tvoid fill(char el) {\n\t\tfill(len_, el);\n\t}\n\n\t/**\n\t * Return the ith character in the window defined by fw, color, depth and\n\t * len.\n\t */\n\tchar windowGetDna(\n\t\tsize_t i,\n\t\tbool   fw,\n\t\tbool   color,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = len_;\n\t\tassert_lt(i, len);\n\t\tassert_leq(len, len_ - depth);\n\t\tif(fw) {\n\t\t\treturn get(depth+i);\n\t\t} else {\n\t\t\treturn\n\t\t\t\tcolor ?\n\t\t\t\t\tget(depth+len-i-1) :\n\t\t\t\t\tcompDna(get(depth+len-i-1));\n\t\t}\n\t}\n\n\t/**\n\t * Fill the given DNA buffer with the substring specified by fw,\n\t * color, depth and len.\n\t */\n\ttemplate<typename T>\n\tvoid windowGetDna(\n\t\tT&     buf,\n\t\tbool   fw,\n\t\tbool   color,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = len_;\n\t\tassert_leq(len, len_ - depth);\n\t\tbuf.resize(len);\n\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\tbuf.set(\n\t\t\t\t(fw ?\n\t\t\t\t\tget(depth+i) :\n\t\t\t\t\t(color ?\n\t\t\t\t\t\tget(depth+len-i-1) :\n\t\t\t\t\t\tcompDna(get(depth+len-i-1)))), i);\n\t\t}\n\t}\n\n\t/**\n\t * Return the length of the string.\n\t */\n\tinline size_t length() const { return len_; }\n\n\t/**\n\t * Clear the buffer.\n\t */\n\tvoid clear() { len_ = 0; }\n\n\t/**\n\t * Return true iff the buffer is empty.\n\t */\n\tinline bool empty() const { return len_ == 0; }\n\n\t/**\n\t * Return a const version of the raw buffer.\n\t */\n\tconst uint32_t* buf() const { return cs_; }\n\n\t/**\n\t * Return a writeable version of the raw buffer.\n\t */\n\tuint32_t* wbuf() { return cs_; }\n\n\t/**\n\t * Note: the size of the string once it's stored in the print buffer is 4\n\t * times as large as the string as stored in compact 2-bit-per-char words.\n\t */\n\tconst char* toZBuf() const {\n\t\tif(printcs_ == NULL) {\n\t\t\tconst_cast<char*&>(printcs_) = new char[len_+1];\n\t\t}\n\t\tchar *printcs = const_cast<char*>(printcs_);\n\t\tsize_t word = 0, bp = 0;\n\t\tfor(size_t i = 0; i < len_; i++) {\n\t\t\tint c = (cs_[word] >> (bp << 1)) & 3;\n\t\t\tprintcs[i] = \"ACGT\"[c];\n\t\t\tif(bp == 15) {\n\t\t\t\tword++;\n\t\t\t\tbp = 0;\n\t\t\t} else bp++;\n\t\t}\n\t\tprintcs[len_] = '\\0';\n\t\treturn printcs_;\n\t}\n\nprotected:\n\n\tuint32_t *cs_; // 2-bit packed words\n\tchar *printcs_;\n\tsize_t len_;   // # elements\n};\n\n/**\n * Simple string class with backing memory that automatically expands as needed.\n */\ntemplate<typename T, int S = 1024, int M = 2>\nclass SStringExpandable {\n\npublic:\n\n\texplicit SStringExpandable() :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0),\n\t\tsz_(0)\n\t{ }\n\n\texplicit SStringExpandable(size_t sz) :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0),\n\t\tsz_(0)\n\t{\n\t\texpandNoCopy(sz);\n\t}\n\n\t/**\n\t * Create an SStringExpandable from another SStringExpandable.\n\t */\n\tSStringExpandable(const SStringExpandable<T, S>& o) :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0),\n\t\tsz_(0)\n\t{\n\t\t*this = o;\n\t}\n\n\t/**\n\t * Create an SStringExpandable from a std::basic_string of the\n\t * appropriate type.\n\t */\n\texplicit SStringExpandable(const std::basic_string<T>& str) :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0),\n\t\tsz_(0)\n\t{\n\t\tinstall(str.c_str(), str.length());\n\t}\n\n\t/**\n\t * Create an SStringExpandable from an array and size.\n\t */\n\texplicit SStringExpandable(const T* b, size_t sz) :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0),\n\t\tsz_(0)\n\t{\n\t\tinstall(b, sz);\n\t}\n\n\t/**\n\t * Create an SStringExpandable from a zero-terminated array.\n\t */\n\texplicit SStringExpandable(const T* b) :\n\t\tcs_(NULL),\n\t\tprintcs_(NULL),\n\t\tlen_(0),\n\t\tsz_(0)\n\t{\n\t\tinstall(b, strlen(b));\n\t}\n\n\t/**\n\t * Destroy the expandable string object.\n\t */\n\tvirtual ~SStringExpandable() {\n\t\tif(cs_ != NULL) {\n\t\t\tdelete[] cs_;\n\t\t\tcs_ = NULL;\n\t\t}\n\t\tif(printcs_ != NULL) {\n\t\t\tdelete[] printcs_;\n\t\t\tprintcs_ = NULL;\n\t\t}\n\t\tsz_ = len_ = 0;\n\t}\n\n\t/**\n\t * Return ith character from the left of either the forward or the\n\t * reverse-complement version of the read.\n\t */\n\tT windowGet(\n\t\tsize_t i,\n\t\tbool   fw,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = len_;\n\t\tassert_lt(i, len);\n\t\tassert_leq(len, len_ - depth);\n\t\treturn fw ? cs_[depth+i] : cs_[depth+len-i-1];\n\t}\n\n\t/**\n\t * Return ith character from the left of either the forward or the\n\t * reverse-complement version of the read.\n\t */\n\tvoid windowGet(\n\t\tT& ret,\n\t\tbool   fw,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = len_;\n\t\tassert_leq(len, len_ - depth);\n\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\tret.append(fw ? cs_[depth+i] : cs_[depth+len-i-1]);\n\t\t}\n\t}\n\n\t/**\n\t * Assignment to other SStringFixed.\n\t */\n\tSStringExpandable<T,S>& operator=(const SStringExpandable<T,S>& o) {\n\t\tinstall(o.cs_, o.len_);\n\t\treturn *this;\n\t}\n\n\t/**\n\t * Assignment from a std::basic_string\n\t */\n\tSStringExpandable<T,S>& operator=(const std::basic_string<T>& o) {\n\t\tinstall(o.c_str(), o.length());\n\t\treturn *this;\n\t}\n\n\t/**\n\t * Insert char c before position 'idx'; slide subsequent chars down.\n\t */\n\tvoid insert(const T& c, size_t idx) {\n\t\tassert_lt(idx, len_);\n\t\tif(sz_ < len_ + 1) expandCopy((len_ + 1 + S) * M);\n\t\tlen_++;\n\t\t// Move everyone down by 1\n\t\t// len_ is the *new* length\n\t\tfor(size_t i = len_; i > idx+1; i--) {\n\t\t\tcs_[i-1] = cs_[i-2];\n\t\t}\n\t\tcs_[idx] = c;\n\t}\n\n\t/**\n\t * Set character at index 'idx' to 'c'.\n\t */\n\tvoid set(int c, size_t idx) {\n\t\tassert_lt(idx, len_);\n\t\tcs_[idx] = c;\n\t}\n\n\t/**\n\t * Append char c.\n\t */\n\tvoid append(const T& c) {\n\t\tif(sz_ < len_ + 1) expandCopy((len_ + 1 + S) * M);\n\t\tcs_[len_++] = c;\n\t}\n\n\t/**\n\t * Delete char at position 'idx'; slide subsequent chars up.\n\t */\n\tvoid remove(size_t idx) {\n\t\tassert_lt(idx, len_);\n\t\tassert_gt(len_, 0);\n\t\tfor(size_t i = idx; i < len_-1; i++) {\n\t\t\tcs_[i] = cs_[i+1];\n\t\t}\n\t\tlen_--;\n\t}\n\n\t/**\n\t * Retrieve constant version of element i.\n\t */\n\tconst T& operator[](size_t i) const {\n\t\tassert_lt(i, len_);\n\t\treturn cs_[i];\n\t}\n\n\t/**\n\t * Retrieve mutable version of element i.\n\t */\n\tT& operator[](size_t i) {\n\t\tassert_lt(i, len_);\n\t\treturn cs_[i];\n\t}\n\n\t/**\n\t * Retrieve constant version of element i.\n\t */\n\tconst T& get(size_t i) const {\n\t\tassert_lt(i, len_);\n\t\treturn cs_[i];\n\t}\n\n\t/**\n\t * Retrieve constant version of element i.\n\t */\n\tconst T* get_ptr(size_t i) const {\n\t\tassert_lt(i, len_);\n\t\treturn cs_+i;\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string.\n\t */\n\tvirtual void install(const T* b, size_t sz) {\n\t\tif(sz_ < sz) expandNoCopy((sz + S) * M);\n\t\tmemcpy(cs_, b, sz * sizeof(T));\n\t\tlen_ = sz;\n\t}\n\n\n\t/**\n\t * Copy all bytes from zero-terminated buffer 'b' into this string.\n\t */\n\tvoid install(const T* b) { install(b, strlen(b)); }\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string, reversing them\n\t * in the process.\n\t */\n\tvoid installReverse(const char* b, size_t sz) {\n\t\tif(sz_ < sz) expandNoCopy((sz + S) * M);\n\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\tcs_[i] = b[sz-i-1];\n\t\t}\n\t\tlen_ = sz;\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string, reversing them\n\t * in the process.\n\t */\n\tvoid installReverse(const SStringExpandable<T, S>& b) {\n\t\tif(sz_ < b.len_) expandNoCopy((b.len_ + S) * M);\n\t\tfor(size_t i = 0; i < b.len_; i++) {\n\t\t\tcs_[i] = b.cs_[b.len_ - i - 1];\n\t\t}\n\t\tlen_ = b.len_;\n\t}\n\n\t/**\n\t * Return true iff the two strings are equal.\n\t */\n\tbool operator==(const SStringExpandable<T, S>& o) {\n\t\treturn sstr_eq(*this, o);\n\t}\n\n\t/**\n\t * Return true iff the two strings are not equal.\n\t */\n\tbool operator!=(const SStringExpandable<T, S>& o) {\n\t\treturn sstr_neq(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is less than given string.\n\t */\n\tbool operator<(const SStringExpandable<T, S>& o) {\n\t\treturn sstr_lt(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is greater than given string.\n\t */\n\tbool operator>(const SStringExpandable<T, S>& o) {\n\t\treturn sstr_gt(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is less than or equal to given string.\n\t */\n\tbool operator<=(const SStringExpandable<T, S>& o) {\n\t\treturn sstr_leq(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is greater than or equal to given string.\n\t */\n\tbool operator>=(const SStringExpandable<T, S>& o) {\n\t\treturn sstr_geq(*this, o);\n\t}\n\n\t/**\n\t * Reverse the buffer in place.\n\t */\n\tvoid reverse() {\n\t\tfor(size_t i = 0; i < (len_ >> 1); i++) {\n\t\t\tT tmp = get(i);\n\t\t\tset(get(len_-i-1), i);\n\t\t\tset(tmp, len_-i-1);\n\t\t}\n\t}\n\n\t/**\n\t * Reverse a substring of the buffer in place.\n\t */\n\tvoid reverseWindow(size_t off, size_t len) {\n\t\tassert_leq(off, len_);\n\t\tassert_leq(off + len, len_);\n\t\tsize_t mid = len >> 1;\n\t\tfor(size_t i = 0; i < mid; i++) {\n\t\t\tT tmp = get(off+i);\n\t\t\tset(get(off+len-i-1), off+i);\n\t\t\tset(tmp, off+len-i-1);\n\t\t}\n\t}\n\n\t/**\n\t * Simply resize the buffer.  If the buffer is resized to be\n\t * longer, the newly-added elements will contain garbage and should\n\t * be initialized immediately.\n\t */\n\tvoid resize(size_t len) {\n\t\tif(sz_ < len) expandCopy((len + S) * M);\n\t\tlen_ = len;\n\t}\n\n\t/**\n\t * Simply resize the buffer.  If the buffer is resized to be\n\t * longer, new elements will be initialized with 'el'.\n\t */\n\tvoid resize(size_t len, const T& el) {\n\t\tif(sz_ < len) expandCopy((len + S) * M);\n\t\tif(len > len_) {\n\t\t\tfor(size_t i = len_; i < len; i++) {\n\t\t\t\tcs_[i] = el;\n\t\t\t}\n\t\t}\n\t\tlen_ = len;\n\t}\n\n\t/**\n\t * Set the first len elements of the buffer to el.\n\t */\n\tvoid fill(size_t len, const T& el) {\n\t\tassert_leq(len, len_);\n\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\tcs_[i] = el;\n\t\t}\n\t}\n\n\t/**\n\t * Set all elements of the buffer to el.\n\t */\n\tvoid fill(const T& el) {\n\t\tfill(len_, el);\n\t}\n\n\t/**\n\t * Trim len characters from the beginning of the string.\n\t */\n\tvoid trimBegin(size_t len) {\n\t\tassert_leq(len, len_);\n\t\tif(len == len_) {\n\t\t\tlen_ = 0; return;\n\t\t}\n\t\tfor(size_t i = 0; i < len_-len; i++) {\n\t\t\tcs_[i] = cs_[i+len];\n\t\t}\n\t\tlen_ -= len;\n\t}\n\n\t/**\n\t * Trim len characters from the end of the string.\n\t */\n\tvoid trimEnd(size_t len) {\n\t\tif(len >= len_) len_ = 0;\n\t\telse len_ -= len;\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string.\n\t */\n\tvoid append(const T* b, size_t sz) {\n\t\tif(sz_ < len_ + sz) expandCopy((len_ + sz + S) * M);\n\t\tmemcpy(cs_ + len_, b, sz * sizeof(T));\n\t\tlen_ += sz;\n\t}\n\n\t/**\n\t * Copy bytes from zero-terminated buffer 'b' into this string.\n\t */\n\tvoid append(const T* b) {\n\t\tappend(b, strlen(b));\n\t}\n\n\t/**\n\t * Return the length of the string.\n\t */\n\tsize_t length() const { return len_; }\n\n\t/**\n\t * Clear the buffer.\n\t */\n\tvoid clear() { len_ = 0; }\n\n\t/**\n\t * Return true iff the buffer is empty.\n\t */\n\tbool empty() const { return len_ == 0; }\n\n\t/**\n\t * Put a terminator in the 'len_'th element and then return a\n\t * pointer to the buffer.  Useful for printing.\n\t */\n\tconst char* toZBufXForm(const char *xform) const {\n\t\tASSERT_ONLY(size_t xformElts = strlen(xform));\n\t\tif(empty()) {\n\t\t\tconst_cast<char&>(zero_) = 0;\n\t\t\treturn &zero_;\n\t\t}\n\t\tchar* printcs = const_cast<char*>(printcs_);\n\t\t// Lazily allocate space for print buffer\n\t\tfor(size_t i = 0; i < len_; i++) {\n\t\t\tassert_lt(cs_[i], (int)xformElts);\n\t\t\tprintcs[i] = xform[(int)cs_[i]];\n\t\t}\n\t\tprintcs[len_] = 0;\n\t\treturn printcs_;\n\t}\n\n\t/**\n\t * Put a terminator in the 'len_'th element and then return a\n\t * pointer to the buffer.  Useful for printing.\n\t */\n\tvirtual const T* toZBuf() const {\n\t\tif(empty()) {\n\t\t\tconst_cast<T&>(zeroT_) = 0;\n\t\t\treturn &zeroT_;\n\t\t}\n\t\tassert_leq(len_, sz_);\n\t\tconst_cast<T*>(cs_)[len_] = 0;\n\t\treturn cs_;\n\t}\n\n\t/**\n\t * Return true iff this DNA string matches the given nucleotide\n\t * character string.\n\t */\n\tbool eq(const char *str) const {\n\t\tconst char *self = toZBuf();\n\t\treturn strcmp(str, self) == 0;\n\t}\n\n\t/**\n\t * Return a const version of the raw buffer.\n\t */\n\tconst T* buf() const { return cs_; }\n\n\t/**\n\t * Return a writeable version of the raw buffer.\n\t */\n\tT* wbuf() { return cs_; }\n\nprotected:\n\t/**\n\t * Allocate new, bigger buffer and copy old contents into it.  If\n\t * requested size can be accommodated by current buffer, do nothing.\n\t */\n\tvoid expandCopy(size_t sz) {\n\t\tif(sz_ >= sz) return; // done!\n\t\tT *tmp  = new T[sz + 1];\n\t\tchar *ptmp = new char[sz + 1];\n\t\tif(cs_ != NULL) {\n\t\t\tmemcpy(tmp, cs_, sizeof(T)*len_);\n\t\t\tdelete[] cs_;\n\t\t}\n\t\tif(printcs_ != NULL) {\n\t\t\tmemcpy(ptmp, printcs_, sizeof(char)*len_);\n\t\t\tdelete[] printcs_;\n\t\t}\n\t\tcs_ = tmp;\n\t\tprintcs_ = ptmp;\n\t\tsz_ = sz;\n\t}\n\n\t/**\n\t * Allocate new, bigger buffer.  If requested size can be\n\t * accommodated by current buffer, do nothing.\n\t */\n\tvoid expandNoCopy(size_t sz) {\n\t\tif(sz_ >= sz) return; // done!\n\t\tif(cs_      != NULL) delete[] cs_;\n\t\tif(printcs_ != NULL) delete[] printcs_;\n\t\tcs_ = new T[sz + 1];\n\t\tprintcs_ = new char[sz + 1];\n\t\tsz_ = sz;\n\t}\n\n\tT *cs_;      // +1 so that we have the option of dropping in a terminating \"\\0\"\n\tchar *printcs_; // +1 so that we have the option of dropping in a terminating \"\\0\"\n\tchar zero_;  // 0 terminator for empty string\n\tT zeroT_;    // 0 terminator for empty string\n\tsize_t len_; // # filled-in elements\n\tsize_t sz_;  // size capacity of cs_\n};\n\n/**\n * Simple string class with in-object storage.\n *\n * All copies induced by, e.g., operator=, the copy constructor,\n * install() and append(), are shallow (using memcpy/sizeof).  If deep\n * copies are needed, use a different class.\n *\n * Reading from an uninitialized element results in an assert as long\n * as NDEBUG is not defined.  If NDEBUG is defined, the result is\n * undefined.\n */\ntemplate<typename T, int S>\nclass SStringFixed {\npublic:\n\texplicit SStringFixed() : len_(0) { }\n\n\t/**\n\t * Create an SStringFixed from another SStringFixed.\n\t */\n\tSStringFixed(const SStringFixed<T, S>& o) {\n\t\t*this = o;\n\t}\n\n\t/**\n\t * Create an SStringFixed from another SStringFixed.\n\t */\n\texplicit SStringFixed(const std::basic_string<T>& str) {\n\t\tinstall(str.c_str(), str.length());\n\t}\n\n\t/**\n\t * Create an SStringFixed from an array and size.\n\t */\n\texplicit SStringFixed(const T* b, size_t sz) {\n\t\tinstall(b, sz);\n\t}\n\n\t/**\n\t * Create an SStringFixed from a zero-terminated string.\n\t */\n\texplicit SStringFixed(const T* b) {\n\t\tinstall(b, strlen(b));\n\t}\n\n\tvirtual ~SStringFixed() { } // C++ needs this\n\n\t/**\n\t * Retrieve constant version of element i.\n\t */\n\tinline const T& operator[](size_t i) const {\n\t\treturn get(i);\n\t}\n\n\t/**\n\t * Retrieve mutable version of element i.\n\t */\n\tinline T& operator[](size_t i) {\n\t\treturn get(i);\n\t}\n\n\t/**\n\t * Retrieve constant version of element i.\n\t */\n\tinline const T& get(size_t i) const {\n\t\tassert_lt(i, len_);\n\t\treturn cs_[i];\n\t}\n\n\t/**\n\t * Retrieve mutable version of element i.\n\t */\n\tinline T& get(size_t i) {\n\t\tassert_lt(i, len_);\n\t\treturn cs_[i];\n\t}\n\n\t/**\n\t * Return ith character from the left of either the forward or the\n\t * reverse-complement version of the read.\n\t */\n\tT windowGet(\n\t\tsize_t i,\n\t\tbool   fw,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = len_;\n\t\tassert_lt(i, len);\n\t\tassert_leq(len, len_ - depth);\n\t\treturn fw ? cs_[depth+i] : cs_[depth+len-i-1];\n\t}\n\n\t/**\n\t * Return ith character from the left of either the forward or the\n\t * reverse-complement version of the read.\n\t */\n\tvoid windowGet(\n\t\tT& ret,\n\t\tbool   fw,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = len_;\n\t\tassert_leq(len, len_ - depth);\n\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\tret.append(fw ? cs_[depth+i] : cs_[depth+len-i-1]);\n\t\t}\n\t}\n\n\t/**\n\t * Assignment to other SStringFixed.\n\t */\n\tSStringFixed<T,S>& operator=(const SStringFixed<T,S>& o) {\n\t\tinstall(o.cs_, o.len_);\n\t\treturn *this;\n\t}\n\n\t/**\n\t * Assignment from a std::basic_string\n\t */\n\tSStringFixed<T,S>& operator=(const std::basic_string<T>& o) {\n\t\tinstall(o);\n\t\treturn *this;\n\t}\n\n\t/**\n\t * Insert char c before position 'idx'; slide subsequent chars down.\n\t */\n\tvoid insert(const T& c, size_t idx) {\n\t\tassert_lt(len_, S);\n\t\tassert_lt(idx, len_);\n\t\t// Move everyone down by 1\n\t\tfor(int i = len_; i > idx; i--) {\n\t\t\tcs_[i] = cs_[i-1];\n\t\t}\n\t\tcs_[idx] = c;\n\t\tlen_++;\n\t}\n\n\t/**\n\t * Set character at index 'idx' to 'c'.\n\t */\n\tvoid set(int c, size_t idx) {\n\t\tassert_lt(idx, len_);\n\t\tcs_[idx] = c;\n\t}\n\n\t/**\n\t * Append char c.\n\t */\n\tvoid append(const T& c) {\n\t\tassert_lt(len_, S);\n\t\tcs_[len_++] = c;\n\t}\n\n\t/**\n\t * Delete char at position 'idx'; slide subsequent chars up.\n\t */\n\tvoid remove(size_t idx) {\n\t\tassert_lt(idx, len_);\n\t\tassert_gt(len_, 0);\n\t\tfor(size_t i = idx; i < len_-1; i++) {\n\t\t\tcs_[i] = cs_[i+1];\n\t\t}\n\t\tlen_--;\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string.\n\t */\n\tvirtual void install(const T* b, size_t sz) {\n\t\tassert_leq(sz, S);\n\t\tmemcpy(cs_, b, sz * sizeof(T));\n\t\tlen_ = sz;\n\t}\n\n\t/**\n\t * Copy all bytes from zero-terminated buffer 'b' into this string.\n\t */\n\tvoid install(const T* b) { install(b, strlen(b)); }\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string, reversing them\n\t * in the process.\n\t */\n\tvoid installReverse(const char* b, size_t sz) {\n\t\tassert_leq(sz, S);\n\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\tcs_[i] = b[sz-i-1];\n\t\t}\n\t\tlen_ = sz;\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string, reversing them\n\t * in the process.\n\t */\n\tvoid installReverse(const SStringFixed<T, S>& b) {\n\t\tassert_leq(b.len_, S);\n\t\tfor(size_t i = 0; i < b.len_; i++) {\n\t\t\tcs_[i] = b.cs_[b.len_ - i - 1];\n\t\t}\n\t\tlen_ = b.len_;\n\t}\n\n\t/**\n\t * Return true iff the two strings are equal.\n\t */\n\tbool operator==(const SStringFixed<T, S>& o) {\n\t\treturn sstr_eq(*this, o);\n\t}\n\n\t/**\n\t * Return true iff the two strings are not equal.\n\t */\n\tbool operator!=(const SStringFixed<T, S>& o) {\n\t\treturn sstr_neq(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is less than given string.\n\t */\n\tbool operator<(const SStringFixed<T, S>& o) {\n\t\treturn sstr_lt(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is greater than given string.\n\t */\n\tbool operator>(const SStringFixed<T, S>& o) {\n\t\treturn sstr_gt(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is less than or equal to given string.\n\t */\n\tbool operator<=(const SStringFixed<T, S>& o) {\n\t\treturn sstr_leq(*this, o);\n\t}\n\n\t/**\n\t * Return true iff this string is greater than or equal to given string.\n\t */\n\tbool operator>=(const SStringFixed<T, S>& o) {\n\t\treturn sstr_geq(*this, o);\n\t}\n\n\t/**\n\t * Reverse the buffer in place.\n\t */\n\tvoid reverse() {\n\t\tfor(size_t i = 0; i < (len_ >> 1); i++) {\n\t\t\tT tmp = get(i);\n\t\t\tset(get(len_-i-1), i);\n\t\t\tset(tmp, len_-i-1);\n\t\t}\n\t}\n\n\t/**\n\t * Reverse a substring of the buffer in place.\n\t */\n\tvoid reverseWindow(size_t off, size_t len) {\n\t\tassert_leq(off, len_);\n\t\tassert_leq(off + len, len_);\n\t\tsize_t mid = len >> 1;\n\t\tfor(size_t i = 0; i < mid; i++) {\n\t\t\tT tmp = get(off+i);\n\t\t\tset(get(off+len-i-1), off+i);\n\t\t\tset(tmp, off+len-i-1);\n\t\t}\n\t}\n\n\t/**\n\t * Simply resize the buffer.  If the buffer is resized to be\n\t * longer, the newly-added elements will contain garbage and should\n\t * be initialized immediately.\n\t */\n\tvoid resize(size_t len) {\n\t\tassert_lt(len, S);\n\t\tlen_ = len;\n\t}\n\n\t/**\n\t * Simply resize the buffer.  If the buffer is resized to be\n\t * longer, new elements will be initialized with 'el'.\n\t */\n\tvoid resize(size_t len, const T& el) {\n\t\tassert_lt(len, S);\n\t\tif(len > len_) {\n\t\t\tfor(size_t i = len_; i < len; i++) {\n\t\t\t\tcs_[i] = el;\n\t\t\t}\n\t\t}\n\t\tlen_ = len;\n\t}\n\n\t/**\n\t * Set the first len elements of the buffer to el.\n\t */\n\tvoid fill(size_t len, const T& el) {\n\t\tassert_leq(len, len_);\n\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\tcs_[i] = el;\n\t\t}\n\t}\n\n\t/**\n\t * Set all elements of the buffer to el.\n\t */\n\tvoid fill(const T& el) {\n\t\tfill(len_, el);\n\t}\n\n\t/**\n\t * Trim len characters from the beginning of the string.\n\t */\n\tvoid trimBegin(size_t len) {\n\t\tassert_leq(len, len_);\n\t\tif(len == len_) {\n\t\t\tlen_ = 0; return;\n\t\t}\n\t\tfor(size_t i = 0; i < len_-len; i++) {\n\t\t\tcs_[i] = cs_[i+len];\n\t\t}\n\t\tlen_ -= len;\n\t}\n\n\t/**\n\t * Trim len characters from the end of the string.\n\t */\n\tvoid trimEnd(size_t len) {\n\t\tif(len >= len_) len_ = 0;\n\t\telse len_ -= len;\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string.\n\t */\n\tvoid append(const T* b, size_t sz) {\n\t\tassert_leq(sz + len_, S);\n\t\tmemcpy(cs_ + len_, b, sz * sizeof(T));\n\t\tlen_ += sz;\n\t}\n\n\t/**\n\t * Copy bytes from zero-terminated buffer 'b' into this string.\n\t */\n\tvoid append(const T* b) {\n\t\tappend(b, strlen(b));\n\t}\n\n\t/**\n\t * Return the length of the string.\n\t */\n\tsize_t length() const { return len_; }\n\n\t/**\n\t * Clear the buffer.\n\t */\n\tvoid clear() { len_ = 0; }\n\n\t/**\n\t * Return true iff the buffer is empty.\n\t */\n\tbool empty() const { return len_ == 0; }\n\n\t/**\n\t * Put a terminator in the 'len_'th element and then return a\n\t * pointer to the buffer.  Useful for printing.\n\t */\n\tvirtual const T* toZBuf() const {\n\t\tconst_cast<T*>(cs_)[len_] = 0;\n\t\treturn cs_;\n\t}\n\n\t/**\n\t * Return true iff this DNA string matches the given nucleotide\n\t * character string.\n\t */\n\tbool eq(const char *str) const {\n\t\tconst char *self = toZBuf();\n\t\treturn strcmp(str, self) == 0;\n\t}\n\t\n\t/**\n\t * Put a terminator in the 'len_'th element and then return a\n\t * pointer to the buffer.  Useful for printing.\n\t */\n\tconst char* toZBufXForm(const char *xform) const {\n\t\tASSERT_ONLY(size_t xformElts = strlen(xform));\n\t\tchar* printcs = const_cast<char*>(printcs_);\n\t\tfor(size_t i = 0; i < len_; i++) {\n\t\t\tassert_lt(cs_[i], (int)xformElts);\n\t\t\tprintcs[i] = xform[cs_[i]];\n\t\t}\n\t\tprintcs[len_] = 0;\n\t\treturn printcs_;\n\t}\n\n\t/**\n\t * Return a const version of the raw buffer.\n\t */\n\tconst T* buf() const { return cs_; }\n\n\t/**\n\t * Return a writeable version of the raw buffer.\n\t */\n\tT* wbuf() { return cs_; }\n\nprotected:\n\tT cs_[S+1]; // +1 so that we have the option of dropping in a terminating \"\\0\"\n\tchar printcs_[S+1]; // +1 so that we have the option of dropping in a terminating \"\\0\"\n\tsize_t len_;\n};\n\n//\n// Stream put operators\n//\n\ntemplate <typename T, int S, int M>\nstd::ostream& operator<< (std::ostream& os, const SStringExpandable<T, S, M>& str) {\n\tos << str.toZBuf();\n\treturn os;\n}\n\ntemplate <typename T, int S>\nstd::ostream& operator<< (std::ostream& os, const SStringFixed<T, S>& str) {\n\tos << str.toZBuf();\n\treturn os;\n}\n\nextern uint8_t asc2dna[];\nextern uint8_t asc2col[];\n\n/**\n * Encapsulates a fixed-length DNA string with characters encoded as\n * chars.  Only capable of encoding A, C, G, T and N.  The length is\n * specified via the template parameter S.\n */\ntemplate<int S>\nclass SDnaStringFixed : public SStringFixed<char, S> {\npublic:\n\n\texplicit SDnaStringFixed() : SStringFixed<char, S>() { }\n\n\t/**\n\t * Create an SStringFixed from another SStringFixed.\n\t */\n\tSDnaStringFixed(const SDnaStringFixed<S>& o) :\n\t\tSStringFixed<char, S>(o) { }\n\n\t/**\n\t * Create an SStringFixed from a C++ basic_string.\n\t */\n\texplicit SDnaStringFixed(const std::basic_string<char>& str) :\n\t\tSStringFixed<char, S>(str) { }\n\n\t/**\n\t * Create an SStringFixed from an array and size.\n\t */\n\texplicit SDnaStringFixed(const char* b, size_t sz) :\n\t\tSStringFixed<char, S>(b, sz) { }\n\n\t/**\n\t * Create an SStringFixed from a zero-terminated string.\n\t */\n\texplicit SDnaStringFixed(\n\t\tconst char* b,\n\t\tbool chars = false,\n\t\tbool colors = false) :\n\t\tSStringFixed<char, S>()\n\t{\n\t\tif(chars) {\n\t\t\tif(colors) {\n\t\t\t\tinstallColors(b, strlen(b));\n\t\t\t} else {\n\t\t\t\tinstallChars(b, strlen(b));\n\t\t\t}\n\t\t} else {\n\t\t\tinstall(b, strlen(b));\n\t\t}\n\t}\n\n\tvirtual ~SDnaStringFixed() { } // C++ needs this\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string, reverse-\n\t * complementing them in the process, assuming an encoding where\n\t * 0=A, 1=C, 2=G, 3=T, 4=N.\n\t */\n\tvoid installReverseComp(const char* b, size_t sz) {\n\t\tassert_leq(sz, S);\n\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\tthis->cs_[i] = (b[sz-i-1] == 4 ? 4 : b[sz-i-1] ^ 3);\n\t\t}\n\t\tthis->len_ = sz;\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string, reverse-\n\t * complementing them in the process, assuming an encoding where\n\t * 0=A, 1=C, 2=G, 3=T, 4=N.\n\t */\n\tvoid installReverseComp(const SDnaStringFixed<S>& b) {\n\t\tassert_leq(b.len_, S);\n\t\tfor(size_t i = 0; i < b.len_; i++) {\n\t\t\tthis->cs_[i] = (b.cs_[b.len_-i-1] == 4 ? 4 : b.cs_[b.len_-i-1] ^ 3);\n\t\t}\n\t\tthis->len_ = b.len_;\n\t}\n\n\t/**\n\t * Either reverse or reverse-complement (depending on \"color\") this\n\t * DNA buffer in-place.\n\t */\n\tvoid reverseComp(bool color = false) {\n\t\tif(color) {\n\t\t\tthis->reverse();\n\t\t} else {\n\t\t\tfor(size_t i = 0; i < (this->len_ >> 1); i++) {\n\t\t\t\tchar tmp1 = (this->cs_[i] == 4 ? 4 : this->cs_[i] ^ 3);\n\t\t\t\tchar tmp2 = (this->cs_[this->len_-i-1] == 4 ? 4 : this->cs_[this->len_-i-1] ^ 3);\n\t\t\t\tthis->cs_[i] = tmp2;\n\t\t\t\tthis->cs_[this->len_-i-1] = tmp1;\n\t\t\t}\n\t\t\t// Do middle element iff there are an odd number\n\t\t\tif((this->len_ & 1) != 0) {\n\t\t\t\tchar tmp = this->cs_[this->len_ >> 1];\n\t\t\t\ttmp = (tmp == 4 ? 4 : tmp ^ 3);\n\t\t\t\tthis->cs_[this->len_ >> 1] = tmp;\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string.\n\t */\n\tvirtual void install(const char* b, size_t sz) {\n\t\tassert_leq(sz, S);\n\t\tmemcpy(this->cs_, b, sz);\n#ifndef NDEBUG\n\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\tassert_leq(this->cs_[i], 4);\n\t\t\tassert_geq(this->cs_[i], 0);\n\t\t}\n#endif\n\t\tthis->len_ = sz;\n\t}\n\n\t/**\n\t * Copy buffer 'b' of ASCII DNA characters into normal DNA\n\t * characters.\n\t */\n\tvirtual void installChars(const char* b, size_t sz) {\n\t\tassert_leq(sz, S);\n\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\tassert_in(toupper(b[i]), \"ACGTN-\");\n\t\t\tthis->cs_[i] = asc2dna[(int)b[i]];\n\t\t\tassert_geq(this->cs_[i], 0);\n\t\t\tassert_leq(this->cs_[i], 4);\n\t\t}\n\t\tthis->len_ = sz;\n\t}\n\n\t/**\n\t * Copy buffer 'b' of ASCII color characters into normal DNA\n\t * characters.\n\t */\n\tvirtual void installColors(const char* b, size_t sz) {\n\t\tassert_leq(sz, S);\n\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\tassert_in(b[i], \"0123.\");\n\t\t\tthis->cs_[i] = asc2col[(int)b[i]];\n\t\t\tassert_geq(this->cs_[i], 0);\n\t\t\tassert_leq(this->cs_[i], 4);\n\t\t}\n\t\tthis->len_ = sz;\n\t}\n\n\t/**\n\t * Copy C++ string of ASCII DNA characters into normal DNA\n\t * characters.\n\t */\n\tvirtual void installChars(const std::basic_string<char>& str) {\n\t\tinstallChars(str.c_str(), str.length());\n\t}\n\n\t/**\n\t * Copy C++ string of ASCII color characters into normal DNA\n\t * characters.\n\t */\n\tvirtual void installColors(const std::basic_string<char>& str) {\n\t\tinstallColors(str.c_str(), str.length());\n\t}\n\n\t/**\n\t * Set DNA character at index 'idx' to 'c'.\n\t */\n\tvoid set(int c, size_t idx) {\n\t\tassert_lt(idx, this->len_);\n\t\tassert_leq(c, 4);\n\t\tassert_geq(c, 0);\n\t\tthis->cs_[idx] = c;\n\t}\n\n\t/**\n\t * Append DNA char c.\n\t */\n\tvoid append(const char& c) {\n\t\tassert_lt(this->len_, S);\n\t\tassert_leq(c, 4);\n\t\tassert_geq(c, 0);\n\t\tthis->cs_[this->len_++] = c;\n\t}\n\n\t/**\n\t * Set DNA character at index 'idx' to 'c'.\n\t */\n\tvoid setChar(char c, size_t idx) {\n\t\tassert_lt(idx, this->len_);\n\t\tassert_in(toupper(c), \"ACGTN\");\n\t\tthis->cs_[idx] = asc2dna[(int)c];\n\t}\n\n\t/**\n\t * Append DNA character.\n\t */\n\tvoid appendChar(char c) {\n\t\tassert_lt(this->len_, S);\n\t\tassert_in(toupper(c), \"ACGTN\");\n\t\tthis->cs_[this->len_++] = asc2dna[(int)c];\n\t}\n\n\t/**\n\t * Return DNA character corresponding to element 'idx'.\n\t */\n\tchar toChar(size_t idx) const {\n\t\tassert_geq((int)this->cs_[idx], 0);\n\t\tassert_leq((int)this->cs_[idx], 4);\n\t\treturn \"ACGTN\"[(int)this->cs_[idx]];\n\t}\n\n\t/**\n\t * Retrieve constant version of element i.\n\t */\n\tconst char& operator[](size_t i) const {\n\t\treturn this->get(i);\n\t}\n\n\t/**\n\t * Retrieve constant version of element i.\n\t */\n\tconst char& get(size_t i) const {\n\t\tassert_lt(i, this->len_);\n\t\tassert_leq(this->cs_[i], 4);\n\t\tassert_geq(this->cs_[i], 0);\n\t\treturn this->cs_[i];\n\t}\n\n\t/**\n\t * Return the ith character in the window defined by fw, color,\n\t * depth and len.\n\t */\n\tchar windowGetDna(\n\t\tsize_t i,\n\t\tbool   fw,\n\t\tbool   color,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = this->len_;\n\t\tassert_lt(i, len);\n\t\tassert_leq(len, this->len_ - depth);\n\t\tif(fw) return this->cs_[depth+i];\n\t\telse   return color ? this->cs_[depth+len-i-1] :\n\t\t                      compDna(this->cs_[depth+len-i-1]);\n\t}\n\n\t/**\n\t * Fill the given DNA buffer with the substring specified by fw,\n\t * color, depth and len.\n\t */\n\tvoid windowGetDna(\n\t\tSDnaStringFixed<S>& buf,\n\t\tbool   fw,\n\t\tbool   color,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = this->len_;\n\t\tassert_leq(len, this->len_ - depth);\n\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\tbuf.append(fw ? this->cs_[depth+i] :\n\t\t\t                (color ? this->cs_[depth+len-i-1] :\n\t\t\t                         compDna(this->cs_[depth+len-i-1])));\n\t\t}\n\t}\n\n\t/**\n\t * Put a terminator in the 'len_'th element and then return a\n\t * pointer to the buffer.  Useful for printing.\n\t */\n\tvirtual const char* toZBuf() const { return this->toZBufXForm(\"ACGTN\"); }\n};\n\n/**\n * Encapsulates a fixed-length DNA string with characters encoded as\n * chars.  Only capable of encoding A, C, G, T and N.  The length is\n * specified via the template parameter S.\n */\n\ntemplate<int S = 1024, int M = 2>\nclass SDnaStringExpandable : public SStringExpandable<char, S, M> {\npublic:\n\n\texplicit SDnaStringExpandable() : SStringExpandable<char, S, M>() { }\n\n\t/**\n\t * Create an SStringFixed from another SStringFixed.\n\t */\n\tSDnaStringExpandable(const SDnaStringExpandable<S, M>& o) :\n\t\tSStringExpandable<char, S, M>(o) { }\n\n\t/**\n\t * Create an SStringFixed from a C++ basic_string.\n\t */\n\texplicit SDnaStringExpandable(\n\t\tconst std::basic_string<char>& str,\n\t\tbool chars = false,\n\t\tbool colors = false) :\n\t\tSStringExpandable<char, S, M>()\n\t{\n\t\tif(chars) {\n\t\t\tif(colors) {\n\t\t\t\tinstallColors(str);\n\t\t\t} else {\n\t\t\t\tinstallChars(str);\n\t\t\t}\n\t\t} else {\n\t\t\t//FIXME FB: Commented out install(str) as it does not conform with the function definition\n\t\t\t//install(str);\n\t\t\tthrow std::invalid_argument(\"chars=false, colors=false not implemented\");\n\t\t}\n\t}\n\n\t/**\n\t * Create an SStringFixed from an array and size.\n\t */\n\texplicit SDnaStringExpandable(\n\t\tconst char* b,\n\t\tsize_t sz,\n\t\tbool chars = false,\n\t\tbool colors = false) :\n\t\tSStringExpandable<char, S, M>()\n\t{\n\t\tif(chars) {\n\t\t\tif(colors) {\n\t\t\t\tinstallColors(b, sz);\n\t\t\t} else {\n\t\t\t\tinstallChars(b, sz);\n\t\t\t}\n\t\t} else {\n\t\t\tinstall(b, sz);\n\t\t}\n\t}\n\n\t/**\n\t * Create an SStringFixed from a zero-terminated string.\n\t */\n\texplicit SDnaStringExpandable(\n\t\tconst char* b,\n\t\tbool chars = false,\n\t\tbool colors = false) :\n\t\tSStringExpandable<char, S, M>()\n\t{\n\t\tinstall(b, chars, colors);\n\t}\n\n\tvirtual ~SDnaStringExpandable() { } // C++ needs this\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string, reverse-\n\t * complementing them in the process, assuming an encoding where\n\t * 0=A, 1=C, 2=G, 3=T, 4=N.\n\t */\n\tvoid installReverseComp(const char* b, size_t sz) {\n\t\tif(this->sz_ < sz) this->expandCopy((sz + S) * M);\n\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\tthis->cs_[i] = (b[sz-i-1] == 4 ? 4 : b[sz-i-1] ^ 3);\n\t\t}\n\t\tthis->len_ = sz;\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string, reverse-\n\t * complementing them in the process, assuming an encoding where\n\t * 0=A, 1=C, 2=G, 3=T, 4=N.\n\t */\n\tvoid installReverseComp(const SDnaStringExpandable<S, M>& b) {\n\t\tif(this->sz_ < b.len_) this->expandCopy((b.len_ + S) * M);\n\t\tfor(size_t i = 0; i < b.len_; i++) {\n\t\t\tthis->cs_[i] = (b.cs_[b.len_-i-1] == 4 ? 4 : b.cs_[b.len_-i-1] ^ 3);\n\t\t}\n\t\tthis->len_ = b.len_;\n\t}\n\n\t/**\n\t * Either reverse or reverse-complement (depending on \"color\") this\n\t * DNA buffer in-place.\n\t */\n\tvoid reverseComp(bool color = false) {\n\t\tif(color) {\n\t\t\tthis->reverse();\n\t\t} else {\n\t\t\tfor(size_t i = 0; i < (this->len_ >> 1); i++) {\n\t\t\t\tchar tmp1 = (this->cs_[i] == 4 ? 4 : this->cs_[i] ^ 3);\n\t\t\t\tchar tmp2 = (this->cs_[this->len_-i-1] == 4 ? 4 : this->cs_[this->len_-i-1] ^ 3);\n\t\t\t\tthis->cs_[i] = tmp2;\n\t\t\t\tthis->cs_[this->len_-i-1] = tmp1;\n\t\t\t}\n\t\t\t// Do middle element iff there are an odd number\n\t\t\tif((this->len_ & 1) != 0) {\n\t\t\t\tchar tmp = this->cs_[this->len_ >> 1];\n\t\t\t\ttmp = (tmp == 4 ? 4 : tmp ^ 3);\n\t\t\t\tthis->cs_[this->len_ >> 1] = tmp;\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string.\n\t */\n\tvirtual void install(\n\t\tconst char* b,\n\t\tbool chars = false,\n\t\tbool colors = false)\n\t{\n\t\tif(chars) {\n\t\t\tif(colors) {\n\t\t\t\tinstallColors(b, strlen(b));\n\t\t\t} else {\n\t\t\t\tinstallChars(b, strlen(b));\n\t\t\t}\n\t\t} else {\n\t\t\tinstall(b, strlen(b));\n\t\t}\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string.\n\t */\n\tvirtual void install(const char* b, size_t sz) {\n\t\tif(this->sz_ < sz) this->expandCopy((sz + S) * M);\n\t\tmemcpy(this->cs_, b, sz);\n#ifndef NDEBUG\n\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\tassert_range(0, 4, (int)this->cs_[i]);\n\t\t}\n#endif\n\t\tthis->len_ = sz;\n\t}\n\n\t/**\n\t * Copy buffer 'b' of ASCII DNA characters into normal DNA\n\t * characters.\n\t */\n\tvirtual void installChars(const char* b, size_t sz) {\n\t\tif(this->sz_ < sz) this->expandCopy((sz + S) * M);\n\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\tassert_in(toupper(b[i]), \"ACGTN-\");\n\t\t\tthis->cs_[i] = asc2dna[(int)b[i]];\n\t\t\tassert_range(0, 4, (int)this->cs_[i]);\n\t\t}\n\t\tthis->len_ = sz;\n\t}\n\n\t/**\n\t * Copy buffer 'b' of ASCII color characters into normal DNA\n\t * characters.\n\t */\n\tvirtual void installColors(const char* b, size_t sz) {\n\t\tif(this->sz_ < sz) this->expandCopy((sz + S) * M);\n\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\tassert_in(b[i], \"0123.\");\n\t\t\tthis->cs_[i] = asc2col[(int)b[i]];\n\t\t\tassert_range(0, 4, (int)this->cs_[i]);\n\t\t}\n\t\tthis->len_ = sz;\n\t}\n\n\t/**\n\t * Copy C++ string of ASCII DNA characters into normal DNA\n\t * characters.\n\t */\n\tvirtual void installChars(const std::basic_string<char>& str) {\n\t\tinstallChars(str.c_str(), str.length());\n\t}\n\n\t/**\n\t * Copy C++ string of ASCII color characters into normal DNA\n\t * characters.\n\t */\n\tvirtual void installColors(const std::basic_string<char>& str) {\n\t\tinstallColors(str.c_str(), str.length());\n\t}\n\n\t/**\n\t * Set DNA character at index 'idx' to 'c'.\n\t */\n\tvoid set(int c, size_t idx) {\n\t\tassert_lt(idx, this->len_);\n\t\tassert_range(0, 4, c);\n\t\tthis->cs_[idx] = c;\n\t}\n\n\t/**\n\t * Append DNA char c.\n\t */\n\tvoid append(const char& c) {\n\t\tif(this->sz_ < this->len_ + 1) {\n\t\t\tthis->expandCopy((this->len_ + 1 + S) * M);\n\t\t}\n\t\tassert_range(0, 4, (int)c);\n\t\tthis->cs_[this->len_++] = c;\n\t}\n\n\t/**\n\t * Set DNA character at index 'idx' to 'c'.\n\t */\n\tvoid setChar(char c, size_t idx) {\n\t\tassert_lt(idx, this->len_);\n\t\tassert_in(toupper(c), \"ACGTN\");\n\t\tthis->cs_[idx] = asc2dna[(int)c];\n\t}\n\n\t/**\n\t * Append DNA character.\n\t */\n\tvoid appendChar(char c) {\n\t\tif(this->sz_ < this->len_ + 1) {\n\t\t\tthis->expandCopy((this->len_ + 1 + S) * M);\n\t\t}\n\t\tassert_in(toupper(c), \"ACGTN\");\n\t\tthis->cs_[this->len_++] = asc2dna[(int)c];\n\t}\n\n\t/**\n\t * Return DNA character corresponding to element 'idx'.\n\t */\n\tchar toChar(size_t idx) const {\n\t\tassert_range(0, 4, (int)this->cs_[idx]);\n\t\treturn \"ACGTN\"[(int)this->cs_[idx]];\n\t}\n//\n//\t// call with uint32_t or uint64_t\n//\ttemplate<typename T>\n//\tT* uint_kmers (size_t begin, size_t end) {\n//\t\tsize_t t_size = sizeof(T) * 8;\n//\t\tassert_lt(end, this->len_);\n//\t\tend = min(end, this->len_-1);\n//\n//\t\t// number of kmers: ceiling of len / t_size\n//\t\tsize_t n_kmers = ((end-begin) % t_size) ? \n//\t\t\t(end-begin) / t_size + 1 : \n//\t\t\t(end-begin) / t_size;\n//\t\tT kmers [n_kmers];\n//\n//\t\t// go through _cs in steps of t_size (16 or 32 for uint32_t and uint64_t, resp)\n//\t\tfor(size_t i = 0; i <= end; i += t_size) {\n//\n//\t\t\t// each step gives one word / kmer\n//\t\t\tT word = 0;\n//\t\t\tint bp = (int)this->cs_[begin+i+j];\n//\t\t\tassert_range(0, 3, (int)bp);\n//\n//\t\t\t// create bitmask, and combine word with new bitmask\n//\t\t\tT shift = (T)j << 1;\n//\t\t\tword |= (bp << shift);\n//\t\t\tif (i % t_size == 0) {\n//\t\t\t\tkmers[i] = word;\n//\t\t\t\tword\n//\t\t\t}\n//\t\t}\n//\t}\n//\n\t//\n\t/**\n\t * update word to the next kmer by shifting off the first two bits, and shifting on the ones from pos\n\t * @param word\n\t * @param pos\n\t */\n\ttemplate<typename UINT>\n\tUINT next_kmer(UINT word, size_t pos) const {\n\t\t// shift the first two bits off the word\n\t\tword = word << 2;\n\t\t// put the base-pair code from pos at that position\n\t\tUINT bp = (UINT)this->cs_[pos];\n\n\t\treturn (word |= bp);\n\t}\n\n\t/**\n\t * get kmer of appropriate size from cs_\n\t * @param begin start position of kmrt\n\t * @param end end position of kmer\n\t */\n\ttemplate<typename UINT>\n\tUINT int_kmer(size_t begin,size_t end) const {\n\t\tconst size_t k_size = sizeof(UINT) * 4;  // size of the kmer, two bits are used per nucleotide\n\t\tassert_leq(end, this->len_);\n\n\t\tUINT word = 0;\n\t\t// go through _cs until end or kmer-size is reached\n\t\tfor (size_t j = 0; j < k_size && (size_t)(begin+j) < end; j++) {\n\t\t\t\tint bp = (int)this->cs_[begin+j];\n\t\t\t\t// assert_range(0, 3, (int)bp); //\n\t\t\t\t// cerr << (begin+j) << \":\" << \"ACGTXYZ\"[bp] << \" \"; //\n\t\t\t\tif (bp < 0 || bp > 3) {\n\t\t\t\t\t// skip non-ACGT bases\n\t\t\t\t\tcontinue;\n\t\t\t\t}\n\t\t\t\t// shift the first two bits off the word\n\t\t\t\tword = word << 2;\n\t\t\t\t// put the base-pair code from pos at that position\n\t\t\t\tword |= bp;\n\t\t}\n\t\t//cerr << endl;\n\t\treturn (word);\n\t}\n\n\tvector<uint64_t> get_all_kmers(size_t begin,size_t len,bool rev=false) const {\n\t\tvector<uint64_t> kmers(max(1,31 - (int)len));\n\t\tsize_t i = begin;\n\t\tsize_t j = 1;\n\t\tkmers[0] = this->int_kmer<uint64_t>(begin,begin+len, rev);\n\t\twhile (i+32 < len) {\n\t\t\tkmers[j] = this->next_kmer(kmers[j-1],i, rev);\n\t\t\t++i; ++j;\n\t\t}\n\t\treturn kmers;\n\t}\n\n\n\t/**\n\t * Retrieve constant version of element i.\n\t */\n\tinline const char& operator[](size_t i) const {\n\t\treturn this->get(i);\n\t}\n\n\t/**\n\t * Retrieve constant version of element i.\n\t */\n\tinline const char& get(size_t i) const {\n\t\tassert_lt(i, this->len_);\n\t\tassert_range(0, 4, (int)this->cs_[i]);\n\t\treturn this->cs_[i];\n\t}\n\n\t/**\n\t * Return the ith character in the window defined by fw, color,\n\t * depth and len.\n\t */\n\tchar windowGetDna(\n\t\tsize_t i,\n\t\tbool   fw,\n\t\tbool   color,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = this->len_;\n\t\tassert_lt(i, len);\n\t\tassert_leq(len, this->len_ - depth);\n\t\tif(fw) return this->cs_[depth+i];\n\t\telse   return color ? this->cs_[depth+len-i-1] :\n\t\t                      compDna(this->cs_[depth+len-i-1]);\n\t}\n\n\t/**\n\t * Fill the given DNA buffer with the substring specified by fw,\n\t * color, depth and len.\n\t */\n\tvoid windowGetDna(\n\t\tSDnaStringExpandable<S, M>& buf,\n\t\tbool   fw,\n\t\tbool   color,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = this->len_;\n\t\tassert_leq(len, this->len_ - depth);\n\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\tbuf.append(fw ? this->cs_[depth+i] :\n\t\t\t                (color ? this->cs_[depth+len-i-1] :\n\t\t\t                         compDna(this->cs_[depth+len-i-1])));\n\t\t}\n\t}\n\n\t/**\n\t * Put a terminator in the 'len_'th element and then return a\n\t * pointer to the buffer.  Useful for printing.\n\t */\n\tvirtual const char* toZBuf() const { return this->toZBufXForm(\"ACGTN\"); }\n};\n\n/**\n * Encapsulates an expandable DNA string with characters encoded as\n * char-sized masks.  Encodes A, C, G, T, and all IUPAC, as well as the\n * empty mask indicating \"matches nothing.\"\n */\ntemplate<int S = 16, int M = 2>\nclass SDnaMaskString : public SStringExpandable<char, S, M> {\npublic:\n\n\texplicit SDnaMaskString() : SStringExpandable<char, S, M>() { }\n\n\t/**\n\t * Create an SStringFixed from another SStringFixed.\n\t */\n\tSDnaMaskString(const SDnaMaskString<S, M>& o) :\n\t\tSStringExpandable<char, S, M>(o) { }\n\n\t/**\n\t * Create an SStringFixed from a C++ basic_string.\n\t */\n\texplicit SDnaMaskString(const std::basic_string<char>& str) :\n\t\tSStringExpandable<char, S, M>(str) { }\n\n\t/**\n\t * Create an SStringFixed from an array and size.\n\t */\n\texplicit SDnaMaskString(const char* b, size_t sz) :\n\t\tSStringExpandable<char, S, M>(b, sz) { }\n\n\t/**\n\t * Create an SStringFixed from a zero-terminated string.\n\t */\n\texplicit SDnaMaskString(const char* b, bool chars = false) :\n\t\tSStringExpandable<char, S, M>()\n\t{\n\t\tif(chars) {\n\t\t\tinstallChars(b, strlen(b));\n\t\t} else {\n\t\t\tinstall(b, strlen(b));\n\t\t}\n\t}\n\n\tvirtual ~SDnaMaskString() { }\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string, reverse-\n\t * complementing them in the process, assuming an encoding where\n\t * 0=A, 1=C, 2=G, 3=T, 4=N.\n\t */\n\tvoid installReverseComp(const char* b, size_t sz) {\n\t\twhile(this->sz_ < sz) {\n\t\t\tthis->expandNoCopy((sz + S) * M);\n\t\t}\n\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\tthis->cs_[i] = maskcomp[(int)b[sz-i-1]];\n\t\t}\n\t\tthis->len_ = sz;\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string, reverse-\n\t * complementing them in the process, assuming an encoding where\n\t * 0=A, 1=C, 2=G, 3=T, 4=N.\n\t */\n\tvoid installReverseComp(const SDnaMaskString<S, M>& b) {\n\t\twhile(this->sz_ < b.len_) {\n\t\t\tthis->expandNoCopy((b.len_ + S) * M);\n\t\t}\n\t\tfor(size_t i = 0; i < b.len_; i++) {\n\t\t\tthis->cs_[i] = maskcomp[(int)b.cs_[b.len_-i-1]];\n\t\t}\n\t\tthis->len_ = b.len_;\n\t}\n\n\t/**\n\t * Either reverse or reverse-complement (depending on \"color\") this\n\t * DNA buffer in-place.\n\t */\n\tvoid reverseComp(bool color = false) {\n\t\tif(color) {\n\t\t\tthis->reverse();\n\t\t} else {\n\t\t\tfor(size_t i = 0; i < (this->len_ >> 1); i++) {\n\t\t\t\tchar tmp1 = maskcomp[(int)this->cs_[i]];\n\t\t\t\tchar tmp2 = maskcomp[(int)this->cs_[this->len_-i-1]];\n\t\t\t\tthis->cs_[i] = tmp2;\n\t\t\t\tthis->cs_[this->len_-i-1] = tmp1;\n\t\t\t}\n\t\t\t// Do middle element iff there are an odd number\n\t\t\tif((this->len_ & 1) != 0) {\n\t\t\t\tchar tmp = this->cs_[this->len_ >> 1];\n\t\t\t\ttmp = maskcomp[(int)tmp];\n\t\t\t\tthis->cs_[this->len_ >> 1] = tmp;\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Copy 'sz' bytes from buffer 'b' into this string.\n\t */\n\tvirtual void install(const char* b, size_t sz) {\n\t\twhile(this->sz_ < sz) {\n\t\t\tthis->expandNoCopy((sz + S) * M);\n\t\t}\n\t\tmemcpy(this->cs_, b, sz);\n#ifndef NDEBUG\n\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\tassert_range((int)this->cs_[i], 0, 15);\n\t\t}\n#endif\n\t\tthis->len_ = sz;\n\t}\n\n\t/**\n\t * Copy buffer 'b' of ASCII DNA characters into DNA masks.\n\t */\n\tvirtual void installChars(const char* b, size_t sz) {\n\t\twhile(this->sz_ < sz) {\n\t\t\tthis->expandNoCopy((sz + S) * M);\n\t\t}\n\t\tfor(size_t i = 0; i < sz; i++) {\n\t\t\tassert_in(b[i], iupacs);\n\t\t\tthis->cs_[i] = asc2dnamask[(int)b[i]];\n\t\t\tassert_range((int)this->cs_[i], 0, 15);\n\t\t}\n\t\tthis->len_ = sz;\n\t}\n\n\t/**\n\t * Copy C++ string of ASCII DNA characters into normal DNA\n\t * characters.\n\t */\n\tvirtual void installChars(const std::basic_string<char>& str) {\n\t\tinstallChars(str.c_str(), str.length());\n\t}\n\n\t/**\n\t * Set DNA character at index 'idx' to 'c'.\n\t */\n\tvoid set(int c, size_t idx) {\n\t\tassert_lt(idx, this->len_);\n\t\tassert_range(c, 0, 15);\n\t\tthis->cs_[idx] = c;\n\t}\n\n\t/**\n\t * Append DNA char c.\n\t */\n\tvoid append(const char& c) {\n\t\twhile(this->sz_ < this->len_+1) {\n\t\t\tthis->expandNoCopy((this->len_ + 1 + S) * M);\n\t\t}\n\t\tassert_range((int)c, 0, 15);\n\t\tthis->cs_[this->len_++] = c;\n\t}\n\n\t/**\n\t * Set DNA character at index 'idx' to 'c'.\n\t */\n\tvoid setChar(char c, size_t idx) {\n\t\tassert_lt(idx, this->len_);\n\t\tassert_in(toupper(c), iupacs);\n\t\tthis->cs_[idx] = asc2dnamask[(int)c];\n\t}\n\n\t/**\n\t * Append DNA character.\n\t */\n\tvoid appendChar(char c) {\n\t\twhile(this->sz_ < this->len_+1) {\n\t\t\texpandNoCopy((this->len_ + 1 + S) * M);\n\t\t}\n\t\tassert_in(toupper(c), iupacs);\n\t\tthis->cs_[this->len_++] = asc2dnamask[(int)c];\n\t}\n\n\t/**\n\t * Return DNA character corresponding to element 'idx'.\n\t */\n\tchar toChar(size_t idx) const {\n\t\tassert_range((int)this->cs_[idx], 0, 15);\n\t\treturn mask2iupac[(int)this->cs_[idx]];\n\t}\n\n\t/**\n\t * Retrieve constant version of element i.\n\t */\n\tconst char& operator[](size_t i) const {\n\t\treturn this->get(i);\n\t}\n\n\t/**\n\t * Retrieve mutable version of element i.\n\t */\n\tchar& operator[](size_t i) {\n\t\treturn this->get(i);\n\t}\n\n\t/**\n\t * Retrieve constant version of element i.\n\t */\n\tconst char& get(size_t i) const {\n\t\tassert_lt(i, this->len_);\n\t\tassert_range((int)this->cs_[i], 0, 15);\n\t\treturn this->cs_[i];\n\t}\n\n\t/**\n\t * Retrieve mutable version of element i.\n\t */\n\tchar& get(size_t i) {\n\t\tassert_lt(i, this->len_);\n\t\tassert_range((int)this->cs_[i], 0, 15);\n\t\treturn this->cs_[i];\n\t}\n\n\t/**\n\t * Return the ith character in the window defined by fw, color,\n\t * depth and len.\n\t */\n\tchar windowGetDna(\n\t\tsize_t i,\n\t\tbool   fw,\n\t\tbool   color,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = this->len_;\n\t\tassert_lt(i, len);\n\t\tassert_leq(len, this->len_ - depth);\n\t\tif(fw) return this->cs_[depth+i];\n\t\telse   return color ? this->cs_[depth+len-i-1] :\n\t\t                      maskcomp[this->cs_[depth+len-i-1]];\n\t}\n\n\t/**\n\t * Fill the given DNA buffer with the substring specified by fw,\n\t * color, depth and len.\n\t */\n\tvoid windowGetDna(\n\t\tSDnaStringFixed<S>& buf,\n\t\tbool   fw,\n\t\tbool   color,\n\t\tsize_t depth = 0,\n\t\tsize_t len = 0) const\n\t{\n\t\tif(len == 0) len = this->len_;\n\t\tassert_leq(len, this->len_ - depth);\n\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\tbuf.append(fw ? this->cs_[depth+i] :\n\t\t\t                (color ? this->cs_[depth+len-i-1] :\n\t\t\t                         maskcomp[this->cs_[depth+len-i-1]]));\n\t\t}\n\t}\n\n\t/**\n\t * Sample a random substring of the given length from this DNA\n\t * string and install the result in 'dst'.\n\t */\n\ttemplate<typename T>\n\tvoid randSubstr(\n\t\tRandomSource& rnd,  // pseudo-random generator\n\t\tT& dst,             // put sampled substring here\n\t\tsize_t len,         // length of substring to extract\n\t\tbool watson = true, // true -> possibly extract from Watson strand\n\t\tbool crick = true)  // true -> possibly extract from Crick strand\n\t{\n\t\tassert(watson || crick);\n\t\tassert_geq(this->len_, len);\n\t\tsize_t poss = this->len_ - len + 1;\n\t\tassert_gt(poss, 0);\n\t\tuint32_t rndoff = (uint32_t)(rnd.nextU32() % poss);\n\t\tbool fw;\n\t\tif     (watson && !crick) fw = true;\n\t\telse if(!watson && crick) fw = false;\n\t\telse {\n\t\t\tfw = rnd.nextBool();\n\t\t}\n\t\tif(fw) {\n\t\t\t// Install Watson substring\n\t\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\t\tdst[i] = this->cs_[i + rndoff];\n\t\t\t}\n\t\t} else {\n\t\t\t// Install Crick substring\n\t\t\tfor(size_t i = 0; i < len; i++) {\n\t\t\t\tdst[i] = maskcomp[(int)this->cs_[i + rndoff + (len - i - 1)]];\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Put a terminator in the 'len_'th element and then return a\n\t * pointer to the buffer.  Useful for printing.\n\t */\n\tvirtual const char* toZBuf() const { return this->toZBufXForm(iupacs); }\n};\n\ntypedef SStringExpandable<char, 1024, 2> BTString;\ntypedef SDnaStringExpandable<1024, 2>    BTDnaString;\ntypedef SDnaMaskString<32, 2>            BTDnaMask;\n\n#endif /* SSTRING_H_ */\n"
  },
  {
    "path": "str_util.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef STR_UTIL_H_\n#define STR_UTIL_H_\n\n#include <string>\n\n/**\n * Given a string, return an int hash for it.\n */\nstatic inline int\nhash_string(const std::string& s) {\n\tint ret = 0;\n\tint a = 63689;\n\tint b = 378551;\n\tfor(size_t i = 0; i < s.length(); i++) {\n\t\tret = (ret * a) + (int)s[i];\n\t\tif(a == 0) {\n\t\t\ta += b;\n\t\t} else {\n\t\t\ta *= b;\n\t\t}\n\t\tif(a == 0) {\n\t\t\ta += b;\n\t\t}\n\t}\n\treturn ret;\n}\n\n#endif /* STR_UTIL_H_ */\n"
  },
  {
    "path": "taxonomy.h",
    "content": "/*\n * taxonomy.h\n *\n *  Created on: Feb 10, 2016\n *      Author: fbreitwieser\n */\n\n#ifndef TAXONOMY_H_\n#define TAXONOMY_H_\n\n#include<map>\n#include<utility>\n#include<string>\n\nenum {\n    RANK_UNKNOWN = 0,\n    RANK_STRAIN,\n    RANK_SPECIES,\n    RANK_GENUS,\n    RANK_FAMILY,\n    RANK_ORDER,\n    RANK_CLASS,\n    RANK_PHYLUM,\n    RANK_KINGDOM,\n    RANK_DOMAIN,\n    RANK_FORMA,\n    RANK_INFRA_CLASS,\n    RANK_INFRA_ORDER,\n    RANK_PARV_ORDER,\n    RANK_SUB_CLASS,\n    RANK_SUB_FAMILY,\n    RANK_SUB_GENUS,\n    RANK_SUB_KINGDOM,\n    RANK_SUB_ORDER,\n    RANK_SUB_PHYLUM,\n    RANK_SUB_SPECIES,\n    RANK_SUB_TRIBE,\n    RANK_SUPER_CLASS,\n    RANK_SUPER_FAMILY,\n    RANK_SUPER_KINGDOM,\n    RANK_SUPER_ORDER,\n    RANK_SUPER_PHYLUM,\n    RANK_TRIBE,\n    RANK_VARIETAS,\n    RANK_LIFE,\n    RANK_MAX\n};\n\nextern uint8_t tax_rank_num[RANK_MAX];\n\nstruct TaxonomyNode {\n    uint64_t parent_tid;\n    uint8_t  rank;\n    uint8_t  leaf;\n\n    TaxonomyNode(uint64_t _parent_tid, uint8_t  _rank, uint8_t _leaf):\n    \tparent_tid(_parent_tid), rank(_rank), leaf(_leaf) {};\n\n    TaxonomyNode(): parent_tid(0), rank(RANK_UNKNOWN), leaf(false) {};\n};\n\nstruct TaxonomyPathTable {\n    static const size_t nranks = 10;\n\n    map<uint64_t, uint32_t> tid_to_pid;  // from taxonomic ID to path ID\n    ELList<uint64_t> paths;\n\n    static uint8_t rank_to_pathID(uint8_t rank) {\n        switch(rank) {\n            case RANK_STRAIN:\n            case RANK_SUB_SPECIES:\n                return 0;\n            case RANK_SPECIES:\n                return 1;\n            case RANK_GENUS:\n                return 2;\n            case RANK_FAMILY:\n                return 3;\n            case RANK_ORDER:\n                return 4;\n            case RANK_CLASS:\n                return 5;\n            case RANK_PHYLUM:\n                return 6;\n            case RANK_KINGDOM:\n                return 7;\n            case RANK_SUPER_KINGDOM:\n                return 8;\n            case RANK_DOMAIN:\n                return 9;\n            default:\n                return std::numeric_limits<uint8_t>::max();\n        }\n    }\n\n    void buildPaths(const EList<pair<string, uint64_t> >& uid_to_tid,\n                    const std::map<uint64_t, TaxonomyNode>& tree)\n    {\n        map<uint32_t, uint32_t> rank_map;\n        rank_map[RANK_STRAIN]        = 0;\n        rank_map[RANK_SUB_SPECIES]   = 0;\n        rank_map[RANK_SPECIES]       = 1;\n        rank_map[RANK_GENUS]         = 2;\n        rank_map[RANK_FAMILY]        = 3;\n        rank_map[RANK_ORDER]         = 4;\n        rank_map[RANK_CLASS]         = 5;\n        rank_map[RANK_PHYLUM]        = 6;\n        rank_map[RANK_KINGDOM]       = 7;\n        rank_map[RANK_SUPER_KINGDOM] = 8;\n        rank_map[RANK_DOMAIN]        = 9;\n\n        tid_to_pid.clear();\n        paths.clear();\n        for(size_t i = 0; i < uid_to_tid.size(); i++) {\n            uint64_t tid = uid_to_tid[i].second;\n            if(tid_to_pid.find(tid) != tid_to_pid.end())\n                continue;\n            if(tree.find(tid) == tree.end())\n                continue;\n            tid_to_pid[tid] = (uint32_t)paths.size();\n            paths.expand();\n            EList<uint64_t>& path = paths.back();\n            path.resizeExact(nranks);\n            path.fillZero();\n            bool first = true;\n            while(true) {\n                std::map<uint64_t, TaxonomyNode>::const_iterator itr = tree.find(tid);\n                if(itr == tree.end()) {\n                    break;\n                }\n                const TaxonomyNode& node = itr->second;\n                uint32_t rank = std::numeric_limits<uint32_t>::max();\n                if(first && node.rank == RANK_UNKNOWN) {\n                    rank = rank_map[RANK_STRAIN];\n                } else if(rank_map.find(node.rank) != rank_map.end()) {\n                    rank = rank_map[node.rank];\n                }\n                if(rank < path.size() && path[rank] == 0) {\n                    path[rank] = tid;\n                }\n\n                first = false;\n                if(node.parent_tid == tid) {\n                    break;\n                }\n                tid = node.parent_tid;\n            }\n        }\n    }\n\n    void getPath(uint64_t tid, EList<uint64_t>& path) const {\n        map<uint64_t, uint32_t>::const_iterator itr = tid_to_pid.find(tid);\n        if(itr != tid_to_pid.end()) {\n            uint32_t pid = itr->second;\n            assert_lt(pid, paths.size());\n            path = paths[pid];\n        } else {\n            path.clear();\n        }\n    }\n};\n\ntypedef std::map<uint64_t, TaxonomyNode> TaxonomyTree;\n\ninline static void initial_tax_rank_num() {\n    uint8_t rank = 0;\n    \n    tax_rank_num[RANK_SUB_SPECIES] = rank;\n    tax_rank_num[RANK_STRAIN] = rank++;\n    \n    tax_rank_num[RANK_SPECIES] = rank++;\n    \n    tax_rank_num[RANK_SUB_GENUS] = rank;\n    tax_rank_num[RANK_GENUS] = rank++;\n    \n    tax_rank_num[RANK_SUB_FAMILY] = rank;\n    tax_rank_num[RANK_FAMILY] = rank;\n    tax_rank_num[RANK_SUPER_FAMILY] = rank++;\n    \n    tax_rank_num[RANK_SUB_ORDER] = rank;\n    tax_rank_num[RANK_INFRA_ORDER] = rank;\n    tax_rank_num[RANK_PARV_ORDER] = rank;\n    tax_rank_num[RANK_ORDER] = rank;\n    tax_rank_num[RANK_SUPER_ORDER] = rank++;\n    \n    tax_rank_num[RANK_INFRA_CLASS] = rank;\n    tax_rank_num[RANK_SUB_CLASS] = rank;\n    tax_rank_num[RANK_CLASS] = rank;\n    tax_rank_num[RANK_SUPER_CLASS] = rank++;\n    \n    tax_rank_num[RANK_SUB_PHYLUM] = rank;\n    tax_rank_num[RANK_PHYLUM] = rank;\n    tax_rank_num[RANK_SUPER_PHYLUM] = rank++;\n    \n    tax_rank_num[RANK_SUB_KINGDOM] = rank;\n    tax_rank_num[RANK_KINGDOM] = rank;\n    tax_rank_num[RANK_SUPER_KINGDOM] = rank++;\n    \n    tax_rank_num[RANK_DOMAIN] = rank;\n    tax_rank_num[RANK_FORMA] = rank;\n    tax_rank_num[RANK_SUB_TRIBE] = rank;\n    tax_rank_num[RANK_TRIBE] = rank;\n    tax_rank_num[RANK_VARIETAS] = rank;\n    tax_rank_num[RANK_UNKNOWN] = rank;\n}\n\ninline static const char* get_tax_rank_string(uint8_t rank) {\n    switch(rank) {\n        case RANK_STRAIN:        return \"strain\";\n        case RANK_SPECIES:       return \"species\";\n        case RANK_GENUS:         return \"genus\";\n        case RANK_FAMILY:        return \"family\";\n        case RANK_ORDER:         return \"order\";\n        case RANK_CLASS:         return \"class\";\n        case RANK_PHYLUM:        return \"phylum\";\n        case RANK_KINGDOM:       return \"kingdom\";\n        case RANK_FORMA:         return \"forma\";\n        case RANK_INFRA_CLASS:   return \"infraclass\";\n        case RANK_INFRA_ORDER:   return \"infraorder\";\n        case RANK_PARV_ORDER:    return \"parvorder\";\n        case RANK_SUB_CLASS:     return \"subclass\";\n        case RANK_SUB_FAMILY:    return \"subfamily\";\n        case RANK_SUB_GENUS:     return \"subgenus\";\n        case RANK_SUB_KINGDOM:   return \"subkingdom\";\n        case RANK_SUB_ORDER:     return \"suborder\";\n        case RANK_SUB_PHYLUM:    return \"subphylum\";\n        case RANK_SUB_SPECIES:   return \"subspecies\";\n        case RANK_SUB_TRIBE:     return \"subtribe\";\n        case RANK_SUPER_CLASS:   return \"superclass\";\n        case RANK_SUPER_FAMILY:  return \"superfamily\";\n        case RANK_SUPER_KINGDOM: return \"superkingdom\";\n        case RANK_SUPER_ORDER:   return \"superorder\";\n        case RANK_SUPER_PHYLUM:  return \"superphylum\";\n        case RANK_TRIBE:         return \"tribe\";\n        case RANK_VARIETAS:      return \"varietas\";\n        case RANK_LIFE:          return \"life\";\n        default:                 return \"no rank\";\n    };\n}\n\ninline static uint8_t get_tax_rank_id(const char* rank) {\n    if(strcmp(rank, \"strain\") == 0) {\n        return RANK_STRAIN;\n    } else if(strcmp(rank, \"species\") == 0) {\n        return RANK_SPECIES;\n    } else if(strcmp(rank, \"genus\") == 0) {\n        return RANK_GENUS;\n    } else if(strcmp(rank, \"family\") == 0) {\n        return RANK_FAMILY;\n    } else if(strcmp(rank, \"order\") == 0) {\n        return RANK_ORDER;\n    } else if(strcmp(rank, \"class\") == 0) {\n        return RANK_CLASS;\n    } else if(strcmp(rank, \"phylum\") == 0) {\n        return RANK_PHYLUM;\n    } else if(strcmp(rank, \"kingdom\") == 0) {\n        return RANK_KINGDOM;\n    } else if(strcmp(rank, \"forma\") == 0) {\n        return RANK_FORMA;\n    } else if(strcmp(rank, \"infraclass\") == 0) {\n        return RANK_INFRA_CLASS;\n    } else if(strcmp(rank, \"infraorder\") == 0) {\n        return RANK_INFRA_ORDER;\n    } else if(strcmp(rank, \"parvorder\") == 0) {\n        return RANK_PARV_ORDER;\n    } else if(strcmp(rank, \"subclass\") == 0) {\n        return RANK_SUB_CLASS;\n    } else if(strcmp(rank, \"subfamily\") == 0) {\n        return RANK_SUB_FAMILY;\n    } else if(strcmp(rank, \"subgenus\") == 0) {\n        return RANK_SUB_GENUS;\n    } else if(strcmp(rank, \"subkingdom\") == 0) {\n        return RANK_SUB_KINGDOM;\n    } else if(strcmp(rank, \"suborder\") == 0) {\n        return RANK_SUB_ORDER;\n    } else if(strcmp(rank, \"subphylum\") == 0) {\n        return RANK_SUB_PHYLUM;\n    } else if(strcmp(rank, \"subspecies\") == 0) {\n        return RANK_SUB_SPECIES;\n    } else if(strcmp(rank, \"subtribe\") == 0) {\n        return RANK_SUB_TRIBE;\n    } else if(strcmp(rank, \"superclass\") == 0) {\n        return RANK_SUPER_CLASS;\n    } else if(strcmp(rank, \"superfamily\") == 0) {\n        return RANK_SUPER_FAMILY;\n    } else if(strcmp(rank, \"superkingdom\") == 0) {\n        return RANK_SUPER_KINGDOM;\n    } else if(strcmp(rank, \"superorder\") == 0) {\n        return RANK_SUPER_ORDER;\n    } else if(strcmp(rank, \"superphylum\") == 0) {\n        return RANK_SUPER_PHYLUM;\n    } else if(strcmp(rank, \"tribe\") == 0) {\n        return RANK_TRIBE;\n    } else if(strcmp(rank, \"varietas\") == 0) {\n        return RANK_VARIETAS;\n    } else if(strcmp(rank, \"life\") == 0) {\n        return RANK_LIFE;\n    } else {\n        return RANK_UNKNOWN;\n    }\n}\n\ninline static uint64_t get_taxid_at_parent_rank(const TaxonomyTree& tree, uint64_t taxid, uint8_t at_rank) {\n\twhile (true) {\n\t\tTaxonomyTree::const_iterator itr = tree.find(taxid);\n\t\tif(itr == tree.end()) {\n\t\t\tbreak;\n\t\t}\n\t\tconst TaxonomyNode& node = itr->second;\n\n\t\tif (node.rank == at_rank) {\n\t\t\treturn taxid;\n\t\t} else if (node.rank > at_rank || node.parent_tid == taxid) {\n\t\t\treturn 0;\n\t\t}\n\n\t\ttaxid = node.parent_tid;\n\t}\n\treturn 0;\n}\n\ninline static TaxonomyTree read_taxonomy_tree(string taxonomy_fname) {\n\tTaxonomyTree tree;\n\tifstream taxonomy_file(taxonomy_fname.c_str(), ios::in);\n\tif(taxonomy_file.is_open()) {\n\t\tchar line[1024];\n\t\twhile(!taxonomy_file.eof()) {\n\t\t\tline[0] = 0;\n\t\t\ttaxonomy_file.getline(line, sizeof(line));\n\t\t\tif(line[0] == 0 || line[0] == '#') continue;\n\t\t\tistringstream cline(line);\n\t\t\tuint64_t tid, parent_tid;\n\t\t\tchar dummy; string rank_string;\n\t\t\tcline >> tid >> dummy >> parent_tid >> dummy >> rank_string;\n\t\t\tif(tree.find(tid) != tree.end()) {\n\t\t\t\tcerr << \"Warning: \" << tid << \" already has a parent!\" << endl;\n\t\t\t\tcontinue;\n\t\t\t}\n\n\t\t\ttree[tid] = TaxonomyNode(parent_tid, get_tax_rank_id(rank_string.c_str()), false);\n\t\t}\n\t\ttaxonomy_file.close();\n\t} else {\n\t\tcerr << \"Error: \" << taxonomy_fname << \" doesn't exist!\" << endl;\n\t\tthrow 1;\n\t}\n\treturn tree;\n}\n\n\n#endif /* TAXONOMY_H_ */\n"
  },
  {
    "path": "third_party/MurmurHash3.cpp",
    "content": "//-----------------------------------------------------------------------------\n// MurmurHash3 was written by Austin Appleby, and is placed in the public\n// domain. The author hereby disclaims copyright to this source code.\n\n// Note - The x86 and x64 versions do _not_ produce the same results, as the\n// algorithms are optimized for their respective platforms. You can still\n// compile and run any of them on any platform, but your performance with the\n// non-native version will be less than optimal.\n\n#include \"MurmurHash3.h\"\n\n//-----------------------------------------------------------------------------\n// Platform-specific functions and macros\n\n// Microsoft Visual Studio\n\n#if defined(_MSC_VER)\n\n#define FORCE_INLINE\t__forceinline\n\n#include <stdlib.h>\n\n#define ROTL32(x,y)\t_rotl(x,y)\n#define ROTL64(x,y)\t_rotl64(x,y)\n\n#define BIG_CONSTANT(x) (x)\n\n// Other compilers\n\n#else\t// defined(_MSC_VER)\n\n#define\tFORCE_INLINE inline __attribute__((always_inline))\n\ninline uint32_t rotl32 ( uint32_t x, int8_t r )\n{\n  return (x << r) | (x >> (32 - r));\n}\n\ninline uint64_t rotl64 ( uint64_t x, int8_t r )\n{\n  return (x << r) | (x >> (64 - r));\n}\n\n#define\tROTL32(x,y)\trotl32(x,y)\n#define ROTL64(x,y)\trotl64(x,y)\n\n#define BIG_CONSTANT(x) (x##LLU)\n\n#endif // !defined(_MSC_VER)\n\n//-----------------------------------------------------------------------------\n// Block read - if your platform needs to do endian-swapping or can only\n// handle aligned reads, do the conversion here\n\nFORCE_INLINE uint32_t getblock32 ( const uint32_t * p, int i )\n{\n  return p[i];\n}\n\nFORCE_INLINE uint64_t getblock64 ( const uint64_t * p, int i )\n{\n  return p[i];\n}\n\n//-----------------------------------------------------------------------------\n// Finalization mix - force all bits of a hash block to avalanche\n\nFORCE_INLINE uint32_t fmix32 ( uint32_t h )\n{\n  h ^= h >> 16;\n  h *= 0x85ebca6b;\n  h ^= h >> 13;\n  h *= 0xc2b2ae35;\n  h ^= h >> 16;\n\n  return h;\n}\n\n//----------\n\nFORCE_INLINE uint64_t fmix64 ( uint64_t k )\n{\n  k ^= k >> 33;\n  k *= BIG_CONSTANT(0xff51afd7ed558ccd);\n  k ^= k >> 33;\n  k *= BIG_CONSTANT(0xc4ceb9fe1a85ec53);\n  k ^= k >> 33;\n\n  return k;\n}\n\n//-----------------------------------------------------------------------------\n\nvoid MurmurHash3_x86_32 ( const void * key, int len,\n                          uint32_t seed, void * out )\n{\n  const uint8_t * data = (const uint8_t*)key;\n  const int nblocks = len / 4;\n\n  uint32_t h1 = seed;\n\n  const uint32_t c1 = 0xcc9e2d51;\n  const uint32_t c2 = 0x1b873593;\n\n  //----------\n  // body\n\n  const uint32_t * blocks = (const uint32_t *)(data + nblocks*4);\n\n  for(int i = -nblocks; i; i++)\n  {\n    uint32_t k1 = getblock32(blocks,i);\n\n    k1 *= c1;\n    k1 = ROTL32(k1,15);\n    k1 *= c2;\n    \n    h1 ^= k1;\n    h1 = ROTL32(h1,13); \n    h1 = h1*5+0xe6546b64;\n  }\n\n  //----------\n  // tail\n\n  const uint8_t * tail = (const uint8_t*)(data + nblocks*4);\n\n  uint32_t k1 = 0;\n\n  switch(len & 3)\n  {\n  case 3: k1 ^= tail[2] << 16;\n  case 2: k1 ^= tail[1] << 8;\n  case 1: k1 ^= tail[0];\n          k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;\n  };\n\n  //----------\n  // finalization\n\n  h1 ^= len;\n\n  h1 = fmix32(h1);\n\n  *(uint32_t*)out = h1;\n} \n\n//-----------------------------------------------------------------------------\n\nvoid MurmurHash3_x86_128 ( const void * key, const int len,\n                           uint32_t seed, void * out )\n{\n  const uint8_t * data = (const uint8_t*)key;\n  const int nblocks = len / 16;\n\n  uint32_t h1 = seed;\n  uint32_t h2 = seed;\n  uint32_t h3 = seed;\n  uint32_t h4 = seed;\n\n  const uint32_t c1 = 0x239b961b; \n  const uint32_t c2 = 0xab0e9789;\n  const uint32_t c3 = 0x38b34ae5; \n  const uint32_t c4 = 0xa1e38b93;\n\n  //----------\n  // body\n\n  const uint32_t * blocks = (const uint32_t *)(data + nblocks*16);\n\n  for(int i = -nblocks; i; i++)\n  {\n    uint32_t k1 = getblock32(blocks,i*4+0);\n    uint32_t k2 = getblock32(blocks,i*4+1);\n    uint32_t k3 = getblock32(blocks,i*4+2);\n    uint32_t k4 = getblock32(blocks,i*4+3);\n\n    k1 *= c1; k1  = ROTL32(k1,15); k1 *= c2; h1 ^= k1;\n\n    h1 = ROTL32(h1,19); h1 += h2; h1 = h1*5+0x561ccd1b;\n\n    k2 *= c2; k2  = ROTL32(k2,16); k2 *= c3; h2 ^= k2;\n\n    h2 = ROTL32(h2,17); h2 += h3; h2 = h2*5+0x0bcaa747;\n\n    k3 *= c3; k3  = ROTL32(k3,17); k3 *= c4; h3 ^= k3;\n\n    h3 = ROTL32(h3,15); h3 += h4; h3 = h3*5+0x96cd1c35;\n\n    k4 *= c4; k4  = ROTL32(k4,18); k4 *= c1; h4 ^= k4;\n\n    h4 = ROTL32(h4,13); h4 += h1; h4 = h4*5+0x32ac3b17;\n  }\n\n  //----------\n  // tail\n\n  const uint8_t * tail = (const uint8_t*)(data + nblocks*16);\n\n  uint32_t k1 = 0;\n  uint32_t k2 = 0;\n  uint32_t k3 = 0;\n  uint32_t k4 = 0;\n\n  switch(len & 15)\n  {\n  case 15: k4 ^= tail[14] << 16;\n  case 14: k4 ^= tail[13] << 8;\n  case 13: k4 ^= tail[12] << 0;\n           k4 *= c4; k4  = ROTL32(k4,18); k4 *= c1; h4 ^= k4;\n\n  case 12: k3 ^= tail[11] << 24;\n  case 11: k3 ^= tail[10] << 16;\n  case 10: k3 ^= tail[ 9] << 8;\n  case  9: k3 ^= tail[ 8] << 0;\n           k3 *= c3; k3  = ROTL32(k3,17); k3 *= c4; h3 ^= k3;\n\n  case  8: k2 ^= tail[ 7] << 24;\n  case  7: k2 ^= tail[ 6] << 16;\n  case  6: k2 ^= tail[ 5] << 8;\n  case  5: k2 ^= tail[ 4] << 0;\n           k2 *= c2; k2  = ROTL32(k2,16); k2 *= c3; h2 ^= k2;\n\n  case  4: k1 ^= tail[ 3] << 24;\n  case  3: k1 ^= tail[ 2] << 16;\n  case  2: k1 ^= tail[ 1] << 8;\n  case  1: k1 ^= tail[ 0] << 0;\n           k1 *= c1; k1  = ROTL32(k1,15); k1 *= c2; h1 ^= k1;\n  };\n\n  //----------\n  // finalization\n\n  h1 ^= len; h2 ^= len; h3 ^= len; h4 ^= len;\n\n  h1 += h2; h1 += h3; h1 += h4;\n  h2 += h1; h3 += h1; h4 += h1;\n\n  h1 = fmix32(h1);\n  h2 = fmix32(h2);\n  h3 = fmix32(h3);\n  h4 = fmix32(h4);\n\n  h1 += h2; h1 += h3; h1 += h4;\n  h2 += h1; h3 += h1; h4 += h1;\n\n  ((uint32_t*)out)[0] = h1;\n  ((uint32_t*)out)[1] = h2;\n  ((uint32_t*)out)[2] = h3;\n  ((uint32_t*)out)[3] = h4;\n}\n\n//-----------------------------------------------------------------------------\n\nvoid MurmurHash3_x64_128 ( const void * key, const int len,\n                           const uint32_t seed, void * out )\n{\n  const uint8_t * data = (const uint8_t*)key;\n  const int nblocks = len / 16;\n\n  uint64_t h1 = seed;\n  uint64_t h2 = seed;\n\n  const uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5);\n  const uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f);\n\n  //----------\n  // body\n\n  const uint64_t * blocks = (const uint64_t *)(data);\n\n  for(int i = 0; i < nblocks; i++)\n  {\n    uint64_t k1 = getblock64(blocks,i*2+0);\n    uint64_t k2 = getblock64(blocks,i*2+1);\n\n    k1 *= c1; k1  = ROTL64(k1,31); k1 *= c2; h1 ^= k1;\n\n    h1 = ROTL64(h1,27); h1 += h2; h1 = h1*5+0x52dce729;\n\n    k2 *= c2; k2  = ROTL64(k2,33); k2 *= c1; h2 ^= k2;\n\n    h2 = ROTL64(h2,31); h2 += h1; h2 = h2*5+0x38495ab5;\n  }\n\n  //----------\n  // tail\n\n  const uint8_t * tail = (const uint8_t*)(data + nblocks*16);\n\n  uint64_t k1 = 0;\n  uint64_t k2 = 0;\n\n  switch(len & 15)\n  {\n  case 15: k2 ^= ((uint64_t)tail[14]) << 48;\n  case 14: k2 ^= ((uint64_t)tail[13]) << 40;\n  case 13: k2 ^= ((uint64_t)tail[12]) << 32;\n  case 12: k2 ^= ((uint64_t)tail[11]) << 24;\n  case 11: k2 ^= ((uint64_t)tail[10]) << 16;\n  case 10: k2 ^= ((uint64_t)tail[ 9]) << 8;\n  case  9: k2 ^= ((uint64_t)tail[ 8]) << 0;\n           k2 *= c2; k2  = ROTL64(k2,33); k2 *= c1; h2 ^= k2;\n\n  case  8: k1 ^= ((uint64_t)tail[ 7]) << 56;\n  case  7: k1 ^= ((uint64_t)tail[ 6]) << 48;\n  case  6: k1 ^= ((uint64_t)tail[ 5]) << 40;\n  case  5: k1 ^= ((uint64_t)tail[ 4]) << 32;\n  case  4: k1 ^= ((uint64_t)tail[ 3]) << 24;\n  case  3: k1 ^= ((uint64_t)tail[ 2]) << 16;\n  case  2: k1 ^= ((uint64_t)tail[ 1]) << 8;\n  case  1: k1 ^= ((uint64_t)tail[ 0]) << 0;\n           k1 *= c1; k1  = ROTL64(k1,31); k1 *= c2; h1 ^= k1;\n  };\n\n  //----------\n  // finalization\n\n  h1 ^= len; h2 ^= len;\n\n  h1 += h2;\n  h2 += h1;\n\n  h1 = fmix64(h1);\n  h2 = fmix64(h2);\n\n  h1 += h2;\n  h2 += h1;\n\n  ((uint64_t*)out)[0] = h1;\n  ((uint64_t*)out)[1] = h2;\n}\n\n//-----------------------------------------------------------------------------\n\n"
  },
  {
    "path": "third_party/MurmurHash3.h",
    "content": "//-----------------------------------------------------------------------------\n// MurmurHash3 was written by Austin Appleby, and is placed in the public\n// domain. The author hereby disclaims copyright to this source code.\n\n#ifndef _MURMURHASH3_H_\n#define _MURMURHASH3_H_\n\n//-----------------------------------------------------------------------------\n// Platform-specific functions and macros\n\n// Microsoft Visual Studio\n\n#if defined(_MSC_VER) && (_MSC_VER < 1600)\n\ntypedef unsigned char uint8_t;\ntypedef unsigned int uint32_t;\ntypedef unsigned __int64 uint64_t;\n\n// Other compilers\n\n#else\t// defined(_MSC_VER)\n\n#include <stdint.h>\n\n#endif // !defined(_MSC_VER)\n\n//-----------------------------------------------------------------------------\n\nvoid MurmurHash3_x86_32  ( const void * key, int len, uint32_t seed, void * out );\n\nvoid MurmurHash3_x86_128 ( const void * key, int len, uint32_t seed, void * out );\n\nvoid MurmurHash3_x64_128 ( const void * key, int len, uint32_t seed, void * out );\n\n//-----------------------------------------------------------------------------\n\n#endif // _MURMURHASH3_H_\n"
  },
  {
    "path": "third_party/cpuid.h",
    "content": "/*\n * Copyright (C) 2007, 2008, 2009 Free Software Foundation, Inc.\n *\n * This file is free software; you can redistribute it and/or modify it\n * under the terms of the GNU General Public License as published by the\n * Free Software Foundation; either version 3, or (at your option) any\n * later version.\n * \n * This file is distributed in the hope that it will be useful, but\n * WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU\n * General Public License for more details.\n * \n * Under Section 7 of GPL version 3, you are granted additional\n * permissions described in the GCC Runtime Library Exception, version\n * 3.1, as published by the Free Software Foundation.\n * \n * You should have received a copy of the GNU General Public License and\n * a copy of the GCC Runtime Library Exception along with this program;\n * see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see\n * <http://www.gnu.org/licenses/>.\n */\n\n/* %ecx */\n#define bit_SSE3\t(1 << 0)\n#define bit_PCLMUL\t(1 << 1)\n#define bit_SSSE3\t(1 << 9)\n#define bit_FMA\t\t(1 << 12)\n#define bit_CMPXCHG16B\t(1 << 13)\n#define bit_SSE4_1\t(1 << 19)\n#define bit_SSE4_2\t(1 << 20)\n#define bit_MOVBE\t(1 << 22)\n#define bit_POPCNT\t(1 << 23)\n#define bit_AES\t\t(1 << 25)\n#define bit_XSAVE\t(1 << 26)\n#define bit_OSXSAVE\t(1 << 27)\n#define bit_AVX\t\t(1 << 28)\n#define bit_F16C\t(1 << 29)\n#define bit_RDRND\t(1 << 30)\n\n/* %edx */\n#define bit_CMPXCHG8B\t(1 << 8)\n#define bit_CMOV\t(1 << 15)\n#define bit_MMX\t\t(1 << 23)\n#define bit_FXSAVE\t(1 << 24)\n#define bit_SSE\t\t(1 << 25)\n#define bit_SSE2\t(1 << 26)\n\n/* Extended Features */\n/* %ecx */\n#define bit_LAHF_LM\t(1 << 0)\n#define bit_ABM\t\t(1 << 5)\n#define bit_SSE4a\t(1 << 6)\n#define bit_XOP         (1 << 11)\n#define bit_LWP \t(1 << 15)\n#define bit_FMA4        (1 << 16)\n#define bit_TBM         (1 << 21)\n\n/* %edx */\n#define bit_LM\t\t(1 << 29)\n#define bit_3DNOWP\t(1 << 30)\n#define bit_3DNOW\t(1 << 31)\n\n/* Extended Features (%eax == 7) */\n#define bit_FSGSBASE\t(1 << 0)\n#define bit_BMI\t\t(1 << 3)\n\n#if defined(__i386__) && defined(__PIC__)\n/* %ebx may be the PIC register.  */\n#if __GNUC__ >= 3\n#define __cpuid(level, a, b, c, d)\t\t\t\\\n  __asm__ (\"xchg{l}\\t{%%}ebx, %1\\n\\t\"\t\t\t\\\n\t   \"cpuid\\n\\t\"\t\t\t\t\t\\\n\t   \"xchg{l}\\t{%%}ebx, %1\\n\\t\"\t\t\t\\\n\t   : \"=a\" (a), \"=r\" (b), \"=c\" (c), \"=d\" (d)\t\\\n\t   : \"0\" (level))\n\n#define __cpuid_count(level, count, a, b, c, d)\t\t\\\n  __asm__ (\"xchg{l}\\t{%%}ebx, %1\\n\\t\"\t\t\t\\\n\t   \"cpuid\\n\\t\"\t\t\t\t\t\\\n\t   \"xchg{l}\\t{%%}ebx, %1\\n\\t\"\t\t\t\\\n\t   : \"=a\" (a), \"=r\" (b), \"=c\" (c), \"=d\" (d)\t\\\n\t   : \"0\" (level), \"2\" (count))\n#else\n/* Host GCCs older than 3.0 weren't supporting Intel asm syntax\n   nor alternatives in i386 code.  */\n#define __cpuid(level, a, b, c, d)\t\t\t\\\n  __asm__ (\"xchgl\\t%%ebx, %1\\n\\t\"\t\t\t\\\n\t   \"cpuid\\n\\t\"\t\t\t\t\t\\\n\t   \"xchgl\\t%%ebx, %1\\n\\t\"\t\t\t\\\n\t   : \"=a\" (a), \"=r\" (b), \"=c\" (c), \"=d\" (d)\t\\\n\t   : \"0\" (level))\n\n#define __cpuid_count(level, count, a, b, c, d)\t\t\\\n  __asm__ (\"xchgl\\t%%ebx, %1\\n\\t\"\t\t\t\\\n\t   \"cpuid\\n\\t\"\t\t\t\t\t\\\n\t   \"xchgl\\t%%ebx, %1\\n\\t\"\t\t\t\\\n\t   : \"=a\" (a), \"=r\" (b), \"=c\" (c), \"=d\" (d)\t\\\n\t   : \"0\" (level), \"2\" (count))\n#endif\n#else\n#define __cpuid(level, a, b, c, d)\t\t\t\\\n  __asm__ (\"cpuid\\n\\t\"\t\t\t\t\t\\\n\t   : \"=a\" (a), \"=b\" (b), \"=c\" (c), \"=d\" (d)\t\\\n\t   : \"0\" (level))\n\n#define __cpuid_count(level, count, a, b, c, d)\t\t\\\n  __asm__ (\"cpuid\\n\\t\"\t\t\t\t\t\\\n\t   : \"=a\" (a), \"=b\" (b), \"=c\" (c), \"=d\" (d)\t\\\n\t   : \"0\" (level), \"2\" (count))\n#endif\n\n/* Return highest supported input value for cpuid instruction.  ext can\n   be either 0x0 or 0x8000000 to return highest supported value for\n   basic or extended cpuid information.  Function returns 0 if cpuid\n   is not supported or whatever cpuid returns in eax register.  If sig\n   pointer is non-null, then first four bytes of the signature\n   (as found in ebx register) are returned in location pointed by sig.  */\n\nstatic __inline unsigned int\n__get_cpuid_max (unsigned int __ext, unsigned int *__sig)\n{\n  unsigned int __eax, __ebx, __ecx, __edx;\n\n#ifndef __x86_64__\n#if __GNUC__ >= 3\n  /* See if we can use cpuid.  On AMD64 we always can.  */\n  __asm__ (\"pushf{l|d}\\n\\t\"\n\t   \"pushf{l|d}\\n\\t\"\n\t   \"pop{l}\\t%0\\n\\t\"\n\t   \"mov{l}\\t{%0, %1|%1, %0}\\n\\t\"\n\t   \"xor{l}\\t{%2, %0|%0, %2}\\n\\t\"\n\t   \"push{l}\\t%0\\n\\t\"\n\t   \"popf{l|d}\\n\\t\"\n\t   \"pushf{l|d}\\n\\t\"\n\t   \"pop{l}\\t%0\\n\\t\"\n\t   \"popf{l|d}\\n\\t\"\n\t   : \"=&r\" (__eax), \"=&r\" (__ebx)\n\t   : \"i\" (0x00200000));\n#else\n/* Host GCCs older than 3.0 weren't supporting Intel asm syntax\n   nor alternatives in i386 code.  */\n  __asm__ (\"pushfl\\n\\t\"\n\t   \"pushfl\\n\\t\"\n\t   \"popl\\t%0\\n\\t\"\n\t   \"movl\\t%0, %1\\n\\t\"\n\t   \"xorl\\t%2, %0\\n\\t\"\n\t   \"pushl\\t%0\\n\\t\"\n\t   \"popfl\\n\\t\"\n\t   \"pushfl\\n\\t\"\n\t   \"popl\\t%0\\n\\t\"\n\t   \"popfl\\n\\t\"\n\t   : \"=&r\" (__eax), \"=&r\" (__ebx)\n\t   : \"i\" (0x00200000));\n#endif\n\n  if (!((__eax ^ __ebx) & 0x00200000))\n    return 0;\n#endif\n\n  /* Host supports cpuid.  Return highest supported cpuid input value.  */\n  __cpuid (__ext, __eax, __ebx, __ecx, __edx);\n\n  if (__sig)\n    *__sig = __ebx;\n\n  return __eax;\n}\n\n/* Return cpuid data for requested cpuid level, as found in returned\n   eax, ebx, ecx and edx registers.  The function checks if cpuid is\n   supported and returns 1 for valid cpuid information or 0 for\n   unsupported cpuid level.  All pointers are required to be non-null.  */\n\nstatic __inline int\n__get_cpuid (unsigned int __level,\n\t     unsigned int *__eax, unsigned int *__ebx,\n\t     unsigned int *__ecx, unsigned int *__edx)\n{\n  unsigned int __ext = __level & 0x80000000;\n\n  if (__get_cpuid_max (__ext, 0) < __level)\n    return 0;\n\n  __cpuid (__level, *__eax, *__ebx, *__ecx, *__edx);\n  return 1;\n}\n"
  },
  {
    "path": "threading.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef THREADING_H_\n#define THREADING_H_\n\n#include <iostream>\n#include \"tinythread.h\"\n#include \"fast_mutex.h\"\n\n#ifdef NO_SPINLOCK\n#   define MUTEX_T tthread::mutex\n#else\n#  \tdefine MUTEX_T tthread::fast_mutex\n#endif /* NO_SPINLOCK */\n\n\n/**\n * Wrap a lock; obtain lock upon construction, release upon destruction.\n */\nclass ThreadSafe {\npublic:\n    ThreadSafe(MUTEX_T* ptr_mutex, bool locked = true) {\n\t\tif(locked) {\n\t\t    this->ptr_mutex = ptr_mutex;\n\t\t    ptr_mutex->lock();\n\t\t}\n\t\telse\n\t\t    this->ptr_mutex = NULL;\n\t}\n\n\t~ThreadSafe() {\n\t    if (ptr_mutex != NULL)\n\t        ptr_mutex->unlock();\n\t}\n    \nprivate:\n\tMUTEX_T *ptr_mutex;\n};\n\n#endif\n"
  },
  {
    "path": "timer.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef TIMER_H_\n#define TIMER_H_\n\n#include <ctime>\n#include <iostream>\n#include <sstream>\n#include <iomanip>\n\nusing namespace std;\n\n/**\n * Use time() call to keep track of elapsed time between creation and\n * destruction.  If verbose is true, Timer will print a message showing\n * elapsed time to the given output stream upon destruction.\n */\nclass Timer {\npublic:\n\tTimer(ostream& out = cout, const char *msg = \"\", bool verbose = true) :\n\t\t_t(time(0)), _out(out), _msg(msg), _verbose(verbose) { }\n\n\t/// Optionally print message\n\t~Timer() {\n\t\tif(_verbose) write(_out);\n\t}\n\t\n\t/// Return elapsed time since Timer object was created\n\ttime_t elapsed() const {\n\t\treturn time(0) - _t;\n\t}\n\t\n\tvoid write(ostream& out) {\n\t\ttime_t passed = elapsed();\n\t\t// Print the message supplied at construction time followed\n\t\t// by time elapsed formatted HH:MM:SS \n\t\ttime_t hours   = (passed / 60) / 60;\n\t\ttime_t minutes = (passed / 60) % 60;\n\t\ttime_t seconds = (passed % 60);\n\t\tstd::ostringstream oss;\n\t\toss << _msg << setfill ('0') << setw (2) << hours << \":\"\n\t\t           << setfill ('0') << setw (2) << minutes << \":\"\n\t\t           << setfill ('0') << setw (2) << seconds << endl;\n\t\tout << oss.str().c_str();\n\t}\n\t\nprivate:\n\ttime_t      _t;\n\tostream&    _out;\n\tconst char *_msg;\n\tbool        _verbose;\n};\n\nstatic inline void logTime(std::ostream& os, bool nl = true) {\n\tstruct tm *current;\n\ttime_t now;\n\ttime(&now);\n\tcurrent = localtime(&now);\n\tstd::ostringstream oss;\n\toss << setfill('0') << setw(2)\n\t    << current->tm_hour << \":\"\n\t    << setfill('0') << setw(2)\n\t    << current->tm_min << \":\"\n\t    << setfill('0') << setw(2)\n\t    << current->tm_sec;\n\tif(nl) oss << std::endl;\n\tos << oss.str().c_str();\n}\n\n#endif /*TIMER_H_*/\n"
  },
  {
    "path": "tinythread.cpp",
    "content": "/* -*- mode: c++; tab-width: 2; indent-tabs-mode: nil; -*-\nCopyright (c) 2010-2012 Marcus Geelnard\n\nThis software is provided 'as-is', without any express or implied\nwarranty. In no event will the authors be held liable for any damages\narising from the use of this software.\n\nPermission is granted to anyone to use this software for any purpose,\nincluding commercial applications, and to alter it and redistribute it\nfreely, subject to the following restrictions:\n\n    1. The origin of this software must not be misrepresented; you must not\n    claim that you wrote the original software. If you use this software\n    in a product, an acknowledgment in the product documentation would be\n    appreciated but is not required.\n\n    2. Altered source versions must be plainly marked as such, and must not be\n    misrepresented as being the original software.\n\n    3. This notice may not be removed or altered from any source\n    distribution.\n*/\n\n#include <exception>\n#include \"tinythread.h\"\n\n#if defined(_TTHREAD_POSIX_)\n  #include <unistd.h>\n  #include <map>\n#elif defined(_TTHREAD_WIN32_)\n  #include <process.h>\n#endif\n\n\nnamespace tthread {\n\n//------------------------------------------------------------------------------\n// condition_variable\n//------------------------------------------------------------------------------\n// NOTE 1: The Win32 implementation of the condition_variable class is based on\n// the corresponding implementation in GLFW, which in turn is based on a\n// description by Douglas C. Schmidt and Irfan Pyarali:\n// http://www.cs.wustl.edu/~schmidt/win32-cv-1.html\n//\n// NOTE 2: Windows Vista actually has native support for condition variables\n// (InitializeConditionVariable, WakeConditionVariable, etc), but we want to\n// be portable with pre-Vista Windows versions, so TinyThread++ does not use\n// Vista condition variables.\n//------------------------------------------------------------------------------\n\n#if defined(_TTHREAD_WIN32_)\n  #define _CONDITION_EVENT_ONE 0\n  #define _CONDITION_EVENT_ALL 1\n#endif\n\n#if defined(_TTHREAD_WIN32_)\ncondition_variable::condition_variable() : mWaitersCount(0)\n{\n  mEvents[_CONDITION_EVENT_ONE] = CreateEvent(NULL, FALSE, FALSE, NULL);\n  mEvents[_CONDITION_EVENT_ALL] = CreateEvent(NULL, TRUE, FALSE, NULL);\n  InitializeCriticalSection(&mWaitersCountLock);\n}\n#endif\n\n#if defined(_TTHREAD_WIN32_)\ncondition_variable::~condition_variable()\n{\n  CloseHandle(mEvents[_CONDITION_EVENT_ONE]);\n  CloseHandle(mEvents[_CONDITION_EVENT_ALL]);\n  DeleteCriticalSection(&mWaitersCountLock);\n}\n#endif\n\n#if defined(_TTHREAD_WIN32_)\nvoid condition_variable::_wait()\n{\n  // Wait for either event to become signaled due to notify_one() or\n  // notify_all() being called\n  int result = WaitForMultipleObjects(2, mEvents, FALSE, INFINITE);\n\n  // Check if we are the last waiter\n  EnterCriticalSection(&mWaitersCountLock);\n  -- mWaitersCount;\n  bool lastWaiter = (result == (WAIT_OBJECT_0 + _CONDITION_EVENT_ALL)) &&\n                    (mWaitersCount == 0);\n  LeaveCriticalSection(&mWaitersCountLock);\n\n  // If we are the last waiter to be notified to stop waiting, reset the event\n  if(lastWaiter)\n    ResetEvent(mEvents[_CONDITION_EVENT_ALL]);\n}\n#endif\n\n#if defined(_TTHREAD_WIN32_)\nvoid condition_variable::notify_one()\n{\n  // Are there any waiters?\n  EnterCriticalSection(&mWaitersCountLock);\n  bool haveWaiters = (mWaitersCount > 0);\n  LeaveCriticalSection(&mWaitersCountLock);\n\n  // If we have any waiting threads, send them a signal\n  if(haveWaiters)\n    SetEvent(mEvents[_CONDITION_EVENT_ONE]);\n}\n#endif\n\n#if defined(_TTHREAD_WIN32_)\nvoid condition_variable::notify_all()\n{\n  // Are there any waiters?\n  EnterCriticalSection(&mWaitersCountLock);\n  bool haveWaiters = (mWaitersCount > 0);\n  LeaveCriticalSection(&mWaitersCountLock);\n\n  // If we have any waiting threads, send them a signal\n  if(haveWaiters)\n    SetEvent(mEvents[_CONDITION_EVENT_ALL]);\n}\n#endif\n\n\n//------------------------------------------------------------------------------\n// POSIX pthread_t to unique thread::id mapping logic.\n// Note: Here we use a global thread safe std::map to convert instances of\n// pthread_t to small thread identifier numbers (unique within one process).\n// This method should be portable across different POSIX implementations.\n//------------------------------------------------------------------------------\n\n#if defined(_TTHREAD_POSIX_)\nstatic thread::id _pthread_t_to_ID(const pthread_t &aHandle)\n{\n  static mutex idMapLock;\n  static std::map<pthread_t, unsigned long int> idMap;\n  static unsigned long int idCount(1);\n\n  lock_guard<mutex> guard(idMapLock);\n  if(idMap.find(aHandle) == idMap.end())\n    idMap[aHandle] = idCount ++;\n  return thread::id(idMap[aHandle]);\n}\n#endif // _TTHREAD_POSIX_\n\n\n//------------------------------------------------------------------------------\n// thread\n//------------------------------------------------------------------------------\n\n/// Information to pass to the new thread (what to run).\nstruct _thread_start_info {\n  void (*mFunction)(void *); ///< Pointer to the function to be executed.\n  void * mArg;               ///< Function argument for the thread function.\n  thread * mThread;          ///< Pointer to the thread object.\n};\n\n// Thread wrapper function.\n#if defined(_TTHREAD_WIN32_)\nunsigned WINAPI thread::wrapper_function(void * aArg)\n#elif defined(_TTHREAD_POSIX_)\nvoid * thread::wrapper_function(void * aArg)\n#endif\n{\n  // Get thread startup information\n  _thread_start_info * ti = (_thread_start_info *) aArg;\n\n  try\n  {\n    // Call the actual client thread function\n    ti->mFunction(ti->mArg);\n  }\n  catch(...)\n  {\n    // Uncaught exceptions will terminate the application (default behavior\n    // according to C++11)\n    std::terminate();\n  }\n\n  // The thread is no longer executing\n  lock_guard<mutex> guard(ti->mThread->mDataMutex);\n  ti->mThread->mNotAThread = true;\n\n  // The thread is responsible for freeing the startup information\n  delete ti;\n\n  return 0;\n}\n\nthread::thread(void (*aFunction)(void *), void * aArg)\n{\n  // Serialize access to this thread structure\n  lock_guard<mutex> guard(mDataMutex);\n\n  // Fill out the thread startup information (passed to the thread wrapper,\n  // which will eventually free it)\n  _thread_start_info * ti = new _thread_start_info;\n  ti->mFunction = aFunction;\n  ti->mArg = aArg;\n  ti->mThread = this;\n\n  // The thread is now alive\n  mNotAThread = false;\n\n  // Create the thread\n#if defined(_TTHREAD_WIN32_)\n  mHandle = (HANDLE) _beginthreadex(0, 0, wrapper_function, (void *) ti, 0, &mWin32ThreadID);\n#elif defined(_TTHREAD_POSIX_)\n  if(pthread_create(&mHandle, NULL, wrapper_function, (void *) ti) != 0)\n    mHandle = 0;\n#endif\n\n  // Did we fail to create the thread?\n  if(!mHandle)\n  {\n    mNotAThread = true;\n    delete ti;\n  }\n}\n\nthread::~thread()\n{\n  if(joinable())\n    std::terminate();\n}\n\nvoid thread::join()\n{\n  if(joinable())\n  {\n#if defined(_TTHREAD_WIN32_)\n    WaitForSingleObject(mHandle, INFINITE);\n    CloseHandle(mHandle);\n#elif defined(_TTHREAD_POSIX_)\n    pthread_join(mHandle, NULL);\n#endif\n  }\n}\n\nbool thread::joinable() const\n{\n  mDataMutex.lock();\n  bool result = !mNotAThread;\n  mDataMutex.unlock();\n  return result;\n}\n\nvoid thread::detach()\n{\n  mDataMutex.lock();\n  if(!mNotAThread)\n  {\n#if defined(_TTHREAD_WIN32_)\n    CloseHandle(mHandle);\n#elif defined(_TTHREAD_POSIX_)\n    pthread_detach(mHandle);\n#endif\n    mNotAThread = true;\n  }\n  mDataMutex.unlock();\n}\n\nthread::id thread::get_id() const\n{\n  if(!joinable())\n    return id();\n#if defined(_TTHREAD_WIN32_)\n  return id((unsigned long int) mWin32ThreadID);\n#elif defined(_TTHREAD_POSIX_)\n  return _pthread_t_to_ID(mHandle);\n#endif\n}\n\nunsigned thread::hardware_concurrency()\n{\n#if defined(_TTHREAD_WIN32_)\n  SYSTEM_INFO si;\n  GetSystemInfo(&si);\n  return (int) si.dwNumberOfProcessors;\n#elif defined(_SC_NPROCESSORS_ONLN)\n  return (int) sysconf(_SC_NPROCESSORS_ONLN);\n#elif defined(_SC_NPROC_ONLN)\n  return (int) sysconf(_SC_NPROC_ONLN);\n#else\n  // The standard requires this function to return zero if the number of\n  // hardware cores could not be determined.\n  return 0;\n#endif\n}\n\n\n//------------------------------------------------------------------------------\n// this_thread\n//------------------------------------------------------------------------------\n\nthread::id this_thread::get_id()\n{\n#if defined(_TTHREAD_WIN32_)\n  return thread::id((unsigned long int) GetCurrentThreadId());\n#elif defined(_TTHREAD_POSIX_)\n  return _pthread_t_to_ID(pthread_self());\n#endif\n}\n\n}\n"
  },
  {
    "path": "tinythread.h",
    "content": "/* -*- mode: c++; tab-width: 2; indent-tabs-mode: nil; -*-\nCopyright (c) 2010-2012 Marcus Geelnard\n\nThis software is provided 'as-is', without any express or implied\nwarranty. In no event will the authors be held liable for any damages\narising from the use of this software.\n\nPermission is granted to anyone to use this software for any purpose,\nincluding commercial applications, and to alter it and redistribute it\nfreely, subject to the following restrictions:\n\n    1. The origin of this software must not be misrepresented; you must not\n    claim that you wrote the original software. If you use this software\n    in a product, an acknowledgment in the product documentation would be\n    appreciated but is not required.\n\n    2. Altered source versions must be plainly marked as such, and must not be\n    misrepresented as being the original software.\n\n    3. This notice may not be removed or altered from any source\n    distribution.\n*/\n\n#ifndef _TINYTHREAD_H_\n#define _TINYTHREAD_H_\n\n/// @file\n/// @mainpage TinyThread++ API Reference\n///\n/// @section intro_sec Introduction\n/// TinyThread++ is a minimal, portable implementation of basic threading\n/// classes for C++.\n///\n/// They closely mimic the functionality and naming of the C++11 standard, and\n/// should be easily replaceable with the corresponding std:: variants.\n///\n/// @section port_sec Portability\n/// The Win32 variant uses the native Win32 API for implementing the thread\n/// classes, while for other systems, the POSIX threads API (pthread) is used.\n///\n/// @section class_sec Classes\n/// In order to mimic the threading API of the C++11 standard, subsets of\n/// several classes are provided. The fundamental classes are:\n/// @li tthread::thread\n/// @li tthread::mutex\n/// @li tthread::recursive_mutex\n/// @li tthread::condition_variable\n/// @li tthread::lock_guard\n/// @li tthread::fast_mutex\n///\n/// @section misc_sec Miscellaneous\n/// The following special keywords are available: #thread_local.\n///\n/// For more detailed information (including additional classes), browse the\n/// different sections of this documentation. A good place to start is:\n/// tinythread.h.\n\n// Which platform are we on?\n#if !defined(_TTHREAD_PLATFORM_DEFINED_)\n  #if defined(_WIN32) || defined(__WIN32__) || defined(__WINDOWS__)\n    #define _TTHREAD_WIN32_\n  #else\n    #define _TTHREAD_POSIX_\n  #endif\n  #define _TTHREAD_PLATFORM_DEFINED_\n#endif\n\n// Platform specific includes\n#if defined(_TTHREAD_WIN32_)\n  #ifndef WIN32_LEAN_AND_MEAN\n    #define WIN32_LEAN_AND_MEAN\n    #define __UNDEF_LEAN_AND_MEAN\n  #endif\n  #include <windows.h>\n  #ifdef __UNDEF_LEAN_AND_MEAN\n    #undef WIN32_LEAN_AND_MEAN\n    #undef __UNDEF_LEAN_AND_MEAN\n  #endif\n#else\n  #include <pthread.h>\n  #include <signal.h>\n  #include <sched.h>\n  #include <unistd.h>\n#endif\n\n// Generic includes\n#include <ostream>\n\n/// TinyThread++ version (major number).\n#define TINYTHREAD_VERSION_MAJOR 1\n/// TinyThread++ version (minor number).\n#define TINYTHREAD_VERSION_MINOR 1\n/// TinyThread++ version (full version).\n#define TINYTHREAD_VERSION (TINYTHREAD_VERSION_MAJOR * 100 + TINYTHREAD_VERSION_MINOR)\n\n// Do we have a fully featured C++11 compiler?\n#if (__cplusplus > 199711L) || (defined(__STDCXX_VERSION__) && (__STDCXX_VERSION__ >= 201001L))\n  #define _TTHREAD_CPP11_\n#endif\n\n// ...at least partial C++11?\n#if defined(_TTHREAD_CPP11_) || defined(__GXX_EXPERIMENTAL_CXX0X__) || defined(__GXX_EXPERIMENTAL_CPP0X__)\n  #define _TTHREAD_CPP11_PARTIAL_\n#endif\n\n// Macro for disabling assignments of objects.\n#ifdef _TTHREAD_CPP11_PARTIAL_\n  #define _TTHREAD_DISABLE_ASSIGNMENT(name) \\\n      name(const name&) = delete; \\\n      name& operator=(const name&) = delete;\n#else\n  #define _TTHREAD_DISABLE_ASSIGNMENT(name) \\\n      name(const name&); \\\n      name& operator=(const name&);\n#endif\n\n/// @def thread_local\n/// Thread local storage keyword.\n/// A variable that is declared with the @c thread_local keyword makes the\n/// value of the variable local to each thread (known as thread-local storage,\n/// or TLS). Example usage:\n/// @code\n/// // This variable is local to each thread.\n/// thread_local int variable;\n/// @endcode\n/// @note The @c thread_local keyword is a macro that maps to the corresponding\n/// compiler directive (e.g. @c __declspec(thread)). While the C++11 standard\n/// allows for non-trivial types (e.g. classes with constructors and\n/// destructors) to be declared with the @c thread_local keyword, most pre-C++11\n/// compilers only allow for trivial types (e.g. @c int). So, to guarantee\n/// portable code, only use trivial types for thread local storage.\n/// @note This directive is currently not supported on Mac OS X (it will give\n/// a compiler error), since compile-time TLS is not supported in the Mac OS X\n/// executable format. Also, some older versions of MinGW (before GCC 4.x) do\n/// not support this directive.\n/// @hideinitializer\n\n#if !defined(_TTHREAD_CPP11_) && !defined(thread_local)\n #if defined(__GNUC__) || defined(__INTEL_COMPILER) || defined(__SUNPRO_CC) || defined(__IBMCPP__)\n  #define thread_local __thread\n #else\n  #define thread_local __declspec(thread)\n #endif\n#endif\n\n\n/// Main name space for TinyThread++.\n/// This namespace is more or less equivalent to the @c std namespace for the\n/// C++11 thread classes. For instance, the tthread::mutex class corresponds to\n/// the std::mutex class.\nnamespace tthread {\n\n/// Mutex class.\n/// This is a mutual exclusion object for synchronizing access to shared\n/// memory areas for several threads. The mutex is non-recursive (i.e. a\n/// program may deadlock if the thread that owns a mutex object calls lock()\n/// on that object).\n/// @see recursive_mutex\nclass mutex {\n  public:\n    /// Constructor.\n    mutex()\n#if defined(_TTHREAD_WIN32_)\n      : mAlreadyLocked(false)\n#endif\n    {\n#if defined(_TTHREAD_WIN32_)\n      InitializeCriticalSection(&mHandle);\n#else\n      pthread_mutex_init(&mHandle, NULL);\n#endif\n    }\n\n    /// Destructor.\n    ~mutex()\n    {\n#if defined(_TTHREAD_WIN32_)\n      DeleteCriticalSection(&mHandle);\n#else\n      pthread_mutex_destroy(&mHandle);\n#endif\n    }\n\n    /// Lock the mutex.\n    /// The method will block the calling thread until a lock on the mutex can\n    /// be obtained. The mutex remains locked until @c unlock() is called.\n    /// @see lock_guard\n    inline void lock()\n    {\n#if defined(_TTHREAD_WIN32_)\n      EnterCriticalSection(&mHandle);\n      while(mAlreadyLocked) Sleep(1000); // Simulate deadlock...\n      mAlreadyLocked = true;\n#else\n      pthread_mutex_lock(&mHandle);\n#endif\n    }\n\n    /// Try to lock the mutex.\n    /// The method will try to lock the mutex. If it fails, the function will\n    /// return immediately (non-blocking).\n    /// @return @c true if the lock was acquired, or @c false if the lock could\n    /// not be acquired.\n    inline bool try_lock()\n    {\n#if defined(_TTHREAD_WIN32_)\n      bool ret = (TryEnterCriticalSection(&mHandle) ? true : false);\n      if(ret && mAlreadyLocked)\n      {\n        LeaveCriticalSection(&mHandle);\n        ret = false;\n      }\n      return ret;\n#else\n      return (pthread_mutex_trylock(&mHandle) == 0) ? true : false;\n#endif\n    }\n\n    /// Unlock the mutex.\n    /// If any threads are waiting for the lock on this mutex, one of them will\n    /// be unblocked.\n    inline void unlock()\n    {\n#if defined(_TTHREAD_WIN32_)\n      mAlreadyLocked = false;\n      LeaveCriticalSection(&mHandle);\n#else\n      pthread_mutex_unlock(&mHandle);\n#endif\n    }\n\n    _TTHREAD_DISABLE_ASSIGNMENT(mutex)\n\n  private:\n#if defined(_TTHREAD_WIN32_)\n    CRITICAL_SECTION mHandle;\n    bool mAlreadyLocked;\n#else\n    pthread_mutex_t mHandle;\n#endif\n\n    friend class condition_variable;\n};\n\n/// Recursive mutex class.\n/// This is a mutual exclusion object for synchronizing access to shared\n/// memory areas for several threads. The mutex is recursive (i.e. a thread\n/// may lock the mutex several times, as long as it unlocks the mutex the same\n/// number of times).\n/// @see mutex\nclass recursive_mutex {\n  public:\n    /// Constructor.\n    recursive_mutex()\n    {\n#if defined(_TTHREAD_WIN32_)\n      InitializeCriticalSection(&mHandle);\n#else\n      pthread_mutexattr_t attr;\n      pthread_mutexattr_init(&attr);\n      pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);\n      pthread_mutex_init(&mHandle, &attr);\n#endif\n    }\n\n    /// Destructor.\n    ~recursive_mutex()\n    {\n#if defined(_TTHREAD_WIN32_)\n      DeleteCriticalSection(&mHandle);\n#else\n      pthread_mutex_destroy(&mHandle);\n#endif\n    }\n\n    /// Lock the mutex.\n    /// The method will block the calling thread until a lock on the mutex can\n    /// be obtained. The mutex remains locked until @c unlock() is called.\n    /// @see lock_guard\n    inline void lock()\n    {\n#if defined(_TTHREAD_WIN32_)\n      EnterCriticalSection(&mHandle);\n#else\n      pthread_mutex_lock(&mHandle);\n#endif\n    }\n\n    /// Try to lock the mutex.\n    /// The method will try to lock the mutex. If it fails, the function will\n    /// return immediately (non-blocking).\n    /// @return @c true if the lock was acquired, or @c false if the lock could\n    /// not be acquired.\n    inline bool try_lock()\n    {\n#if defined(_TTHREAD_WIN32_)\n      return TryEnterCriticalSection(&mHandle) ? true : false;\n#else\n      return (pthread_mutex_trylock(&mHandle) == 0) ? true : false;\n#endif\n    }\n\n    /// Unlock the mutex.\n    /// If any threads are waiting for the lock on this mutex, one of them will\n    /// be unblocked.\n    inline void unlock()\n    {\n#if defined(_TTHREAD_WIN32_)\n      LeaveCriticalSection(&mHandle);\n#else\n      pthread_mutex_unlock(&mHandle);\n#endif\n    }\n\n    _TTHREAD_DISABLE_ASSIGNMENT(recursive_mutex)\n\n  private:\n#if defined(_TTHREAD_WIN32_)\n    CRITICAL_SECTION mHandle;\n#else\n    pthread_mutex_t mHandle;\n#endif\n\n    friend class condition_variable;\n};\n\n/// Lock guard class.\n/// The constructor locks the mutex, and the destructor unlocks the mutex, so\n/// the mutex will automatically be unlocked when the lock guard goes out of\n/// scope. Example usage:\n/// @code\n/// mutex m;\n/// int counter;\n///\n/// void increment()\n/// {\n///   lock_guard<mutex> guard(m);\n///   ++ counter;\n/// }\n/// @endcode\n\ntemplate <class T>\nclass lock_guard {\n  public:\n    typedef T mutex_type;\n\n    lock_guard() : mMutex(0) {}\n\n    /// The constructor locks the mutex.\n    explicit lock_guard(mutex_type &aMutex)\n    {\n      mMutex = &aMutex;\n      mMutex->lock();\n    }\n\n    /// The destructor unlocks the mutex.\n    ~lock_guard()\n    {\n      if(mMutex)\n        mMutex->unlock();\n    }\n\n  private:\n    mutex_type * mMutex;\n};\n\n/// Condition variable class.\n/// This is a signalling object for synchronizing the execution flow for\n/// several threads. Example usage:\n/// @code\n/// // Shared data and associated mutex and condition variable objects\n/// int count;\n/// mutex m;\n/// condition_variable cond;\n///\n/// // Wait for the counter to reach a certain number\n/// void wait_counter(int targetCount)\n/// {\n///   lock_guard<mutex> guard(m);\n///   while(count < targetCount)\n///     cond.wait(m);\n/// }\n///\n/// // Increment the counter, and notify waiting threads\n/// void increment()\n/// {\n///   lock_guard<mutex> guard(m);\n///   ++ count;\n///   cond.notify_all();\n/// }\n/// @endcode\nclass condition_variable {\n  public:\n    /// Constructor.\n#if defined(_TTHREAD_WIN32_)\n    condition_variable();\n#else\n    condition_variable()\n    {\n      pthread_cond_init(&mHandle, NULL);\n    }\n#endif\n\n    /// Destructor.\n#if defined(_TTHREAD_WIN32_)\n    ~condition_variable();\n#else\n    ~condition_variable()\n    {\n      pthread_cond_destroy(&mHandle);\n    }\n#endif\n\n    /// Wait for the condition.\n    /// The function will block the calling thread until the condition variable\n    /// is woken by @c notify_one(), @c notify_all() or a spurious wake up.\n    /// @param[in] aMutex A mutex that will be unlocked when the wait operation\n    ///   starts, an locked again as soon as the wait operation is finished.\n    template <class _mutexT>\n    inline void wait(_mutexT &aMutex)\n    {\n#if defined(_TTHREAD_WIN32_)\n      // Increment number of waiters\n      EnterCriticalSection(&mWaitersCountLock);\n      ++ mWaitersCount;\n      LeaveCriticalSection(&mWaitersCountLock);\n\n      // Release the mutex while waiting for the condition (will decrease\n      // the number of waiters when done)...\n      aMutex.unlock();\n      _wait();\n      aMutex.lock();\n#else\n      pthread_cond_wait(&mHandle, &aMutex.mHandle);\n#endif\n    }\n\n    /// Notify one thread that is waiting for the condition.\n    /// If at least one thread is blocked waiting for this condition variable,\n    /// one will be woken up.\n    /// @note Only threads that started waiting prior to this call will be\n    /// woken up.\n#if defined(_TTHREAD_WIN32_)\n    void notify_one();\n#else\n    inline void notify_one()\n    {\n      pthread_cond_signal(&mHandle);\n    }\n#endif\n\n    /// Notify all threads that are waiting for the condition.\n    /// All threads that are blocked waiting for this condition variable will\n    /// be woken up.\n    /// @note Only threads that started waiting prior to this call will be\n    /// woken up.\n#if defined(_TTHREAD_WIN32_)\n    void notify_all();\n#else\n    inline void notify_all()\n    {\n      pthread_cond_broadcast(&mHandle);\n    }\n#endif\n\n    _TTHREAD_DISABLE_ASSIGNMENT(condition_variable)\n\n  private:\n#if defined(_TTHREAD_WIN32_)\n    void _wait();\n    HANDLE mEvents[2];                  ///< Signal and broadcast event HANDLEs.\n    unsigned int mWaitersCount;         ///< Count of the number of waiters.\n    CRITICAL_SECTION mWaitersCountLock; ///< Serialize access to mWaitersCount.\n#else\n    pthread_cond_t mHandle;\n#endif\n};\n\n\n/// Thread class.\nclass thread {\n  public:\n#if defined(_TTHREAD_WIN32_)\n    typedef HANDLE native_handle_type;\n#else\n    typedef pthread_t native_handle_type;\n#endif\n\n    class id;\n\n    /// Default constructor.\n    /// Construct a @c thread object without an associated thread of execution\n    /// (i.e. non-joinable).\n    thread() : mHandle(0), mNotAThread(true)\n#if defined(_TTHREAD_WIN32_)\n    , mWin32ThreadID(0)\n#endif\n    {}\n\n    /// Thread starting constructor.\n    /// Construct a @c thread object with a new thread of execution.\n    /// @param[in] aFunction A function pointer to a function of type:\n    ///          <tt>void fun(void * arg)</tt>\n    /// @param[in] aArg Argument to the thread function.\n    /// @note This constructor is not fully compatible with the standard C++\n    /// thread class. It is more similar to the pthread_create() (POSIX) and\n    /// CreateThread() (Windows) functions.\n    thread(void (*aFunction)(void *), void * aArg);\n\n    /// Destructor.\n    /// @note If the thread is joinable upon destruction, @c std::terminate()\n    /// will be called, which terminates the process. It is always wise to do\n    /// @c join() before deleting a thread object.\n    ~thread();\n\n    /// Wait for the thread to finish (join execution flows).\n    /// After calling @c join(), the thread object is no longer associated with\n    /// a thread of execution (i.e. it is not joinable, and you may not join\n    /// with it nor detach from it).\n    void join();\n\n    /// Check if the thread is joinable.\n    /// A thread object is joinable if it has an associated thread of execution.\n    bool joinable() const;\n\n    /// Detach from the thread.\n    /// After calling @c detach(), the thread object is no longer assicated with\n    /// a thread of execution (i.e. it is not joinable). The thread continues\n    /// execution without the calling thread blocking, and when the thread\n    /// ends execution, any owned resources are released.\n    void detach();\n\n    /// Return the thread ID of a thread object.\n    id get_id() const;\n\n    /// Get the native handle for this thread.\n    /// @note Under Windows, this is a @c HANDLE, and under POSIX systems, this\n    /// is a @c pthread_t.\n    inline native_handle_type native_handle()\n    {\n      return mHandle;\n    }\n\n    /// Determine the number of threads which can possibly execute concurrently.\n    /// This function is useful for determining the optimal number of threads to\n    /// use for a task.\n    /// @return The number of hardware thread contexts in the system.\n    /// @note If this value is not defined, the function returns zero (0).\n    static unsigned hardware_concurrency();\n\n    _TTHREAD_DISABLE_ASSIGNMENT(thread)\n\n  private:\n    native_handle_type mHandle;   ///< Thread handle.\n    mutable mutex mDataMutex;     ///< Serializer for access to the thread private data.\n    bool mNotAThread;             ///< True if this object is not a thread of execution.\n#if defined(_TTHREAD_WIN32_)\n    unsigned int mWin32ThreadID;  ///< Unique thread ID (filled out by _beginthreadex).\n#endif\n\n    // This is the internal thread wrapper function.\n#if defined(_TTHREAD_WIN32_)\n    static unsigned WINAPI wrapper_function(void * aArg);\n#else\n    static void * wrapper_function(void * aArg);\n#endif\n};\n\n/// Thread ID.\n/// The thread ID is a unique identifier for each thread.\n/// @see thread::get_id()\nclass thread::id {\n  public:\n    /// Default constructor.\n    /// The default constructed ID is that of thread without a thread of\n    /// execution.\n    id() : mId(0) {};\n\n    id(unsigned long int aId) : mId(aId) {};\n\n    id(const id& aId) : mId(aId.mId) {};\n\n    inline id & operator=(const id &aId)\n    {\n      mId = aId.mId;\n      return *this;\n    }\n\n    inline friend bool operator==(const id &aId1, const id &aId2)\n    {\n      return (aId1.mId == aId2.mId);\n    }\n\n    inline friend bool operator!=(const id &aId1, const id &aId2)\n    {\n      return (aId1.mId != aId2.mId);\n    }\n\n    inline friend bool operator<=(const id &aId1, const id &aId2)\n    {\n      return (aId1.mId <= aId2.mId);\n    }\n\n    inline friend bool operator<(const id &aId1, const id &aId2)\n    {\n      return (aId1.mId < aId2.mId);\n    }\n\n    inline friend bool operator>=(const id &aId1, const id &aId2)\n    {\n      return (aId1.mId >= aId2.mId);\n    }\n\n    inline friend bool operator>(const id &aId1, const id &aId2)\n    {\n      return (aId1.mId > aId2.mId);\n    }\n\n    inline friend std::ostream& operator <<(std::ostream &os, const id &obj)\n    {\n      os << obj.mId;\n      return os;\n    }\n\n  private:\n    unsigned long int mId;\n};\n\n\n// Related to <ratio> - minimal to be able to support chrono.\ntypedef long long __intmax_t;\n\n/// Minimal implementation of the @c ratio class. This class provides enough\n/// functionality to implement some basic @c chrono classes.\ntemplate <__intmax_t N, __intmax_t D = 1> class ratio {\n  public:\n    static double _as_double() { return double(N) / double(D); }\n};\n\n/// Minimal implementation of the @c chrono namespace.\n/// The @c chrono namespace provides types for specifying time intervals.\nnamespace chrono {\n  /// Duration template class. This class provides enough functionality to\n  /// implement @c this_thread::sleep_for().\n  template <class _Rep, class _Period = ratio<1> > class duration {\n    private:\n      _Rep rep_;\n    public:\n      typedef _Rep rep;\n      typedef _Period period;\n\n      /// Construct a duration object with the given duration.\n      template <class _Rep2>\n        explicit duration(const _Rep2& r) : rep_(r) {};\n\n      /// Return the value of the duration object.\n      rep count() const\n      {\n        return rep_;\n      }\n  };\n\n  // Standard duration types.\n  typedef duration<__intmax_t, ratio<1, 1000000000> > nanoseconds; ///< Duration with the unit nanoseconds.\n  typedef duration<__intmax_t, ratio<1, 1000000> > microseconds;   ///< Duration with the unit microseconds.\n  typedef duration<__intmax_t, ratio<1, 1000> > milliseconds;      ///< Duration with the unit milliseconds.\n  typedef duration<__intmax_t> seconds;                            ///< Duration with the unit seconds.\n  typedef duration<__intmax_t, ratio<60> > minutes;                ///< Duration with the unit minutes.\n  typedef duration<__intmax_t, ratio<3600> > hours;                ///< Duration with the unit hours.\n}\n\n/// The namespace @c this_thread provides methods for dealing with the\n/// calling thread.\nnamespace this_thread {\n  /// Return the thread ID of the calling thread.\n  thread::id get_id();\n\n  /// Yield execution to another thread.\n  /// Offers the operating system the opportunity to schedule another thread\n  /// that is ready to run on the current processor.\n  inline void yield()\n  {\n#if defined(_TTHREAD_WIN32_)\n    Sleep(0);\n#else\n    sched_yield();\n#endif\n  }\n\n  /// Blocks the calling thread for a period of time.\n  /// @param[in] aTime Minimum time to put the thread to sleep.\n  /// Example usage:\n  /// @code\n  /// // Sleep for 100 milliseconds\n  /// this_thread::sleep_for(chrono::milliseconds(100));\n  /// @endcode\n  /// @note Supported duration types are: nanoseconds, microseconds,\n  /// milliseconds, seconds, minutes and hours.\n  template <class _Rep, class _Period> void sleep_for(const chrono::duration<_Rep, _Period>& aTime)\n  {\n#if defined(_TTHREAD_WIN32_)\n    Sleep(int(double(aTime.count()) * (1000.0 * _Period::_as_double()) + 0.5));\n#else\n    usleep(int(double(aTime.count()) * (1000000.0 * _Period::_as_double()) + 0.5));\n#endif\n  }\n}\n\n}\n\n// Define/macro cleanup\n#undef _TTHREAD_DISABLE_ASSIGNMENT\n\n#endif // _TINYTHREAD_H_\n"
  },
  {
    "path": "tokenize.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef TOKENIZE_H_\n#define TOKENIZE_H_\n\n#include <string>\n#include <sstream>\n#include <limits>\n\nusing namespace std;\n\n/**\n * Split string s according to given delimiters.  Mostly borrowed\n * from C++ Programming HOWTO 7.3.\n */\ntemplate<typename T>\nstatic inline void tokenize(\n\tconst string& s,\n\tconst string& delims,\n\tT& ss,\n\tsize_t max = std::numeric_limits<size_t>::max())\n{\n\t//string::size_type lastPos = s.find_first_not_of(delims, 0);\n\tstring::size_type lastPos = 0;\n\tstring::size_type pos = s.find_first_of(delims, lastPos);\n\twhile (string::npos != pos || string::npos != lastPos) {\n\t\tss.push_back(s.substr(lastPos, pos - lastPos));\n\t\tlastPos = s.find_first_not_of(delims, pos);\n\t\tpos = s.find_first_of(delims, lastPos);\n\t\tif(ss.size() == (max - 1)) {\n\t\t\tpos = string::npos;\n\t\t}\n\t}\n}\n\ntemplate<typename T>\nstatic inline void tokenize(const std::string& s, char delim, T& ss) {\n\tstd::string token;\n\tstd::istringstream iss(s);\n\twhile(getline(iss, token, delim)) {\n\t\tss.push_back(token);\n\t}\n}\n\n#endif /*TOKENIZE_H_*/\n"
  },
  {
    "path": "util.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef UTIL_H_\n#define UTIL_H_\n\n#include <stdlib.h>\n#include <limits>\n#include <map>\n#include <string>\n#include <sstream>\n\n/**\n * C++ version char* style \"itoa\": Convert integer to string\n */\ntemplate<typename T>\nchar* itoa10(const T& value, char* result) {\n\t// Check that base is valid\n\tchar* out = result;\n\tT quotient = value;\n\tif(std::numeric_limits<T>::is_signed) {\n\t\tif(quotient <= 0) quotient = -quotient;\n\t}\n\t// Now write each digit from most to least significant\n\tdo {\n\t\t*out = \"0123456789\"[quotient % 10];\n\t\t++out;\n\t\tquotient /= 10;\n\t} while (quotient > 0);\n\t// Only apply negative sign for base 10\n\tif(std::numeric_limits<T>::is_signed) {\n\t\t// Avoid compiler warning in cases where T is unsigned\n\t\tif (value <= 0 && value != 0) *out++ = '-';\n\t}\n\treverse( result, out );\n\t*out = 0; // terminator\n\treturn out;\n}\n\n// extract numeric ID from the beginning of a string\ninline\nuint64_t extractIDFromRefName(const string& refName) {\n    uint64_t id = 0;\n    for (size_t ni = 0; ni < refName.length(); ni++) {\n        if (refName[ni] < '0' || refName[ni] > '9')\n            break;\n\n        id *= 10;\n        id += (refName[ni] - '0');\n    }\n    return id;\n}\n\n// Converts a numeric value to std::string (part of C++11)\ntemplate <typename T>\nstd::string to_string(T value) {\n ostringstream ss;\n ss << value;\n return ss.str();\n}\n\n/**\n *\n */\ntemplate<typename K,typename V>\ninline\nV find_or_use_default(const std::map<K, V>& my_map, const K& query, const V default_value) {\n\ttypedef typename std::map<K,V>::const_iterator MapIterator;\n\tMapIterator itr = my_map.find(query);\n\n\tif (itr == my_map.end()) {\n\t\treturn default_value;\n\t}\n\n\treturn itr->second;\n}\n\n#endif /*ifndef UTIL_H_*/\n"
  },
  {
    "path": "word_io.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef WORD_IO_H_\n#define WORD_IO_H_\n\n#include <stdint.h>\n#include <unistd.h>\n#include <iostream>\n#include <fstream>\n#include \"assert_helpers.h\"\n#include \"endian_swap.h\"\n\n/**\n * Write a 32-bit unsigned to an output stream being careful to\n * re-endianize if caller-requested endianness differs from current\n * host.\n */\nstatic inline void writeU32(std::ostream& out, uint32_t x, bool toBigEndian) {\n\tuint32_t y = endianizeU32(x, toBigEndian);\n\tout.write((const char*)&y, 4);\n}\n\n/**\n * Write a 32-bit unsigned to an output stream using the native\n * endianness.\n */\nstatic inline void writeU32(std::ostream& out, uint32_t x) {\n\tout.write((const char*)&x, 4);\n}\n\n/**\n * Write a 32-bit signed int to an output stream being careful to\n * re-endianize if caller-requested endianness differs from current\n * host.\n */\nstatic inline void writeI32(std::ostream& out, int32_t x, bool toBigEndian) {\n\tint32_t y = endianizeI32(x, toBigEndian);\n\tout.write((const char*)&y, 4);\n}\n\n/**\n * Write a 32-bit unsigned to an output stream using the native\n * endianness.\n */\nstatic inline void writeI32(std::ostream& out, int32_t x) {\n\tout.write((const char*)&x, 4);\n}\n\n/**\n * Write a 16-bit unsigned to an output stream being careful to\n * re-endianize if caller-requested endianness differs from current\n * host.\n */\nstatic inline void writeU16(std::ostream& out, uint16_t x, bool toBigEndian) {\n\tuint16_t y = endianizeU16(x, toBigEndian);\n\tout.write((const char*)&y, 2);\n}\n\n/**\n * Write a 16-bit unsigned to an output stream using the native\n * endianness.\n */\nstatic inline void writeU16(std::ostream& out, uint16_t x) {\n\tout.write((const char*)&x, 2);\n}\n\n/**\n * Write a 16-bit signed int to an output stream being careful to\n * re-endianize if caller-requested endianness differs from current\n * host.\n */\nstatic inline void writeI16(std::ostream& out, int16_t x, bool toBigEndian) {\n\tint16_t y = endianizeI16(x, toBigEndian);\n\tout.write((const char*)&y, 2);\n}\n\n/**\n * Write a 16-bit unsigned to an output stream using the native\n * endianness.\n */\nstatic inline void writeI16(std::ostream& out, int16_t x) {\n\tout.write((const char*)&x, 2);\n}\n\n/**\n * Read a 32-bit unsigned from an input stream, inverting endianness\n * if necessary.\n */\nstatic inline uint32_t readU32(std::istream& in, bool swap) {\n\tuint32_t x;\n\tin.read((char *)&x, 4);\n\tassert_eq(4, in.gcount());\n\tif(swap) {\n\t\treturn endianSwapU32(x);\n\t} else {\n\t\treturn x;\n\t}\n}\n\n/**\n * Read a 32-bit unsigned from a file descriptor, optionally inverting\n * endianness.\n */\n#ifdef BOWTIE_MM\nstatic inline uint32_t readU32(int in, bool swap) {\n\tuint32_t x;\n\tif(read(in, (void *)&x, 4) != 4) {\n\t\tassert(false);\n\t}\n\tif(swap) {\n\t\treturn endianSwapU32(x);\n\t} else {\n\t\treturn x;\n\t}\n}\n#endif\n\n/**\n * Read a 32-bit unsigned from a FILE*, optionally inverting\n * endianness.\n */\nstatic inline uint32_t readU32(FILE* in, bool swap) {\n\tuint32_t x;\n\tif(fread((void *)&x, 1, 4, in) != 4) {\n\t\tassert(false);\n\t}\n\tif(swap) {\n\t\treturn endianSwapU32(x);\n\t} else {\n\t\treturn x;\n\t}\n}\n\n\n/**\n * Read a 32-bit signed from an input stream, inverting endianness\n * if necessary.\n */\nstatic inline int32_t readI32(std::istream& in, bool swap) {\n\tint32_t x;\n\tin.read((char *)&x, 4);\n\tassert_eq(4, in.gcount());\n\tif(swap) {\n\t\treturn endianSwapI32(x);\n\t} else {\n\t\treturn x;\n\t}\n}\n\n/**\n * Read a 32-bit unsigned from a file descriptor, optionally inverting\n * endianness.\n */\n#ifdef BOWTIE_MM\nstatic inline uint32_t readI32(int in, bool swap) {\n\tint32_t x;\n\tif(read(in, (void *)&x, 4) != 4) {\n\t\tassert(false);\n\t}\n\tif(swap) {\n\t\treturn endianSwapI32(x);\n\t} else {\n\t\treturn x;\n\t}\n}\n#endif\n\n/**\n * Read a 32-bit unsigned from a FILE*, optionally inverting\n * endianness.\n */\nstatic inline uint32_t readI32(FILE* in, bool swap) {\n\tint32_t x;\n\tif(fread((void *)&x, 1, 4, in) != 4) {\n\t\tassert(false);\n\t}\n\tif(swap) {\n\t\treturn endianSwapI32(x);\n\t} else {\n\t\treturn x;\n\t}\n}\n\n\n/**\n * Read a 16-bit unsigned from an input stream, inverting endianness\n * if necessary.\n */\nstatic inline uint16_t readU16(std::istream& in, bool swap) {\n\tuint16_t x;\n\tin.read((char *)&x, 2);\n\tassert_eq(2, in.gcount());\n\tif(swap) {\n\t\treturn endianSwapU16(x);\n\t} else {\n\t\treturn x;\n\t}\n}\n\n/**\n * Read a 16-bit unsigned from a file descriptor, optionally inverting\n * endianness.\n */\n#ifdef BOWTIE_MM\nstatic inline uint16_t readU16(int in, bool swap) {\n\tuint16_t x;\n\tif(read(in, (void *)&x, 2) != 2) {\n\t\tassert(false);\n\t}\n\tif(swap) {\n\t\treturn endianSwapU16(x);\n\t} else {\n\t\treturn x;\n\t}\n}\n#endif\n\n/**\n * Read a 16-bit unsigned from a FILE*, optionally inverting\n * endianness.\n */\nstatic inline uint16_t readU16(FILE* in, bool swap) {\n\tuint16_t x;\n\tif(fread((void *)&x, 1, 2, in) != 2) {\n\t\tassert(false);\n\t}\n\tif(swap) {\n\t\treturn endianSwapU32(x);\n\t} else {\n\t\treturn x;\n\t}\n}\n\n\n/**\n * Read a 16-bit signed from an input stream, inverting endianness\n * if necessary.\n */\nstatic inline int32_t readI16(std::istream& in, bool swap) {\n\tint16_t x;\n\tin.read((char *)&x, 2);\n\tassert_eq(2, in.gcount());\n\tif(swap) {\n\t\treturn endianSwapI16(x);\n\t} else {\n\t\treturn x;\n\t}\n}\n\n/**\n * Read a 16-bit unsigned from a file descriptor, optionally inverting\n * endianness.\n */\n#ifdef BOWTIE_MM\nstatic inline uint16_t readI16(int in, bool swap) {\n\tint16_t x;\n\tif(read(in, (void *)&x, 2) != 2) {\n\t\tassert(false);\n\t}\n\tif(swap) {\n\t\treturn endianSwapI16(x);\n\t} else {\n\t\treturn x;\n\t}\n}\n#endif\n\n/**\n * Read a 16-bit unsigned from a FILE*, optionally inverting\n * endianness.\n */\nstatic inline uint16_t readI16(FILE* in, bool swap) {\n\tint16_t x;\n\tif(fread((void *)&x, 1, 2, in) != 2) {\n\t\tassert(false);\n\t}\n\tif(swap) {\n\t\treturn endianSwapI16(x);\n\t} else {\n\t\treturn x;\n\t}\n}\n\ntemplate <typename index_t>\nvoid writeIndex(std::ostream& out, index_t x, bool toBigEndian) {\n\tindex_t y = endianizeIndex(x, toBigEndian);\n\tout.write((const char*)&y, sizeof(index_t));\n}\n\n/**\n * Read a unsigned from an input stream, inverting endianness\n * if necessary.\n */\ntemplate <typename index_t>\nstatic inline index_t readIndex(std::istream& in, bool swap) {\n\tindex_t x;\n\tin.read((char *)&x, sizeof(index_t));\n\tassert_eq(sizeof(index_t), in.gcount());\n\tif(swap) {\n\t\treturn endianSwapIndex(x);\n\t} else {\n\t\treturn x;\n\t}\n}\n\n/**\n * Read a unsigned from a file descriptor, optionally inverting\n * endianness.\n */\n#ifdef BOWTIE_MM\ntemplate <typename index_t>\nstatic inline index_t readIndex(int in, bool swap) {\n\tindex_t x;\n\tif(read(in, (void *)&x, sizeof(index_t)) != sizeof(index_t)) {\n\t\tassert(false);\n\t}\n\tif(swap) {\n\t\tif(sizeof(index_t) == 8) {\n\t\t\tassert(false);\n\t\t\treturn 0;\n\t\t} else if(sizeof(index_t) == 4) {\n\t\t\treturn endianSwapU32(x);\n\t\t} else {\n\t\t\tassert_eq(sizeof(index_t), 2);\n\t\t\treturn endianSwapU16(x);\n\t\t}\n\t} else {\n\t\treturn x;\n\t}\n}\n#endif\n\n/**\n * Read a unsigned from a FILE*, optionally inverting\n * endianness.\n */\ntemplate <typename index_t>\nstatic inline index_t readIndex(FILE* in, bool swap) {\n\tindex_t x;\n\tif(fread((void *)&x, 1, sizeof(index_t), in) != sizeof(index_t)) {\n\t\tassert(false);\n\t}\n\tif(swap) {\n\t\tif(sizeof(index_t) == 8) {\n\t\t\tassert(false);\n\t\t\treturn 0;\n\t\t} else if(sizeof(index_t) == 4) {\n\t\t\treturn endianSwapU32((uint32_t)x);\n\t\t} else {\n\t\t\tassert_eq(sizeof(index_t), 2);\n\t\t\treturn endianSwapU16(x);\n\t\t}\n\t} else {\n\t\treturn x;\n\t}\n}\n\n\n#endif /*WORD_IO_H_*/\n"
  },
  {
    "path": "zbox.h",
    "content": "/*\n * Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>\n *\n * This file is part of Bowtie 2.\n *\n * Bowtie 2 is free software: you can redistribute it and/or modify\n * it under the terms of the GNU General Public License as published by\n * the Free Software Foundation, either version 3 of the License, or\n * (at your option) any later version.\n *\n * Bowtie 2 is distributed in the hope that it will be useful,\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n * GNU General Public License for more details.\n *\n * You should have received a copy of the GNU General Public License\n * along with Bowtie 2.  If not, see <http://www.gnu.org/licenses/>.\n */\n\n#ifndef ZBOX_H_\n#define ZBOX_H_\n\n#include \"btypes.h\"\n\n/**\n * Fill z with Z-box information for s.  String z will not be resized\n * and will only be filled up to its size cap.  This is the linear-time\n * algorithm from Gusfield.  An optional sanity-check uses a naive\n * algorithm to double-check results.\n */\ntemplate<typename T>\nvoid calcZ(const T& s,\n           TIndexOffU off,\n           EList<TIndexOffU>& z,\n           bool verbose = false,\n           bool sanityCheck = false)\n{\n\tsize_t lCur = 0, rCur = 0;\n\tsize_t zlen = z.size();\n\tsize_t slen = s.length();\n\tassert_gt(zlen, 0);\n\tassert_eq(z[0], 0);\n\t//assert_leq(zlen, slen);\n\tfor (size_t k = 1; k < zlen && k+off < slen; k++) {\n\t\tassert_lt(lCur, k);\n\t\tassert(z[lCur] == 0 || z[lCur] == rCur - lCur + 1);\n\t\tif(k > rCur) {\n\t\t\t// compare starting at k with prefix starting at 0\n\t\t\tsize_t ki = k;\n\t\t\twhile(off+ki < s.length() && s[off+ki] == s[off+ki-k]) ki++;\n\t\t\tz[k] = (TIndexOffU)(ki - k);\n\t\t\tassert_lt(off+z[k], slen);\n\t\t\tif(z[k] > 0) {\n\t\t\t\tlCur = k;\n\t\t\t\trCur = k + z[k] - 1;\n\t\t\t}\n\t\t} else {\n\t\t\t// position k is contained in a Z-box\n\t\t\tsize_t betaLen = rCur - k + 1;\n\t\t\tsize_t kPrime = k - lCur;\n\t\t\tassert_eq(s[off+k], s[off+kPrime]);\n\t\t\tif(z[kPrime] < betaLen) {\n\t\t\t\tz[k] = z[kPrime];\n\t\t\t\tassert_lt(off+z[k], slen);\n\t\t\t\t// lCur, rCur unchanged\n\t\t\t} else if (z[kPrime] > 0) {\n\t\t\t\tint q = 0;\n\t\t\t\twhile (off+q+rCur+1 < s.length() && s[off+q+rCur+1] == s[off+betaLen+q]) q++;\n\t\t\t\tz[k] = (TIndexOffU)(betaLen + q);\n\t\t\t\tassert_lt(off+z[k], slen);\n\t\t\t\trCur = rCur + q;\n\t\t\t\tassert_geq(k, lCur);\n\t\t\t\tlCur = k;\n\t\t\t} else {\n\t\t\t\tz[k] = 0;\n\t\t\t\tassert_lt(off+z[k], slen);\n\t\t\t\t// lCur, rCur unchanged\n\t\t\t}\n\t\t}\n\t}\n#ifndef NDEBUG\n\tif(sanityCheck) {\n\t\t// Recalculate Z-boxes using naive quadratic-time algorithm and\n\t\t// compare to linear-time result\n\t\tassert_eq(0, z[0]);\n\t\tfor(size_t i = 1; i < z.size(); i++) {\n\t\t\tsize_t j;\n\t\t\tfor(j = i; off+j < s.length(); j++) {\n\t\t\t\tif(s[off+j] != s[off+j-i]) break;\n\t\t\t}\n\t\t\tassert_eq(j-i, z[i]);\n\t\t}\n\t}\n#endif\n}\n\n#endif /*ZBOX_H_*/\n"
  }
]